PostgreSQL Bulk Imports: How to Move Millions of Rows Without the Wait

Database tutorial - IT technology blog
Database tutorial - IT technology blog

The Bottleneck: Why Standard INSERTs Fail at Scale

I once spent an entire afternoon watching a Python script crawl through a legacy database migration. I had written a basic loop that executed one INSERT statement per row. After 30 minutes, I checked the count: 50,000 records processed, 9.95 million to go.

At that pace, the migration would have taken nearly four days. Standard INSERT statements are perfect for single transactions, but they are incredibly slow for bulk loading. Each statement forces the database to parse the SQL, plan the execution, check constraints, and write to the Write-Ahead Log (WAL).

When you deal with millions of rows, this overhead becomes a literal roadblock. To move faster, you have to shift from individual transactions to batch operations. My goal is to show you how to saturate your hardware’s throughput and get that data moved in minutes, not days.

The Gold Standard: Using the COPY Command

For flat files like CSV or TSV, the COPY command is the fastest tool in the PostgreSQL arsenal. It is a server-side command that bypasses the standard SQL parser and execution planner. Instead of processing each row as a separate event, it streams the data directly into the table.

Using COPY from the psql CLI

If your data file already lives on the database server, you can trigger the import directly from the psql console. It looks like this:

COPY users(first_name, last_name, email) 
FROM '/var/lib/postgres/data/users.csv' 
DELIMITER ',' 
CSV HEADER;

Local files require a slightly different approach. If the file is on your laptop and you need to push it to a remote RDS or Cloud instance, use the \copy meta-command. This reads the file locally and pipes the stream to the server over the network:

psql -h remote-db-instance -U admin -d production_db -c "\copy users FROM 'local_users.csv' WITH (FORMAT csv, HEADER true)"

Application-Level Performance: Multi-Row Inserts

You won’t always have a clean CSV file ready to go. Sometimes your data comes from a live API or is generated dynamically by your application code. Sending one INSERT per record in a loop is a performance killer. You should group your records into batches instead.

A single-row insert might take 1ms. However, a multi-row insert of 1,000 rows might only take 10ms total. That is a 100x speedup just by changing your syntax.

Avoid this pattern:

INSERT INTO orders (id, amount) VALUES (1, 100);
INSERT INTO orders (id, amount) VALUES (2, 200);

Adopt this one:

INSERT INTO orders (id, amount) VALUES 
(1, 100), 
(2, 200), 
(3, 300);

In my experience, a batch size between 1,000 and 5,000 records hits the sweet spot. It balances memory consumption with network efficiency. If you use Python, the psycopg3 library has a copy() method that is significantly faster than execute_batch().

Handling Data Formats Before Import

Messy data is the most common reason imports fail. I often receive files with weird encoding or nested structures that don’t map directly to columns. Before I build a full automation pipeline, I use toolcraft.app/en/tools/data/csv-to-json to quickly verify data structures. It runs entirely in the browser. This means I can check my data without uploading sensitive records to a random server.

Advanced Optimization: Tuning the Database Engine

Even the COPY command can hit a ceiling if your database is configured for heavy read safety rather than heavy write speed. When importing hundreds of millions of rows, you need to loosen the reigns temporarily.

1. Drop Indexes and Constraints

Indexes are the enemy of ingestion speed. Every time you insert a row, PostgreSQL must update every B-tree or Hash index associated with that table. For a 100-million-row import, it is often 10 times faster to drop the indexes, run the import, and then recreate them from scratch.

-- Drop the index first
DROP INDEX idx_user_email;

-- Run your COPY command
COPY users FROM 'massive_data.csv' CSV;

-- Rebuild the index (Postgres does this very efficiently in bulk)
CREATE INDEX idx_user_email ON users(email);

2. Tune the Write-Ahead Log (WAL)

PostgreSQL uses checkpoints to ensure data is safely on disk. During a massive load, these checkpoints happen too frequently, causing “checkpoint spikes” that stall your import. You can give the database more breathing room by increasing the max_wal_size.

-- Increase WAL size to 4GB to reduce checkpoint frequency
SET max_wal_size = '4GB';
SET checkpoint_timeout = '20min';

3. Use UNLOGGED Tables

If you are importing data into a temporary staging table, use the UNLOGGED keyword. These tables do not write to the WAL. This makes them incredibly fast, though the data won’t survive a database crash. For staging data that you can easily recreate, this is a massive win.

CREATE UNLOGGED TABLE temp_import_staging (
    id SERIAL PRIMARY KEY,
    payload JSONB
);

The High-Speed Import Checklist

To get the best results, follow this workflow for your next big data move:

  • Disable Autocommit: Always wrap multi-row inserts in a single transaction block.
  • Strip non-essential indexes: Rebuild them only after the data is in.
  • Increase maintenance_work_mem: Set this to 1GB or more to speed up index recreation.
  • Use COPY: It is the undisputed champion of speed.
  • Run ANALYZE: Once the import finishes, run ANALYZE table_name. This ensures the query planner understands the new data distribution.

By shifting from row-by-row logic to a streaming mindset, I have seen import times drop from six hours to under fifteen minutes. PostgreSQL is built to handle massive throughput; you just have to stop getting in its way.

Share: