Contents
TABLE OF CONTENTS
← All posts

Zero-Downtime Database Migrations

·8 min read
PostgreSQLDevOps

Schema migrations are terrifying on large tables. A naive ALTER TABLE on a 50-million-row table can take an ACCESS EXCLUSIVE lock, blocking all reads and writes for minutes or hours. But with the right approach, you can run most migrations without any downtime at all.

The problem with ALTER TABLE

PostgreSQL acquires different lock levels for different DDL operations. Adding a nullable column takes a very brief ACCESS EXCLUSIVE lock (essentially instant). But adding a NOT NULL DEFAULT column on older versions could rewrite the entire table. Knowing your lock levels is step one.

Created snippet.sql
+8
+-- Fast: only modifies catalog, no table rewrite
+ALTER TABLE users ADD COLUMN timezone text;
+
+-- Slow: requires table rewrite (Postgres < 11)
+ALTER TABLE users ADD COLUMN status text NOT NULL DEFAULT 'active';
+
+-- Fast in Postgres 11+: no rewrite needed
+ALTER TABLE users ADD COLUMN status text NOT NULL DEFAULT 'active';

The expand-contract pattern

The core idea is to split each migration into two phases. In the expand phase, you add new columns, tables, or indexes without removing anything old. Your application starts writing to both the old and new schema. Once that's deployed and stable, the contract phase removes the old schema and cleans up.

For renaming a column, the process is: add the new column, deploy code that writes to both and reads from the new one with a fallback to the old one, backfill existing rows, switch reads to only the new column, then drop the old column. It's more steps, but each step is individually safe.

Backfilling without locks

Backfilling a new column on millions of rows needs to happen in small batches to avoid long-running transactions and lock contention. The recipe: update in batches of 1,000–10,000 rows with short sleeps between batches, using a WHERE clause that tracks progress by primary key rather than relying on a separate tracking table.

Created snippet.sql
+18
+-- Batch backfill without a long-running transaction
+DO $$
+DECLARE
+ batch_size INT := 5000;
+ min_id BIGINT := 0;
+ max_id BIGINT;
+BEGIN
+ SELECT MAX(id) INTO max_id FROM users;
+ WHILE min_id < max_id LOOP
+ UPDATE users
+ SET status = COALESCE(legacy_status, 'active')
+ WHERE id BETWEEN min_id AND min_id + batch_size;
+
+ COMMIT;
+ min_id := min_id + batch_size + 1;
+ PERFORM pg_sleep(0.1);
+ END LOOP;
+END $$;

Verification and rollback

Every migration should ship with a verification query and a rollback plan. Verification checks that the new column has expected values and no unexpected NULLs. The rollback plan isn't just the reverse DDL — it's a sequence that accounts for the fact that new data may have been written. If the migration has been live for three days, simply reverting the schema could lose data written during that window.

These patterns have let us run hundreds of migrations on production tables with 100M+ rows without a single incident. The key principle: never do in one step what you can safely decompose into multiple safe steps.