Schema migrations are terrifying on large tables. A naive ALTER TABLE on a 50-million-row table can take an ACCESS EXCLUSIVE lock, blocking all reads and writes for minutes or hours. But with the right approach, you can run most migrations without any downtime at all.
The problem with ALTER TABLE
PostgreSQL acquires different lock levels for different DDL operations. Adding a nullable column takes a very brief ACCESS EXCLUSIVE lock (essentially instant). But adding a NOT NULL DEFAULT column on older versions could rewrite the entire table. Knowing your lock levels is step one.
The expand-contract pattern
The core idea is to split each migration into two phases. In the expand phase, you add new columns, tables, or indexes without removing anything old. Your application starts writing to both the old and new schema. Once that's deployed and stable, the contract phase removes the old schema and cleans up.
For renaming a column, the process is: add the new column, deploy code that writes to both and reads from the new one with a fallback to the old one, backfill existing rows, switch reads to only the new column, then drop the old column. It's more steps, but each step is individually safe.
- Add new schema elements alongside existing ones
- Deploy application code that handles both states
- Backfill existing data in batches
- Validate no code depends on the old schema
- Remove old schema elements in a separate deployment
Backfilling without locks
Backfilling a new column on millions of rows needs to happen in small batches to avoid long-running transactions and lock contention. The recipe: update in batches of 1,000–10,000 rows with short sleeps between batches, using a WHERE clause that tracks progress by primary key rather than relying on a separate tracking table.
Verification and rollback
Every migration should ship with a verification query and a rollback plan. Verification checks that the new column has expected values and no unexpected NULLs. The rollback plan isn't just the reverse DDL — it's a sequence that accounts for the fact that new data may have been written. If the migration has been live for three days, simply reverting the schema could lose data written during that window.
These patterns have let us run hundreds of migrations on production tables with 100M+ rows without a single incident. The key principle: never do in one step what you can safely decompose into multiple safe steps.