PostgreSQL Migration Best Practices for Zero-Downtime Deployments

Zero-downtime schema migrations are not about a single trick. They require a combination of techniques: lock management, phased rollouts, backfill strategies, and CI enforcement. This guide covers the full playbook.

1. Always Set lock_timeout

The single most important practice for safe migrations. Every DDL statement that acquires a table lock should be preceded by a lock_timeout:

-- Set a 5-second lock timeout
-- If the lock cannot be acquired in 5s, the statement fails
-- This prevents the lock queue problem (see below)
SET lock_timeout = '5s';

ALTER TABLE users ADD COLUMN status TEXT DEFAULT 'active';

-- Reset after DDL
RESET lock_timeout;

Without lock_timeout, a DDL statement will wait indefinitely for a lock. While it waits, every subsequent query that needs a conflicting lock queues behind it. A single long-running analytics query can trigger a cascading outage:

Long query holds ACCESS SHARE on the table
ALTER TABLE waits for ACCESS EXCLUSIVE lock
All new queries queue behind the ALTER TABLE
Connection pool exhausts within seconds
Application-wide outage

With lock_timeout, the ALTER fails fast. Your deployment script can retry after a delay, and the application continues serving traffic in the meantime.

The retry wrapper pattern

-- Retry wrapper: try up to 5 times with increasing delays
DO $$
DECLARE
  retries INT := 0;
  max_retries INT := 5;
BEGIN
  LOOP
    BEGIN
      SET lock_timeout = '5s';
      ALTER TABLE users ADD COLUMN status TEXT DEFAULT 'active';
      RESET lock_timeout;
      RETURN;  -- Success
    EXCEPTION WHEN lock_not_available THEN
      retries := retries + 1;
      IF retries >= max_retries THEN
        RAISE EXCEPTION 'Failed to acquire lock after % attempts', max_retries;
      END IF;
      PERFORM pg_sleep(retries * 2);  -- Backoff: 2s, 4s, 6s, 8s, 10s
    END;
  END LOOP;
END $$;

2. Use statement_timeout for Long-Running DDL

While lock_timeout limits how long you wait to acquire a lock, statement_timeout limits how long the statement itself can run. Set both:

-- Prevent DDL from running longer than expected
SET lock_timeout = '5s';
SET statement_timeout = '30s';

ALTER TABLE users ALTER COLUMN email SET NOT NULL;

RESET lock_timeout;
RESET statement_timeout;

3. The Expand-Contract Pattern

The expand-contract pattern (also called parallel change) is the safest way to make backwards-incompatible schema changes. It splits a dangerous change into three deployments:

Example: Renaming a column

Directly renaming a column breaks all existing queries and application code that references the old name. The expand-contract approach:

-- DEPLOY 1: Expand — add the new column alongside the old one
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_name_columns()
RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
    IF NEW.full_name IS NULL AND NEW.name IS NOT NULL THEN
      NEW.full_name := NEW.name;
    ELSIF NEW.name IS NULL AND NEW.full_name IS NOT NULL THEN
      NEW.name := NEW.full_name;
    END IF;
  END IF;
  RETURN NEW;
END $$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_names
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_name_columns();

-- Backfill existing rows
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- DEPLOY 2: Migrate — update application to use "full_name"
-- All reads and writes now use the new column name
-- Deploy this code change and verify in production

-- DEPLOY 3: Contract — remove the old column and trigger
DROP TRIGGER trg_sync_names ON users;
DROP FUNCTION sync_name_columns();
ALTER TABLE users DROP COLUMN name;

Example: Changing a column type

-- Changing INT to BIGINT directly acquires ACCESS EXCLUSIVE
-- and rewrites the entire table. Instead, use expand-contract:

-- DEPLOY 1: Add new column
ALTER TABLE orders ADD COLUMN amount_v2 BIGINT;

-- Sync trigger + backfill (same pattern as above)

-- DEPLOY 2: Switch reads/writes to amount_v2

-- DEPLOY 3: Drop old column
ALTER TABLE orders DROP COLUMN amount;
-- Optionally: ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;

4. Batched Backfills

Never run a single UPDATE against millions of rows. A single large UPDATE:

Holds ROW EXCLUSIVE locks for the entire duration
Generates massive WAL (write-ahead log) volume
Creates millions of dead tuples that autovacuum must clean up
Can cause replication lag on replicas
Might fill up your disk with WAL and dead tuples

-- BAD: Single update that touches all 50M rows
UPDATE users SET status = 'active' WHERE status IS NULL;

-- GOOD: Batched update with pauses for autovacuum
DO $$
DECLARE
  rows_updated INT;
  total_updated INT := 0;
  batch_size INT := 10000;
BEGIN
  LOOP
    UPDATE users
    SET status = 'active'
    WHERE id IN (
      SELECT id FROM users
      WHERE status IS NULL
      ORDER BY id
      LIMIT batch_size
    );

    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    total_updated := total_updated + rows_updated;

    RAISE NOTICE 'Updated % rows (% total)', rows_updated, total_updated;

    EXIT WHEN rows_updated = 0;

    -- Give autovacuum time to clean up dead tuples
    PERFORM pg_sleep(0.5);

    -- Commit each batch independently (PG 11+ procedures)
    COMMIT;
  END LOOP;
END $$;

5. Safe Constraint Addition

NOT NULL constraints

Adding NOT NULL directly acquires ACCESS EXCLUSIVE and scans the entire table. Use the CHECK constraint pattern instead:

-- BAD: Scans entire table under ACCESS EXCLUSIVE
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- GOOD: Two-step pattern
-- Step 1: Add CHECK constraint as NOT VALID (instant, no scan)
ALTER TABLE users ADD CONSTRAINT chk_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;

-- Step 2: Validate the constraint (scans table but only holds
-- SHARE UPDATE EXCLUSIVE — does not block reads or writes)
ALTER TABLE users VALIDATE CONSTRAINT chk_email_not_null;

-- PG 12+: Once a valid CHECK (col IS NOT NULL) exists,
-- you can SET NOT NULL without a table scan:
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT chk_email_not_null;

Foreign key constraints

-- BAD: Validates all existing rows under SHARE ROW EXCLUSIVE
ALTER TABLE orders ADD CONSTRAINT fk_user
  FOREIGN KEY (user_id) REFERENCES users (id);

-- GOOD: Add NOT VALID, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_user
  FOREIGN KEY (user_id) REFERENCES users (id) NOT VALID;

ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

Unique constraints

-- BAD: Creates an index + validates under ACCESS EXCLUSIVE
ALTER TABLE users ADD CONSTRAINT uniq_email UNIQUE (email);

-- GOOD: Build the index concurrently, then attach it
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email ON users (email);
ALTER TABLE users ADD CONSTRAINT uniq_email UNIQUE USING INDEX idx_users_email;

6. One DDL Per Transaction

Multiple DDL statements in a single transaction compound lock duration. Each statement acquires its own locks, and all locks are held until COMMIT:

-- BAD: Three DDL statements in one transaction
-- ACCESS EXCLUSIVE held for the entire combined duration
BEGIN;
ALTER TABLE users ADD COLUMN phone TEXT;
ALTER TABLE users ADD COLUMN address TEXT;
ALTER TABLE users ADD COLUMN city TEXT;
COMMIT;

-- GOOD: Separate transactions (or separate migration files)
ALTER TABLE users ADD COLUMN phone TEXT;
-- locks released
ALTER TABLE users ADD COLUMN address TEXT;
-- locks released
ALTER TABLE users ADD COLUMN city TEXT;
-- locks released

7. Test Migrations Against Production Data

Migrations that work on a dev database with 100 rows can cause outages on a production database with 100 million rows. The locking behavior is the same, but the duration is completely different.

Best practices for migration testing:

Maintain a staging database with production-sized data (or at least production-like volumes)
Time your migrations on staging and set alerts for migrations that take longer than expected
Use EXPLAIN on any data migration queries to check for sequential scans
Monitor pg_stat_activity and pg_locks during staging runs
Run static analysis on migration files in CI to catch common issues early

8. Deployment Ordering

The order of application deployment vs migration execution matters:

Change Type	Deploy Order	Why
Add column	Migration first, then app code	Old code ignores new columns
Remove column	App code first, then migration	Old code still references the column
Rename column	Expand-contract (3 deploys)	Both old and new code need to work
Add NOT NULL	App code first (ensure no NULLs), then constraint	Constraint validation fails if NULLs exist
Add index	Migration (CONCURRENTLY) anytime	Indexes are transparent to app code

9. Monitor During Migrations

Always monitor these metrics during migration execution:

-- Active locks on the table being migrated
SELECT mode, granted, count(*)
FROM pg_locks
WHERE relation = 'users'::regclass
GROUP BY mode, granted;

-- Queries waiting for locks
SELECT pid, query, wait_event_type, wait_event,
       age(now(), query_start) AS duration
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;

-- Replication lag (if using replicas)
SELECT client_addr, state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behind,
       replay_lag
FROM pg_stat_replication;

10. Enforce Best Practices in CI

The best migration practices are worthless if they are not enforced. Add automated checks to your CI pipeline that catch common mistakes before they reach production:

Missing CONCURRENTLY on index operations
Missing lock_timeout before DDL
Volatile defaults that cause table rewrites
Direct NOT NULL additions without the CHECK pattern
Multiple DDL in a single transaction
Unbatched backfills

MigrationPilot checks all of these patterns (and 80 more) as a CLI, GitHub Action, or Node.js library. It runs in under a second, requires no database connection for static analysis, and posts results as PR comments so your team catches issues during code review.

# Add to .github/workflows/migration-check.yml
- uses: mickelsamuel/migrationpilot@v1
  with:
    migration-path: "migrations/*.sql"
    fail-on: critical