Your code can roll back. Your schema migration cannot. Stop shipping them together.

 


The database migration is the one step in your deploy that you cannot simply roll back. A bad release of application code is reversible, because you redeploy the previous build. A schema change mutates shared state, and once it has rewritten a table or dropped a column, there is no previous build to fall back to.

The numbers say teams underestimate this. Gartner reports that 83 per cent of data migration projects either fail outright or overrun their budgets and timelines (BrightCoding, 2025). In February 2024, the email provider Resend ran a migration command that pointed at production by mistake, dropped every table, and caused a roughly twelve-hour outage with five minutes of data lost for good (BrightCoding, 2025).

Most migration outages are quieter than that. The common one is a single ALTER TABLE that takes a long lock on a table that has quietly grown to millions of rows, and freezes the whole application. This post is about running migrations inside a delivery pipeline so neither failure mode can happen. Decouple the schema change from the code deploy, make every change in small backward-compatible steps, and let the pipeline refuse anything dangerous. The pattern has a name. It is expand and contract, sometimes called Parallel Change.

Why a migration is not a deploy

Treating a migration like another deploy step is the root mistake. PostgreSQL takes an ACCESS EXCLUSIVE lock for many ALTER TABLE operations, and that lock blocks every read and write until the operation finishes (Xata, 2025). A column type change rewrites the entire table, and on a ten-gigabyte table that rewrite can run for ten to thirty minutes with the lock held the whole time (DEV, 2026). While that lock is held, incoming requests queue behind it, the connection pool fills, and one DDL statement becomes a service-wide outage.

-- Dangerous: a type change rewrites the whole table under ACCESS EXCLUSIVE
ALTER TABLE orders ALTER COLUMN amount TYPE numeric(12,2);

-- Guard every migration session so it fails fast instead of freezing the app
SET lock_timeout = '2s';

The type change is the trap. The lock_timeout is the seatbelt. Setting it to a low value, under two seconds is common, means a blocked DDL statement gives up rather than holding the queue of reads and writes behind it (Xata, 2025). Some operations have become cheaper over time. Since PostgreSQL 11, adding a column with a default is instant and stored in the catalogue rather than rewriting the table (Di Pinto, 2026), and PostgreSQL 18, released in September 2025, lets you add a NOT NULL column with NOT VALID directly (DEV, 2026).

The trade-off of a strict lock_timeout is that the migration may fail and need a retry, which is the worst case teams report after adopting this discipline. A migration that fails fast and reruns is a far better worst case than a production database that is frozen for twenty minutes. On a tiny table or in a genuine quiet window, a direct alter is fine. The rule is to assume scale, because the table you treat as small is the one that grew.

A code deploy is reversible, because you redeploy the previous image. A schema migration mutates shared state and often cannot be cleanly undone. Treat it as a separate, gated step, never as a side effect of the deploy (iamraghuveer, 2026).

Expand: add the new shape without breaking the old

Expand is the first step. You add the new column, table, or index alongside the existing one, in a way that is purely additive, so the running application carries on unaffected. Nothing is renamed and nothing is removed. This matters because during a rolling deploy the old version of your application is still serving traffic, and the migration has to complete while that old code runs (DeployHQ, 2026). If you drop or rename a column in one step, the old version starts failing at query time the moment the migration lands (DEV, 2025).

-- Step 1, expand: additive only, safe while the old code is still running
ALTER TABLE users ADD COLUMN email_normalised text;   -- nullable, no default, instant on PG 11+

-- Build the index without locking writes (this cannot run inside a transaction)
CREATE INDEX CONCURRENTLY idx_users_email_normalised
    ON users (email_normalised);

Adding a nullable column with no default is a catalogue change on PostgreSQL 11 and later, so it is effectively free. Building the index with CREATE INDEX CONCURRENTLY avoids the exclusive lock a normal index build would take, at the cost that it cannot run inside a transaction and takes an unpredictable amount of time (Defacto, 2025). The old code does not know the new column exists, so it keeps working untouched.

The trade-off is planning. A change that used to be one file is now several, and the concurrent index has to live in its own step outside a transaction, which feels untidy in most migration tools. When the change is genuinely additive and trivial, expand may be the entire migration and you are done.

The rule that makes zero downtime possible: every migration must complete successfully while the old version of the application is still running, and only then do you deploy the new code (DeployHQ, 2026). Old and new code must both work against the schema at the same time.

Migrate: backfill in batches, and gate it in the pipeline

The migrate step fills the new structure from the old one. The danger here is a single UPDATE across millions of rows, which holds a long lock and bloats one enormous transaction. You avoid it by backfilling in bounded batches, each one a short transaction, run repeatedly until there is nothing left to update.

-- Step 2, migrate: backfill in bounded batches, each holding only a short lock.
-- Run this repeatedly (from a job or the app) until it updates zero rows.
UPDATE users
   SET email_normalised = lower(email)
 WHERE id IN (
     SELECT id FROM users
      WHERE email_normalised IS NULL
      ORDER BY id
      LIMIT 5000
 );

The other half of this step is the pipeline. Migrations run as their own job, and the deploy job depends on it, so the new application code never ships against a schema that is not ready (OneUptime, 2026).

# .github/workflows/deploy.yml
jobs:
  migrate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run schema migrations
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}
        run: flyway -url="$DATABASE_URL" -locations=filesystem:./db/migrations migrate

  deploy:
    needs: migrate          # the app ships only after migrations succeed
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy application
        run: ./deploy.sh

The needs: migrate line is the whole point. Migrations finish first, the deploy waits for them, and the order can never invert. A tool like Flyway also takes a lock on its own metadata table so two pipeline runs cannot apply migrations at the same time (OneUptime, 2026). During this window the new code dual-writes to both the old and the new column, so the data stays consistent whichever version handles a request.

The trade-off is real complexity for the duration of the transition. The application has to write both columns, and a backfill over a very large table can run for a long time. On a small table you can backfill in a single statement and skip the batching.

Contract: remove the old shape, only once nothing reads it

Contract is the final step, and it belongs in a later deploy, not the same one as expand. Once the new code is fully rolled out and the new column is the source of truth, you remove the old structure. Doing this too early breaks the version still running, and doing it in the same release as expand puts back the coupling you worked to remove.

-- Step 3, contract: a separate, later deploy, once no running version reads the old column.

-- Enforce NOT NULL without a full-table exclusive lock
ALTER TABLE users
  ADD CONSTRAINT users_email_normalised_not_null
  CHECK (email_normalised IS NOT NULL) NOT VALID;   -- instant, no blocking scan

ALTER TABLE users VALIDATE CONSTRAINT users_email_normalised_not_null;  -- scans without blocking writes

-- Finally drop the old column, now that nothing references it
ALTER TABLE users DROP COLUMN email;

Adding the constraint as NOT VALID and then running VALIDATE separates the cheap catalogue change from the table scan, and the validation step takes a weaker lock that does not block writes (DEV, 2026). Dropping the column is a metadata operation and is fast, but it is only safe once you are certain no deployed version still queries it.

The trade-off is that you carry the old column and dual-write for a release or two, which is untidy. The payoff is that every step in the sequence is independently reversible until the very last one. If the old column was never read by the old code, you can contract sooner.

Gate it in CI so the dangerous migration never merges

The pattern only holds if the pipeline enforces it, because under deadline pressure people forget the rules. The fix is a linter that reads migration files and fails the build on dangerous operations. Defacto added the Squawk linter as a pre-commit hook and a migration checklist to their pull request template, and reports that it completely eliminated outages caused by schema migrations (Defacto, 2025).

# Add this job to .github/workflows/deploy.yml
  lint-migrations:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint migrations for unsafe operations
        run: |
          pip install squawk-cli
          squawk db/migrations/*.sql

Squawk flags the exact operations that cause outages: adding a column with a default, adding a non-nullable column from the start, and creating an index without CONCURRENTLY (Defacto, 2025). With the linter failing the build, the gating job ordering the deploy, and a lock table preventing concurrent runs, the safe path becomes the only path. For teams who want this taken further, declarative tools like Atlas treat the schema as code, plan the migration for you, and run the safety checks inside CI (Atlas, 2026).

The trade-off is noise. A linter raises warnings that can be hard to read for engineers who do not know the database internals behind them. That cost is worth paying, because the alternative is learning the same lessons during an incident. It is also worth testing each migration against a copy of production data before it runs, since staging at a fraction of the size will not reproduce a locking problem (OneUptime, 2026).

The part worth sitting with

So the next time you wrap a schema change and a code change into the same deploy, remember which half you can take back. The code you can redeploy in seconds. The migration you cannot, and if it locks a busy table or drops a column the old version still reads, no rollback button will help you. Resend lost five minutes of data and twelve hours of uptime to one migration that ran in the wrong place. The teams that never have that day are not braver or luckier. They stopped treating the migration as part of the deploy, broke every change into expand, migrate, and contract, and let the pipeline refuse anything that locks. The migration is the most dangerous line in your pipeline. Stop running it like the least.

Author note

I am Mohan Gopi, an Associate DevOps Engineer at Frigga Cloud Labs, working across AWS, GCP, and Azure with GitHub Actions as my deployment backbone. I wrote this because schema migrations are the failure I see teams underestimate most. They put real care into rolling back code and almost none into the one step a rollback cannot save. The pattern I keep seeing is a single coupled deploy, a locking alter on a table that has quietly grown to millions of rows, and an outage nobody rehearsed. Expand, migrate, contract is not academic. It is the difference between a migration being a non-event and being your worst on-call night. Let us connect on LinkedIn → Mohan Gopi.

Post a Comment

Previous Post Next Post