GitLab had five different backups. The day they deleted production, not one of them worked.



There is one deploy your team cannot simply roll back, and it is the one that touches your database. Every other change has an undo: revert the code, redeploy the old version, and you are back where you started. A migration does not work like that.

GitLab found this out in public. On 31 January 2017, an engineer ran a cleanup command against what they believed was a replica, but it was the production database, and live data began disappearing within seconds. They lost data they were never able to recover, roughly 5,000 projects, 5,000 comments, and 700 new accounts, and when they reached for their backups, they found that of five separate backup mechanisms, not one worked when it mattered (GitLab, 2017; Bytesized Design, 2025). Recovery took around 18 hours, from a copy that was already six hours old.

The reason migrations are uniquely dangerous is simple: the database is the one part of your system you cannot rebuild from scratch, because it holds the state everything else depends on. This post is about why that makes migrations different from every other deploy, what tends to go wrong, and the handful of principles that keep them from becoming the worst day of your quarter. No code, just the stakes and the discipline.

Code rolls back. The database does not.

Rolling back a database change is fundamentally different from rolling back code. A stateless application can swap a new binary for the old one and be done. A database carries persistent state that has been changing the entire time, so rolling it back usually means restoring from a backup taken before the deploy, which loses every change made since and brings deleted data back to life (Fluri, 2025).

And many migrations cannot be reversed at all. Dropping a column destroys the data in it, and changing a data type can lose precision permanently, so the undo simply does not exist (Toolshelf, 2025). These are not exotic operations. They are the ordinary contents of a normal migration.

The nuance is that some migrations are perfectly reversible, like adding a column or a table. The danger is that the irreversible ones look identical to the reversible ones in a pull request, so they receive the same quick approval and the same casual treatment, right up until the moment you discover you cannot take one back.

The damage is often silent, and you find it weeks later

A code bug usually announces itself with an error or a broken page. A bad migration can corrupt data quietly while the application keeps serving traffic, so nobody notices until the bad data has spread. A botched migration can introduce subtle corruption that goes unnoticed for weeks, and by the time it is spotted, a simple restore is impossible because too much has happened since (Modernization Intel, 2026).

This is more common than teams expect. Around 23 per cent of organisations report some data loss during a migration (Cloudficient, 2025), and a test migration that ran flawlessly can still corrupt a slice of production data once it meets the full scale of real records (Monte Carlo, 2025).

The nuance is that silent does not mean rare, it means you cannot rely on alarms to catch it. The safeguard has to be verification built into the migration itself, row counts, checksums, and reconciliation that confirm the data is intact, rather than a hope that something will fail loudly if it goes wrong.

GitLab's backups had been failing silently for weeks, and no single engineer was responsible for confirming they could be restored, so no one did (Bytesized Design, 2025). A backup you have never restored is not a backup. It is a guess.

Never drop anything in the same breath as the change

The single most useful principle is to make migrations additive and backward-compatible, so old and new code both work correctly at every step. The workhorse pattern for any non-trivial schema change has three phases: add the new structure without touching the old, backfill it and switch the code to it, and only then drop the old column or table, in a separate deploy that can land days or weeks later (IGC).

The order is what protects you. Adding something new before the code uses it is safe; removing something before the code has stopped using it is one of the most common causes of deploy-time incidents (IGC). The mistake is almost always a drop that happened too early.

The nuance is that this is slower and feels like bureaucracy, two or three deploys where one would seem to do. That slowness is exactly the cost of reversibility. At every intermediate step you can stop and go back, which is the one thing you cannot do if you drop the column on day one.

Add the new column, move the code to it, and drop the old one weeks later as a separate deploy (IGC). The slowness is the point: at every step you can still go back. A migration done in one irreversible step trades that safety for a few saved hours.

Test on a copy of production, not on a hope

A migration that works on a tiny development database tells you almost nothing about how it behaves on millions of real rows. The only honest test is a clone of production. Before a schema change reaches production it must be validated on a pre-production environment that is a true copy of it, or you have no real confidence it will go smoothly (Fluri, 2025).

The pattern in failed migrations is consistent: skipping staging validation shows up in the large majority of catastrophic failures, and teams that test on production-like data eliminate the overwhelming majority of surprises (AriaShaw, 2025). The migration becomes the execution of a proven playbook rather than a live experiment.

The nuance is that a clone is not free, and for a large database it costs time and storage to maintain. But the alternative is testing in production, on live customer data, which is the most expensive test environment there is, billed in downtime and lost records. Downtime alone runs into thousands of dollars a minute for many businesses (Cloudficient, 2025).

Have a backup you have actually restored, and a plan for 3am

Because some migrations are irreversible, your real safety net is recovery: a recent backup you have proven you can restore, and a rehearsed runbook. The GitLab lesson is that having backups and being able to restore them are completely different things. Untested backups are the common thread in the worst outcomes, present in the overwhelming majority of total-data-loss scenarios (AriaShaw, 2025).

A pre-rehearsed recovery plan turns hours of panic into minutes of procedure (Modernization Intel, 2026). The fix GitLab eventually landed on was not new technology, it was routine: restore from backups regularly, and treat each successful restore as a measure of reliability in its own right (Eunice, 2025).

The nuance is that this is the least glamorous work in engineering and the easiest to defer, because the payoff only appears on the day something goes wrong, and that is precisely the day you cannot improvise it. A backup is only as good as your last successful restore of it, and if you cannot remember the last one, you are running on faith.

The part worth sitting with

So the next time a migration is sitting in a pull request next to a routine code change, getting the same quick approval, stop and treat it as what it is: the one deploy with no undo button. The database is the only part of your system you cannot rebuild from a fresh checkout, because it is the part that remembers. That is what makes a careless migration so much more expensive than a careless code change, and it is why the boring habits matter so much here, additive changes you can reverse step by step, a clone of production to test on, and a backup you have personally watched come back to life. GitLab had redundancy on paper and still lost data they could never recover, not because they were careless, but because nobody had checked that the safety net worked until they were already falling. Check yours before you need it. With migrations, there is rarely a second chance to get it right.

Author note

I am Manjunaathaa, an Associate DevOps Engineer at Frigga Cloud Labs. I work across AWS, GCP, and Azure daily, with GitHub Actions as my deployment backbone. My focus is Proactive Resilience, and the database is where that focus earns its keep, because it is the one system where saying we will just roll back can end a company. Every practice in this post is something I actually run in production, not something I read about. I wrote this because migrations get treated like ordinary deploys right up until the one that is not. The thing I keep coming back to is that I do not trust a backup I have not restored, the same way I do not trust a smoke alarm I have never tested, so I restore into a clean environment on a schedule and time how long it takes. Additive changes, a production-like clone, and a proven restore are not caution for its own sake. They are the difference between a bad afternoon and a permanent loss. Let's connect on LinkedIn → Manjunaathaa.

DevOps, Infrastructure, Database Migrations, Schema Changes, Data Loss, Backups, Disaster Recovery, Reliability, Rollback, Downtime

Post a Comment

Previous Post Next Post