The worst infrastructure outages start the same way: someone ran terraform apply without reading the plan.



On a lot of teams, infrastructure changes still happen the way they did five years ago. An engineer runs terraform apply from their laptop, watches the output scroll, and hopes.

The cost of that shows up as drift, when the real infrastructure quietly diverges from the code that is meant to describe it. The DORA research found that teams dealing with frequent configuration drift had 2.3 times higher change failure rates than teams keeping their infrastructure as code consistent (InfoWorld, 2026). And the pattern behind the worst incidents is consistent: someone applied without reading the plan (Clanker Cloud, 2026).

The fix is to take Terraform off the laptop and put it in the pipeline, with a strict flow: a reviewed plan on every pull request, locked remote state, a gate before apply, an apply of exactly the plan that was reviewed, and a scheduled check that tells you when reality has drifted. This post shows that pipeline, with the Terraform and GitHub Actions configuration to build it. One rule sits underneath all of it. You never apply a plan you have not read.

Lock the state, or two engineers will corrupt it

Terraform keeps a state file that maps your code to real resources. If that file lives on a laptop, or in a bucket with no locking, two applies running at once can overwrite each other and corrupt it, which is one of the worst situations to recover from. The first move is a remote backend with versioning and locking, so concurrent runs are serialised and you can roll the state back if something goes wrong.

# backend.tf
terraform {
  backend "s3" {
    bucket       = "acme-tfstate"
    key          = "prod/network/terraform.tfstate"
    region       = "eu-west-1"
    encrypt      = true
    use_lockfile = true   # S3-native state locking, Terraform 1.10+
  }
}

The state lives in an encrypted, versioned S3 bucket, and use_lockfile turns on locking using the bucket itself. This is the modern approach: as of Terraform 1.10, S3 can lock the state directly, and the older DynamoDB table is now deprecated and slated for removal in a future version (HashiCorp, 2025).

The trade-off is a little operational friction. A lock can occasionally get stuck if a run is killed mid-apply, and you have to clear it deliberately with a force-unlock. That is a rare, well-understood event, and it is a far smaller problem than two applies racing each other into a corrupt state.

Plan on every pull request, and post it for a human to read

A Terraform plan is the real review artifact. The code diff tells you what changed in the configuration. The plan tells you what will actually happen to your infrastructure, including the resources that will be destroyed, which a code diff can hide. So the pipeline runs the plan on every pull request, saves it, and posts it where a reviewer can read it. Peer review on Terraform code is associated with a 30 per cent improvement in quality, but only if someone actually reads the change (InfoWorld, 2026).

# .github/workflows/terraform.yml
on:
  pull_request:
    paths: ["infra/**"]
  push:
    branches: [main]
    paths: ["infra/**"]

permissions:
  contents: read
  id-token: write        # OIDC, so no long-lived cloud keys
  pull-requests: write   # to post the plan on the PR

jobs:
  plan:
    runs-on: ubuntu-latest
    defaults:
      run: { working-directory: infra }
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform validate
      - run: terraform plan -out=tfplan -input=false
      - uses: actions/upload-artifact@v4
        with: { name: tfplan, path: infra/tfplan }

The job authenticates to the cloud with OIDC rather than static keys, runs terraform plan -out=tfplan to save the exact plan to a file, and uploads that file as an artifact. A small extra step, left out here for brevity, posts the readable output of terraform show tfplan as a pull request comment, so the reviewer approves the actual change, not just the code.

The trade-off is plan noise. A large change can produce a long plan, and it is tempting to summarise it down to nothing. Summarise the additions and updates if you like, but never hide the destroys. A deleted database is the one line in the plan that most needs a second pair of eyes.

The most expensive infrastructure incidents share one pattern: someone applied a plan they never read (Clanker Cloud, 2026). A pipeline cannot force anyone to read, but it can make the plan impossible to skip and the apply impossible to run early.

Apply the saved plan, behind a gate, never a fresh one

When the change merges, the apply runs, and two things have to be true. It must wait for a human to approve it, and it must apply the exact plan that was reviewed, not a freshly generated one. Generating a new plan at apply time means applying something nobody looked at, because the world may have changed since. So the apply job is gated behind a protected environment, and it consumes the saved plan file.

  apply:
    needs: plan
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    environment: production   # a reviewer must approve before this runs
    defaults:
      run: { working-directory: infra }
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - uses: actions/download-artifact@v4
        with: { name: tfplan, path: infra }
      - run: terraform apply -input=false tfplan   # exactly the reviewed plan

On a pull request, only the plan job runs, so reviewers see the plan before anything merges. After the merge to main, the plan runs again and the apply job waits. The environment: production setting maps to a protection rule that pauses the job until a named reviewer approves it, and then terraform apply tfplan runs that saved plan and nothing else. This is the swap that matters: risky manual steps become a single, reviewable, auditable path through Git (Optimum Partners, 2026).

The trade-off is staleness and latency. If the infrastructure changed between plan and apply, Terraform will refuse the saved plan, and you re-run the plan, which is the safe behaviour, not a bug. And the approval step adds delay. For production infrastructure, that delay is the feature.

Block the insecure change before it merges

Most cloud breaches are not clever. They are misconfigurations: a public bucket, an open security group, an unencrypted volume. Gartner has projected that through 2027, 99 per cent of cloud security failures will be the customer's fault, primarily through misconfiguration, and the 2024 Verizon Data Breach Investigations Report found misconfiguration a growing share of cloud breaches (DEV, 2026). When your infrastructure is code, those mistakes are catchable in the pull request, so you add a scanner that fails the build on them.

  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Scan Terraform for misconfigurations
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: infra
          soft_fail: false   # fail the build on high-severity findings

Checkov reads your Terraform and checks it against hundreds of policies for insecure settings, failing the pull request before the change can merge. This is the shift-left principle applied to infrastructure: find the problem in a pull request, not in a production incident (DEV, 2026).

The trade-off is false positives. A scanner will flag things that are fine in your context, and the wrong response is to switch it off. The right one is to suppress individual findings with a written justification, so every exception is a recorded decision rather than a silent gap.

Detect drift on a schedule, before it detects you

Even with every change going through the pipeline, drift creeps in: someone makes an emergency change in the console, or a cloud service updates a resource on its own. The real infrastructure stops matching the code, and your next apply does something surprising. The fix is to run the plan on a schedule and let Terraform's exit code tell you when reality has moved.

# .github/workflows/drift.yml
on:
  schedule:
    - cron: "0 7 * * *"   # every morning at 07:00 UTC

jobs:
  drift:
    runs-on: ubuntu-latest
    defaults:
      run: { working-directory: infra }
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - name: Detect drift
        id: plan
        run: terraform plan -detailed-exitcode -input=false
        continue-on-error: true
      - name: Alert if infrastructure has drifted
        if: steps.plan.outputs.exitcode == '2'
        run: echo "Drift detected. Real infrastructure no longer matches code."

The flag -detailed-exitcode makes terraform plan return 0 when there are no changes, 1 on error, and 2 when there are changes, which on a scheduled run means drift (DEV, 2026). The alert step fires only on a 2, turning drift into a morning notification instead of a nasty surprise during your next deploy. Pairing this with read-only console permissions for most engineers keeps drift rare to begin with (Optimum Partners, 2026).

The trade-off is what you do about it. It is tempting to auto-apply and revert any drift, but in production that can undo a legitimate emergency change someone made for a reason, so automatic remediation should be used cautiously there (DEV, 2026). Alert and review for production, auto-correct for development and test, is a sensible default.

Teams with frequent configuration drift had 2.3 times the change failure rate of teams that kept code and reality in sync (InfoWorld, 2026). A scheduled plan turns drift from a surprise during your next apply into an alert you get on a quiet morning.

The part worth sitting with

So go back to the laptop. Right now, on most teams, the thing that decides whether production infrastructure changes safely is one engineer, one terminal, and whether they happened to read the plan before they typed yes. That is not a process, it is a habit, and habits fail on the busy afternoons when it matters most. Drift was already costing those teams more than twice the change failure rate. The pipeline does not ask anyone to be more careful. It makes the plan impossible to skip, the apply impossible to run early, the state impossible to corrupt, and the drift impossible to miss for long. Terraform gave you infrastructure as code. Running it on a laptop throws away the half that makes it safe. Put it in the pipeline, and the worst afternoon becomes an ordinary one.

Author note

I am Mohan Gopi, an Associate DevOps Engineer at Frigga Cloud Labs, working across AWS, GCP, and Azure with GitHub Actions as my deployment backbone. I wrote this because infrastructure is the last place many teams still deploy by hand, long after they automated their application releases. The pattern I keep seeing is a shared Terraform state, an apply run from someone's laptop, and a slow build-up of drift that nobody notices until an apply does something nobody expected. A plan you have read and a state you cannot corrupt are not advanced practices. They are the minimum for touching production infrastructure. Put Terraform in the pipeline and the scary part of infrastructure work becomes boring, which is exactly what you want it to be. Let us connect on LinkedIn → Mohan Gopi.

Post a Comment

Previous Post Next Post