Every cloud setup must have these 8 things. Most are missing at least three.

 


Working across AWS, GCP, and Azure means encountering cloud setups in various states of maturity. The ones that cause the most pain during incidents, audits, or team transitions are almost always missing the same things. Not complex things. Foundational things that get skipped in the early months when the team is focused on shipping product and deferred indefinitely because they never feel urgent until they are.

These eight are non-negotiable. They are not nice-to-haves. Every cloud environment running production workloads should have all of them. Some take an afternoon. None require a platform team.


1. Tagged resources

Every resource in every environment needs a consistent set of tags from the moment it is created. The minimum viable tag schema is four keys: environmentteamproject, and owner. Without these, cost attribution is impossible, automated governance has nothing to act on, and the answer to "who owns this?" during an incident is always "unclear."

The problem on multi-cloud setups is not the tags themselves but the consistency. Seventeen variations of envEnvenvironment, and Environment across AWS, GCP, and Azure make cost reports unreadable. The fix is enforcement at the provider level. On AWS, a Service Control Policy blocks resource creation without required tags. On GCP, Organisation Policies do the same. On Azure, Azure Policy assignments enforce tagging at the subscription level.

In Terraform, a shared tagging module ensures every resource gets the required tags without relying on engineers to remember:

locals {
  required_tags = {
    environment = var.environment   # dev / staging / production
    team        = var.team
    project     = var.project
    owner       = var.owner
    managed_by  = "terraform"
  }
}

resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type
  tags          = merge(local.required_tags, var.additional_tags)
}

2. Automated backups with verified restores

Unverified backups are not backups. They are optimism. The backup job completing successfully and the data being restorable are two different things, and most teams only discover the gap when they need to restore something. Automated backups need two components: the backup job itself and a scheduled restore test that verifies the backup actually works.

On AWS, RDS automated backups with point-in-time recovery are enabled by default but retention is set to one day unless explicitly configured. For production databases, a minimum of seven days retention with at least one verified restore per month is the baseline. The restore test does not need to restore to production. It restores to a temporary instance, runs a query that validates data integrity, and destroys the instance:

# AWS RDS with backup configuration via Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-db"
  engine            = "postgres"
  instance_class    = "db.t3.medium"
  allocated_storage = 100

  backup_retention_period   = 7      # 7 days point-in-time recovery
  backup_window             = "03:00-04:00"  # UTC, low traffic window
  maintenance_window        = "Mon:04:00-Mon:05:00"
  deletion_protection       = true
  skip_final_snapshot       = false
  final_snapshot_identifier = "prod-db-final-${formatdate("YYYY-MM-DD", timestamp())}"

  tags = local.required_tags
}

On GCP, Cloud SQL automated backups follow the same pattern. On Azure, Azure Backup policies apply to both VMs and databases. The key is ensuring the retention period and backup window are explicitly set in Infrastructure as Code rather than left at console defaults that nobody checks.


3. Staging environment parity

A staging environment that does not match production configuration is not staging. It is a different system with a different name. The bugs it catches are a subset of the bugs production will surface, and the subset it misses is usually the category that causes the most damage: database schema differences, environment variable mismatches, service integration behaviour under real data volumes.

Staging parity means the same Terraform modules, the same Docker images at the same versions, the same environment variable structure with different values, and the same infrastructure topology at a smaller scale. The instance types can be smaller. The replication factor can be lower. The data can be anonymised. The configuration structure must match.

The practical way to enforce this is a shared Terraform module for each service where environment is a variable that changes instance sizing and replica counts but not the resource structure:

module "api_service" {
  source = "../../modules/service"

  environment    = "staging"
  instance_type  = "t3.small"    # Smaller than production t3.xlarge
  replica_count  = 1             # Fewer replicas than production 3
  # All other configuration is identical to production
  image_tag      = var.image_tag  # Same image, same tag
  env_vars       = var.env_vars   # Same keys, staging-specific values
}

When staging and production diverge at the module level, the team ends up debugging production-only failures because the systems are structurally different. The module ensures the structure stays identical even as the sizing changes.


4. Secret rotation

Static secrets that never rotate are credentials waiting to be compromised. Database passwords, API keys, service account tokens: any credential that does not have an expiry and a rotation schedule is a long-lived attack surface. The risk is not hypothetical. Leaked credentials in git history, CI logs, and Slack messages are one of the most consistent sources of cloud security incidents.

AWS Secrets Manager supports managed rotation for RDS, Aurora, Redshift, and DocumentDB credentials without writing any custom code. For these services, AWS handles the four-step rotation lifecycle: generate the new credential, update the target system, test the new credential, and promote it to active. The rotation Lambda runs on the schedule you define:

resource "aws_secretsmanager_secret" "db_password" {
  name       = "production/database/password"
  kms_key_id = aws_kms_key.secrets.arn

  tags = local.required_tags
}

resource "aws_secretsmanager_secret_rotation" "db_password" {
  secret_id           = aws_secretsmanager_secret.db_password.id
  rotation_lambda_arn = "arn:aws:lambda:us-east-1:123456789:function:SecretsManagerRDSPostgreSQLRotationSingleUser"

  rotation_rules {
    automatically_after_days = 30
  }
}

On GCP, Secret Manager supports rotation notifications via Pub/Sub with a Cloud Function handling the actual rotation. On Azure, Key Vault integrates with Azure Functions for the same pattern. Applications should always retrieve secrets at runtime via the secrets manager API rather than reading from environment variables set at deploy time, so rotation takes effect without redeployment.


5. Cost alerts

Cloud cost anomaly detection alerts before the monthly bill arrives. AWS Cost Anomaly Detection is free and identifies spend patterns that deviate from the established baseline per service, linked account, or cost category. A GPU instance left running, an ephemeral environment that was never cleaned up, or an unexpected data transfer spike all surface as anomalies within hours rather than at the end of the billing cycle.

# AWS Cost Anomaly Detection via Terraform
resource "aws_ce_anomaly_monitor" "service_monitor" {
  name              = "service-level-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "cost-anomaly-alerts"
  frequency = "IMMEDIATE"

  monitor_arn_list = [aws_ce_anomaly_monitor.service_monitor.arn]

  subscriber {
    address = "engineering-alerts@your-company.com"
    type    = "EMAIL"
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
      values        = ["20"]  # Alert when spend is 20% above expected
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

Budget alerts are a separate and complementary layer. A monthly budget alert at 80% of the expected spend threshold fires before the budget is exceeded. An anomaly detection alert fires when the spend pattern changes unexpectedly regardless of the absolute value. Both should be active. On GCP, Cloud Billing budget alerts serve the same function. On Azure, Azure Cost Management budget alerts apply at the subscription or resource group level.


6. Access logs

Access logs are the audit trail that answers who accessed what, when, and from where. Without them, a security incident has no investigation surface. With them, the scope of a breach, the timeline of unauthorized access, and the specific resources touched are all answerable from log data.

The minimum viable access logging setup on AWS enables CloudTrail for management events across all regions, S3 server access logging for all production buckets, and VPC Flow Logs for network traffic. All three export to S3 with a retention policy and, ideally, an alert on specific high-risk API calls:

resource "aws_cloudtrail" "main" {
  name                          = "production-trail"
  s3_bucket_name                = aws_s3_bucket.audit_logs.id
  include_global_service_events = true
  is_multi_region_trail         = true
  enable_log_file_validation    = true

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::"]  # All S3 objects
    }
  }

  tags = local.required_tags
}

# Alert on high-risk API calls
resource "aws_cloudwatch_metric_alarm" "root_account_usage" {
  alarm_name          = "root-account-usage"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = 1
  metric_name         = "RootAccountUsage"
  namespace           = "CloudTrailMetrics"
  period              = 60
  statistic           = "Sum"
  threshold           = 1
  alarm_actions       = [aws_sns_topic.security_alerts.arn]
}

On GCP, Cloud Audit Logs cover admin activity, data access, and system events per service. On Azure, Azure Monitor Activity Logs serve the equivalent function. The logs themselves are not sufficient without retention policies that keep them long enough for post-incident investigation: 90 days minimum for active storage, one year for cold archive.


7. Uptime monitoring

Uptime monitoring from outside the infrastructure is the check that confirms services are reachable from the user's perspective, independent of what internal metrics say. Internal health checks and dashboards can show green while DNS misconfiguration, TLS expiry, or a routing change makes the service unreachable from the public internet. External uptime monitoring catches this category of failure that internal observability misses.

The minimum setup is HTTP checks against production endpoints with alerts on response code, response time threshold, and TLS certificate expiry. Checks should run from multiple regions to distinguish a regional network issue from a service-level outage. Open-source options like Uptime Kuma are self-hostable and free. Better Stack provides a managed free tier that covers the basics for small teams without operational overhead.

TLS certificate expiry monitoring specifically deserves an alert at 30 days and again at 7 days. Expired certificates cause complete service outages that are trivially preventable with a reminder. Let's Encrypt with cert-manager in Kubernetes handles automatic renewal, but the monitoring alert exists as a fallback for the case where the renewal fails silently.

# cert-manager ClusterIssuer for automatic TLS with Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@your-company.com
    privateKeySecretRef:
      name: letsencrypt-production
    solvers:
      - http01:
          ingress:
            class: nginx

8. A rollback mechanism

Every deployment needs a defined rollback path that any engineer on the team can execute without needing to understand the full deployment history. The rollback mechanism does not need to be exotic. It needs to be documented, tested, and fast.

For Kubernetes deployments, ArgoCD's GitOps model means rollback is reverting a commit. The cluster reconciles back to the previous state automatically. For non-Kubernetes deployments, immutable artifact versioning and a documented rollback procedure are the minimum. Every deployment should produce an artifact tagged with the commit SHA that can be redeployed directly:

# GitHub Actions: tag every image with commit SHA for rollback traceability
- name: Build and push image
  uses: docker/build-push-action@v6
  with:
    push: true
    tags: |
      ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
      ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}

# To roll back: redeploy the previous SHA
# kubectl set image deployment/my-app \
#   app=$REGISTRY/$IMAGE_NAME:$PREVIOUS_SHA

For database migrations, every migration needs a corresponding down migration. For infrastructure changes, ArgoCD or Terraform state both support reverting to a previous known-good configuration. The rollback capability is only real if it has been tested. A rollback procedure that has never been exercised under realistic conditions is a procedure that will fail at the worst possible moment.


All eight of these are infrastructure decisions, not product decisions. They do not ship features. They do not generate revenue directly. What they do is prevent the category of incidents that cost days of engineering time to recover from, make compliance conversations manageable instead of painful, and let the team operate the infrastructure with confidence rather than anxiety. Setting them up early is a one-time investment. Retrofitting them into a system that has been running without them for two years is a project.

Author note

Manjunaathaa

Associate DevOps Engineer at Frigga Cloud Labs. Manages infrastructure across AWS, GCP, and Azure with GitHub Actions as the deployment layer. This blog comes from the consistent pattern of encountering cloud setups missing the same foundational pieces across different teams and environments. Each of these eight has come up as a gap during an incident, an audit, or a team handover. None of them are complex to implement. All of them are worth doing before they become urgent.

Let's connect on LinkedIn → Manjunaathaa

Post a Comment

Previous Post Next Post