Terraform & IaC Automation10 min read

Terraform Best Practices: 15 Mistakes Costing 20+ Hours/Week

Discover the 15 most common Terraform mistakes teams make and how to fix them. Improve IaC workflows, prevent drift, and ship production-ready infrastructure faster.

AAbhay Singh· Cloud Architect
#terraform best practices#terraform mistakes#infrastructure as code#terraform state management#devops automation#aws terraform#cloudops ai#terraform modules#terraform security#IaC best practices

Your Terraform Is Technically Working. That's the Problem.

Most Terraform codebases don't fail dramatically. They decay slowly — one hardcoded value here, one skipped lock there, one "I'll refactor this later" module that never gets refactored. Six months in, your team spends more time fighting the codebase than shipping infrastructure.

We audited dozens of engineering teams and found a consistent pattern: the same 15 mistakes were responsible for the majority of wasted hours — debugging drift, untangling state corruption, re-doing work that should have been automated.

This article names every one of them — and for each, explains how to fix it permanently.


Mistake #1: Hardcoding Values Instead of Using Variables

Time wasted per week: 2–3 hours

# ❌ What most teams write
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.large"
  subnet_id     = "subnet-0bb1c79de3EXAMPLE"
}

Hardcoded AMI IDs, instance types, and subnet IDs turn your Terraform into a fragile, environment-specific mess. When you need to deploy to staging, update the AMI, or change regions, you're doing a full find-and-replace across dozens of files — with no safety net.

The fix:

# ✅ Parameterized, reusable
variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.large"
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  subnet_id     = var.subnet_id
}

Every environment-specific value belongs in variables.tf with a type, description, and sensible default. Use terraform.tfvars files per environment, never inline literals.

CloudOps AI prevents this by auto-extracting hardcoded values into variables when generating or importing Terraform code — so you start clean, not technical-debt-first.


Mistake #2: No Remote State Backend

Time wasted per week: 1–2 hours

Local terraform.tfstate is a single-engineer solution masquerading as a team workflow. The moment two people run terraform apply from different machines, you have state divergence — and debugging it is brutal.

The fix:

terraform {
  backend "s3" {
    bucket         = "my-tf-state-prod"
    key            = "infra/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Use S3 + DynamoDB for state locking, or Terraform Cloud/Enterprise for a fully managed experience. This is non-negotiable for any team with more than one engineer.

CloudOps AI generates a ready-to-use backend.tf as part of every code export, with S3 backend, DynamoDB lock table, and encryption configured by default.


Mistake #3: Missing State Locking

Time wasted per week: 2–4 hours (when it breaks)

Even teams with remote backends often skip DynamoDB state locking. The result: two concurrent terraform apply runs corrupt the state file. Recovering from a corrupted state file can take an afternoon.

The fix:

Create the DynamoDB table once:

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Then reference it in every backend config. Always. No exceptions.


Mistake #4: Not Using Modules

Time wasted per week: 3–4 hours

Copy-pasting the same VPC, security group, or ECS task configuration across five projects isn't reuse — it's five separate things to maintain. One security fix means five PRs. One breaking change in AWS means five broken configs.

The fix:

Structure reusable patterns as modules:

modules/
  vpc/
    main.tf
    variables.tf
    outputs.tf
  rds/
    main.tf
    variables.tf
    outputs.tf
environments/
  prod/
    main.tf   ← calls modules
  staging/
    main.tf   ← calls same modules, different vars

Consume them cleanly:

module "vpc" {
  source  = "../../modules/vpc"
  version = "1.2.0"

  cidr_block   = var.vpc_cidr
  environment  = var.environment
  project_name = var.project_name
}

Use the Terraform Registry for battle-tested community modules before writing your own.

CloudOps AI organizes generated code into modules by default — VPC, compute, storage, and IAM are separated from day one.


Mistake #5: Skipping terraform plan Reviews

Time wasted per week: 1–3 hours

Treating terraform plan output as a formality — scrolling past it and hitting apply — is how production resources get accidentally destroyed. The # aws_rds_instance.main must be replaced line is easy to miss in 200 lines of diff.

The fix:

Make plan review a formal step:

  • Save plan output: terraform plan -out=tfplan

  • Review with your team in PRs (use terraform show tfplan)

  • Automate plan output as a PR comment in CI/CD

  • Set up Sentinel or OPA policies to block destructive changes without approval

Never run terraform apply without an explicit plan review, especially in production.


Mistake #6: Not Pinning Provider and Module Versions

Time wasted per week: 1–2 hours

# ❌ A breaking change will find you at the worst time
terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }
}

An unpinned AWS provider means that terraform init on a new machine in six months pulls a major version with breaking changes — and your pipeline breaks in ways that are hard to trace.

The fix:

# ✅ Explicit, reproducible
terraform {
  required_version = ">= 1.5.0, < 2.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

Pin modules to specific tags, not branches:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"   # ✅ pinned tag
}

Use dependabot or Renovate to automate version bump PRs with controlled review.


Mistake #7: Secrets and Credentials in .tf Files

Time wasted per week: 2–5 hours when a breach occurs

# ❌ This gets committed to Git. Every time.
resource "aws_db_instance" "main" {
  username = "admin"
  password = "MyS3cur3P@ssw0rd"
}

Hardcoded passwords, API keys, and tokens in Terraform files eventually end up in Git history — even if you catch them and delete them. Git history is forever.

The fix:

# ✅ Reference secrets, never store them
resource "aws_db_instance" "main" {
  username = var.db_username
  password = var.db_password  # sourced from environment or Vault at plan time
}

Source secrets via:

  • TF_VAR_db_password environment variables in CI/CD

  • HashiCorp Vault provider for dynamic credentials

  • AWS Secrets Manager or SSM Parameter Store via data sources

  • A .tfvars file that is .gitignored and stored in a secrets manager

Add a pre-commit hook using git-secrets or truffleHog to catch accidental commits.

CloudOps AI flags sensitive attributes during code generation and replaces them with variable references automatically.


Mistake #8: One Giant main.tf File

Time wasted per week: 1–2 hours

A 1,500-line main.tf containing your VPC, EC2 fleet, RDS cluster, IAM policies, CloudWatch alarms, and S3 buckets is not infrastructure as code. It's infrastructure as archaeology.

The fix:

Split by resource concern:

main.tf          ← provider config, backend, data sources
vpc.tf           ← VPC, subnets, route tables, NACLs
compute.tf       ← EC2, ASG, launch templates
database.tf      ← RDS, parameter groups, subnet groups
iam.tf           ← roles, policies, instance profiles
monitoring.tf    ← CloudWatch, SNS, alarms
outputs.tf       ← all outputs
variables.tf     ← all variables

Each file should be independently readable and under 200 lines where possible.


Mistake #9: No Tagging Strategy

Time wasted per week: 1–2 hours

Untagged resources are invisible resources. When your AWS bill spikes, untagged infrastructure makes cost attribution impossible. When an incident fires at 2am, untagged EC2 instances can't be traced to a team, project, or environment.

The fix:

Define a mandatory tagging baseline using default_tags at the provider level — so every resource inherits it automatically:

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Project     = var.project_name
      Environment = var.environment
      ManagedBy   = "terraform"
      Team        = var.team_name
      CostCenter  = var.cost_center
    }
  }
}

This is better than tagging each resource individually — it's enforced, consistent, and automatic.

CloudOps AI applies default_tags to every generated provider block and prompts you for tag values during setup.


Mistake #10: Ignoring terraform fmt and terraform validate

Time wasted per week: 30 min–1 hour

Inconsistent formatting creates noisy diffs, slows down code reviews, and makes it harder to spot real changes. Skipping terraform validate means syntax errors reach CI/CD instead of being caught locally in seconds.

The fix:

Make both mandatory in your workflow:

# Run before every commit
terraform fmt -recursive
terraform validate

Add to pre-commit hooks:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.83.5
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_tflint

Mistake #11: No CI/CD Pipeline for Terraform

Time wasted per week: 2–3 hours

Teams that apply Terraform manually from local machines can't audit who changed what, when, and why. They also can't enforce plan reviews, policy checks, or automated testing.

The fix:

A minimal Terraform CI/CD pipeline in GitHub Actions:

name: Terraform

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform fmt -check
      - run: terraform validate
      - run: terraform plan -out=tfplan
      - name: Post plan to PR
        uses: actions/github-script@v7
        # ... comment tfplan output on PR

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production   # requires manual approval
    steps:
      - run: terraform apply tfplan

Every apply to production should require a human approval gate.


Mistake #12: Not Using terraform.tfvars Per Environment

Time wasted per week: 1 hour

Using the same variable values across dev, staging, and production is how you accidentally deploy production-scale infrastructure to a dev sandbox — or worse, point dev workloads at production databases.

The fix:

Maintain per-environment var files:

environments/
  dev.tfvars
  staging.tfvars
  prod.tfvars
# Explicit, never ambiguous
terraform apply -var-file="environments/prod.tfvars"

Use a locals block to derive environment-specific settings from a single environment variable when values follow a pattern, reducing the number of vars you need to maintain.


Mistake #13: Neglecting Resource Lifecycle Rules

Time wasted per week: 2–3 hours

Without lifecycle rules, Terraform can destroy and recreate stateful resources — databases, Elasticsearch clusters, S3 buckets — in ways that cause downtime or data loss. Terraform doesn't know the difference between "this is just a config server" and "this is your primary database."

The fix:

Use lifecycle rules to protect critical resources:

resource "aws_db_instance" "primary" {
  # ...

  lifecycle {
    prevent_destroy       = true   # block accidental deletion
    create_before_destroy = true   # zero-downtime replacements
    ignore_changes        = [
      snapshot_identifier,         # don't track snapshot drift
    ]
  }
}

prevent_destroy = true on your RDS instance, Elasticsearch domain, and any stateful infrastructure is cheap insurance against an accidental terraform destroy.


Mistake #14: Not Running tflint or Security Scanning

Time wasted per week: 2–4 hours (incident cost)

Terraform validates HCL syntax but won't catch an S3 bucket with public read access, an unrestricted security group, or an unencrypted EBS volume. These don't fail terraform plan — they become security incidents.

The fix:

Add static analysis to your pipeline:

# tflint — catches resource-level misconfigurations
tflint --init && tflint

# tfsec — security scanning
brew install tfsec
tfsec .

# checkov — compliance-as-code
pip install checkov
checkov -d .

Configure rules to match your organization's security baseline. Fail the pipeline on high-severity findings.


Mistake #15: No Documentation on Modules and Outputs

Time wasted per week: 1–2 hours

Undocumented Terraform modules are black boxes. Your colleague (or future you) has to read through 300 lines of HCL to understand what a module does, what it requires, and what it produces. Multiply that by ten modules and an onboarding engineer, and you've lost days.

The fix:

Use terraform-docs to auto-generate documentation from your code:

brew install terraform-docs
terraform-docs markdown . > README.md

This generates a formatted README.md from your variables.tf and outputs.tf descriptions — which means good variable descriptions become free documentation:

variable "instance_type" {
  description = "EC2 instance type. Use t3.micro for dev, t3.large for prod."
  type        = string
  default     = "t3.micro"

  validation {
    condition     = contains(["t3.micro", "t3.large", "m5.xlarge"], var.instance_type)
    error_message = "Must be an approved instance type."
  }
}

CloudOps AI generates a README.md alongside every module — documenting inputs, outputs, dependencies, and example usage automatically.


The Compounding Cost of Getting This Wrong

Each mistake alone might cost an hour or two. Together, they compound:

Mistake Category Weekly Hours Lost Hardcoded values / no variables 2–3 hrs State management issues 1–4 hrs No modules (copy-paste sprawl) 3–4 hrs Missing CI/CD and plan reviews 2–3 hrs Security incidents from no scanning 2–4 hrs Undocumented modules 1–2 hrs Formatting and validation gaps 1 hr Missing tagging (cost attribution) 1–2 hrs Total 13–23 hrs/week

That's a part-time engineer worth of time, every single week, on avoidable friction.


How CloudOps AI Eliminates These Mistakes by Default

The reason most teams make these mistakes isn't carelessness — it's that setting up all of this correctly from scratch takes time that new projects never have. CloudOps AI changes the starting point.

When you generate or import infrastructure with CloudOps AI:

  • Variables are extracted automatically — no hardcoded values

  • A remote backend with state locking is configured out of the box

  • Code is organized into logical modules from day one

  • default_tags are applied to the provider block

  • Sensitive values are identified and replaced with variable references

  • A README.md is generated for every module

  • Output is clean, idiomatic HCL — ready for review, not cleanup

You still write Terraform. You just skip the part where you pay the technical debt tax for six months before getting there.

Start generating production-ready Terraform today →


Quick Reference Checklist

Before shipping any Terraform codebase, run through this list:

  • [ ] All environment-specific values in variables.tf

  • [ ] Remote backend configured with state locking

  • [ ] Provider and module versions pinned

  • [ ] Resources organized into modules by concern

  • [ ] Secrets sourced from Vault, SSM, or environment variables — never in .tf files

  • [ ] default_tags on provider block

  • [ ] prevent_destroy on stateful resources

  • [ ] terraform fmt and terraform validate in pre-commit hooks

  • [ ] CI/CD pipeline with plan review and manual apply gate

  • [ ] tflint and tfsec in the pipeline

  • [ ] terraform-docs generating README.md for all modules

  • [ ] Per-environment .tfvars files

Ready to optimise your cloud operations?

CloudOps AI gives your team AI-powered architecture, FinOps, and DevSecOps in one platform.

Start for free →

Frequently Asked Questions

What are the most important Terraform best practices for beginners?

Start with three: use variables instead of hardcoded values, configure a remote backend with state locking, and pin your provider versions. These alone prevent the majority of beginner mistakes.

How do I manage Terraform state across multiple teams?

Use Terraform Cloud workspaces or separate S3 state files per team/environment with strict IAM policies. Never share a single state file across teams.

What is terraform fmt and why does it matter?

terraform fmt standardizes HCL formatting according to Terraform's style conventions. It eliminates formatting debates in code reviews and makes diffs meaningful rather than noisy.

Should I use Terraform modules for every resource?

Not necessarily — single-resource modules add overhead without value. Create modules when a pattern is reused across multiple environments or projects, typically grouping 3–10 related resources.

How do I prevent Terraform from destroying production databases?

Use lifecycle { prevent_destroy = true } on all stateful resources. Also enforce this via Sentinel or OPA policies that block destroy operations on tagged production resources without explicit override.

A

Written by

Abhay Singh

Cloud Architect

Cloud Architect and DevOps specialist with 10+ years of experience in AWS and Azure.

More articles by Abhay Singh