Moving Beyond Static Analysis for Infrastructure
For years, my Infrastructure as Code (IaC) workflow was predictable: write HCL, run terraform plan, and hit apply if the output looked mostly right. This approach works for basic setups. However, as your architecture scales, a ‘plan’ becomes a false sense of security. It confirms Terraform’s intent, but it cannot guarantee that a VPC peering connection actually routes traffic or that a database is reachable over port 5432.
The turning point came six months ago. A simple CIDR block change in a shared VPC module passed every manual check but triggered a silent routing conflict. We didn’t catch it until the staging environment collapsed. That failure led us to integrate Terratest. Instead of guessing, we now treat infrastructure like software. We provision real resources, run functional tests against them, and destroy them immediately after.
Since adopting this workflow, our post-deployment ‘hotfix’ rate has dropped by roughly 40%. We no longer cross our fingers during a Friday afternoon deployment. Instead, we rely on a suite of Go-based tests that verify everything from HTTP 200 responses to specific IAM policy attachments before code ever reaches the main branch.
Setting Up the Testing Environment
Terratest is a Go library, so you will need a Go runtime alongside Terraform. If you are a DevOps engineer who hasn’t touched Go, do not be intimidated. Most infrastructure tests follow a highly repetitive pattern. Once you nail the initial structure, you can copy and adapt it for almost any module.
Prerequisites
Ensure your local machine or CI runner has these tools ready:
- Terraform: Use the version that matches your production environment (e.g., v1.5.0+).
- Go: Version 1.18 or later is the sweet spot for modern library support.
- Cloud Credentials: Set your environment variables (like
AWS_ACCESS_KEY_ID) to point to a dedicated testing account.
To start, initialize a new Go module in your project root. This manages your testing dependencies:
# Create a dedicated test directory
mkdir test && cd test
# Initialize the Go module
go mod init github.com/your-org/infra-tests
# Pull in the Terratest terraform module
go get github.com/gruntwork-io/terratest/modules/terraform
Configuring Your First Terratest Suite
Terratest follows a strict lifecycle: Deploy, Validate, Undeploy. Let’s test a module that creates an S3 bucket with versioning enabled. We want to prove that the bucket exists and that the versioning status is ‘Enabled’—not just assume it worked because the command finished.
The Terraform Module (main.tf)
Consider this standard module in modules/s3:
resource "aws_s3_bucket" "this" {
bucket = var.bucket_name
}
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.this.id
versioning_configuration {
status = "Enabled"
}
}
output "bucket_id" {
value = aws_s3_bucket.this.id
}
The Go Test Script (s3_test.go)
Create s3_test.go in your test/ folder. I recommend using unique naming conventions for resources to prevent naming collisions when multiple developers run tests simultaneously.
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestS3BucketVersioning(t *testing.T) {
t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../modules/s3",
Vars: map[string]interface{}{
"bucket_name": "terratest-audit-log-bucket-unique-id",
},
})
// Ensure 'terraform destroy' runs at the end
defer terraform.Destroy(t, terraformOptions)
// Trigger 'terraform init' and 'terraform apply'
terraform.InitAndApply(t, terraformOptions)
// Retrieve the output variable
bucketId := terraform.Output(t, terraformOptions, "bucket_id")
// Assert that the bucket was actually created
assert.NotEmpty(t, bucketId)
}
Pay close attention to the defer terraform.Destroy line. It is your insurance policy. If a test fails halfway through, this command ensures the resources are wiped. Without it, you might find a $500/month NAT Gateway or an idle EKS cluster lingering in your account long after the test ends.
Verification and CI/CD Integration
Writing the code is the first step; automating it is where the real ROI appears. Local testing is great for debugging, but the CI/CD pipeline is where these tests act as a final gatekeeper.
Executing the Suite
Run your tests from the test/ directory with this command:
go test -v -timeout 30m
I set the timeout to 30 minutes. While standard Go tests finish in seconds, infrastructure tests are at the mercy of cloud providers. Provisioning an RDS instance or a CloudFront distribution can easily take 15 to 20 minutes.
GitHub Actions Workflow
Here is how I configure a typical validation job. It ensures every pull request is tested in a live sandbox before it can be merged.
name: IaC Validation
on: [pull_request]
jobs:
terratest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-go@v4
with:
go-version: '1.20'
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Execute Tests
run: |
cd test
go mod tidy
go test -v -timeout 60m
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_TEST_ACCOUNT_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_TEST_ACCOUNT_SECRET }}
AWS_REGION: "us-west-2"
Managing “Leaked” Resources
Even with defer, tests can occasionally fail so badly that they leave resources behind. To avoid surprise bills, I use a “nuke” strategy. We use a tool called aws-nuke in our testing account to purge any resource older than 24 hours that carries a testing: true tag. Using a completely separate AWS account for testing is also mandatory. It provides a hard boundary that protects your production data from accidental deletion.
Reflections After Six Months
Moving to a test-driven approach for infrastructure requires an upfront investment in time and learning. It will slow down your initial development. However, the trade-off is a massive increase in deployment confidence. My team no longer fears refactoring core networking modules because our safety net catches errors before they reach a single user.
If you are managing high-stakes environments, start small. Pick one critical module—perhaps your security group logic or your load balancer config—and write a single test. Once you see it catch a real-world misconfiguration, you will never want to go back to manual verification.

