AWS ECS Fargate & Terraform: A Production-Ready Deployment Guide

DevOps tutorial - IT technology blog
DevOps tutorial - IT technology blog

The Shift to Serverless Containers

Managing EC2 instances for container orchestration often feels like a never-ending cycle of 2 AM pager alerts. You have to patch the OS, monitor disk space, and tweak scaling groups just to keep your Docker containers breathing.

When I migrated my first cluster from self-managed EC2 to AWS ECS Fargate, the operational burden vanished almost overnight. Fargate lets you run containers without touching a single server. When you pair it with Terraform, you get a repeatable, version-controlled environment that works every time.

I’ve found that mastering this stack is the fastest way to build production-ready systems. You get to focus on your code while AWS handles the heavy lifting of infrastructure maintenance. This guide skips the fluff and shows you how to build a functional, auto-scaling Fargate service from scratch.

Quick Start: The 5-Minute Cluster

Everything starts with an ECS cluster. Think of this as your logical sandbox. Unlike traditional clusters, a Fargate-backed cluster doesn’t require you to provision or pay for underlying EC2 capacity upfront.

# provider.tf
provider "aws" {
  region = "us-east-1"
}

# cluster.tf
resource "aws_ecs_cluster" "main" {
  name = "production-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

Fire off terraform init and terraform apply. You now have a namespace ready for action. But a cluster alone is just an empty shell. We need networking and task definitions to actually run a workload.

Building the Infrastructure Foundation

A production-grade Fargate setup relies on three pillars: Networking, IAM Roles, and the Task Definition.

1. Networking for Fargate

Fargate tasks must live inside a VPC. For a secure setup, place your tasks in private subnets. Use an Application Load Balancer (ALB) in public subnets to handle incoming traffic. Note: Since your tasks are in private subnets, you will need a NAT Gateway or VPC Endpoints to pull images from ECR.

# Simplified VPC setup
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_security_group" "ecs_tasks" {
  name        = "ecs-tasks-sg"
  vpc_id      = aws_vpc.main.id

  ingress {
    protocol        = "tcp"
    from_port       = 80
    to_port         = 80
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    protocol    = "-1"
    from_port   = 0
    to_port     = 0
    cidr_blocks = ["0.0.0.0/0"]
  }
}

2. IAM Roles: Execution vs. Task Role

This is where many engineers get stuck. You need two distinct roles to make this work. The Execution Role is for the ECS agent; it pulls your image and sends logs to CloudWatch. The Task Role is for your application code, allowing it to talk to services like S3 or DynamoDB.

resource "aws_iam_role" "ecs_task_execution_role" {
  name = "ecs-task-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_execution_standard" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

3. The Task Definition and Service

The Task Definition is your container’s DNA. It defines the image, CPU (e.g., 256 for 0.25 vCPU), and memory. The Service then acts as a manager, ensuring your desired number of tasks stay healthy.

resource "aws_ecs_task_definition" "app" {
  family                   = "my-app"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256"
  memory                   = "512"
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn

  container_definitions = jsonencode([
    {
      name      = "app-container"
      image     = "nginx:latest"
      essential = true
      portMappings = [{
        containerPort = 80
        hostPort      = 80
      }]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/my-app"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "ecs"
        }
      }
    }
  ])
}

resource "aws_ecs_service" "main" {
  name            = "my-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    security_groups = [aws_security_group.ecs_tasks.id]
    subnets         = [aws_subnet.private.id]
  }
}

Scaling for Peak Demand

Static task counts fail during traffic spikes. If your app gets featured on a major news site, you need to scale instantly. AWS Application Auto Scaling adjusts your desired_count based on real-time metrics.

Target tracking is the smartest approach here. It acts like a thermostat for your infrastructure, adding capacity when things get hot and cooling down when traffic drops.

resource "aws_appautoscaling_policy" "ecs_policy_cpu" {
  name               = "cpu-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

If average CPU usage hits 70%, ECS will spin up more tasks. When the load subsides, it gracefully terminates the extras to protect your budget.

Hard-Won Lessons from the Field

Running Fargate in production for several years has taught me a few critical lessons that don’t always appear in the documentation.

The ‘:latest’ Tag Trap

Terraform won’t detect a change if you simply push a new image to the same :latest tag. Set force_new_deployment = true in your aws_ecs_service. This forces a rollout every time you run terraform apply, ensuring your latest code actually reaches production.

Visibility is Everything

You cannot SSH into a Fargate container. Logs are your only lifeline. Always configure the awslogs driver. Without it, debugging a 500 error becomes a guessing game instead of a technical process.

Handle SIGTERM Gracefully

When ECS deploys a new version, it sends a SIGTERM to the old containers. Your application has exactly 30 seconds to finish its current request before AWS kills the process. If your app handles long-running jobs, increase the stopTimeout in your container definition to avoid data corruption.

Slash Costs with Fargate Spot

Fargate can get expensive if you run large clusters 24/7. For development environments or non-critical background workers, use Fargate Spot. It allows you to use spare AWS capacity for a 70% discount, provided you can handle a two-minute termination notice.

Terraform boilerplate can feel heavy at first. However, the trade-off is a rock-solid environment that scales without manual intervention. Once your HCL files are ready, launching a new microservice takes minutes, not days.

Share: