Ditching Cluster Autoscaler for Karpenter: Faster, Cheaper EKS Scaling

DevOps tutorial - IT technology blog
DevOps tutorial - IT technology blog

The Problem with Traditional Scaling

Running Kubernetes at scale is often a fight against waste. You want your apps to stay responsive, but over-provisioning leads to massive AWS bills. For years, the Kubernetes Cluster Autoscaler (CA) was the only real tool for the job. It got us by, but it was never particularly fast or smart.

I’ve spent many stressful shifts watching error rates climb while waiting for the Cluster Autoscaler to react. It has to trigger an AWS Auto Scaling Group (ASG), wait for the EC2 instance to boot, and finally register the node. In a busy production environment, those four or five minutes can feel like an eternity while pods sit in a Pending state.

Karpenter changes the game by cutting out the middleman. It bypasses Auto Scaling Groups entirely and talks directly to the EC2 Fleet API. This approach makes node management faster, more flexible, and significantly cheaper.

How Karpenter Outperforms Cluster Autoscaler

To understand why teams are migrating to Karpenter, we have to look at the fundamental shift in how it handles infrastructure.

The Reactive Nature of CA

The Cluster Autoscaler is strictly reactive. When it spots an “unschedulable” pod, it scans your existing Node Groups. If it finds a match, it increases the “desired capacity” of that ASG. This process is rigid. If your pod needs a specific GPU or a high-memory instance that isn’t in your pre-defined groups, CA simply can’t help you.

The Just-in-Time Logic of Karpenter

Karpenter is proactive and “group-less.” It evaluates the exact requirements of your pending pods—things like CPU, RAM, architecture (ARM64 vs. x86), and Availability Zones. It then asks AWS for the single most cost-effective instance that fits those needs. If a pod requires 2 vCPUs, Karpenter won’t spin up a 16 vCPU m5.4xlarge just because it’s part of a group. It might pick a t3.medium instead, saving you roughly 70% in compute costs for that specific workload.

The Trade-offs: Is It Right for You?

The Benefits

  • Blazing Speed: Nodes typically join the cluster in 45-60 seconds. CA often takes 3-5 minutes to complete the same task.
  • Aggressive Bin-packing: Karpenter constantly look for ways to consolidate. It will move pods to a smaller node and terminate the old one if it saves money.
  • Simplified Config: You can replace dozens of specialized Auto Scaling Groups with a single Karpenter configuration.
  • Smarter Spot Handling: It proactively replaces Spot instances when AWS issues a termination notice, often before the instance actually goes down.

The Challenges

  • AWS-Centric: While the project aims to be platform-neutral, its most mature features are currently exclusive to AWS.
  • Initial Setup: Configuring the IAM roles and OIDC provider requires more precision than clicking “Create Managed Node Group” in the console.
  • Fast-Moving API: Karpenter is evolving quickly. You’ll need to keep an eye on version updates as some configuration schemas have changed recently.

Prerequisites for Success

Before starting the migration, ensure your environment meets these requirements:

  • Kubernetes: Version 1.25 or higher is recommended.
  • Environment: An existing Amazon EKS cluster.
  • Local Tools: helm, kubectl, and an updated aws-cli.
  • Access: Admin permissions to create IAM roles and manage OpenID Connect (OIDC) providers.

Step-by-Step Installation

We will focus on deploying the Karpenter controller and setting up the necessary permissions. This guide assumes you are working with an existing EKS cluster.

1. Provision IAM Roles

Karpenter needs permission to provision EC2 instances on your behalf. You need two distinct roles: one for the nodes Karpenter creates and one for the controller itself.

# Create the IAM Role for the nodes
aws iam create-role --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --assume-role-policy-document file://node-trust-policy.json

# Attach essential policies
aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

2. Deploy the Controller with Helm

Once your IAM roles are mapped, use Helm to install the controller. This component watches the API server for pods that the scheduler can’t place.

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version v0.32.1 \
  --namespace karpenter --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set settings.aws.clusterName=${CLUSTER_NAME} \
  --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --wait

3. Define Your NodePool

This is the core of your scaling logic. In version 0.32+, Karpenter uses NodePool and EC2NodeClass. The NodePool defines the constraints for your instances, while the EC2NodeClass handles the AWS-specific network settings.

Create nodepool.yaml:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

And the EC2NodeClass to link your subnets:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}

4. Testing the Setup

To see Karpenter in action, scale a dummy deployment. I recommend using the “pause” image, which consumes resources without doing any work.

kubectl create deployment inflate --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7
kubectl scale deployment inflate --replicas=15

Monitor the logs: kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter. You will see Karpenter immediately calculate the resource gap and provision a new instance. In my tests, the new node usually shows a Ready status in kubectl get nodes within 40 seconds.

Final Thoughts

Switching from Cluster Autoscaler to Karpenter is like upgrading from a map to a real-time GPS. You stop managing individual pools of servers and start defining the needs of your applications. While the initial IAM setup is a bit rigid, the payoff is a cluster that scales faster and costs less. For anyone running significant production workloads on EKS, Karpenter is no longer just an alternative—it is the modern standard.

Share: