The Problem: The Expensive Guessing Game of Resource Limits
Nothing ruins a DevOps engineer’s morning like a 3:00 AM Slack alert for a CrashLoopBackOff. You check the logs and see the dreaded OOMKilled status—your application ran out of memory, and now it’s stuck. Even worse is the month-end cloud bill. You might find you’re paying for 80% idle CPU because someone set resource requests to 2.0 cores “just to be safe” when the app only uses 0.1 cores.
Setting static resource limits in Kubernetes is rarely accurate. If limits are too low, your application throttles or crashes under load. If they are too high, you waste money and kill cluster density. Static configurations fail because traffic is unpredictable and memory footprints often grow as an application stays online.
Root Cause: Why Static Limits Fail at Scale
Kubernetes relies on requests for scheduling and limits to cap usage. The gap between these two numbers is where most waste happens. Manual tuning works when you have three microservices. It fails miserably when you’re managing fifty. If you over-provision by just 0.5 CPU cores per pod across a 100-pod cluster, you are paying for 50 cores of “ghost” compute that never performs any actual work.
Most teams struggle with two specific patterns:
- Horizontal Spikes: Sudden traffic bursts that require more instances of the application.
- Vertical Growth: Java or Python apps that need more memory per instance as they process larger datasets or run for extended periods.
Quick Start: Automate Your Scaling in 5 Minutes
You can stop guessing by implementing automated scaling. First, ensure the Metrics Server is running in your cluster. This component provides the telemetry data that autoscalers need to make decisions.
1. Deploy Horizontal Pod Autoscaler (HPA)
HPA changes the number of pods based on demand. Use this command to scale a deployment when CPU utilization hits 50%:
kubectl autoscale deployment my-app --cpu-percent=50 --min=2 --max=10
2. Deploy Vertical Pod Autoscaler (VPA)
VPA adjusts the size of individual pods. While HPA adds more workers, VPA gives the existing workers a bigger engine. After installing the VPA components, apply this configuration to monitor your app:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto"
How the Mechanics Work
Horizontal Pod Autoscaler (HPA)
HPA runs a continuous control loop, querying the Metrics API every 15 seconds. It calculates the necessary replica count using a simple ratio. If your target is 50% and your current usage is 100%, HPA doubles your pod count. It is the best choice for stateless web services where spreading the load across more “workers” is easy.
Vertical Pod Autoscaler (VPA)
VPA is more surgical. It uses three internal components to manage pod sizes:
- Recommender: Analyzes historical usage to suggest the “just right” CPU and memory values.
- Admission Controller: Modifies new pods at creation time to use the recommended values.
- Updater: Kills pods that have outdated resource settings so they can be recreated with the new, correct size.
Warning: In Auto mode, VPA will restart your pods to apply changes. This can cause brief downtime if you haven’t configured enough replicas or set up Pod Disruption Budgets (PDBs).
Strategic Implementation: Resolving the Conflict
A common pitfall is running HPA and VPA on the same metric, like CPU. This creates a “tug-of-war” effect. VPA tries to increase the CPU per pod, while HPA tries to add more pods to lower the average CPU. They will fight each other, resulting in an unstable environment and fluctuating performance.
The Recommended Strategy
To make them work in harmony, follow these two rules:
- Use HPA for Scaling: Let HPA handle traffic spikes based on CPU or request throughput.
- Use VPA for Rightsizing: Use VPA in
InitialorOffmode to manage memory. Memory is non-compressible; if you run out, the app crashes. VPA ensures your memory requests match real-world peaks.
I recommend starting with updateMode: "Off". This allows you to see recommendations without letting the controller restart your pods automatically:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa-recommend
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Recommendation only mode
Production Tips for Stability
1. Enforce Pod Disruption Budgets (PDB)
Resizing requires a pod restart. If you use VPA in Auto mode, a PDB is your safety net. It ensures that even during a mass resize, a minimum number of pods remain available to serve traffic.
2. The “Observe First” Rule
Don’t jump straight to automated restarts. Run VPA in Off mode for at least one week. Check the recommendations in your Grafana dashboards and compare them to your current settings. Only move to Auto once the recommendations prove consistent and safe.
3. Manage Memory Carefully
CPU can be throttled, meaning your app just runs slower. Memory cannot. When a pod hits its memory limit, the kernel kills the process immediately. VPA is your best defense here because it learns the app’s peak memory requirements over time and adjusts the floor to prevent those 3:00 AM crashes.
4. Tune the Stabilization Window
HPA can sometimes scale down too aggressively, causing “thrashing” where pods are constantly created and deleted. Modern Kubernetes versions allow you to tune the stabilizationWindowSeconds. Setting this to 300 seconds (5 minutes) for scale-down operations prevents premature pod termination during temporary traffic dips.
Closing Thoughts
Optimizing Kubernetes isn’t about finding a magic set of numbers. It is about building a system that observes and reacts to reality. By combining HPA for traffic spikes and VPA for long-term rightsizing, you can slash cloud costs while making your applications significantly more resilient. Start small, trust the data, and let the controllers handle the operational burden of resource management.

