Quick Start — Get Argo Rollouts Running in 5 Minutes
It’s 2 AM. Your team just pushed a hotfix and the standard kubectl rollout strategy is giving you cold sweats — one bad pod and traffic crashes. I’ve been there. That’s exactly the night I started taking Argo Rollouts seriously.
Argo Rollouts is a Kubernetes controller that lets you control exactly how traffic shifts during a release: canary splitting, blue/green switching, automated metric analysis, and rollback on failure. Standard Kubernetes Deployment objects can’t do any of this natively.
Install the controller and the kubectl plugin first:
# Install the Argo Rollouts controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
-f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Install the kubectl plugin (macOS/Linux)
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
# Verify
kubectl argo rollouts version
Next, swap your Deployment manifest for a Rollout object. The pod spec stays identical — you’re mostly changing the kind and adding a strategy block:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 5
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: my-registry/my-app:v1.0.0
ports:
- containerPort: 8080
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 2m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100
Apply it:
kubectl apply -f rollout.yaml
kubectl argo rollouts get rollout my-app --watch
You now have a working canary rollout. 20% of traffic hits the new version, waits 2 minutes, scales to 50%, waits 5 minutes, then goes full. All without writing a single line of custom logic.
Deep Dive — How Argo Rollouts Actually Works
The Canary Strategy
Under the hood, Argo Rollouts manages two ReplicaSets: the stable set (current version) and the canary set (new version). Traffic weighting works by adjusting replica counts — at 20% weight with 5 total replicas, you get 1 canary pod and 4 stable pods.
For actual HTTP-level traffic splitting (not just replica-based), you need an ingress controller or service mesh integration. With NGINX Ingress:
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
nginx:
stableIngress: my-app-ingress
steps:
- setWeight: 10
- pause: {duration: 10m}
- setWeight: 30
- pause: {duration: 10m}
- setWeight: 100
Create the two services Argo Rollouts will manage:
---
apiVersion: v1
kind: Service
metadata:
name: my-app-stable
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: my-app-canary
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
Argo Rollouts injects NGINX annotations automatically to split traffic at the proxy level — not at the pod count level. This is the difference between “roughly 10%” and “exactly 10%”.
The Blue/Green Strategy
Blue/green is conceptually simpler: run the new version (green) alongside the old (blue), then flip the switch. No gradual traffic ramp — it’s all-or-nothing, but you get to validate green before cutting over.
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false # Require manual promotion
scaleDownDelaySeconds: 30 # Keep blue running 30s after promotion
After deploying a new image, the green pods spin up behind my-app-preview. Your QA team can hit the preview endpoint directly to validate. When you’re confident:
# Promote green to active (flip the switch)
kubectl argo rollouts promote my-app
# Or abort and roll back to blue
kubectl argo rollouts abort my-app
Pay attention to scaleDownDelaySeconds at 2 AM. It keeps the old pods alive briefly after promotion — long enough for in-flight requests to drain. That buffer also gives you time to spot an immediate problem and abort before the blue pods disappear.
Advanced Usage — Automated Analysis and Rollback
Manual promotion works fine for small teams. When you’re shipping multiple times a day, though, you need the process to check metrics for you. Argo Rollouts has an AnalysisTemplate that can query Prometheus, Datadog, New Relic, or any HTTP endpoint and automatically decide whether to proceed or roll back.
Define an analysis that checks your error rate:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 1m
count: 5
successCondition: result[0] < 0.05 # Less than 5% errors
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
Wire it into your canary strategy:
strategy:
canary:
steps:
- setWeight: 20
- analysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: my-app-canary
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100
Now the rollout pauses at 20%, runs 5 Prometheus queries over 5 minutes, and automatically aborts if error rate exceeds 5% in 2 or more checks. No human needed at 2 AM.
Background Analysis for Blue/Green
With blue/green, you can run analysis against the preview environment before production traffic ever sees the new version:
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
prePromotionAnalysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: my-app-preview
postPromotionAnalysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: my-app-active
Pre-promotion analysis runs against preview. Post-promotion analysis runs against active for a confirmation window. If post-promotion fails, it triggers an automatic rollback.
Practical Tips From Production
1. Start With Manual Promotion
Don’t add automated analysis on day one. Get comfortable with the mechanics first — manual promote and abort commands. Those operations need to be muscle memory before you hand the wheel to automation. Operators who skip this step tend to misread automated failures and override rollbacks that should have stuck.
2. Always Set scaleDownDelaySeconds
The default is 30 seconds. At 1,000+ req/s, bump it to 60–120 seconds. This setting saved me once when a canary silently introduced a memory leak. The spike wasn’t immediate. But those extra two minutes of stable pods being warm let us abort cleanly — zero dropped requests.
strategy:
blueGreen:
scaleDownDelaySeconds: 120
3. Use the Dashboard
Argo Rollouts ships a read-only UI that earns its keep during an incident:
# Start the dashboard (port-forward to localhost:3100)
kubectl argo rollouts dashboard
You get a visual timeline of steps, analysis results, and replica counts. When you’re sleep-deprived and trying to explain deployment status on a Slack call, a colour-coded chart beats raw kubectl output every time.
4. Migrate Gradually — Don’t Rewrite Everything at Once
Pick one service. Run it as a Rollout for a sprint, learn where it can fail, then expand. Argo Rollouts is fully compatible with ArgoCD — if you’re already using GitOps, the Rollout manifest just lives in the same repo alongside everything else.
5. Watch the Pause Behavior
A pause: {} with no duration pauses indefinitely until you manually promote. Useful for canaries that need a human sign-off. The trap: if your CI/CD pipeline calls kubectl argo rollouts promote without checking rollout status first, it will promote immediately regardless of metrics.
Always guard the promotion step:
# Wait until rollout is healthy or degraded before promoting
kubectl argo rollouts status my-app --timeout 10m
# Then promote only if status is "Paused"
kubectl argo rollouts promote my-app
Argo Rollouts turns Kubernetes deployments from a binary flip into a controlled, observable process. The first time you watch a bad canary automatically abort and roll back at 3 AM — while you’re still reading the Slack alert — you’ll understand why this matters.
