Every deployment is a risk. The question isn’t whether something will go wrong — it’s how much damage it will cause when it does.
Traditional Kubernetes deployments are all-or-nothing. You push a new version, and within seconds, 100% of your traffic hits the new code. If there’s a bug, everyone sees it. If the service crashes, all users are affected.
Progressive delivery changes this equation. Instead of deploying to everyone at once, you gradually shift traffic to the new version, validating at each step. If something goes wrong, only a fraction of users are affected.
Argo Rollouts brings progressive delivery to Kubernetes as a drop-in replacement for Deployments.
Why Progressive Delivery?
Consider what happens with a standard Deployment during a bug release:
Time 0:00 - Deploy new version
Time 0:02 - All pods running new version
Time 0:05 - Errors start appearing
Time 0:08 - Alerts fire
Time 0:15 - Engineer investigates
Time 0:25 - Rollback initiated
Time 0:27 - All pods back to old version
Blast radius: 100% of users for ~25 minutes
With progressive delivery:
Time 0:00 - Deploy new version (5% traffic)
Time 0:05 - Automated analysis detects errors
Time 0:06 - Automatic rollback
Blast radius: 5% of users for ~6 minutes
This is resilience. Not preventing failures, but limiting their impact.
Two Strategies: Canary vs Blue-Green
Argo Rollouts supports multiple strategies. The two most common:
Canary
Traffic shifts gradually from old to new version:
Step 1: 5% new, 95% old (test the waters)
Step 2: 20% new, 80% old (expand if healthy)
Step 3: 50% new, 50% old (halfway point)
Step 4: 100% new, 0% old (full rollout)
Best for: Stateless services, high-traffic applications where you want gradual validation.
Blue-Green
Two complete environments, instant switch:
Before: 100% blue (old) 0% green (new)
Deploy: 100% blue green ready, not receiving traffic
Switch: 0% blue 100% green
Best for: Services requiring instant rollback, database migrations, when you need both versions running simultaneously for testing.
Installing Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
For GitOps with ArgoCD, add it as an Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argo-rollouts
namespace: argocd
spec:
project: default
source:
repoURL: https://argoproj.github.io/argo-helm
chart: argo-rollouts
targetRevision: 2.35.1
helm:
values: |
dashboard:
enabled: true
destination:
server: https://kubernetes.default.svc
namespace: argo-rollouts
syncPolicy:
automated:
prune: true
selfHeal: true
Canary Rollout Example
Replace your Deployment with a Rollout:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:v2.0.0
ports:
- containerPort: 8080
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 20
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
This creates a 20-minute gradual rollout:
- Send 5% of traffic to new version, wait 5 minutes
- If healthy, increase to 20%, wait 5 minutes
- If healthy, increase to 50%, wait 5 minutes
- Complete rollout to 100%
At any point, if you rollback, traffic instantly returns to the old version.
Traffic Management
By default, Argo Rollouts uses replica count to approximate traffic weight. For precise traffic control, integrate with your ingress:
With Traefik
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
traefik:
weightedTraefikServiceName: my-app-weighted
steps:
- setWeight: 10
- pause: { duration: 2m }
- setWeight: 50
- pause: { duration: 2m }
With supporting TraefikService:
apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
name: my-app-weighted
spec:
weighted:
services:
- name: my-app-stable
port: 80
weight: 100 # Managed by Argo Rollouts
- name: my-app-canary
port: 80
weight: 0 # Managed by Argo Rollouts
With Nginx Ingress
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
nginx:
stableIngress: my-app-ingress
With Istio
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
trafficRouting:
istio:
virtualService:
name: my-app-vsvc
routes:
- primary
Blue-Green Rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 5
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:v2.0.0
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
prePromotionAnalysis:
templates:
- templateName: smoke-tests
postPromotionAnalysis:
templates:
- templateName: load-test
This creates:
my-app-active: Points to current production versionmy-app-preview: Points to new version for testing
The new version is deployed but receives no production traffic until you promote it.
Automated Analysis
The real power of progressive delivery is automated rollback. Argo Rollouts can analyze metrics during rollout and abort if something goes wrong.
Analysis Template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
This template:
- Queries Prometheus every minute
- Checks if success rate is >= 95%
- Fails the rollout after 3 consecutive failures
Using Analysis in Rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-app
- setWeight: 50
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-app
Now the rollout:
- Shifts 5% traffic
- Waits 2 minutes
- Runs analysis — if it fails, automatic rollback
- If analysis passes, shifts to 50%
- Runs analysis again
- Completes rollout
Multiple Analysis Metrics
Combine multiple checks:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: comprehensive-check
spec:
args:
- name: service-name
metrics:
# HTTP success rate
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
# P99 latency
- name: latency-p99
interval: 1m
successCondition: result[0] < 0.5
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m])) by (le))
# Error rate
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
Web-Based Job Analysis
For non-Prometheus checks (smoke tests, integration tests):
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: smoke-test
spec:
metrics:
- name: smoke-test
provider:
job:
spec:
backoffLimit: 1
template:
spec:
containers:
- name: smoke
image: curlimages/curl
command:
- /bin/sh
- -c
- |
curl -f http://my-app-canary/health || exit 1
curl -f http://my-app-canary/api/status || exit 1
restartPolicy: Never
Dashboard and CLI
Monitor rollouts with the Argo Rollouts kubectl plugin:
# Install plugin
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-darwin-amd64
chmod +x kubectl-argo-rollouts-darwin-amd64
mv kubectl-argo-rollouts-darwin-amd64 /usr/local/bin/kubectl-argo-rollouts
# Watch rollout progress
kubectl argo rollouts get rollout my-app -w
# Manually promote (if autoPromotion is disabled)
kubectl argo rollouts promote my-app
# Abort and rollback
kubectl argo rollouts abort my-app
# View dashboard
kubectl argo rollouts dashboard
The dashboard shows real-time traffic distribution and analysis status.
Integration with GitOps
When using ArgoCD, Rollouts work seamlessly. Update the image tag in Git, ArgoCD syncs, and the progressive rollout begins.
# In your GitOps repo
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
template:
spec:
containers:
- name: app
image: my-app:v2.1.0 # Update this line
The semantic versioning pipeline creates the tag, which triggers an update to the GitOps repo, which ArgoCD syncs, which starts the Rollout.
Notifications
Get notified about rollout events:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
annotations:
notifications.argoproj.io/subscribe.on-rollout-completed.slack: my-channel
notifications.argoproj.io/subscribe.on-rollout-aborted.slack: my-channel
Configure the notification controller separately (shares configuration with ArgoCD notifications).
My Production Setup
Here’s my actual Rollout configuration:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 5
revisionHistoryLimit: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: registry.example.com/api:v1.0.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
strategy:
canary:
canaryService: api-canary
stableService: api-stable
trafficRouting:
traefik:
weightedTraefikServiceName: api-weighted
steps:
# Phase 1: Smoke test
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: smoke-test
# Phase 2: Limited exposure
- setWeight: 25
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: api
# Phase 3: Majority traffic
- setWeight: 75
- pause: { duration: 10m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: api
# Phase 4: Full rollout
- setWeight: 100
rollbackWindow:
revisions: 2
Key decisions:
- Multiple analysis phases — Early smoke test, then metric-based validation
- Increasing pause durations — More time at higher traffic percentages
- Rollback window — Can quickly revert to last 2 versions
When Not to Use Progressive Delivery
Progressive delivery adds complexity. Skip it when:
- Breaking database schema changes — You need the whole app on one version
- Single-user applications — No meaningful traffic to split
- Simple internal tools — The overhead isn’t worth it
- Tight coupling between services — When services must upgrade together
For most production services handling real users, progressive delivery is worth the investment.
Why This Matters
Every deployment is a controlled experiment. You’re testing the hypothesis that your new code works in production.
Progressive delivery makes that experiment safer:
- Smaller blast radius — Problems affect fewer users
- Faster detection — Automated analysis catches issues early
- Instant recovery — One command reverts to known-good state
This is resilience in practice. Not hoping deployments succeed, but designing systems that gracefully handle when they don’t.
The best deployment strategy isn’t the one that never fails — it’s the one that minimizes damage when failure happens. Progressive delivery limits your blast radius and gives you time to react.
