Every deployment is a risk. The question isn’t whether something will go wrong — it’s how much damage it will cause when it does.

Traditional Kubernetes deployments are all-or-nothing. You push a new version, and within seconds, 100% of your traffic hits the new code. If there’s a bug, everyone sees it. If the service crashes, all users are affected.

Progressive delivery changes this equation. Instead of deploying to everyone at once, you gradually shift traffic to the new version, validating at each step. If something goes wrong, only a fraction of users are affected.

Argo Rollouts brings progressive delivery to Kubernetes as a drop-in replacement for Deployments.

Why Progressive Delivery?

Consider what happens with a standard Deployment during a bug release:

Time 0:00 - Deploy new version
Time 0:02 - All pods running new version
Time 0:05 - Errors start appearing
Time 0:08 - Alerts fire
Time 0:15 - Engineer investigates
Time 0:25 - Rollback initiated
Time 0:27 - All pods back to old version

Blast radius: 100% of users for ~25 minutes

With progressive delivery:

Time 0:00 - Deploy new version (5% traffic)
Time 0:05 - Automated analysis detects errors
Time 0:06 - Automatic rollback

Blast radius: 5% of users for ~6 minutes

This is resilience. Not preventing failures, but limiting their impact.

Two Strategies: Canary vs Blue-Green

Argo Rollouts supports multiple strategies. The two most common:

Canary

Traffic shifts gradually from old to new version:

Step 1:  5% new, 95% old   (test the waters)
Step 2: 20% new, 80% old   (expand if healthy)
Step 3: 50% new, 50% old   (halfway point)
Step 4: 100% new, 0% old   (full rollout)

Best for: Stateless services, high-traffic applications where you want gradual validation.

Blue-Green

Two complete environments, instant switch:

Before:  100% blue (old)    0% green (new)
Deploy:  100% blue          green ready, not receiving traffic
Switch:    0% blue        100% green

Best for: Services requiring instant rollback, database migrations, when you need both versions running simultaneously for testing.

Installing Argo Rollouts

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

For GitOps with ArgoCD, add it as an Application:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argo-rollouts
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://argoproj.github.io/argo-helm
    chart: argo-rollouts
    targetRevision: 2.35.1
    helm:
      values: |
        dashboard:
          enabled: true
  destination:
    server: https://kubernetes.default.svc
    namespace: argo-rollouts
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Canary Rollout Example

Replace your Deployment with a Rollout:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-app:v2.0.0
          ports:
            - containerPort: 8080
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

This creates a 20-minute gradual rollout:

  1. Send 5% of traffic to new version, wait 5 minutes
  2. If healthy, increase to 20%, wait 5 minutes
  3. If healthy, increase to 50%, wait 5 minutes
  4. Complete rollout to 100%

At any point, if you rollback, traffic instantly returns to the old version.

Traffic Management

By default, Argo Rollouts uses replica count to approximate traffic weight. For precise traffic control, integrate with your ingress:

With Traefik

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      trafficRouting:
        traefik:
          weightedTraefikServiceName: my-app-weighted
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - setWeight: 50
        - pause: { duration: 2m }

With supporting TraefikService:

apiVersion: traefik.io/v1alpha1
kind: TraefikService
metadata:
  name: my-app-weighted
spec:
  weighted:
    services:
      - name: my-app-stable
        port: 80
        weight: 100  # Managed by Argo Rollouts
      - name: my-app-canary
        port: 80
        weight: 0    # Managed by Argo Rollouts

With Nginx Ingress

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      canaryService: my-app-canary
      stableService: my-app-stable
      trafficRouting:
        nginx:
          stableIngress: my-app-ingress

With Istio

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      trafficRouting:
        istio:
          virtualService:
            name: my-app-vsvc
            routes:
              - primary

Blue-Green Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-app:v2.0.0
  strategy:
    blueGreen:
      activeService: my-app-active
      previewService: my-app-preview
      autoPromotionEnabled: false
      prePromotionAnalysis:
        templates:
          - templateName: smoke-tests
      postPromotionAnalysis:
        templates:
          - templateName: load-test

This creates:

  • my-app-active: Points to current production version
  • my-app-preview: Points to new version for testing

The new version is deployed but receives no production traffic until you promote it.

Automated Analysis

The real power of progressive delivery is automated rollback. Argo Rollouts can analyze metrics during rollout and abort if something goes wrong.

Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

This template:

  • Queries Prometheus every minute
  • Checks if success rate is >= 95%
  • Fails the rollout after 3 consecutive failures

Using Analysis in Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: my-app
        - setWeight: 50
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: my-app

Now the rollout:

  1. Shifts 5% traffic
  2. Waits 2 minutes
  3. Runs analysis — if it fails, automatic rollback
  4. If analysis passes, shifts to 50%
  5. Runs analysis again
  6. Completes rollout

Multiple Analysis Metrics

Combine multiple checks:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: comprehensive-check
spec:
  args:
    - name: service-name
  metrics:
    # HTTP success rate
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

    # P99 latency
    - name: latency-p99
      interval: 1m
      successCondition: result[0] < 0.5
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m])) by (le))

    # Error rate
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

Web-Based Job Analysis

For non-Prometheus checks (smoke tests, integration tests):

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoke-test
spec:
  metrics:
    - name: smoke-test
      provider:
        job:
          spec:
            backoffLimit: 1
            template:
              spec:
                containers:
                  - name: smoke
                    image: curlimages/curl
                    command:
                      - /bin/sh
                      - -c
                      - |
                        curl -f http://my-app-canary/health || exit 1
                        curl -f http://my-app-canary/api/status || exit 1
                restartPolicy: Never

Dashboard and CLI

Monitor rollouts with the Argo Rollouts kubectl plugin:

# Install plugin
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-darwin-amd64
chmod +x kubectl-argo-rollouts-darwin-amd64
mv kubectl-argo-rollouts-darwin-amd64 /usr/local/bin/kubectl-argo-rollouts

# Watch rollout progress
kubectl argo rollouts get rollout my-app -w

# Manually promote (if autoPromotion is disabled)
kubectl argo rollouts promote my-app

# Abort and rollback
kubectl argo rollouts abort my-app

# View dashboard
kubectl argo rollouts dashboard

The dashboard shows real-time traffic distribution and analysis status.

Integration with GitOps

When using ArgoCD, Rollouts work seamlessly. Update the image tag in Git, ArgoCD syncs, and the progressive rollout begins.

# In your GitOps repo
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: my-app:v2.1.0  # Update this line

The semantic versioning pipeline creates the tag, which triggers an update to the GitOps repo, which ArgoCD syncs, which starts the Rollout.

Notifications

Get notified about rollout events:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
  annotations:
    notifications.argoproj.io/subscribe.on-rollout-completed.slack: my-channel
    notifications.argoproj.io/subscribe.on-rollout-aborted.slack: my-channel

Configure the notification controller separately (shares configuration with ArgoCD notifications).

My Production Setup

Here’s my actual Rollout configuration:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 5
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v1.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        traefik:
          weightedTraefikServiceName: api-weighted
      steps:
        # Phase 1: Smoke test
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: smoke-test

        # Phase 2: Limited exposure
        - setWeight: 25
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: api

        # Phase 3: Majority traffic
        - setWeight: 75
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: api

        # Phase 4: Full rollout
        - setWeight: 100
      rollbackWindow:
        revisions: 2

Key decisions:

  • Multiple analysis phases — Early smoke test, then metric-based validation
  • Increasing pause durations — More time at higher traffic percentages
  • Rollback window — Can quickly revert to last 2 versions

When Not to Use Progressive Delivery

Progressive delivery adds complexity. Skip it when:

  • Breaking database schema changes — You need the whole app on one version
  • Single-user applications — No meaningful traffic to split
  • Simple internal tools — The overhead isn’t worth it
  • Tight coupling between services — When services must upgrade together

For most production services handling real users, progressive delivery is worth the investment.

Why This Matters

Every deployment is a controlled experiment. You’re testing the hypothesis that your new code works in production.

Progressive delivery makes that experiment safer:

  • Smaller blast radius — Problems affect fewer users
  • Faster detection — Automated analysis catches issues early
  • Instant recovery — One command reverts to known-good state

This is resilience in practice. Not hoping deployments succeed, but designing systems that gracefully handle when they don’t.


The best deployment strategy isn’t the one that never fails — it’s the one that minimizes damage when failure happens. Progressive delivery limits your blast radius and gives you time to react.