Your cluster looks healthy. Pods are running. Metrics are green. Everything works.

Until a node fails during peak traffic. Or the database connection pool exhausts. Or that one service nobody remembers deploying starts consuming all available memory.

You can wait for these things to happen in production at 3 AM. Or you can break things intentionally, on your terms, and fix the weaknesses before they become outages.

This is chaos engineering.

Why Break Things on Purpose?

Distributed systems fail in distributed ways. You can’t anticipate every failure mode by reading code or drawing architecture diagrams. The only way to know how your system behaves under failure is to make it fail.

Chaos engineering is not:

  • Random destruction
  • Testing in production without preparation
  • Breaking things to see what happens

Chaos engineering is:

  • Hypothesis-driven experimentation — “We believe the system will handle node failure gracefully”
  • Controlled fault injection — Known blast radius, defined duration, rollback ready
  • Learning from failures — Every experiment improves understanding

Netflix pioneered this with Chaos Monkey — randomly killing instances to ensure engineers built resilient services. Today, the practice has matured with sophisticated tools and methodologies.

The Chaos Engineering Process

flowchart TD
    subgraph process["Chaos Engineering Cycle"]
        H["Define Hypothesis"]
        E["Design Experiment"]
        R["Run in Controlled Environment"]
        O["Observe & Measure"]
        L["Learn & Improve"]
    end

    H --> E --> R --> O --> L
    L --> H

1. Define Hypothesis

Start with what you believe to be true:

  • “If one node fails, pods will reschedule within 5 minutes”
  • “If the database becomes slow, the API will degrade gracefully, not crash”
  • “If DNS fails, cached responses will serve requests for 30 seconds”

2. Design Experiment

Define:

  • Blast radius: What can be affected?
  • Duration: How long will fault persist?
  • Rollback: How do you stop immediately?
  • Metrics: What indicates success or failure?

3. Run in Controlled Environment

Start small:

  1. Development environment first
  2. Then staging
  3. Production only with safeguards

4. Observe and Measure

Watch your observability stack. Key questions:

  • Did alerts fire correctly?
  • How did latency change?
  • Did error rates spike?
  • How long until recovery?

5. Learn and Improve

Document findings. Fix weaknesses. Update runbooks. Then design the next experiment.

Litmus Chaos: Kubernetes-Native Chaos

Litmus is a CNCF project providing chaos engineering for Kubernetes. It’s declarative, GitOps-friendly, and comprehensive.

Installing Litmus

# Add Litmus helm repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install Litmus
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace \
  --set portal.frontend.service.type=ClusterIP

Litmus Architecture

flowchart TD
    subgraph litmus["Litmus Components"]
        subgraph control["Control Plane"]
            Portal["Litmus Portal<br/>(UI/API)"]
            Server["Litmus Server"]
        end
        subgraph exec["Execution Plane"]
            Sub["Subscriber"]
            Runner["Chaos Runner"]
            Exp["Chaos Experiments"]
        end
    end

    Portal --> Server
    Server --> Sub
    Sub --> Runner
    Runner --> Exp
    Exp --> Target["Target Workload"]

ChaosEngine: Declaring Experiments

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

This kills nginx pods every 10 seconds for 30 seconds total. Simple, contained, observable.

Essential Chaos Experiments

Pod-Level Chaos

Pod Delete: Kill pods to test restart behavior

experiments:
  - name: pod-delete
    spec:
      components:
        env:
          - name: TOTAL_CHAOS_DURATION
            value: '60'
          - name: CHAOS_INTERVAL
            value: '10'
          - name: PODS_AFFECTED_PERC
            value: '50'  # Kill 50% of pods

Container Kill: Kill specific containers within pods

experiments:
  - name: container-kill
    spec:
      components:
        env:
          - name: TARGET_CONTAINER
            value: 'sidecar'
          - name: CHAOS_INTERVAL
            value: '10'

Node-Level Chaos

Node Drain: Simulate node maintenance

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: node-drain-chaos
spec:
  engineState: 'active'
  auxiliaryAppInfo: ''
  chaosServiceAccount: litmus-admin
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: TARGET_NODE
              value: 'worker-1'

Node CPU Hog: Simulate CPU pressure

experiments:
  - name: node-cpu-hog
    spec:
      components:
        env:
          - name: TOTAL_CHAOS_DURATION
            value: '60'
          - name: NODE_CPU_CORE
            value: '2'  # Consume 2 cores

Network Chaos

Pod Network Loss: Simulate network partitions

experiments:
  - name: pod-network-loss
    spec:
      components:
        env:
          - name: NETWORK_INTERFACE
            value: 'eth0'
          - name: NETWORK_PACKET_LOSS_PERCENTAGE
            value: '100'  # Total loss
          - name: TOTAL_CHAOS_DURATION
            value: '30'

Pod Network Latency: Inject latency

experiments:
  - name: pod-network-latency
    spec:
      components:
        env:
          - name: NETWORK_LATENCY
            value: '300'  # 300ms latency
          - name: JITTER
            value: '100'  # 100ms jitter

Storage Chaos

Disk Fill: Test disk pressure handling

experiments:
  - name: disk-fill
    spec:
      components:
        env:
          - name: FILL_PERCENTAGE
            value: '90'
          - name: TOTAL_CHAOS_DURATION
            value: '60'

Running Your First Experiment

Let’s run a complete chaos experiment against a sample application.

1. Deploy Test Application

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-app
  labels:
    app: demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          ports:
            - containerPort: 80
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5

2. Create ServiceAccount for Litmus

apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-admin
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: litmus-admin
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log", "events"]
    verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: litmus-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: litmus-admin
subjects:
  - kind: ServiceAccount
    name: litmus-admin
    namespace: default

3. Run Pod Delete Experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: demo-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=demo'
    appkind: 'deployment'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'
        probe:
          - name: "check-endpoint"
            type: "httpProbe"
            httpProbe/inputs:
              url: "http://demo-app.default.svc:80"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: "Continuous"
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 2

4. Observe Results

# Watch chaos engine status
kubectl get chaosengine demo-chaos -w

# Check chaos result
kubectl get chaosresult demo-chaos-pod-delete -o yaml

# Watch pod behavior
kubectl get pods -l app=demo -w

Game Days: Practicing for Real Incidents

Chaos experiments test systems. Game days test people.

Planning a Game Day

  1. Define scope: Which systems, which failure scenarios
  2. Assemble team: Engineers who would respond to real incidents
  3. Prepare runbooks: Document expected responses
  4. Set success criteria: What does “handled well” look like?
  5. Schedule time: Dedicated time, not squeezed between meetings

Game Day Structure

09:00 - Briefing: Explain scope, rules, how to abort
09:15 - Inject failure (team doesn't know which one)
09:15 - Team responds as if real incident
10:00 - Failure revealed, additional scenarios introduced
11:00 - Debrief: What worked, what didn't, what to improve

Example Scenarios

ScenarioWhat You Learn
Kill primary databaseFailover speed, data integrity
DNS outageCaching behavior, timeout handling
Certificate expiresMonitoring gaps, renewal process
Cloud region unavailableMulti-region readiness
Secrets manager downApplication behavior without secrets

Integrating with GitOps

Chaos experiments can be managed with ArgoCD:

# chaos-experiments/pod-delete.yaml (in Git)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: scheduled-pod-delete
  namespace: chaos-testing
spec:
  engineState: 'stop'  # Manually activated
  # ... experiment config

Workflow:

  1. Define experiments in Git
  2. Sync with ArgoCD
  3. Activate experiments via engineState: 'active'
  4. Review results
  5. Commit findings and improvements

Safety Guidelines

Do

  • Start with non-production environments
  • Have rollback procedures ready
  • Monitor closely during experiments
  • Communicate with stakeholders
  • Document everything

Don’t

  • Run experiments during high-traffic periods (unless testing that specifically)
  • Inject chaos without alerting the team
  • Skip the hypothesis step
  • Ignore findings

Abort Conditions

Define when to stop immediately:

  • Error rate exceeds X%
  • Latency exceeds Y seconds
  • Customer complaints received
  • Any unexpected behavior
# Abort on high error rate
probe:
  - name: "error-rate-check"
    type: "promProbe"
    promProbe/inputs:
      endpoint: "http://prometheus:9090"
      query: "rate(http_requests_total{status=~'5..'}[1m])"
      comparator:
        type: "float"
        criteria: "<"
        value: "0.1"  # Abort if > 10% errors

Building a Chaos Culture

Chaos engineering is not a tool. It’s a practice.

Start small:

  1. Run one experiment in development
  2. Document what you learn
  3. Share with the team
  4. Gradually increase scope

Questions to answer:

  • What happens when a node fails?
  • How does the system behave under memory pressure?
  • What if network latency increases 10x?
  • Can we lose a database replica without downtime?

Every experiment that reveals a weakness is a production outage prevented.

My Chaos Experiments

In my homelab cluster, I regularly run:

  1. Weekly pod delete: Random pods in non-critical namespaces
  2. Monthly node drain: Simulate node failure
  3. Quarterly full game day: Multi-failure scenarios

What I’ve learned:

  • Longhorn handles node failures gracefully
  • Some workloads don’t have proper terminationGracePeriod
  • Alerting often fires too late
  • Recovery is always slower than expected

Why This Matters

Production will test your systems. The only question is whether you test them first.

Chaos engineering turns “we think it’s resilient” into “we know it survives X, Y, and Z failures.” It transforms hope into evidence.

The systems that survive real outages are the ones that practiced for them.


Break things intentionally, on your terms, in daylight. Or wait for them to break accidentally, at 3 AM, during the busiest day of the year. Your choice.