Chaos Engineering: Breaking Your Cluster to Make It Stronger

My dashboard is a wall of green. Pods running, replicas matched, CPU comfortable, no alerts firing. I look at it and feel that small dopamine hit of “everything is fine.” And for the most part, it is fine. The cluster has been up for weeks. Nothing has fallen over.

That green wall is also the most dangerous thing in my homelab, because it tells me nothing about what happens when something goes wrong. It only tells me that, right now, nothing has.

So picture the other version of that morning. A node dies during the one moment of the day when actual load matters. A connection pool quietly exhausts itself. Some service I deployed eight months ago and forgot about starts eating memory until the kernel OOM-killer steps in. None of that shows up on a healthy dashboard. It shows up at 3 AM, when I’m asleep and the only person debugging is future-me, sleep-deprived and angry.

There is another way to find out. I can break things myself, on a Tuesday afternoon, with a coffee in hand and a rollback ready. That’s the whole pitch for chaos engineering: I’d rather discover my weaknesses in daylight, on my terms, than have production discover them for me.

What Breaking Things on Purpose Actually Means

Distributed systems fail in distributed ways. You can read every line of code and draw the prettiest architecture diagram in the world and still have no idea how the thing behaves when a node drops mid-request. The behaviour under failure lives in the gaps between components, and the only way to see it is to cause the failure and watch.

This connects straight to one of the values I keep coming back to on this blog: I want to understand what I run. A green dashboard is a kind of trust, and I don’t fully trust things I haven’t seen fail. Chaos engineering is how I turn “I think it’s resilient” into something I’ve actually watched happen.

A quick clarification, because the name scares people. Chaos engineering does not mean randomly destroying things to see what happens, and it does not mean yeeting fault injection into prod with no plan. Every experiment has three things attached to it:

A hypothesis. “If one node fails, pods reschedule within five minutes.” You’re testing a belief, not gambling.
A controlled blast radius. Known target, defined duration, a rollback you can hit instantly.
Something you learn. Every run either confirms the belief or hands you a bug you didn’t know about. Both are wins.

Netflix made this famous with Chaos Monkey, randomly killing instances so engineers had no choice but to build services that survived it. The tooling has come a long way since then, but the core idea is the same: don’t wait for the failure, schedule it.

The Chaos Engineering Process

flowchart TD
    subgraph process["Chaos Engineering Cycle"]
        H["Define Hypothesis"]
        E["Design Experiment"]
        R["Run in Controlled Environment"]
        O["Observe & Measure"]
        L["Learn & Improve"]
    end

    H --> E --> R --> O --> L
    L --> H

1. Define Hypothesis

Write down what you believe is true before you touch anything. The point is that you’re committing to a prediction, so the result can actually surprise you:

“If one node fails, pods will reschedule within 5 minutes”
“If the database becomes slow, the API will degrade gracefully, not crash”
“If DNS fails, cached responses will serve requests for 30 seconds”

2. Design Experiment

Now scope it. Four questions:

Blast radius: What can be affected?
Duration: How long will fault persist?
Rollback: How do you stop immediately?
Metrics: What indicates success or failure?

The rollback question is the one people skip, and it’s the one that keeps a chaos experiment from becoming an actual incident.

3. Run in Controlled Environment

Start small. Dev first, then staging, and only then production, and even then only with safeguards. I broke this rule exactly once and learned why it exists.

4. Observe and Measure

This is where your observability stack earns its keep. If you can’t see what happened, you didn’t run an experiment, you just caused damage. Watch for the obvious things: did the right alerts fire, did latency move, did error rates spike, and how long did recovery actually take versus what you guessed.

5. Learn and Improve

Write down what you found, fix the weak spot, update the runbook so the next person (probably you, six months from now) isn’t starting from scratch. Then design the next experiment.

Litmus Chaos: Kubernetes-Native Chaos

Litmus is a CNCF project that does chaos engineering the Kubernetes way: you declare experiments as YAML and apply them like anything else. That declarative angle is exactly why I picked it for the homelab. The experiment lives in Git next to the workload it’s attacking, which means it’s inspectable, reviewable, and reproducible instead of being a clever command someone ran once and never wrote down.

Installing Litmus

# Add Litmus helm repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install Litmus
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace \
  --set portal.frontend.service.type=ClusterIP

Litmus Architecture

flowchart TD
    subgraph litmus["Litmus Components"]
        subgraph control["Control Plane"]
            Portal["Litmus Portal<br/>(UI/API)"]
            Server["Litmus Server"]
        end
        subgraph exec["Execution Plane"]
            Sub["Subscriber"]
            Runner["Chaos Runner"]
            Exp["Chaos Experiments"]
        end
    end

    Portal --> Server
    Server --> Sub
    Sub --> Runner
    Runner --> Exp
    Exp --> Target["Target Workload"]

ChaosEngine: Declaring Experiments

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

This kills nginx pods every 10 seconds for 30 seconds total. Small, contained, and easy to watch, which is exactly where you want to start.

The Experiments Worth Knowing

You don’t need all of these on day one. But it helps to know the menu, because each one probes a different assumption you’re quietly making about your cluster.

Pod-Level Chaos

Pod Delete: Kill pods to test restart behavior

experiments:
  - name: pod-delete
    spec:
      components:
        env:
          - name: TOTAL_CHAOS_DURATION
            value: '60'
          - name: CHAOS_INTERVAL
            value: '10'
          - name: PODS_AFFECTED_PERC
            value: '50'  # Kill 50% of pods

Container Kill: Kill specific containers within pods

experiments:
  - name: container-kill
    spec:
      components:
        env:
          - name: TARGET_CONTAINER
            value: 'sidecar'
          - name: CHAOS_INTERVAL
            value: '10'

Node-Level Chaos

Node Drain: Simulate node maintenance

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: node-drain-chaos
spec:
  engineState: 'active'
  auxiliaryAppInfo: ''
  chaosServiceAccount: litmus-admin
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: TARGET_NODE
              value: 'worker-1'

Node CPU Hog: Simulate CPU pressure

experiments:
  - name: node-cpu-hog
    spec:
      components:
        env:
          - name: TOTAL_CHAOS_DURATION
            value: '60'
          - name: NODE_CPU_CORE
            value: '2'  # Consume 2 cores

Network Chaos

Pod Network Loss: Simulate network partitions

experiments:
  - name: pod-network-loss
    spec:
      components:
        env:
          - name: NETWORK_INTERFACE
            value: 'eth0'
          - name: NETWORK_PACKET_LOSS_PERCENTAGE
            value: '100'  # Total loss
          - name: TOTAL_CHAOS_DURATION
            value: '30'

Pod Network Latency: Inject latency

experiments:
  - name: pod-network-latency
    spec:
      components:
        env:
          - name: NETWORK_LATENCY
            value: '300'  # 300ms latency
          - name: JITTER
            value: '100'  # 100ms jitter

Storage Chaos

Disk Fill: Test disk pressure handling

experiments:
  - name: disk-fill
    spec:
      components:
        env:
          - name: FILL_PERCENTAGE
            value: '90'
          - name: TOTAL_CHAOS_DURATION
            value: '60'

Running Your First Experiment

Enough menu-reading. Here’s a full experiment against a throwaway app, start to finish, so you can copy it and actually watch something break.

1. Deploy Test Application

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-app
  labels:
    app: demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          ports:
            - containerPort: 80
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5

2. Create ServiceAccount for Litmus

apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-admin
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: litmus-admin
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log", "events"]
    verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: litmus-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: litmus-admin
subjects:
  - kind: ServiceAccount
    name: litmus-admin
    namespace: default

3. Run Pod Delete Experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: demo-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=demo'
    appkind: 'deployment'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'
        probe:
          - name: "check-endpoint"
            type: "httpProbe"
            httpProbe/inputs:
              url: "http://demo-app.default.svc:80"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: "Continuous"
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 2

4. Observe Results

# Watch chaos engine status
kubectl get chaosengine demo-chaos -w

# Check chaos result
kubectl get chaosresult demo-chaos-pod-delete -o yaml

# Watch pod behavior
kubectl get pods -l app=demo -w

Game Days: Practicing for Real Incidents

Here’s the part the YAML can’t fix. A chaos experiment tells you whether the system survives. It says nothing about whether the humans behind it know what to do when the pager goes off. The first time anyone touches a runbook should not be during a real outage, and a game day is how you avoid that.

So you take the same controlled-failure idea and aim it at the team instead of the cluster.

Planning a Game Day

Define scope: Which systems, which failure scenarios
Assemble team: Engineers who would respond to real incidents
Prepare runbooks: Document expected responses
Set success criteria: What does “handled well” look like?
Schedule time: Dedicated time, not squeezed between meetings

Game Day Structure

09:00 - Briefing: Explain scope, rules, how to abort
09:15 - Inject failure (team doesn't know which one)
09:15 - Team responds as if real incident
10:00 - Failure revealed, additional scenarios introduced
11:00 - Debrief: What worked, what didn't, what to improve

Example Scenarios

Scenario	What You Learn
Kill primary database	Failover speed, data integrity
DNS outage	Caching behavior, timeout handling
Certificate expires	Monitoring gaps, renewal process
Cloud region unavailable	Multi-region readiness
Secrets manager down	Application behavior without secrets

Keeping It in Git

Since the experiments are just YAML, they belong in version control with everything else. I keep mine managed with ArgoCD, defined but dormant, ready to activate when I want to run them:

# chaos-experiments/pod-delete.yaml (in Git)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: scheduled-pod-delete
  namespace: chaos-testing
spec:
  engineState: 'stop'  # Manually activated
  # ... experiment config

The flow is the same one I use for everything else: define experiments in Git, sync with ArgoCD, flip engineState to active when I want one to run, read the results, then commit whatever fixes or findings came out of it. The experiment and its history live in the same place as the workload it tests, which is the whole point.

Staying Safe While Doing This

I want to be honest about the trade-off here, because chaos engineering done carelessly is just an outage you caused yourself. The discipline is what separates the two.

A few rules I actually follow. Start in non-production and stay there until you trust the experiment. Have the rollback ready before you start, not after things go sideways. Watch the experiment live instead of kicking it off and walking away. Tell anyone who might get paged that you’re about to do this. And write down what you found, because an undocumented experiment is just chaos with extra steps.

The mistakes that bite people are equally predictable: running during your actual peak window when you weren’t trying to test peak behaviour, injecting failure without warning the on-call, skipping the hypothesis so you can’t tell signal from noise, and then ignoring the finding because the cluster recovered “well enough.”

Most important: decide your abort conditions up front. Error rate past some threshold, latency past some ceiling, real users complaining, or honestly any behaviour you didn’t predict. Litmus can enforce some of this for you with a probe that kills the experiment automatically:

# Abort on high error rate
probe:
  - name: "error-rate-check"
    type: "promProbe"
    promProbe/inputs:
      endpoint: "http://prometheus:9090"
      query: "rate(http_requests_total{status=~'5..'}[1m])"
      comparator:
        type: "float"
        criteria: "<"
        value: "0.1"  # Abort if > 10% errors

Why Not Everyone Already Does This

If chaos engineering is so obviously useful, why isn’t it everywhere? The obstacles are real and worth naming, otherwise the whole thing sounds like an easy sell that it isn’t.

The big one is fear, and it’s rational fear. Deliberately breaking a system you’re responsible for feels insane the first time, especially if you’ve never seen it fail safely. “It works” is a genuinely powerful argument against “let’s go break it on purpose.” There’s also the time cost. Designing a real experiment, watching it, documenting it, fixing what it found, none of that is free, and it competes with shipping features. And there’s a culture problem: in a blameful team, causing a failure on purpose is career risk, even when it’s the responsible thing to do.

None of those go away by ignoring them. They go away by starting small enough that the fear has nowhere to stand.

How I Actually Run This in the Homelab

So here’s the achievable version, the one I run on my own cluster. Nothing heroic. A weekly pod-delete against random pods in non-critical namespaces, a monthly node drain to rehearse losing a machine, and roughly once a quarter a proper game day with a couple of stacked failures.

That tiny cadence has already paid for itself. A few things it taught me:

Longhorn handles a node disappearing far more calmly than I expected, which is reassuring to know rather than hope.
Several of my workloads had a useless terminationGracePeriod and were getting hard-killed mid-request without me ever noticing in normal operation.
My alerting consistently fires later than it should. Good to find that on a Tuesday instead of during a real incident.
Recovery always takes longer than my optimistic mental model says it will.

Every one of those was a future 3 AM page that I traded for an afternoon of mild, controlled discomfort. That’s the deal chaos engineering offers, and after running it for a while I think it’s a very good deal.

Production is going to test your systems whether you like it or not. Running the test yourself first is the only version where you get to pick the time, hold the rollback, and learn something instead of just surviving.

Break things on purpose, in daylight, with a coffee and a rollback. The alternative is letting them break by themselves at 3 AM on the busiest day of the year. One of those mornings is a lot better than the other.

What Breaking Things on Purpose Actually Means#

The Chaos Engineering Process#

1. Define Hypothesis#

2. Design Experiment#

3. Run in Controlled Environment#

4. Observe and Measure#

5. Learn and Improve#

Litmus Chaos: Kubernetes-Native Chaos#

Installing Litmus#

Litmus Architecture#

ChaosEngine: Declaring Experiments#

The Experiments Worth Knowing#

Pod-Level Chaos#

Node-Level Chaos#

Network Chaos#

Storage Chaos#

Running Your First Experiment#

1. Deploy Test Application#

2. Create ServiceAccount for Litmus#

3. Run Pod Delete Experiment#

4. Observe Results#

Game Days: Practicing for Real Incidents#

Planning a Game Day#

Game Day Structure#

Example Scenarios#

Keeping It in Git#

Staying Safe While Doing This#

Why Not Everyone Already Does This#

How I Actually Run This in the Homelab#