Your cluster looks healthy. Pods are running. Metrics are green. Everything works.
Until a node fails during peak traffic. Or the database connection pool exhausts. Or that one service nobody remembers deploying starts consuming all available memory.
You can wait for these things to happen in production at 3 AM. Or you can break things intentionally, on your terms, and fix the weaknesses before they become outages.
This is chaos engineering.
Why Break Things on Purpose?
Distributed systems fail in distributed ways. You can’t anticipate every failure mode by reading code or drawing architecture diagrams. The only way to know how your system behaves under failure is to make it fail.
Chaos engineering is not:
- Random destruction
- Testing in production without preparation
- Breaking things to see what happens
Chaos engineering is:
- Hypothesis-driven experimentation — “We believe the system will handle node failure gracefully”
- Controlled fault injection — Known blast radius, defined duration, rollback ready
- Learning from failures — Every experiment improves understanding
Netflix pioneered this with Chaos Monkey — randomly killing instances to ensure engineers built resilient services. Today, the practice has matured with sophisticated tools and methodologies.
The Chaos Engineering Process
flowchart TD
subgraph process["Chaos Engineering Cycle"]
H["Define Hypothesis"]
E["Design Experiment"]
R["Run in Controlled Environment"]
O["Observe & Measure"]
L["Learn & Improve"]
end
H --> E --> R --> O --> L
L --> H
1. Define Hypothesis
Start with what you believe to be true:
- “If one node fails, pods will reschedule within 5 minutes”
- “If the database becomes slow, the API will degrade gracefully, not crash”
- “If DNS fails, cached responses will serve requests for 30 seconds”
2. Design Experiment
Define:
- Blast radius: What can be affected?
- Duration: How long will fault persist?
- Rollback: How do you stop immediately?
- Metrics: What indicates success or failure?
3. Run in Controlled Environment
Start small:
- Development environment first
- Then staging
- Production only with safeguards
4. Observe and Measure
Watch your observability stack. Key questions:
- Did alerts fire correctly?
- How did latency change?
- Did error rates spike?
- How long until recovery?
5. Learn and Improve
Document findings. Fix weaknesses. Update runbooks. Then design the next experiment.
Litmus Chaos: Kubernetes-Native Chaos
Litmus is a CNCF project providing chaos engineering for Kubernetes. It’s declarative, GitOps-friendly, and comprehensive.
Installing Litmus
# Add Litmus helm repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
# Install Litmus
helm install litmus litmuschaos/litmus \
--namespace litmus \
--create-namespace \
--set portal.frontend.service.type=ClusterIP
Litmus Architecture
flowchart TD
subgraph litmus["Litmus Components"]
subgraph control["Control Plane"]
Portal["Litmus Portal<br/>(UI/API)"]
Server["Litmus Server"]
end
subgraph exec["Execution Plane"]
Sub["Subscriber"]
Runner["Chaos Runner"]
Exp["Chaos Experiments"]
end
end
Portal --> Server
Server --> Sub
Sub --> Runner
Runner --> Exp
Exp --> Target["Target Workload"]
ChaosEngine: Declaring Experiments
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
This kills nginx pods every 10 seconds for 30 seconds total. Simple, contained, observable.
Essential Chaos Experiments
Pod-Level Chaos
Pod Delete: Kill pods to test restart behavior
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: PODS_AFFECTED_PERC
value: '50' # Kill 50% of pods
Container Kill: Kill specific containers within pods
experiments:
- name: container-kill
spec:
components:
env:
- name: TARGET_CONTAINER
value: 'sidecar'
- name: CHAOS_INTERVAL
value: '10'
Node-Level Chaos
Node Drain: Simulate node maintenance
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: node-drain-chaos
spec:
engineState: 'active'
auxiliaryAppInfo: ''
chaosServiceAccount: litmus-admin
experiments:
- name: node-drain
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: TARGET_NODE
value: 'worker-1'
Node CPU Hog: Simulate CPU pressure
experiments:
- name: node-cpu-hog
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: NODE_CPU_CORE
value: '2' # Consume 2 cores
Network Chaos
Pod Network Loss: Simulate network partitions
experiments:
- name: pod-network-loss
spec:
components:
env:
- name: NETWORK_INTERFACE
value: 'eth0'
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: '100' # Total loss
- name: TOTAL_CHAOS_DURATION
value: '30'
Pod Network Latency: Inject latency
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_LATENCY
value: '300' # 300ms latency
- name: JITTER
value: '100' # 100ms jitter
Storage Chaos
Disk Fill: Test disk pressure handling
experiments:
- name: disk-fill
spec:
components:
env:
- name: FILL_PERCENTAGE
value: '90'
- name: TOTAL_CHAOS_DURATION
value: '60'
Running Your First Experiment
Let’s run a complete chaos experiment against a sample application.
1. Deploy Test Application
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-app
labels:
app: demo
spec:
replicas: 3
selector:
matchLabels:
app: demo
template:
metadata:
labels:
app: demo
spec:
containers:
- name: nginx
image: nginx:alpine
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
2. Create ServiceAccount for Litmus
apiVersion: v1
kind: ServiceAccount
metadata:
name: litmus-admin
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: litmus-admin
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "pods/log", "events"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: litmus-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: litmus-admin
subjects:
- kind: ServiceAccount
name: litmus-admin
namespace: default
3. Run Pod Delete Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: demo-chaos
namespace: default
spec:
appinfo:
appns: 'default'
applabel: 'app=demo'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
probe:
- name: "check-endpoint"
type: "httpProbe"
httpProbe/inputs:
url: "http://demo-app.default.svc:80"
insecureSkipVerify: false
method:
get:
criteria: "=="
responseCode: "200"
mode: "Continuous"
runProperties:
probeTimeout: 5
interval: 2
retry: 2
4. Observe Results
# Watch chaos engine status
kubectl get chaosengine demo-chaos -w
# Check chaos result
kubectl get chaosresult demo-chaos-pod-delete -o yaml
# Watch pod behavior
kubectl get pods -l app=demo -w
Game Days: Practicing for Real Incidents
Chaos experiments test systems. Game days test people.
Planning a Game Day
- Define scope: Which systems, which failure scenarios
- Assemble team: Engineers who would respond to real incidents
- Prepare runbooks: Document expected responses
- Set success criteria: What does “handled well” look like?
- Schedule time: Dedicated time, not squeezed between meetings
Game Day Structure
09:00 - Briefing: Explain scope, rules, how to abort
09:15 - Inject failure (team doesn't know which one)
09:15 - Team responds as if real incident
10:00 - Failure revealed, additional scenarios introduced
11:00 - Debrief: What worked, what didn't, what to improve
Example Scenarios
| Scenario | What You Learn |
|---|---|
| Kill primary database | Failover speed, data integrity |
| DNS outage | Caching behavior, timeout handling |
| Certificate expires | Monitoring gaps, renewal process |
| Cloud region unavailable | Multi-region readiness |
| Secrets manager down | Application behavior without secrets |
Integrating with GitOps
Chaos experiments can be managed with ArgoCD:
# chaos-experiments/pod-delete.yaml (in Git)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: scheduled-pod-delete
namespace: chaos-testing
spec:
engineState: 'stop' # Manually activated
# ... experiment config
Workflow:
- Define experiments in Git
- Sync with ArgoCD
- Activate experiments via
engineState: 'active' - Review results
- Commit findings and improvements
Safety Guidelines
Do
- Start with non-production environments
- Have rollback procedures ready
- Monitor closely during experiments
- Communicate with stakeholders
- Document everything
Don’t
- Run experiments during high-traffic periods (unless testing that specifically)
- Inject chaos without alerting the team
- Skip the hypothesis step
- Ignore findings
Abort Conditions
Define when to stop immediately:
- Error rate exceeds X%
- Latency exceeds Y seconds
- Customer complaints received
- Any unexpected behavior
# Abort on high error rate
probe:
- name: "error-rate-check"
type: "promProbe"
promProbe/inputs:
endpoint: "http://prometheus:9090"
query: "rate(http_requests_total{status=~'5..'}[1m])"
comparator:
type: "float"
criteria: "<"
value: "0.1" # Abort if > 10% errors
Building a Chaos Culture
Chaos engineering is not a tool. It’s a practice.
Start small:
- Run one experiment in development
- Document what you learn
- Share with the team
- Gradually increase scope
Questions to answer:
- What happens when a node fails?
- How does the system behave under memory pressure?
- What if network latency increases 10x?
- Can we lose a database replica without downtime?
Every experiment that reveals a weakness is a production outage prevented.
My Chaos Experiments
In my homelab cluster, I regularly run:
- Weekly pod delete: Random pods in non-critical namespaces
- Monthly node drain: Simulate node failure
- Quarterly full game day: Multi-failure scenarios
What I’ve learned:
- Longhorn handles node failures gracefully
- Some workloads don’t have proper
terminationGracePeriod - Alerting often fires too late
- Recovery is always slower than expected
Why This Matters
Production will test your systems. The only question is whether you test them first.
Chaos engineering turns “we think it’s resilient” into “we know it survives X, Y, and Z failures.” It transforms hope into evidence.
The systems that survive real outages are the ones that practiced for them.
Break things intentionally, on your terms, in daylight. Or wait for them to break accidentally, at 3 AM, during the busiest day of the year. Your choice.
