Kubernetes is designed to be self-healing, but what does that actually mean? More importantly: what happens when the components doing the healing themselves fail?
I’ve run Kubernetes clusters through all kinds of failures — planned, unplanned, and “hold my beer” experiments. Here’s what actually happens when things break.
The Components That Can Fail
Before diving into failure scenarios, let’s map out what we’re working with:
Control Plane:
- kube-apiserver: The API that everything talks to
- etcd: The database storing all cluster state
- kube-scheduler: Decides where pods run
- kube-controller-manager: Runs controllers (ReplicaSet, Deployment, etc.)
- cloud-controller-manager: Cloud provider integrations (if applicable)
Node Components:
- kubelet: Manages pods on each node
- kube-proxy: Handles network rules for Services
- Container runtime: Actually runs containers
Scenario 1: API Server Down
The kube-apiserver is the single point through which all Kubernetes API requests flow. What happens when it dies?
Immediate impact:
kubectlcommands fail- No new deployments or updates possible
- No new pod scheduling
- Existing pods keep running
What keeps working:
- Running pods continue to run
- Containers stay alive
- Network connectivity between pods
- Services continue to route traffic
What breaks:
- No new pods can be created
- Failed pods won’t be replaced
- Horizontal Pod Autoscaler stops working
- No changes to any resources
flowchart TD
subgraph api_down["API Server Down"]
subgraph control["Control Plane"]
etcd["etcd ✓"]
sched["scheduler<br/>idle"]
ctrl["controller-mgr<br/>idle"]
apiX["✗ API Server"]
end
control -->|"no updates"| nodes
subgraph nodes["Worker Nodes"]
N1["Node 1<br/>Pods ✓"]
N2["Node 2<br/>Pods ✓"]
N3["Node 3<br/>Pods ✓"]
end
end
note["Pods keep running - they don't need the API"]
This is the key insight: Kubernetes is designed for API server outages. The system degrades gracefully — workloads keep running, you just can’t change anything.
For HA setups, you want multiple API servers behind a load balancer. See Kubernetes High Availability: stacked vs external etcd for architecture options.
Scenario 2: etcd Down
etcd is the brain of Kubernetes. All cluster state lives here. This is the scariest failure.
Immediate impact:
- API server can’t read or write state
- Effectively same as API server down
- Existing pods keep running
What keeps working:
- Running pods continue to run
- Containers stay alive
- Network connectivity
- Services work
What breaks:
- Same as API server down, plus:
- Risk of state inconsistency on recovery
- Split-brain scenarios in partial failures
flowchart TD
subgraph etcd_down["etcd Down"]
subgraph control["Control Plane"]
etcdX["etcd ✗"] --> apiX["API ✗"]
apiX --> other["other components"]
end
end
note["Without etcd, API server cannot function<br/>Nodes continue running - they cache their assignments"]
etcd failures are why backups are non-negotiable. See etcd Deep Dive for understanding etcd’s role and backup strategies.
Scenario 3: Scheduler Down
The kube-scheduler decides where pods run. When it’s down:
Immediate impact:
- New pods stay in Pending state
- No scheduling decisions made
What keeps working:
- Existing pods keep running
- API server functions normally
- You can create resources (they just won’t be scheduled)
What breaks:
- New pods can’t be scheduled
- Rescheduling after node failure doesn’t happen
- HPA creates pods that stay Pending
flowchart TD
subgraph sched_down["Scheduler Down"]
apply["kubectl apply deployment"] --> api["API accepts"]
api --> pending["Pod: Pending..."]
subgraph queue["Pending Queue"]
web["Pod: web"]
apiPod["Pod: api"]
job["Pod: job"]
waiting["... waiting"]
end
end
note["Nodes have capacity but nobody assigns pods to them"]
In HA setups, you run multiple schedulers with leader election. Only one is active, others wait.
Scenario 4: Controller Manager Down
The controller manager runs all the controllers that make Kubernetes “self-healing.”
Immediate impact:
- ReplicaSet controller stops
- Deployment controller stops
- Node controller stops
- All reconciliation loops stop
What keeps working:
- Existing pods keep running
- Scheduling still works
- API server still works
What breaks:
- Failed pods don’t get replaced
- Deployments don’t roll out
- Node failures aren’t handled
- Orphaned resources aren’t cleaned up
flowchart TD
subgraph ctrl_down["Controller Manager Down"]
rs["ReplicaSet: 3 desired, 2 running → no action taken"]
deploy["Deployment: rollout in progress → stuck"]
node["Node marked NotReady → pods not evicted"]
end
note["The 'self-healing' part of Kubernetes stops"]
This is where you notice Kubernetes isn’t magic — it’s just software that runs reconciliation loops. Stop the loops, stop the magic.
Scenario 5: Kubelet Down on a Node
The kubelet is the Kubernetes agent on each node. When it fails:
Immediate impact (on that node):
- Node marked as NotReady after timeout (default ~40 seconds)
- Pods on that node get evicted (after another timeout)
- No new pods scheduled to that node
What keeps working:
- Containers keep running (they don’t need kubelet)
- Network might still work (depends on CNI)
- Other nodes unaffected
What breaks:
- No pod lifecycle management on that node
- Health checks stop
- Resource updates stop
- Eventually pods are rescheduled elsewhere
flowchart LR
subgraph kubelet_down["Kubelet Down on Node 2"]
subgraph N1["Node 1 ✓ Ready"]
P1["pod ✓"]
end
subgraph N2["Node 2 ✗ NotReady"]
P2["pod ?<br/>orphaned"]
end
subgraph N3["Node 3 ✓ Ready"]
P3["pod ✓<br/>rescheduled"]
end
P2 -.->|"rescheduled"| P3
end
note["After pod-eviction-timeout, pods get rescheduled"]
The interesting part: containers keep running even without kubelet. The kubelet manages them, but doesn’t keep them alive.
Scenario 6: Container Runtime Down
If the container runtime (containerd, CRI-O) fails:
Immediate impact:
- Running containers might die
- New containers can’t start
- Health checks fail
What happens next:
- kubelet detects failures
- Pods marked as Failed
- Pods get rescheduled to other nodes
This is typically a node-level failure that triggers pod eviction.
Scenario 7: Network Partition
Network partitions are the trickiest failures. A node loses connectivity to the control plane but can still run containers.
What happens:
- Node marked NotReady (can’t reach API server)
- Pods eventually evicted
- But they might still be running on the partitioned node
- Potential for “split brain” — same pod running in two places
flowchart LR
subgraph partition["Network Partition"]
subgraph control["Control Plane"]
cp["Node 2 is NotReady<br/>Evicting pods..."]
end
control x--x|"✗"| partitioned
subgraph partitioned["Partitioned Node"]
pn["I'm fine, running<br/>these pods..."]
end
end
note["Pod 'web-abc123' now runs on Node 1 AND Node 2<br/>Both think they're the real one"]
This is why stateful applications need careful handling. Databases with split-brain can corrupt data.
Failure Timeouts to Know
These timeouts affect how fast Kubernetes reacts to failures:
| Timeout | Default | What it does |
|---|---|---|
node-monitor-grace-period | 40s | How long before marking node NotReady |
pod-eviction-timeout | 5m | How long before evicting pods from NotReady node |
node-monitor-period | 5s | How often node status is checked |
For faster failover, you can tune these — but beware of false positives during temporary network issues.
The Blast Radius Principle
Every failure affects a “blast radius”:
| Component | Blast Radius |
|---|---|
| Container | Single container |
| Pod | All containers in pod |
| Kubelet | All pods on node |
| Node | All pods on node |
| Scheduler | New pod scheduling cluster-wide |
| Controller Manager | Self-healing cluster-wide |
| API Server | All management operations |
| etcd | Everything |
Design your HA accordingly. etcd and API server need the most redundancy.
What This Means for You
- Running workloads are resilient: Existing pods survive most control plane failures
- Management operations aren’t: You need control plane HA for continuous deployment
- etcd is the critical path: Protect it, back it up, monitor it
- Failures cascade: API server down → looks like everything is down
- Timeouts matter: Know your failure detection times
The beauty of Kubernetes is that it was designed with failures in mind. The question isn’t whether components will fail — it’s whether you’ve architected for it.
Testing Failures
Don’t wait for production to find out how your cluster behaves. Test:
# Simulate API server failure (on a test cluster!)
kubectl exec -it -n kube-system kube-apiserver-xxx -- kill 1
# Simulate scheduler failure
kubectl scale deployment kube-scheduler -n kube-system --replicas=0
# Simulate kubelet failure on a node
ssh node-1 'sudo systemctl stop kubelet'
Better yet, use chaos engineering tools like Litmus Chaos for controlled experiments.
Understanding failure modes isn’t pessimism — it’s engineering. Every system fails. The question is whether you designed for it or get surprised by it.
