When something goes wrong in Kubernetes, it’s often etcd. API server timing out? Check etcd. Pods stuck in pending? Might be etcd. Cluster feels slow? Probably etcd.
Yet most Kubernetes operators treat etcd as a black box. It’s just “the database” that runs alongside the control plane. But understanding etcd makes you dramatically better at operating Kubernetes. Let me take you inside.
What is etcd?
etcd is a distributed key-value store. Think of it as a highly reliable dictionary that multiple servers agree on. Kubernetes uses it to store all cluster state: every pod, deployment, secret, configmap, and custom resource lives in etcd.
The name “etcd” comes from “/etc distributed” — the Unix /etc directory holds system configuration, and etcd is that concept distributed across nodes.
Key characteristics:
- Strongly consistent: All nodes see the same data at the same time
- Highly available: Survives node failures (with quorum)
- Watch-capable: Clients can subscribe to changes
- Ordered: Operations happen in a strict sequence
The Raft Consensus Algorithm
etcd uses Raft to maintain consistency across nodes. Understanding Raft explains most of etcd’s behavior.
Leader Election
In any etcd cluster, one node is the leader. The leader handles all writes. Followers replicate the leader’s log.
flowchart LR
L["Leader<br/>(etcd1)"] -->|replicates| F1["Follower<br/>(etcd2)"]
L -->|replicates| F2["Follower<br/>(etcd3)"]
L -.->|heartbeats| F1
L -.->|heartbeats| F2
The leader sends heartbeats to followers. If a follower doesn’t hear from the leader for an election timeout (typically 1-2 seconds), it becomes a candidate and starts an election.
# See who's the leader
etcdctl endpoint status --cluster -w table
Log Replication
When a write comes in:
- Client sends write to leader
- Leader appends to its log
- Leader replicates to followers
- Once majority confirms, leader commits
- Leader responds to client
This is why etcd needs a majority (quorum) to write. With 3 nodes, you need 2. With 5, you need 3.
Term Numbers
Each election increments the term number. It’s like a logical clock. If a node sees a higher term, it knows a new leader was elected and updates accordingly.
# Check cluster health and terms
etcdctl endpoint health --cluster -w table
Data Model
etcd stores data as key-value pairs in a flat namespace. Kubernetes uses a hierarchical convention:
/registry/pods/default/nginx-abc123
/registry/pods/kube-system/coredns-xyz789
/registry/deployments/default/my-app
/registry/secrets/default/my-secret
Everything is prefixed with /registry/. The Kubernetes API server is really just a translation layer between the Kubernetes API and etcd’s key-value model.
Revisions and MVCC
etcd uses Multi-Version Concurrency Control (MVCC). Every modification creates a new revision. Old revisions are kept (for a while), enabling:
- Watch from revision: “Tell me everything that changed since revision 12345”
- Historical reads: “What was the value at revision 12345?”
- Optimistic locking: “Update only if still at revision 12345”
# Get current revision
etcdctl get / --prefix --limit=1 -w json | jq '.header.revision'
# Read a key with its revision
etcdctl get /registry/pods/default/nginx -w json | jq '.kvs[0].mod_revision'
Why etcd Performance Matters
Kubernetes is chatty with etcd. Every API call, every controller reconciliation, every watch notification involves etcd. A slow etcd means a slow cluster.
Critical Metrics
fsync latency: Time to persist to disk. Should be <10ms. Over 20ms causes problems.
# Check disk latency
etcdctl check perf --load="s"
Raft proposal latency: Time to commit a write through Raft. Should be <50ms.
Watch count: Number of active watches. Kubernetes uses thousands.
Common Performance Problems
Slow disks: etcd writes to disk synchronously. Use SSDs, not HDDs. NVMe if you can.
Network latency: Raft heartbeats and log replication need low latency. Keep etcd nodes on the same network segment.
Too many objects: A cluster with 100k pods puts more pressure on etcd than one with 1k. Watch cardinality explodes.
Large objects: Secrets with huge certificates, ConfigMaps with megabytes of data — all go through etcd.
Backing Up etcd
If etcd data is lost, your cluster is gone. Backup regularly.
# Snapshot backup
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
# Verify the backup
etcdctl snapshot status /backup/etcd-*.db -w table
For automated backups:
#!/bin/bash
# /etc/cron.daily/etcd-backup
BACKUP_DIR=/var/backups/etcd
RETENTION_DAYS=7
etcdctl snapshot save $BACKUP_DIR/etcd-$(date +%Y%m%d-%H%M%S).db
# Cleanup old backups
find $BACKUP_DIR -name "etcd-*.db" -mtime +$RETENTION_DAYS -delete
Restoring from Backup
Restoring etcd is serious business. You’re resetting cluster state to a point in time.
# Stop etcd on all nodes first
# Restore on each node (different data-dir)
etcdctl snapshot restore /backup/etcd-snapshot.db \
--name etcd1 \
--initial-cluster etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380 \
--initial-cluster-token etcd-cluster-restored \
--initial-advertise-peer-urls https://10.0.0.1:2380 \
--data-dir /var/lib/etcd-restored
# Start etcd with new data-dir
After restore, everything that happened after the snapshot is lost. Pods might reference deleted deployments. Services might point to non-existent endpoints. The cluster will reconcile, but expect some chaos.
Compaction and Defragmentation
etcd keeps old revisions until compacted. Without compaction, the database grows forever.
# Check current database size
etcdctl endpoint status --cluster -w table
# Compact to revision
etcdctl compact 123456
# Defragment to reclaim space
etcdctl defrag --cluster
Kubernetes auto-compacts etcd (configured by --etcd-compaction-interval), but defragmentation is manual.
Watching etcd
Kubernetes controllers work by watching etcd. When something changes, etcd notifies watchers immediately.
# Watch all pod changes
etcdctl watch /registry/pods --prefix
# Watch specific namespace
etcdctl watch /registry/pods/default --prefix
This is how the scheduler knows about new pods instantly. How the kubelet knows about pod assignments. How everything in Kubernetes stays synchronized.
Debugging etcd Issues
High Latency
# Check if disk is the problem
etcdctl check perf --load="s"
# Check cluster health
etcdctl endpoint health --cluster -w table
# Check leader changes (too many = instability)
etcdctl endpoint status --cluster -w table
Disk Space Issues
# Check database size
etcdctl endpoint status --cluster -w table
# Force compaction and defrag
REV=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
etcdctl compact $REV
etcdctl defrag --cluster
Network Issues
# Check if nodes can reach each other
for ep in https://10.0.0.1:2379 https://10.0.0.2:2379 https://10.0.0.3:2379; do
etcdctl --endpoints=$ep endpoint health
done
etcd in Production
My recommendations for production etcd:
Dedicated SSDs: etcd needs fast, consistent disk I/O. Don’t share disks with other workloads.
Monitor relentlessly: etcd_disk_backend_commit_duration_seconds, etcd_network_peer_round_trip_time_seconds, etcd_server_leader_changes_seen_total.
Regular backups: At least daily. Test restores periodically.
Separate network: If possible, put etcd traffic on a dedicated network to avoid contention.
Right-size the cluster: 3 nodes for most workloads. 5 for critical clusters. More than 5 rarely makes sense.
Conclusion
etcd is simple in concept but critical in execution. It’s the foundation everything else builds on. When etcd is healthy, Kubernetes is healthy. When etcd struggles, everything struggles.
The good news: once you understand Raft consensus, quorum, and the importance of disk latency, most etcd problems become diagnosable. It’s not magic — it’s a well-designed distributed system doing exactly what distributed systems do.
Treat your etcd cluster with respect. Back it up. Monitor it. Give it fast disks. And it will quietly keep your cluster running reliably.
etcd is the source of truth. Everything else in Kubernetes is derived state. Protect your source of truth.
