etcd Deep Dive: The Heart of Your Kubernetes Cluster

When something goes wrong in Kubernetes, it’s often etcd. API server timing out? Check etcd. Pods stuck in pending? Might be etcd. Cluster feels slow? Probably etcd.

Yet most Kubernetes operators treat etcd as a black box. It’s just “the database” that runs alongside the control plane. But understanding etcd makes you dramatically better at operating Kubernetes. Let me take you inside.

What is etcd?

etcd is a distributed key-value store. Think of it as a highly reliable dictionary that multiple servers agree on. Kubernetes uses it to store all cluster state: every pod, deployment, secret, configmap, and custom resource lives in etcd.

The name “etcd” comes from “/etc distributed” — the Unix /etc directory holds system configuration, and etcd is that concept distributed across nodes.

Key characteristics:

Strongly consistent: All nodes see the same data at the same time
Highly available: Survives node failures (with quorum)
Watch-capable: Clients can subscribe to changes
Ordered: Operations happen in a strict sequence

The Raft Consensus Algorithm

etcd uses Raft to maintain consistency across nodes. Understanding Raft explains most of etcd’s behavior.

Leader Election

In any etcd cluster, one node is the leader. The leader handles all writes. Followers replicate the leader’s log.

flowchart LR
    L["Leader<br/>(etcd1)"] -->|replicates| F1["Follower<br/>(etcd2)"]
    L -->|replicates| F2["Follower<br/>(etcd3)"]
    L -.->|heartbeats| F1
    L -.->|heartbeats| F2

The leader sends heartbeats to followers. If a follower doesn’t hear from the leader for an election timeout (typically 1-2 seconds), it becomes a candidate and starts an election.

# See who's the leader
etcdctl endpoint status --cluster -w table

Log Replication

When a write comes in:

Client sends write to leader
Leader appends to its log
Leader replicates to followers
Once majority confirms, leader commits
Leader responds to client

This is why etcd needs a majority (quorum) to write. With 3 nodes, you need 2. With 5, you need 3.

Term Numbers

Each election increments the term number. It’s like a logical clock. If a node sees a higher term, it knows a new leader was elected and updates accordingly.

# Check cluster health and terms
etcdctl endpoint health --cluster -w table

Data Model

etcd stores data as key-value pairs in a flat namespace. Kubernetes uses a hierarchical convention:

/registry/pods/default/nginx-abc123
/registry/pods/kube-system/coredns-xyz789
/registry/deployments/default/my-app
/registry/secrets/default/my-secret

Everything is prefixed with /registry/. The Kubernetes API server is really just a translation layer between the Kubernetes API and etcd’s key-value model.

Revisions and MVCC

etcd uses Multi-Version Concurrency Control (MVCC). Every modification creates a new revision. Old revisions are kept (for a while), enabling:

Watch from revision: “Tell me everything that changed since revision 12345”
Historical reads: “What was the value at revision 12345?”
Optimistic locking: “Update only if still at revision 12345”

# Get current revision
etcdctl get / --prefix --limit=1 -w json | jq '.header.revision'

# Read a key with its revision
etcdctl get /registry/pods/default/nginx -w json | jq '.kvs[0].mod_revision'

Why etcd Performance Matters

Kubernetes is chatty with etcd. Every API call, every controller reconciliation, every watch notification involves etcd. A slow etcd means a slow cluster.

Critical Metrics

fsync latency: Time to persist to disk. Should be <10ms. Over 20ms causes problems.

# Check disk latency
etcdctl check perf --load="s"

Raft proposal latency: Time to commit a write through Raft. Should be <50ms.

Watch count: Number of active watches. Kubernetes uses thousands.

Common Performance Problems

Slow disks: etcd writes to disk synchronously. Use SSDs, not HDDs. NVMe if you can.

Network latency: Raft heartbeats and log replication need low latency. Keep etcd nodes on the same network segment.

Too many objects: A cluster with 100k pods puts more pressure on etcd than one with 1k. Watch cardinality explodes.

Large objects: Secrets with huge certificates, ConfigMaps with megabytes of data — all go through etcd.

Backing Up etcd

If etcd data is lost, your cluster is gone. Backup regularly.

# Snapshot backup
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db

# Verify the backup
etcdctl snapshot status /backup/etcd-*.db -w table

For automated backups:

#!/bin/bash
# /etc/cron.daily/etcd-backup

BACKUP_DIR=/var/backups/etcd
RETENTION_DAYS=7

etcdctl snapshot save $BACKUP_DIR/etcd-$(date +%Y%m%d-%H%M%S).db

# Cleanup old backups
find $BACKUP_DIR -name "etcd-*.db" -mtime +$RETENTION_DAYS -delete

Restoring from Backup

Restoring etcd is serious business. You’re resetting cluster state to a point in time.

# Stop etcd on all nodes first

# Restore on each node (different data-dir)
etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name etcd1 \
  --initial-cluster etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380 \
  --initial-cluster-token etcd-cluster-restored \
  --initial-advertise-peer-urls https://10.0.0.1:2380 \
  --data-dir /var/lib/etcd-restored

# Start etcd with new data-dir

After restore, everything that happened after the snapshot is lost. Pods might reference deleted deployments. Services might point to non-existent endpoints. The cluster will reconcile, but expect some chaos.

Compaction and Defragmentation

etcd keeps old revisions until compacted. Without compaction, the database grows forever.

# Check current database size
etcdctl endpoint status --cluster -w table

# Compact to revision
etcdctl compact 123456

# Defragment to reclaim space
etcdctl defrag --cluster

Kubernetes auto-compacts etcd (configured by --etcd-compaction-interval), but defragmentation is manual.

Watching etcd

Kubernetes controllers work by watching etcd. When something changes, etcd notifies watchers immediately.

# Watch all pod changes
etcdctl watch /registry/pods --prefix

# Watch specific namespace
etcdctl watch /registry/pods/default --prefix

This is how the scheduler knows about new pods instantly. How the kubelet knows about pod assignments. How everything in Kubernetes stays synchronized.

Debugging etcd Issues

High Latency

# Check if disk is the problem
etcdctl check perf --load="s"

# Check cluster health
etcdctl endpoint health --cluster -w table

# Check leader changes (too many = instability)
etcdctl endpoint status --cluster -w table

Disk Space Issues

# Check database size
etcdctl endpoint status --cluster -w table

# Force compaction and defrag
REV=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
etcdctl compact $REV
etcdctl defrag --cluster

Network Issues

# Check if nodes can reach each other
for ep in https://10.0.0.1:2379 https://10.0.0.2:2379 https://10.0.0.3:2379; do
  etcdctl --endpoints=$ep endpoint health
done

etcd in Production

My recommendations for production etcd:

Dedicated SSDs: etcd needs fast, consistent disk I/O. Don’t share disks with other workloads.
Monitor relentlessly: etcd_disk_backend_commit_duration_seconds, etcd_network_peer_round_trip_time_seconds, etcd_server_leader_changes_seen_total.
Regular backups: At least daily. Test restores periodically.
Separate network: If possible, put etcd traffic on a dedicated network to avoid contention.
Right-size the cluster: 3 nodes for most workloads. 5 for critical clusters. More than 5 rarely makes sense.

Conclusion

etcd is simple in concept but critical in execution. It’s the foundation everything else builds on. When etcd is healthy, Kubernetes is healthy. When etcd struggles, everything struggles.

The good news: once you understand Raft consensus, quorum, and the importance of disk latency, most etcd problems become diagnosable. It’s not magic — it’s a well-designed distributed system doing exactly what distributed systems do.

Treat your etcd cluster with respect. Back it up. Monitor it. Give it fast disks. And it will quietly keep your cluster running reliably.

etcd is the source of truth. Everything else in Kubernetes is derived state. Protect your source of truth.

What is etcd?#

The Raft Consensus Algorithm#

Leader Election#

Log Replication#

Term Numbers#

Data Model#

Revisions and MVCC#

Why etcd Performance Matters#

Critical Metrics#

Common Performance Problems#

Backing Up etcd#

Restoring from Backup#

Compaction and Defragmentation#

Watching etcd#

Debugging etcd Issues#

High Latency#

Disk Space Issues#

Network Issues#

etcd in Production#

Conclusion#