etcd deep dive internals

How etcd Actually Works: The Heart of Your Kubernetes Cluster

When something goes wrong in Kubernetes, the trail usually leads back to etcd. API server timing out? Check etcd. Pods stuck in pending? Might be etcd. Cluster feels sluggish? Probably etcd. For a long time I treated etcd the way most operators do: as a black box that hums along next to the control plane. “The database.” You back it up and otherwise leave it alone. But black boxes feel like splinters to me, and the first time an etcd cluster fell over at 2am I realised I had no idea what I was actually looking at. So I learned. And it turns out the whole thing is built on a handful of ideas that, once they click, make most etcd problems diagnosable instead of terrifying. ...

January 27, 2025 · 8 min read · Tom Meurs
Kubernetes high availability architecture with etcd

Kubernetes High Availability: Stacked vs External etcd Explained

The first “production” Kubernetes cluster I ran had a single control plane node. It hummed along happily for weeks, right up until a disk failed and took the whole thing with it. Every pod, every service, gone. That outage taught me what “single point of failure” actually feels like, and it pushed me toward a question that I keep seeing trip people up: when you build a cluster that survives node loss, do you run etcd on your control plane nodes, or on dedicated nodes of its own? ...

January 15, 2025 · 8 min read · Tom Meurs