Tom Meurs

Kubernetes graceful degradation visualization

Graceful Degradation in Kubernetes: What Happens When Components Fail

Everyone repeats the line that Kubernetes is self-healing. Pods die, they come back. Nodes drop, workloads reschedule. The system reconciles itself toward the state you declared, and most days you never have to think about it. Then one day the thing doing the healing is the thing that broke. The API server is down. etcd won’t respond. The scheduler is wedged. Now what? This is the question I actually care about, because “self-healing” is only useful if I understand its edges. I want to know what degrades gracefully and what takes the whole cluster with it. So I’ve run my clusters through a lot of failures on purpose: planned, unplanned, and a few “hold my beer” experiments on hardware I didn’t mind losing. Here is what actually happens when each piece breaks, and why most of it matters less than people fear. ...

Kubernetes Network Policies visual guide

Kubernetes Network Policies: A Visual Guide to Pod Security

Picture this: an attacker pops a single pod in your cluster, maybe through a vulnerable image or a leaked token. From that one foothold, they can reach every database, every internal API, every secret-fetching sidecar you run. Nothing stops them, because by default nothing tries to. Network Policies are the thing that stops them. They turn “one compromised pod” into “one compromised pod, and that’s it.” Everyone knows they should use them. Almost nobody actually does, because the YAML looks scary and the behaviour is weird until the mental model clicks. ...

How etcd Actually Works: The Heart of Your Kubernetes Cluster

When something goes wrong in Kubernetes, the trail usually leads back to etcd. API server timing out? Check etcd. Pods stuck in pending? Might be etcd. Cluster feels sluggish? Probably etcd. For a long time I treated etcd the way most operators do: as a black box that hums along next to the control plane. “The database.” You back it up and otherwise leave it alone. But black boxes feel like splinters to me, and the first time an etcd cluster fell over at 2am I realised I had no idea what I was actually looking at. So I learned. And it turns out the whole thing is built on a handful of ideas that, once they click, make most etcd problems diagnosable instead of terrifying. ...

Kubernetes high availability architecture with etcd

Kubernetes High Availability: Stacked vs External etcd Explained

The first “production” Kubernetes cluster I ran had a single control plane node. It hummed along happily for weeks, right up until a disk failed and took the whole thing with it. Every pod, every service, gone. That outage taught me what “single point of failure” actually feels like, and it pushed me toward a question that I keep seeing trip people up: when you build a cluster that survives node loss, do you run etcd on your control plane nodes, or on dedicated nodes of its own? ...