As a kid I had a word for things that fascinated me: unbreakable. Not “indestructible” — that implies something never breaks. Unbreakable is different. It means something even broken still works.
I remember exactly when that fascination began. A photo of an A-10 Thunderbolt II, returned from a mission. Half the wing gone. Tail in tatters. Fuselage full of holes. And yet that thing had brought its pilot home.
That’s not luck. That’s design.
Things that refuse to die
Do you know the stories of the Camel Trophy? Land Rover Defenders that were chased through Indonesian jungle. Chassis bent, parts broken off, everything covered in mud. And yet they arrived. Not because nothing went wrong along the way — things went wrong constantly. They arrived because they were built to break and still keep driving.
Compare that to modern cars. One sensor fails and you’re standing by the roadside waiting for roadside assistance.
Somewhere we started confusing “robust” with “complex”. More systems, more redundancy, more failovers. But complexity itself is a source of failure. Every extra component is another thing that can break.
Why I do this for work
I build platforms. Kubernetes clusters, GitOps pipelines, that sort of thing. And I notice I’m constantly asking the same question: “What happens if this falls over?”
Not “how do we prevent this from falling over?” — that’s a different question. An important one, but different. I want to know what happens when it falls over anyway. Because it will fall over. Servers crash. Networks act weird. Code has bugs. That’s not pessimism, that’s just how it works.
Most systems I encounter are built for the happy path. Everything works, requests come in, responses go out, logs are green. Nice. But as soon as something goes wrong — and it will — it turns out nobody thought about what should happen then.
Kubernetes and the illusion of self-healing
Kubernetes is often sold as “self-healing”. Pod crashed? No problem, we’ll start a new one. Node goes down? Workloads are automatically moved.
And yes, that’s true. Up to a point.
But a Kubernetes cluster is itself just a system that can fail. What if etcd gets corrupted? What if your control plane goes down? What if a network partition forms between your nodes?
Self-healing within a cluster is nice. But the cluster itself is still a single point of failure. And that’s exactly what I’m experimenting with — clusters that are together unbreakable, even when individual clusters go down.
But that’s a story for another post.
The question I ask myself
Every time I build something, I try to ask myself: is this unbreakable?
Not perfect. Not indestructible. But can this thing keep doing its job when half of it is on fire?
Usually the answer is no. And then the next question is: what would need to change to make the answer yes?
Sometimes the answer is simple. Sometimes the answer is “this entire design needs to be different”. And sometimes the answer is “that’s too much effort for this specific problem” — also fine, as long as you’re making that choice consciously.
The point isn’t that everything has to be unbreakable. The point is that you think about what happens when it breaks. Before it breaks.
This is also why I’m a fan of chaos engineering — deliberately breaking things to discover what happens then. Better in a controlled setting than at 3 AM.
