resilience, kubernetes, platform engineering, high availability, fault tolerance

Unbreakable - my fascination.

As a kid I had a word for things that fascinated me: unbreakable. Not “indestructible” — that implies something never breaks. Unbreakable is different. It means something even broken still works. I remember exactly when that fascination began. A photo of an A-10 Thunderbolt II, returned from a mission. Half the wing gone. Tail in tatters. Fuselage full of holes. And yet that thing had brought its pilot home. That’s not luck. That’s design. ...

December 23, 2025 · 3 min read · Tom Meurs
Prometheus and Thanos metrics architecture visualization

Prometheus and Thanos: Metrics at Scale

You can’t fix what you can’t see. You can’t optimize what you can’t measure. Prometheus is the standard for Kubernetes metrics. It works beautifully — until you need long-term storage, or multiple clusters, or high availability. Then you hit its limits. Thanos extends Prometheus without replacing it. Keep your existing setup, add Thanos components, get unlimited retention and global querying. The Problem with Standalone Prometheus Prometheus has built-in limitations: Single node — No native clustering or HA Local storage — Retention limited by disk size Single cluster view — Can’t query across clusters No downsampling — Old data takes as much space as new For a single small cluster with 2 weeks retention, these aren’t problems. For production multi-cluster environments with compliance requirements, they’re blockers. ...

August 31, 2025 · 6 min read · Tom Meurs
Kubernetes RBAC access control visualization

Kubernetes RBAC: Least Privilege in Practice

When everything has cluster-admin, nothing is secure. Kubernetes RBAC (Role-Based Access Control) exists to answer one question: who can do what to which resources? Most clusters answer incorrectly: “everyone can do everything.” This isn’t just a security problem — it’s a resilience problem. When a service account gets compromised, how much damage can it do? When someone runs the wrong command, what’s the blast radius? Least privilege limits that radius. ...

August 19, 2025 · 7 min read · Tom Meurs
Falco runtime security monitoring visualization

Runtime Security with Falco: Detect Suspicious Behavior in Your Cluster

You scanned your images with Trivy. You enforced policies with Kyverno. Your workloads have cryptographic identity via SPIFFE. But what happens after deployment? What if a container gets compromised at runtime? What if an attacker exploits a zero-day? Prevention isn’t enough. You need detection. Falco is a runtime security tool that monitors system calls in your cluster. It sees everything containers do — file access, network connections, process execution — and alerts when something looks wrong. ...

August 7, 2025 · 8 min read · Tom Meurs
SPIFFE workload identity visualization

SPIFFE and SPIRE: Zero Trust Service Identity

How does Service A know that Service B is actually Service B? In traditional networks, we trusted network location. If traffic came from the right IP, it was legitimate. Zero trust killed that assumption. Now every service must prove its identity, every time, regardless of network position. SPIFFE (Secure Production Identity Framework for Everyone) is a standard for service identity. SPIRE is its production-ready implementation. Together, they give every workload a cryptographic identity — automatically, without static secrets. ...

July 26, 2025 · 7 min read · Tom Meurs