Tom Meurs

Cilium: eBPF Networking for Kubernetes

The first time a service stopped resolving in one of my clusters, I spent an evening reading iptables chains. Hundreds of rules, generated by kube-proxy, evaluated top to bottom. I never found the actual problem. I restarted a node and it went away. That bothered me more than the outage did. I was running something I couldn’t read. That feeling is why I moved to Cilium. It uses eBPF to push networking logic down into the Linux kernel and skips iptables entirely. You get better performance, you can actually see what your traffic is doing, and network policies stop being a guessing game. ...

Distributed tracing visualization with Tempo

Distributed Tracing with Tempo and OpenTelemetry

Your metrics say something is slow. Your logs say errors happened. Great. Now answer me this: which request actually failed, where did the latency come from, and which service in the chain ate the timeout? Metrics and logs both shrug at that. I hit this wall the first time a checkout flow started timing out under load. Ten services in the path, every one of them green on its own dashboard, and no way to follow a single doomed request from front door to failure. That gap is exactly what distributed tracing fills. It follows one request as it moves through your services and shows you precisely what happened and where it stalled. ...

Loki log aggregation architecture for Kubernetes

Loki for Kubernetes Logging: The Prometheus-Like Approach

You’ve got Prometheus for metrics, so you can already see what’s happening across your clusters. Metrics tell you a request latency spiked at 14:32. They don’t tell you the payment service threw a null pointer because someone shipped a config change with a typo. For that you need logs. The default answer for years was Elasticsearch. It’s powerful and flexible, and it indexes every single token in every log line. That full-text index is great until you look at the bill. You pay for it in CPU at ingest, in RAM to keep the index hot, and in storage that grows faster than your actual log volume. I ran an ELK stack in a previous job and spent more time tuning JVM heap sizes than reading logs. ...

Thanos remote write push architecture with edge clusters

Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster

In my previous post on Prometheus and Thanos, I set up the sidecar architecture. Thanos Sidecar runs next to Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier over gRPC. For clusters sitting in the same datacenter with a fat, stable link to your central infrastructure, it’s lovely. Everything pulls. Everything talks to everything. Life is good. Then I started putting Prometheus on clusters at the edge, and life got less good. ...

Declarative infrastructure for compliance and certification

Declarative Infrastructure as Compliance Documentation: Talos, NixOS, and Audit-Ready Systems

Here’s how an ISO 27001 audit usually goes. Weeks before the auditor shows up, someone starts collecting screenshots. Configuration panels, firewall rules, a dashboard showing patches applied. Then come the Word documents describing what the systems are supposed to do. Then the change tickets, dug out of a ticketing system, each one referencing a vague “server maintenance” that nobody can fully reconstruct six months later. Everyone treats this as the cost of doing business. I did too, for years. ...