Effective alerting strategy visualization

Alerting That Works: From Alert Fatigue to Actionable Notifications

Your phone buzzes at 3 AM. You groggily check: “High CPU usage on node-worker-3.” You look at the graph, see it’s been at 75% for 10 minutes, and go back to sleep. Tomorrow, same alert. Next week, you stop checking altogether. This is alert fatigue, and it’s dangerous. When everything alerts, nothing does. Real incidents get lost in the noise. I’ve been on both sides — drowning in alerts, and running systems where pages are rare and always actionable. The difference isn’t better tools. It’s better thinking about what deserves attention. ...

April 16, 2026 · 7 min read · Tom Meurs
Thanos remote write push architecture with edge clusters

Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster

In my previous post on Prometheus and Thanos, I covered the sidecar architecture — Thanos Sidecar runs alongside Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier. It works beautifully for clusters with stable connectivity to your central infrastructure. But what happens when your clusters are at the edge? When they might lose connectivity for hours or days? When you’re running dozens or hundreds of small clusters and don’t want sidecar complexity on each one? ...

March 27, 2026 · 8 min read · Tom Meurs
Prometheus and Thanos metrics architecture visualization

Prometheus and Thanos: Metrics at Scale

You can’t fix what you can’t see. You can’t optimize what you can’t measure. Prometheus is the standard for Kubernetes metrics. It works beautifully — until you need long-term storage, or multiple clusters, or high availability. Then you hit its limits. Thanos extends Prometheus without replacing it. Keep your existing setup, add Thanos components, get unlimited retention and global querying. The Problem with Standalone Prometheus Prometheus has built-in limitations: Single node — No native clustering or HA Local storage — Retention limited by disk size Single cluster view — Can’t query across clusters No downsampling — Old data takes as much space as new For a single small cluster with 2 weeks retention, these aren’t problems. For production multi-cluster environments with compliance requirements, they’re blockers. ...

August 31, 2025 · 6 min read · Tom Meurs