Well-designed Grafana dashboard

Grafana Dashboards That Actually Get Used

You have Grafana. You have Prometheus metrics. You have logs in Loki and traces in Tempo. You also have 47 dashboards that nobody looks at. Dashboard rot is real. Teams create dashboards for every possible metric, every service, every potential issue. Six months later, nobody remembers what half of them show or why they exist. Good dashboards are different. They get opened daily. They answer questions before you ask. They help you understand your system, not just display numbers. ...

May 2, 2026 · 7 min read · Tom Meurs
Effective alerting strategy visualization

Alerting That Works: From Alert Fatigue to Actionable Notifications

Your phone buzzes at 3 AM. You groggily check: “High CPU usage on node-worker-3.” You look at the graph, see it’s been at 75% for 10 minutes, and go back to sleep. Tomorrow, same alert. Next week, you stop checking altogether. This is alert fatigue, and it’s dangerous. When everything alerts, nothing does. Real incidents get lost in the noise. I’ve been on both sides — drowning in alerts, and running systems where pages are rare and always actionable. The difference isn’t better tools. It’s better thinking about what deserves attention. ...

April 16, 2026 · 7 min read · Tom Meurs
Distributed tracing visualization with Tempo

Distributed Tracing with Tempo and OpenTelemetry

You have metrics telling you something is slow. You have logs telling you errors happened. But which request failed? Where did the latency come from? Which service in the chain caused the timeout? This is where distributed tracing comes in. It follows individual requests as they flow through your microservices, showing you exactly what happened and where. The Observability Triangle flowchart TD subgraph observability["Complete Observability"] M["Metrics<br/>(Prometheus/Thanos)<br/>WHAT is happening"] L["Logs<br/>(Loki)<br/>WHY it happened"] T["Traces<br/>(Tempo)<br/>WHERE it happened"] end M <--> L L <--> T T <--> M G["Grafana"] --> M G --> L G --> T Metrics answer: “What is the error rate? What is the latency?” Logs answer: “What error message? What was the context?” Traces answer: “Which service? Which call? What was the path?” Together, they give you complete understanding. ...

April 4, 2026 · 7 min read · Tom Meurs
Loki log aggregation architecture for Kubernetes

Loki for Kubernetes Logging: The Prometheus-Like Approach

You’ve got Prometheus for metrics. You can see what’s happening across your clusters. But when something breaks, metrics tell you that something is wrong — logs tell you why. The traditional answer is Elasticsearch. It’s powerful, flexible, and… expensive. It indexes everything, which means you pay for every byte of log data in CPU, memory, and storage. Loki takes a different approach: index labels, not content. It’s the same philosophy that makes Prometheus efficient for metrics, applied to logs. ...

March 31, 2026 · 7 min read · Tom Meurs
Thanos remote write push architecture with edge clusters

Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster

In my previous post on Prometheus and Thanos, I covered the sidecar architecture — Thanos Sidecar runs alongside Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier. It works beautifully for clusters with stable connectivity to your central infrastructure. But what happens when your clusters are at the edge? When they might lose connectivity for hours or days? When you’re running dozens or hundreds of small clusters and don’t want sidecar complexity on each one? ...

March 27, 2026 · 8 min read · Tom Meurs