Observability

Grafana Dashboards That Actually Get Used

You have Grafana. You have Prometheus metrics. You have logs in Loki and traces in Tempo. The data is all there. You also have 47 dashboards that nobody opens. I have done this to myself more than once. Something breaks at 2 AM, I bolt together a dashboard to see what’s going on, and then it just sits there forever. Multiply that by a year of incidents and a few “let me just add a panel for that” moments, and you end up with a Grafana that’s mostly archaeology. Nobody remembers what half the panels mean. The honest move is to delete most of them, but first it helps to understand what makes the survivors worth keeping. ...

Effective alerting strategy visualization

Alerting That Works: From Alert Fatigue to Actionable Notifications

For a while my alerting worked fine. A handful of rules, pages were rare, and when one came in it meant something. Then the cluster grew, I bolted on the Prometheus Operator defaults, and “fine” quietly turned into noise. The tipping point was a 3 AM page. My phone buzzed, I groggily checked it: “High CPU usage on node-worker-3.” I looked at the graph, saw it had been sitting at 75% for ten minutes, and went back to sleep. Next night, same alert. A week later I’d stopped checking at all. ...

Distributed tracing visualization with Tempo

Distributed Tracing with Tempo and OpenTelemetry

Your metrics say something is slow. Your logs say errors happened. Great. Now answer me this: which request actually failed, where did the latency come from, and which service in the chain ate the timeout? Metrics and logs both shrug at that. I hit this wall the first time a checkout flow started timing out under load. Ten services in the path, every one of them green on its own dashboard, and no way to follow a single doomed request from front door to failure. That gap is exactly what distributed tracing fills. It follows one request as it moves through your services and shows you precisely what happened and where it stalled. ...

Loki log aggregation architecture for Kubernetes

Loki for Kubernetes Logging: The Prometheus-Like Approach

You’ve got Prometheus for metrics, so you can already see what’s happening across your clusters. Metrics tell you a request latency spiked at 14:32. They don’t tell you the payment service threw a null pointer because someone shipped a config change with a typo. For that you need logs. The default answer for years was Elasticsearch. It’s powerful and flexible, and it indexes every single token in every log line. That full-text index is great until you look at the bill. You pay for it in CPU at ingest, in RAM to keep the index hot, and in storage that grows faster than your actual log volume. I ran an ELK stack in a previous job and spent more time tuning JVM heap sizes than reading logs. ...

Thanos remote write push architecture with edge clusters

Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster

In my previous post on Prometheus and Thanos, I set up the sidecar architecture. Thanos Sidecar runs next to Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier over gRPC. For clusters sitting in the same datacenter with a fat, stable link to your central infrastructure, it’s lovely. Everything pulls. Everything talks to everything. Life is good. Then I started putting Prometheus on clusters at the edge, and life got less good. ...