Prometheus

Effective alerting strategy visualization

Alerting That Works: From Alert Fatigue to Actionable Notifications

For a while my alerting worked fine. A handful of rules, pages were rare, and when one came in it meant something. Then the cluster grew, I bolted on the Prometheus Operator defaults, and “fine” quietly turned into noise. The tipping point was a 3 AM page. My phone buzzed, I groggily checked it: “High CPU usage on node-worker-3.” I looked at the graph, saw it had been sitting at 75% for ten minutes, and went back to sleep. Next night, same alert. A week later I’d stopped checking at all. ...

Thanos remote write push architecture with edge clusters

Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster

In my previous post on Prometheus and Thanos, I set up the sidecar architecture. Thanos Sidecar runs next to Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier over gRPC. For clusters sitting in the same datacenter with a fat, stable link to your central infrastructure, it’s lovely. Everything pulls. Everything talks to everything. Life is good. Then I started putting Prometheus on clusters at the edge, and life got less good. ...

Prometheus and Thanos metrics architecture visualization

Prometheus and Thanos: Metrics at Scale

The first time someone asked me “was this slower last month than it is now?”, I had no answer. My Prometheus only remembered two weeks. The data I needed had already aged out of local disk and been deleted. That gap is the whole reason this post exists. Prometheus is the default for Kubernetes metrics, and for good reason. It works beautifully right up until you need long-term storage, or a view across multiple clusters, or genuine high availability. Then you meet the wall. ...