Operations

For a while my alerting worked fine. A handful of rules, pages were rare, and when one came in it meant something. Then the cluster grew, I bolted on the Prometheus Operator defaults, and “fine” quietly turned into noise. The tipping point was a 3 AM page. My phone buzzed, I groggily checked it: “High CPU usage on node-worker-3.” I looked at the graph, saw it had been sitting at 75% for ten minutes, and went back to sleep. Next night, same alert. A week later I’d stopped checking at all. ...

Operations

Chaos Engineering: Breaking Your Cluster to Make It Stronger

Alerting That Works: From Alert Fatigue to Actionable Notifications