Alerting That Works: From Alert Fatigue to Actionable Notifications

For a while my alerting worked fine. A handful of rules, pages were rare, and when one came in it meant something. Then the cluster grew, I bolted on the Prometheus Operator defaults, and “fine” quietly turned into noise.

The tipping point was a 3 AM page. My phone buzzed, I groggily checked it: “High CPU usage on node-worker-3.” I looked at the graph, saw it had been sitting at 75% for ten minutes, and went back to sleep. Next night, same alert. A week later I’d stopped checking at all.

That’s alert fatigue, and it’s the dangerous kind of broken because it looks like it’s working. The pages keep coming, the dashboards keep blinking, and you keep ignoring them. When everything pages, real incidents drown in the noise and nobody notices until a user does.

So which alerts actually deserve to wake a human up at 3 AM? That’s the only question worth answering here, and it has very little to do with the tooling. I’ve run systems on both ends: drowning in pages, and systems where a page is rare enough that I trust it the instant it fires. The gap between those two is how you decide what gets to interrupt someone.

This connects straight back to one of the values I keep coming back to on this blog: understanding over trust. An alert you don’t understand the meaning of is a black box bolted onto your pager. You can’t reason about it, so you start ignoring it.

Why default alerts make it worse

Most monitoring setups arrive pre-loaded with alerting rules. The Prometheus Operator ships hundreds of them. They fire on everything:

CPU > 80%
Memory > 80%
Disk > 80%
Pod restarts > 0
Any 5xx error

Good intentions, terrible results. Four things go wrong:

No context. Is 80% CPU bad? Depends entirely on the workload. A batch job pinning a core is doing its job.
No impact. Does this affect a user right now? The alert has no idea, and neither do you at 3 AM.
No action. What am I supposed to do about it? Usually nothing.
Too sensitive. A brief spike fires an alert that resolves itself before you’ve even unlocked your phone.

You end up with pages that don’t need a human, and humans who’ve learned to stop trusting pages. Both halves of that are bad, and they feed each other.

How I decide what gets to page me

The fix isn’t a better tool. It’s a few rules about what an alert is allowed to be.

1. Alert on symptoms, not causes

Bad: “Pod restarted” Good: “Service error rate > 1%”

Nobody using your service cares whether a pod restarted. They care whether the thing works. So alert on what the user experiences, then go chase the cause once you’ve been woken up for a real reason. A pod that crash-loops while the service stays healthy (because the others picked up the slack) is exactly the kind of self-healing I want, and exactly the kind of thing I don’t want a page for.

2. Alert only on what needs a hand

If nobody has to do anything, it has no business paging anyone. Informational stuff belongs on a dashboard or in a daily review, not on your phone at night.

The test I use is one question: “If I get this alert, what do I actually do?”

Answer is “investigate and probably nothing” → not an alert
Answer is “this specific remediation step” → fine, that’s an alert

3. Every alert needs a runbook

When the thing fires, what do you do about it? If you can’t write that down ahead of time, the alert isn’t finished. This is the low-friction principle applied to 3 AM: future-me, half-awake, should not have to reverse-engineer what past-me meant.

annotations:
  runbook_url: https://wiki.example.com/runbooks/high-error-rate
  summary: "Error rate exceeds SLO"
  description: "Service {{ $labels.service }} error rate is {{ $value }}%"

4. Severity has to mean something

Severity	Response	Example
critical	Wake someone up NOW	Service down, data loss imminent
warning	Handle during business hours	Disk filling, certificate expiring
info	Review in daily standup	Unusual pattern, non-urgent

The moment a critical alert fires that didn’t need immediate action, you’ve taught yourself that critical doesn’t really mean critical. Do that a few times and the whole scale is worthless. Guard it.

Tie alerts to SLOs, not raw numbers

The alerts I trust most map directly onto a Service Level Objective. The SLO already encodes “how much badness is acceptable,” so the alert inherits that meaning instead of guessing at a threshold.

# SLO: 99.9% availability
# Error budget: 0.1% = 43 minutes/month

groups:
  - name: slo-alerts
    rules:
      # Fast burn: consuming error budget quickly
      - alert: HighErrorRateFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) /
            sum(rate(http_requests_total[5m]))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 1% (fast burn)"
          description: "At this rate, monthly error budget exhausts in {{ $value | humanizeDuration }}"

      # Slow burn: steadily consuming error budget
      - alert: HighErrorRateSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) /
            sum(rate(http_requests_total[1h]))
          ) > 0.001
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Error rate elevated (slow burn)"

The two burn rates do different jobs:

Fast burn: you’re torching the error budget right now, wake someone up.
Slow burn: you’re drifting toward an SLO breach over hours, deal with it during work time.

Some alerts I actually run

Here’s the concrete stuff. Not an exhaustive catalogue, just the categories that have earned a place on my pager.

Availability

# Service is down
- alert: ServiceDown
  expr: up{job="my-service"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.job }} is down"
    runbook_url: https://wiki/runbooks/service-down

# High error rate (user impact)
- alert: HighErrorRate
  expr: |
    sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) /
    sum by (service) (rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.service }} error rate above 5%"

Latency

# P99 latency too high
- alert: HighLatency
  expr: |
    histogram_quantile(0.99,
      sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
    ) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.service }} P99 latency above 2s"

Capacity

These two are predictive, which is the point. I’d rather get a warning during office hours that a disk will fill tomorrow than a critical page at midnight when it already has.

# Disk filling (predictive)
- alert: DiskFillingUp
  expr: |
    predict_linear(node_filesystem_avail_bytes{fstype!="tmpfs"}[6h], 24*60*60) < 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Disk {{ $labels.mountpoint }} will fill within 24h"

# Certificate expiring
- alert: CertificateExpiringSoon
  expr: |
    certmanager_certificate_expiration_timestamp_seconds - time() < 7*24*60*60
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Certificate {{ $labels.name }} expires in less than 7 days"

What I deliberately don’t alert on

This list matters as much as the one above. Every rule here is something a default config will happily set up for you, and every one of them is noise.

# DON'T: Generic resource usage
- alert: HighCPU
  expr: node_cpu_seconds_total > 0.8  # Bad: no context

# DON'T: Expected behavior
- alert: PodRestart
  expr: increase(kube_pod_container_status_restarts_total[1h]) > 0  # Bad: restarts are normal

# DON'T: Things that auto-resolve
- alert: PodPending
  expr: kube_pod_status_phase{phase="Pending"} == 1
  for: 1m  # Bad: too short, scheduler will handle it

Routing it all through Alertmanager

Deciding what’s worth a page is half the work. The other half is making sure the page lands in the right place at the right time. Critical goes to the pager, warning goes to Slack but only during work hours, info goes nowhere except a log.

# alertmanager.yaml
global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical: page immediately
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    # Warning: Slack during work hours
    - match:
        severity: warning
      receiver: 'slack-ops'
      active_time_intervals:
        - work_hours

    # Info: just log
    - match:
        severity: info
      receiver: 'null'

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<key>'
        severity: '{{ .CommonLabels.severity }}'

  - name: 'slack-ops'
    slack_configs:
      - api_url: '<webhook>'
        channel: '#ops-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'null'

time_intervals:
  - name: work_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

Silencing and inhibition

Two more tools for keeping the noise down: silence what you already know about, and suppress alerts that are just downstream symptoms of a bigger one.

Silences for maintenance

# Create silence during maintenance
amtool silence add alertname=~".+" --duration=2h --comment="Planned maintenance"

# Silence specific service
amtool silence add service="api" --duration=1h --comment="Deploying v2.3.0"

Inhibition rules

When one alert obviously implies a pile of others, suppress the pile. If the whole cluster is down, I don’t need forty separate pages telling me each service on it is also down.

inhibit_rules:
  # If cluster is down, don't alert on individual services
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: '.+'
    equal: ['cluster']

  # If node is down, don't alert on pods on that node
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: 'Pod.+'
    equal: ['node']

Surfacing alert state in Grafana

Not every alert should page, but you still want to see them. The ones that don’t wake you up live on a dashboard, which is where the info and resolved-but-recent alerts go to be reviewed instead of ignored.

# Prometheus datasource query
ALERTS{alertstate="firing"}

Create an alerts overview panel showing:

Currently firing alerts
Recent alert history
Alert trends over time

Link from alerts to relevant dashboards:

annotations:
  dashboard_url: https://grafana/d/abc123?var-service={{ $labels.service }}

The human side of on-call

All the config in the world doesn’t help if the rotation around it is a mess. A few things I’ve found worth being explicit about.

Rotation structure

Primary    → First responder, handles all pages
Secondary  → Backup if primary doesn't respond in 10 min
Escalation → Management, for major incidents

Response time SLAs

Severity	Acknowledge	Start Work
Critical	5 minutes	15 minutes
Warning	4 hours	8 hours
Info	Next standup	When convenient

Post-incident review

Every critical alert is worth a quick look afterwards, even when the fix was easy:

Was the alert necessary?
Did it give enough context?
Was the runbook helpful?
Could the underlying thing be prevented?

If the same alert keeps firing and the answer is always “do nothing,” that’s not a robust system telling you about a recurring problem. That’s a broken alert. Fix it or delete it.

What my setup actually looks like

# Core SLO alerts only
groups:
  - name: slo
    rules:
      # Availability: error rate
      - alert: HighErrorRate
        expr: error_rate > 0.01
        for: 5m
        severity: critical

      # Latency: P99
      - alert: HighLatency
        expr: p99_latency > 2
        for: 10m
        severity: warning

      # Capacity: predictive
      - alert: DiskFillingUp
        expr: disk_will_fill_24h
        severity: warning

      # Security: cert expiry
      - alert: CertExpiring
        expr: cert_expires_7d
        severity: warning

# That's it. ~10 alert rules total.

Around ten rules, and every one of them earns its keep:

SLO-based, so each alert is tied to something a user would actually feel.
Minimal, because a rule I can’t justify is a rule I delete.
Clear severity, where critical genuinely means get out of bed and warning genuinely means it can wait until morning.
Runbook on everything, so half-asleep me has a next step instead of a riddle.

Digging out if you’re already buried

If you’re reading this from inside a pile of noisy alerts, you don’t fix it by adding more. You fix it by deleting. Here’s roughly the order I’d go in:

Audit what you have. How many alerts fired last month, and how many of those led to anyone doing anything?
Delete the dead ones. If a rule hasn’t fired in 90 days, you probably don’t need it, and if it has fired but nobody acted, you definitely don’t.
Raise the thresholds that always auto-resolve. If an alert clears itself before you respond, it was set too tight.
Add a for duration so brief spikes can’t page you.
Merge the near-duplicates. One “high error rate” beats forty per-endpoint variations of the same thing.

The bar to clear is simple: every page should need a human. Anything that doesn’t, get it off the pager.

Why I bother with all this

A noisy alerting setup looks healthy and is quietly rotting your reliability. The engineer on call learns the pages are mostly junk, starts skimming them, and one night skims past the one that mattered. That’s the failure mode, and it’s a people problem dressed up as a tooling problem.

So I aim for alerting that is respectful of someone’s attention, accurate about real impact, and small enough in volume that every page still gets read. That’s the whole game.

Your observability stack with Loki and Tempo produces a firehose of data. Alerting is the valve that turns a sliver of that firehose into human attention. Open it too wide and you’re back at 3 AM, ignoring your own phone.

The best alert is the one you never have to send, because the system healed itself. The second best tells you exactly what’s wrong and exactly what to do about it.

Why default alerts make it worse#

How I decide what gets to page me#

1. Alert on symptoms, not causes#

2. Alert only on what needs a hand#

3. Every alert needs a runbook#

4. Severity has to mean something#

Tie alerts to SLOs, not raw numbers#

Some alerts I actually run#

Availability#

Latency#

Capacity#

What I deliberately don’t alert on#

Routing it all through Alertmanager#

Silencing and inhibition#

Silences for maintenance#

Inhibition rules#

Surfacing alert state in Grafana#

The human side of on-call#

Rotation structure#

Response time SLAs#

Post-incident review#

What my setup actually looks like#

Digging out if you’re already buried#

Why I bother with all this#