Your phone buzzes at 3 AM. You groggily check: “High CPU usage on node-worker-3.” You look at the graph, see it’s been at 75% for 10 minutes, and go back to sleep. Tomorrow, same alert. Next week, you stop checking altogether.
This is alert fatigue, and it’s dangerous. When everything alerts, nothing does. Real incidents get lost in the noise.
I’ve been on both sides — drowning in alerts, and running systems where pages are rare and always actionable. The difference isn’t better tools. It’s better thinking about what deserves attention.
The Problem with Default Alerts
Most monitoring setups come with default alerting rules. Prometheus Operator ships with hundreds. They alert on everything:
- CPU > 80%
- Memory > 80%
- Disk > 80%
- Pod restarts > 0
- Any 5xx error
These defaults are well-intentioned but terrible in practice:
- No context — Is 80% CPU bad? Depends on the workload.
- No impact — Does this affect users? Unknown.
- No action — What should I do about it? Also unknown.
- Too sensitive — Brief spikes trigger alerts that resolve before you respond.
The result: pages that don’t need humans, humans who stop trusting pages.
The Alerting Philosophy
Good alerts follow principles:
1. Alert on Symptoms, Not Causes
Bad: “Pod restarted” Good: “Service error rate > 1%”
Users don’t care if pods restart. They care if the service works. Alert on what users experience, investigate causes after.
2. Alert on What Requires Action
If nobody needs to do anything, it shouldn’t page. Save informational alerts for dashboards or daily reviews.
Ask: “If I get this alert, what will I do?”
- If the answer is “investigate and probably nothing” → not an alert
- If the answer is “specific remediation” → valid alert
3. Every Alert Needs a Runbook
When the alert fires, what do you do? If you can’t write it down, the alert isn’t ready.
annotations:
runbook_url: https://wiki.example.com/runbooks/high-error-rate
summary: "Error rate exceeds SLO"
description: "Service {{ $labels.service }} error rate is {{ $value }}%"
4. Severity Must Mean Something
| Severity | Response | Example |
|---|---|---|
| critical | Wake someone up NOW | Service down, data loss imminent |
| warning | Handle during business hours | Disk filling, certificate expiring |
| info | Review in daily standup | Unusual pattern, non-urgent |
If critical alerts don’t require immediate action, they shouldn’t be critical.
SLO-Based Alerting
The best alerts tie directly to Service Level Objectives:
# SLO: 99.9% availability
# Error budget: 0.1% = 43 minutes/month
groups:
- name: slo-alerts
rules:
# Fast burn: consuming error budget quickly
- alert: HighErrorRateFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate exceeds 1% (fast burn)"
description: "At this rate, monthly error budget exhausts in {{ $value | humanizeDuration }}"
# Slow burn: steadily consuming error budget
- alert: HighErrorRateSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))
) > 0.001
for: 1h
labels:
severity: warning
annotations:
summary: "Error rate elevated (slow burn)"
The logic:
- Fast burn: Immediate problem, wake someone up
- Slow burn: Trending toward SLO breach, handle during work hours
Practical Alert Examples
Availability Alerts
# Service is down
- alert: ServiceDown
expr: up{job="my-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
runbook_url: https://wiki/runbooks/service-down
# High error rate (user impact)
- alert: HighErrorRate
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by (service) (rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} error rate above 5%"
Latency Alerts
# P99 latency too high
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} P99 latency above 2s"
Capacity Alerts
# Disk filling (predictive)
- alert: DiskFillingUp
expr: |
predict_linear(node_filesystem_avail_bytes{fstype!="tmpfs"}[6h], 24*60*60) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Disk {{ $labels.mountpoint }} will fill within 24h"
# Certificate expiring
- alert: CertificateExpiringSoon
expr: |
certmanager_certificate_expiration_timestamp_seconds - time() < 7*24*60*60
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} expires in less than 7 days"
What NOT to Alert On
# DON'T: Generic resource usage
- alert: HighCPU
expr: node_cpu_seconds_total > 0.8 # Bad: no context
# DON'T: Expected behavior
- alert: PodRestart
expr: increase(kube_pod_container_status_restarts_total[1h]) > 0 # Bad: restarts are normal
# DON'T: Things that auto-resolve
- alert: PodPending
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 1m # Bad: too short, scheduler will handle it
Alertmanager Configuration
Route alerts appropriately:
# alertmanager.yaml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical: page immediately
- match:
severity: critical
receiver: 'pagerduty'
continue: true
# Warning: Slack during work hours
- match:
severity: warning
receiver: 'slack-ops'
active_time_intervals:
- work_hours
# Info: just log
- match:
severity: info
receiver: 'null'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<key>'
severity: '{{ .CommonLabels.severity }}'
- name: 'slack-ops'
slack_configs:
- api_url: '<webhook>'
channel: '#ops-alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'null'
time_intervals:
- name: work_hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '18:00'
Silencing and Inhibition
Silences for Maintenance
# Create silence during maintenance
amtool silence add alertname=~".+" --duration=2h --comment="Planned maintenance"
# Silence specific service
amtool silence add service="api" --duration=1h --comment="Deploying v2.3.0"
Inhibition Rules
When one alert implies another, suppress the noise:
inhibit_rules:
# If cluster is down, don't alert on individual services
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: '.+'
equal: ['cluster']
# If node is down, don't alert on pods on that node
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: 'Pod.+'
equal: ['node']
Grafana Integration
Visualize alert state in dashboards:
# Prometheus datasource query
ALERTS{alertstate="firing"}
Create an alerts overview panel showing:
- Currently firing alerts
- Recent alert history
- Alert trends over time
Link from alerts to relevant dashboards:
annotations:
dashboard_url: https://grafana/d/abc123?var-service={{ $labels.service }}
On-Call Best Practices
Rotation Structure
Primary → First responder, handles all pages
Secondary → Backup if primary doesn't respond in 10 min
Escalation → Management, for major incidents
Response Time SLAs
| Severity | Acknowledge | Start Work |
|---|---|---|
| Critical | 5 minutes | 15 minutes |
| Warning | 4 hours | 8 hours |
| Info | Next standup | When convenient |
Post-Incident Review
Every critical alert should trigger a review:
- Was the alert necessary?
- Did it provide enough context?
- Was the runbook helpful?
- Could this be prevented?
If the same alert fires repeatedly without action, fix or delete it.
My Alerting Setup
# Core SLO alerts only
groups:
- name: slo
rules:
# Availability: error rate
- alert: HighErrorRate
expr: error_rate > 0.01
for: 5m
severity: critical
# Latency: P99
- alert: HighLatency
expr: p99_latency > 2
for: 10m
severity: warning
# Capacity: predictive
- alert: DiskFillingUp
expr: disk_will_fill_24h
severity: warning
# Security: cert expiry
- alert: CertExpiring
expr: cert_expires_7d
severity: warning
# That's it. ~10 alert rules total.
Key decisions:
- SLO-based — Alerts tied to user impact
- Minimal rules — Each rule earns its place
- Clear severity — Critical means wake up, warning means work hours
- Every alert has a runbook — No guessing at 3 AM
Reducing Alert Fatigue
Steps to clean up:
- Audit current alerts — How many fired last month? How many were actionable?
- Delete unused alerts — If it hasn’t fired in 90 days, probably not needed
- Raise thresholds — If alerts always auto-resolve, threshold is too sensitive
- Add
forduration — Brief spikes shouldn’t page - Merge similar alerts — One “high error rate” beats per-endpoint alerts
Goal: Every page should require human action.
Why This Matters
Alert fatigue kills reliability. When on-call engineers stop trusting alerts, real incidents get missed. When everyone is paged for everything, nobody takes responsibility.
Good alerting is:
- Respectful of human attention
- Actionable with clear next steps
- Accurate about impact
- Minimal in volume
Your observability stack with Loki and Tempo generates massive data. Alerting is how you turn that data into attention — use it wisely.
The best alert is the one you don’t need to send because the system healed itself. The second best is the one that tells you exactly what’s wrong and how to fix it.
