Your phone buzzes at 3 AM. You groggily check: “High CPU usage on node-worker-3.” You look at the graph, see it’s been at 75% for 10 minutes, and go back to sleep. Tomorrow, same alert. Next week, you stop checking altogether.

This is alert fatigue, and it’s dangerous. When everything alerts, nothing does. Real incidents get lost in the noise.

I’ve been on both sides — drowning in alerts, and running systems where pages are rare and always actionable. The difference isn’t better tools. It’s better thinking about what deserves attention.

The Problem with Default Alerts

Most monitoring setups come with default alerting rules. Prometheus Operator ships with hundreds. They alert on everything:

  • CPU > 80%
  • Memory > 80%
  • Disk > 80%
  • Pod restarts > 0
  • Any 5xx error

These defaults are well-intentioned but terrible in practice:

  1. No context — Is 80% CPU bad? Depends on the workload.
  2. No impact — Does this affect users? Unknown.
  3. No action — What should I do about it? Also unknown.
  4. Too sensitive — Brief spikes trigger alerts that resolve before you respond.

The result: pages that don’t need humans, humans who stop trusting pages.

The Alerting Philosophy

Good alerts follow principles:

1. Alert on Symptoms, Not Causes

Bad: “Pod restarted” Good: “Service error rate > 1%”

Users don’t care if pods restart. They care if the service works. Alert on what users experience, investigate causes after.

2. Alert on What Requires Action

If nobody needs to do anything, it shouldn’t page. Save informational alerts for dashboards or daily reviews.

Ask: “If I get this alert, what will I do?”

  • If the answer is “investigate and probably nothing” → not an alert
  • If the answer is “specific remediation” → valid alert

3. Every Alert Needs a Runbook

When the alert fires, what do you do? If you can’t write it down, the alert isn’t ready.

annotations:
  runbook_url: https://wiki.example.com/runbooks/high-error-rate
  summary: "Error rate exceeds SLO"
  description: "Service {{ $labels.service }} error rate is {{ $value }}%"

4. Severity Must Mean Something

SeverityResponseExample
criticalWake someone up NOWService down, data loss imminent
warningHandle during business hoursDisk filling, certificate expiring
infoReview in daily standupUnusual pattern, non-urgent

If critical alerts don’t require immediate action, they shouldn’t be critical.

SLO-Based Alerting

The best alerts tie directly to Service Level Objectives:

# SLO: 99.9% availability
# Error budget: 0.1% = 43 minutes/month

groups:
  - name: slo-alerts
    rules:
      # Fast burn: consuming error budget quickly
      - alert: HighErrorRateFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) /
            sum(rate(http_requests_total[5m]))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 1% (fast burn)"
          description: "At this rate, monthly error budget exhausts in {{ $value | humanizeDuration }}"

      # Slow burn: steadily consuming error budget
      - alert: HighErrorRateSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) /
            sum(rate(http_requests_total[1h]))
          ) > 0.001
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Error rate elevated (slow burn)"

The logic:

  • Fast burn: Immediate problem, wake someone up
  • Slow burn: Trending toward SLO breach, handle during work hours

Practical Alert Examples

Availability Alerts

# Service is down
- alert: ServiceDown
  expr: up{job="my-service"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.job }} is down"
    runbook_url: https://wiki/runbooks/service-down

# High error rate (user impact)
- alert: HighErrorRate
  expr: |
    sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) /
    sum by (service) (rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.service }} error rate above 5%"

Latency Alerts

# P99 latency too high
- alert: HighLatency
  expr: |
    histogram_quantile(0.99,
      sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
    ) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.service }} P99 latency above 2s"

Capacity Alerts

# Disk filling (predictive)
- alert: DiskFillingUp
  expr: |
    predict_linear(node_filesystem_avail_bytes{fstype!="tmpfs"}[6h], 24*60*60) < 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Disk {{ $labels.mountpoint }} will fill within 24h"

# Certificate expiring
- alert: CertificateExpiringSoon
  expr: |
    certmanager_certificate_expiration_timestamp_seconds - time() < 7*24*60*60
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Certificate {{ $labels.name }} expires in less than 7 days"

What NOT to Alert On

# DON'T: Generic resource usage
- alert: HighCPU
  expr: node_cpu_seconds_total > 0.8  # Bad: no context

# DON'T: Expected behavior
- alert: PodRestart
  expr: increase(kube_pod_container_status_restarts_total[1h]) > 0  # Bad: restarts are normal

# DON'T: Things that auto-resolve
- alert: PodPending
  expr: kube_pod_status_phase{phase="Pending"} == 1
  for: 1m  # Bad: too short, scheduler will handle it

Alertmanager Configuration

Route alerts appropriately:

# alertmanager.yaml
global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical: page immediately
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    # Warning: Slack during work hours
    - match:
        severity: warning
      receiver: 'slack-ops'
      active_time_intervals:
        - work_hours

    # Info: just log
    - match:
        severity: info
      receiver: 'null'

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<key>'
        severity: '{{ .CommonLabels.severity }}'

  - name: 'slack-ops'
    slack_configs:
      - api_url: '<webhook>'
        channel: '#ops-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'null'

time_intervals:
  - name: work_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

Silencing and Inhibition

Silences for Maintenance

# Create silence during maintenance
amtool silence add alertname=~".+" --duration=2h --comment="Planned maintenance"

# Silence specific service
amtool silence add service="api" --duration=1h --comment="Deploying v2.3.0"

Inhibition Rules

When one alert implies another, suppress the noise:

inhibit_rules:
  # If cluster is down, don't alert on individual services
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: '.+'
    equal: ['cluster']

  # If node is down, don't alert on pods on that node
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: 'Pod.+'
    equal: ['node']

Grafana Integration

Visualize alert state in dashboards:

# Prometheus datasource query
ALERTS{alertstate="firing"}

Create an alerts overview panel showing:

  • Currently firing alerts
  • Recent alert history
  • Alert trends over time

Link from alerts to relevant dashboards:

annotations:
  dashboard_url: https://grafana/d/abc123?var-service={{ $labels.service }}

On-Call Best Practices

Rotation Structure

Primary    → First responder, handles all pages
Secondary  → Backup if primary doesn't respond in 10 min
Escalation → Management, for major incidents

Response Time SLAs

SeverityAcknowledgeStart Work
Critical5 minutes15 minutes
Warning4 hours8 hours
InfoNext standupWhen convenient

Post-Incident Review

Every critical alert should trigger a review:

  1. Was the alert necessary?
  2. Did it provide enough context?
  3. Was the runbook helpful?
  4. Could this be prevented?

If the same alert fires repeatedly without action, fix or delete it.

My Alerting Setup

# Core SLO alerts only
groups:
  - name: slo
    rules:
      # Availability: error rate
      - alert: HighErrorRate
        expr: error_rate > 0.01
        for: 5m
        severity: critical

      # Latency: P99
      - alert: HighLatency
        expr: p99_latency > 2
        for: 10m
        severity: warning

      # Capacity: predictive
      - alert: DiskFillingUp
        expr: disk_will_fill_24h
        severity: warning

      # Security: cert expiry
      - alert: CertExpiring
        expr: cert_expires_7d
        severity: warning

# That's it. ~10 alert rules total.

Key decisions:

  • SLO-based — Alerts tied to user impact
  • Minimal rules — Each rule earns its place
  • Clear severity — Critical means wake up, warning means work hours
  • Every alert has a runbook — No guessing at 3 AM

Reducing Alert Fatigue

Steps to clean up:

  1. Audit current alerts — How many fired last month? How many were actionable?
  2. Delete unused alerts — If it hasn’t fired in 90 days, probably not needed
  3. Raise thresholds — If alerts always auto-resolve, threshold is too sensitive
  4. Add for duration — Brief spikes shouldn’t page
  5. Merge similar alerts — One “high error rate” beats per-endpoint alerts

Goal: Every page should require human action.

Why This Matters

Alert fatigue kills reliability. When on-call engineers stop trusting alerts, real incidents get missed. When everyone is paged for everything, nobody takes responsibility.

Good alerting is:

  • Respectful of human attention
  • Actionable with clear next steps
  • Accurate about impact
  • Minimal in volume

Your observability stack with Loki and Tempo generates massive data. Alerting is how you turn that data into attention — use it wisely.


The best alert is the one you don’t need to send because the system healed itself. The second best is the one that tells you exactly what’s wrong and how to fix it.