You have Grafana. You have Prometheus metrics. You have logs in Loki and traces in Tempo.

You also have 47 dashboards that nobody looks at.

Dashboard rot is real. Teams create dashboards for every possible metric, every service, every potential issue. Six months later, nobody remembers what half of them show or why they exist.

Good dashboards are different. They get opened daily. They answer questions before you ask. They help you understand your system, not just display numbers.

Why Dashboards Fail

Too Many Panels

The dashboard has 30 panels. Each shows a different metric. Together, they show nothing — because nobody can process 30 graphs at once.

Wrong Abstraction Level

The dashboard shows CPU, memory, and disk for each pod. Useful for debugging, useless for understanding if things are working.

No Context

Numbers without context are meaningless. “500 requests/second” — is that good? Bad? Normal?

Built and Forgotten

Created during an incident, never updated. Now shows metrics for services that were renamed two years ago.

Dashboard Design Principles

1. One Purpose Per Dashboard

Every dashboard answers one question:

QuestionDashboard
“Is the system healthy?”Service overview
“Why is this service slow?”Service deep dive
“What happened at 3 AM?”Incident investigation
“Are we meeting SLOs?”SLO dashboard

Don’t mix purposes. A healthy-check dashboard shouldn’t have debugging panels.

2. Progressive Disclosure

Start with the answer, drill down for details:

Level 1: Is everything OK? (green/yellow/red)
    ↓
Level 2: Which service has problems?
    ↓
Level 3: What specific metric is wrong?
    ↓
Level 4: Raw data for debugging

Each level is a different dashboard or section, linked together.

3. Context Over Numbers

Raw numbers are useless. Add:

  • Thresholds: Color changes when metrics cross SLO boundaries
  • Comparisons: Today vs last week, this deploy vs previous
  • Annotations: Deployments, incidents, config changes marked on graphs

4. Design for Glanceability

If you can’t understand the dashboard in 5 seconds, it’s too complex.

Use:

  • Stat panels for current state (big numbers, color-coded)
  • Time series for trends (when did this start?)
  • Heatmaps for distributions (latency percentiles)

Avoid:

  • Tables with 50 rows
  • Graphs with 20 overlapping lines
  • Pie charts (always avoid pie charts)

Essential Dashboard Types

Service Overview Dashboard

The “is everything OK?” dashboard. One per service or service group.

# Panels (top to bottom, left to right)

Row 1: Status (stat panels)
  - Availability (% uptime last 24h)
  - Error Rate (current, color-coded)
  - P99 Latency (current, color-coded)
  - Active Instances (count)

Row 2: Trends (time series)
  - Request Rate (stacked by status code)
  - Latency (P50, P95, P99 lines)
  - Error Rate (percentage over time)

Row 3: Resources (gauges)
  - CPU Usage (% of limit)
  - Memory Usage (% of limit)
  - Pod Count (current vs desired)

Key queries:

# Availability (percentage of successful requests)
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

SLO Dashboard

Tracks Service Level Objectives. Critical for meaningful alerting.

Row 1: SLO Status (stat panels)
  - Availability SLO (99.9% target)
  - Latency SLO (P99 < 500ms target)
  - Error Budget Remaining (%)

Row 2: Burn Rate (time series)
  - Error budget consumption rate
  - Projected budget exhaustion

Row 3: SLI Breakdown (time series)
  - Availability over time
  - Latency percentiles over time

Error budget calculation:

# Error budget remaining (for 99.9% SLO over 30 days)
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) /
        sum(rate(http_requests_total[30d])))) /
  0.001  # 0.1% error budget
)

Kubernetes Cluster Dashboard

Overview of cluster health. Not per-pod details — that’s for debugging.

Row 1: Cluster Status
  - Nodes Ready / Total
  - Pods Running / Scheduled
  - PVCs Bound / Total

Row 2: Resource Utilization
  - CPU: Used / Requested / Allocatable
  - Memory: Used / Requested / Allocatable

Row 3: Workload Health
  - Deployments Available / Total
  - StatefulSets Ready / Total
  - DaemonSets Ready / Total

Row 4: Recent Events (table)
  - Warning events from last hour

Debugging Dashboard

For incident investigation. More detailed, acceptable complexity.

Row 1: Service Selection (variables)
  - Namespace dropdown
  - Service dropdown
  - Pod dropdown

Row 2: Request Flow
  - Incoming requests by endpoint
  - Outgoing requests by destination
  - Error breakdown by type

Row 3: Resource Details
  - CPU per container (time series)
  - Memory per container (time series)
  - Network I/O (time series)

Row 4: Logs Integration
  - Link to Loki with pre-filtered query
  - Recent error log count

Practical Panel Examples

Request Rate with Status Codes

sum by (status_code) (
  rate(http_requests_total{service="$service"}[5m])
)

Visualization settings:

  • Stack series
  • Color by status: 2xx green, 3xx blue, 4xx yellow, 5xx red

Latency Heatmap

sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)

Visualization: Heatmap

  • Shows latency distribution over time
  • Hot spots reveal latency spikes

Resource Usage vs Limits

# CPU usage percentage of limit
sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) /
sum(kube_pod_container_resource_limits{resource="cpu", pod=~"$pod"}) * 100

Visualization: Gauge

  • Thresholds: 0-70 green, 70-90 yellow, 90-100 red

Pod Restart Indicator

sum(increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h])) by (pod)

Visualization: Stat panel

  • Show only if > 0
  • Links to pod logs

Dashboard Variables

Variables make dashboards reusable. Essential ones:

# Namespace variable
name: namespace
query: label_values(kube_pod_info, namespace)
multi: true
include_all: true

# Service variable (filtered by namespace)
name: service
query: label_values(kube_pod_info{namespace=~"$namespace"}, pod)
multi: false

# Time comparison
name: comparison
options:
  - 1h ago
  - 1d ago
  - 1w ago

Use in queries:

# Current value
sum(rate(http_requests_total{namespace="$namespace"}[5m]))

# Comparison value (offset by selected time)
sum(rate(http_requests_total{namespace="$namespace"}[5m] offset $comparison))

Annotations for Context

Mark important events on graphs:

# Deployment annotations
name: Deployments
datasource: Prometheus
query: changes(kube_deployment_status_observed_generation{deployment="$service"}[5m]) > 0
tags: deployment

# Alert annotations
name: Alerts
datasource: Alertmanager
filter: service="$service"
tags: alert

Now when latency spikes, you immediately see if it correlates with a deployment.

Linking Dashboards Together

Create navigation between dashboards:

On a service panel, link to the service detail dashboard:

URL: /d/service-detail?var-service=${__field.labels.service}
Title: View ${__field.labels.service} details

On the cluster overview, link to namespace-specific views:

URL: /d/namespace-view?var-namespace=${namespace}
Title: Drill down to namespace

Link to Loki for logs:

URL: /explore?left=["now-1h","now","Loki",{"expr":"{namespace=\"${namespace}\",pod=~\"${pod}\"}"}]
Title: View logs in Explore

Dashboard Organization

Folder Structure

├── Overview
│   ├── Platform Health
│   └── SLO Summary
├── Services
│   ├── API Gateway
│   ├── Auth Service
│   └── ...
├── Infrastructure
│   ├── Kubernetes Cluster
│   ├── Nodes
│   └── Storage
├── Debugging
│   ├── Service Debug
│   └── Incident Investigation
└── Archive
    └── (old dashboards, hidden)

Naming Convention

[Category] - [Subject] - [Type]

Examples:
- Services - API Gateway - Overview
- Services - API Gateway - Debug
- Infrastructure - Kubernetes - Nodes
- SLO - Platform - 30d

Starring and Home Dashboard

Set the most important dashboard as the Grafana home. Star frequently used dashboards.

Dashboard as Code

Store dashboards in Git. Deploy with ArgoCD.

Grafana Dashboard ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: service-overview-dashboard
  labels:
    grafana_dashboard: "1"
data:
  service-overview.json: |
    {
      "title": "Service Overview",
      "uid": "service-overview",
      "panels": [
        ...
      ]
    }

Grafonnet (Jsonnet library)

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;

dashboard.new(
  'Service Overview',
  uid='service-overview',
  tags=['service', 'overview'],
)
.addPanel(
  grafana.statPanel.new(
    'Error Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target(
      'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100',
    )
  ),
  gridPos={ x: 0, y: 0, w: 6, h: 4 },
)

My Dashboard Setup

For my homelab cluster, I maintain exactly 5 dashboards:

  1. Home — Everything OK? (one glance)
  2. SLOs — Am I meeting objectives?
  3. Cluster — Kubernetes health
  4. Service Debug — Variable-driven deep dive
  5. Incident — Full observability stack integration

That’s it. I resist creating more.

Each dashboard has a purpose. Each gets used weekly or more. If a dashboard isn’t opened in a month, I delete it.

Avoiding Dashboard Rot

Regular Review

Monthly: Review all dashboards

  • Which were opened this month?
  • Are queries still valid?
  • Do thresholds match current reality?

Sunset Process

  1. Dashboard not opened in 30 days → move to Archive folder
  2. Dashboard in Archive for 60 days → delete

Documentation

Each dashboard has a description:

Purpose: Quick health check for production services
Audience: On-call engineers
When to use: Daily check, incident first response
Links to: Service Debug dashboard for details

Why This Matters

Dashboards are communication tools. A good dashboard tells a story about your system. A bad dashboard is noise that trains people to ignore monitoring.

Your observability stack collects massive amounts of data. Dashboards are how you make that data useful. Design them with intention, maintain them with discipline, and delete them when they stop earning their place.

The best dashboard is the one that gets looked at.


Metrics are easy to collect. Dashboards are easy to create. Understanding is hard to achieve. Design for understanding.