You have Grafana. You have Prometheus metrics. You have logs in Loki and traces in Tempo.
You also have 47 dashboards that nobody looks at.
Dashboard rot is real. Teams create dashboards for every possible metric, every service, every potential issue. Six months later, nobody remembers what half of them show or why they exist.
Good dashboards are different. They get opened daily. They answer questions before you ask. They help you understand your system, not just display numbers.
Why Dashboards Fail
Too Many Panels
The dashboard has 30 panels. Each shows a different metric. Together, they show nothing — because nobody can process 30 graphs at once.
Wrong Abstraction Level
The dashboard shows CPU, memory, and disk for each pod. Useful for debugging, useless for understanding if things are working.
No Context
Numbers without context are meaningless. “500 requests/second” — is that good? Bad? Normal?
Built and Forgotten
Created during an incident, never updated. Now shows metrics for services that were renamed two years ago.
Dashboard Design Principles
1. One Purpose Per Dashboard
Every dashboard answers one question:
| Question | Dashboard |
|---|---|
| “Is the system healthy?” | Service overview |
| “Why is this service slow?” | Service deep dive |
| “What happened at 3 AM?” | Incident investigation |
| “Are we meeting SLOs?” | SLO dashboard |
Don’t mix purposes. A healthy-check dashboard shouldn’t have debugging panels.
2. Progressive Disclosure
Start with the answer, drill down for details:
Level 1: Is everything OK? (green/yellow/red)
↓
Level 2: Which service has problems?
↓
Level 3: What specific metric is wrong?
↓
Level 4: Raw data for debugging
Each level is a different dashboard or section, linked together.
3. Context Over Numbers
Raw numbers are useless. Add:
- Thresholds: Color changes when metrics cross SLO boundaries
- Comparisons: Today vs last week, this deploy vs previous
- Annotations: Deployments, incidents, config changes marked on graphs
4. Design for Glanceability
If you can’t understand the dashboard in 5 seconds, it’s too complex.
Use:
- Stat panels for current state (big numbers, color-coded)
- Time series for trends (when did this start?)
- Heatmaps for distributions (latency percentiles)
Avoid:
- Tables with 50 rows
- Graphs with 20 overlapping lines
- Pie charts (always avoid pie charts)
Essential Dashboard Types
Service Overview Dashboard
The “is everything OK?” dashboard. One per service or service group.
# Panels (top to bottom, left to right)
Row 1: Status (stat panels)
- Availability (% uptime last 24h)
- Error Rate (current, color-coded)
- P99 Latency (current, color-coded)
- Active Instances (count)
Row 2: Trends (time series)
- Request Rate (stacked by status code)
- Latency (P50, P95, P99 lines)
- Error Rate (percentage over time)
Row 3: Resources (gauges)
- CPU Usage (% of limit)
- Memory Usage (% of limit)
- Pod Count (current vs desired)
Key queries:
# Availability (percentage of successful requests)
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
SLO Dashboard
Tracks Service Level Objectives. Critical for meaningful alerting.
Row 1: SLO Status (stat panels)
- Availability SLO (99.9% target)
- Latency SLO (P99 < 500ms target)
- Error Budget Remaining (%)
Row 2: Burn Rate (time series)
- Error budget consumption rate
- Projected budget exhaustion
Row 3: SLI Breakdown (time series)
- Availability over time
- Latency percentiles over time
Error budget calculation:
# Error budget remaining (for 99.9% SLO over 30 days)
1 - (
(1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) /
sum(rate(http_requests_total[30d])))) /
0.001 # 0.1% error budget
)
Kubernetes Cluster Dashboard
Overview of cluster health. Not per-pod details — that’s for debugging.
Row 1: Cluster Status
- Nodes Ready / Total
- Pods Running / Scheduled
- PVCs Bound / Total
Row 2: Resource Utilization
- CPU: Used / Requested / Allocatable
- Memory: Used / Requested / Allocatable
Row 3: Workload Health
- Deployments Available / Total
- StatefulSets Ready / Total
- DaemonSets Ready / Total
Row 4: Recent Events (table)
- Warning events from last hour
Debugging Dashboard
For incident investigation. More detailed, acceptable complexity.
Row 1: Service Selection (variables)
- Namespace dropdown
- Service dropdown
- Pod dropdown
Row 2: Request Flow
- Incoming requests by endpoint
- Outgoing requests by destination
- Error breakdown by type
Row 3: Resource Details
- CPU per container (time series)
- Memory per container (time series)
- Network I/O (time series)
Row 4: Logs Integration
- Link to Loki with pre-filtered query
- Recent error log count
Practical Panel Examples
Request Rate with Status Codes
sum by (status_code) (
rate(http_requests_total{service="$service"}[5m])
)
Visualization settings:
- Stack series
- Color by status: 2xx green, 3xx blue, 4xx yellow, 5xx red
Latency Heatmap
sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
Visualization: Heatmap
- Shows latency distribution over time
- Hot spots reveal latency spikes
Resource Usage vs Limits
# CPU usage percentage of limit
sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) /
sum(kube_pod_container_resource_limits{resource="cpu", pod=~"$pod"}) * 100
Visualization: Gauge
- Thresholds: 0-70 green, 70-90 yellow, 90-100 red
Pod Restart Indicator
sum(increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h])) by (pod)
Visualization: Stat panel
- Show only if > 0
- Links to pod logs
Dashboard Variables
Variables make dashboards reusable. Essential ones:
# Namespace variable
name: namespace
query: label_values(kube_pod_info, namespace)
multi: true
include_all: true
# Service variable (filtered by namespace)
name: service
query: label_values(kube_pod_info{namespace=~"$namespace"}, pod)
multi: false
# Time comparison
name: comparison
options:
- 1h ago
- 1d ago
- 1w ago
Use in queries:
# Current value
sum(rate(http_requests_total{namespace="$namespace"}[5m]))
# Comparison value (offset by selected time)
sum(rate(http_requests_total{namespace="$namespace"}[5m] offset $comparison))
Annotations for Context
Mark important events on graphs:
# Deployment annotations
name: Deployments
datasource: Prometheus
query: changes(kube_deployment_status_observed_generation{deployment="$service"}[5m]) > 0
tags: deployment
# Alert annotations
name: Alerts
datasource: Alertmanager
filter: service="$service"
tags: alert
Now when latency spikes, you immediately see if it correlates with a deployment.
Linking Dashboards Together
Create navigation between dashboards:
Data Links
On a service panel, link to the service detail dashboard:
URL: /d/service-detail?var-service=${__field.labels.service}
Title: View ${__field.labels.service} details
Panel Links
On the cluster overview, link to namespace-specific views:
URL: /d/namespace-view?var-namespace=${namespace}
Title: Drill down to namespace
Explore Links
Link to Loki for logs:
URL: /explore?left=["now-1h","now","Loki",{"expr":"{namespace=\"${namespace}\",pod=~\"${pod}\"}"}]
Title: View logs in Explore
Dashboard Organization
Folder Structure
├── Overview
│ ├── Platform Health
│ └── SLO Summary
├── Services
│ ├── API Gateway
│ ├── Auth Service
│ └── ...
├── Infrastructure
│ ├── Kubernetes Cluster
│ ├── Nodes
│ └── Storage
├── Debugging
│ ├── Service Debug
│ └── Incident Investigation
└── Archive
└── (old dashboards, hidden)
Naming Convention
[Category] - [Subject] - [Type]
Examples:
- Services - API Gateway - Overview
- Services - API Gateway - Debug
- Infrastructure - Kubernetes - Nodes
- SLO - Platform - 30d
Starring and Home Dashboard
Set the most important dashboard as the Grafana home. Star frequently used dashboards.
Dashboard as Code
Store dashboards in Git. Deploy with ArgoCD.
Grafana Dashboard ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: service-overview-dashboard
labels:
grafana_dashboard: "1"
data:
service-overview.json: |
{
"title": "Service Overview",
"uid": "service-overview",
"panels": [
...
]
}
Grafonnet (Jsonnet library)
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
dashboard.new(
'Service Overview',
uid='service-overview',
tags=['service', 'overview'],
)
.addPanel(
grafana.statPanel.new(
'Error Rate',
datasource='Prometheus',
).addTarget(
prometheus.target(
'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100',
)
),
gridPos={ x: 0, y: 0, w: 6, h: 4 },
)
My Dashboard Setup
For my homelab cluster, I maintain exactly 5 dashboards:
- Home — Everything OK? (one glance)
- SLOs — Am I meeting objectives?
- Cluster — Kubernetes health
- Service Debug — Variable-driven deep dive
- Incident — Full observability stack integration
That’s it. I resist creating more.
Each dashboard has a purpose. Each gets used weekly or more. If a dashboard isn’t opened in a month, I delete it.
Avoiding Dashboard Rot
Regular Review
Monthly: Review all dashboards
- Which were opened this month?
- Are queries still valid?
- Do thresholds match current reality?
Sunset Process
- Dashboard not opened in 30 days → move to Archive folder
- Dashboard in Archive for 60 days → delete
Documentation
Each dashboard has a description:
Purpose: Quick health check for production services
Audience: On-call engineers
When to use: Daily check, incident first response
Links to: Service Debug dashboard for details
Why This Matters
Dashboards are communication tools. A good dashboard tells a story about your system. A bad dashboard is noise that trains people to ignore monitoring.
Your observability stack collects massive amounts of data. Dashboards are how you make that data useful. Design them with intention, maintain them with discipline, and delete them when they stop earning their place.
The best dashboard is the one that gets looked at.
Metrics are easy to collect. Dashboards are easy to create. Understanding is hard to achieve. Design for understanding.
