Grafana Dashboards That Actually Get Used

You have Grafana. You have Prometheus metrics. You have logs in Loki and traces in Tempo. The data is all there.

You also have 47 dashboards that nobody opens.

I have done this to myself more than once. Something breaks at 2 AM, I bolt together a dashboard to see what’s going on, and then it just sits there forever. Multiply that by a year of incidents and a few “let me just add a panel for that” moments, and you end up with a Grafana that’s mostly archaeology. Nobody remembers what half the panels mean. The honest move is to delete most of them, but first it helps to understand what makes the survivors worth keeping.

The dashboards I actually use have one thing in common: I can open one and know within five seconds whether I need to care. That’s the whole bar. Everything below is how I get there, starting from the simplest dashboard that earns its place and layering up to the full setup I run for my homelab cluster.

This connects to something I keep coming back to on this blog. A pile of metrics you can’t read is just another black box, except this one you built yourself. The point of observability is understanding, and a dashboard that nobody can parse fails at exactly that.

Where Dashboards Go Wrong

Before the good patterns, the failure modes. I’ve shipped all of these.

Too many panels. Thirty panels, thirty metrics, and somehow zero answers, because no human reads thirty graphs at a glance. The panels compete for attention and win nothing.

Wrong abstraction level. CPU, memory and disk for every individual pod is great when you’re elbow-deep in a problem and useless when you just want to know if the service is up. Those are different questions and they want different dashboards.

No context. “500 requests/second.” Good? Bad? Normal? A number with nothing to compare it against is decoration.

Built and forgotten. The incident dashboard from two years ago, still querying services that got renamed in a refactor nobody told it about. It shows flat lines and everyone has learned to ignore it.

Start Simple: One Dashboard, One Question

The rule that fixes most of this is boring: each dashboard answers exactly one question.

Question	Dashboard
“Is the system healthy?”	Service overview
“Why is this service slow?”	Service deep dive
“What happened at 3 AM?”	Incident investigation
“Are we meeting SLOs?”	SLO dashboard

Keep the purposes separate. The moment you bolt a debugging panel onto your health-check dashboard, you’ve started the slide back toward 47 dashboards nobody reads.

The simplest useful dashboard is the “is everything OK?” one. A handful of stat panels, big and colour-coded, that go red when something is actually wrong. If that’s the only dashboard you ever build, you’re already ahead of most teams.

Let people drill down instead of cramming it all in

Once the one-glance dashboard works, the natural next question is “OK, but which service?” Don’t answer that on the same screen. Link out to a deeper one. The mental model I use:

Level 1: Is everything OK? (green/yellow/red)
    ↓
Level 2: Which service has problems?
    ↓
Level 3: What specific metric is wrong?
    ↓
Level 4: Raw data for debugging

Each level lives in its own dashboard or section, wired together with links. You stop at whichever level answers your question. Most of the time that’s level 1.

Give every number something to compare against

A raw number tells you almost nothing. The same value can be fine on a Tuesday afternoon and a five-alarm fire at 3 AM. So I add context wherever I can:

Thresholds: the panel changes colour when a metric crosses an SLO boundary, so red means red without you doing arithmetic
Comparisons: today against last week, this deploy against the previous one
Annotations: deployments, incidents and config changes drawn straight onto the graph

That last one has saved me more debugging time than anything else. When latency spikes and there’s a deployment marker sitting right under the spike, you’ve found your culprit before you’ve finished your coffee.

Make it readable in five seconds

If I can’t read a dashboard in five seconds, it’s too complex and I will stop opening it, which defeats the point. Stick to a small vocabulary of panel types:

Stat panels for current state (big numbers, colour-coded)
Time series for trends, so you can see when something started
Heatmaps for distributions like latency percentiles

And avoid the things that look informative but aren’t: tables with 50 rows, graphs with 20 overlapping lines, and pie charts. Always avoid pie charts.

Layer 1: The Dashboards You Actually Need

With the principles in place, here are the dashboards that earn a permanent spot. You don’t need more than a few.

Service overview

The “is everything OK?” dashboard, one per service or service group. This is the one you’ll open most.

# Panels (top to bottom, left to right)

Row 1: Status (stat panels)
  - Availability (% uptime last 24h)
  - Error Rate (current, color-coded)
  - P99 Latency (current, color-coded)
  - Active Instances (count)

Row 2: Trends (time series)
  - Request Rate (stacked by status code)
  - Latency (P50, P95, P99 lines)
  - Error Rate (percentage over time)

Row 3: Resources (gauges)
  - CPU Usage (% of limit)
  - Memory Usage (% of limit)
  - Pod Count (current vs desired)

Key queries:

# Availability (percentage of successful requests)
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

SLO dashboard

This one tracks Service Level Objectives and tells you whether you’re spending your error budget faster than you can afford. It’s the backbone of meaningful alerting, because you want to page someone when the budget is burning, not every time a single request hiccups.

Row 1: SLO Status (stat panels)
  - Availability SLO (99.9% target)
  - Latency SLO (P99 < 500ms target)
  - Error Budget Remaining (%)

Row 2: Burn Rate (time series)
  - Error budget consumption rate
  - Projected budget exhaustion

Row 3: SLI Breakdown (time series)
  - Availability over time
  - Latency percentiles over time

Error budget calculation:

# Error budget remaining (for 99.9% SLO over 30 days)
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) /
        sum(rate(http_requests_total[30d])))) /
  0.001  # 0.1% error budget
)

Kubernetes cluster dashboard

A bird’s-eye view of cluster health. Resist the urge to put per-pod detail here. That belongs in the debugging dashboard, and mixing the two is how you end up back at thirty panels.

Row 1: Cluster Status
  - Nodes Ready / Total
  - Pods Running / Scheduled
  - PVCs Bound / Total

Row 2: Resource Utilization
  - CPU: Used / Requested / Allocatable
  - Memory: Used / Requested / Allocatable

Row 3: Workload Health
  - Deployments Available / Total
  - StatefulSets Ready / Total
  - DaemonSets Ready / Total

Row 4: Recent Events (table)
  - Warning events from last hour

Debugging dashboard

This is the one place where complexity is allowed. When you’re investigating an incident you want detail, variables, and the ability to slice things by namespace and pod. The five-second rule doesn’t apply when you’re already neck-deep in a problem.

Row 1: Service Selection (variables)
  - Namespace dropdown
  - Service dropdown
  - Pod dropdown

Row 2: Request Flow
  - Incoming requests by endpoint
  - Outgoing requests by destination
  - Error breakdown by type

Row 3: Resource Details
  - CPU per container (time series)
  - Memory per container (time series)
  - Network I/O (time series)

Row 4: Logs Integration
  - Link to Loki with pre-filtered query
  - Recent error log count

Layer 2: Panels Worth Stealing

Here are the individual panels I reach for again and again. Copy them, point them at your own metrics, done.

Request rate with status codes

sum by (status_code) (
  rate(http_requests_total{service="$service"}[5m])
)

Stack the series and colour them by status: 2xx green, 3xx blue, 4xx yellow, 5xx red. When the graph goes red, you don’t need to read the legend.

Latency heatmap

sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)

Render it as a heatmap. You get the full latency distribution over time, and the hot spots jump out where averages would have hidden them. A p99 spike that a line graph smooths away shows up here as a bright band.

Resource usage vs limits

# CPU usage percentage of limit
sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) /
sum(kube_pod_container_resource_limits{resource="cpu", pod=~"$pod"}) * 100

A gauge with thresholds at 0-70 green, 70-90 yellow, 90-100 red. Showing usage as a percentage of the limit rather than raw cores means the same panel works for every service without retuning.

Pod restart indicator

sum(increase(kube_pod_container_status_restarts_total{namespace="$namespace"}[1h])) by (pod)

A stat panel that only shows up when the count is above zero, linked straight to the pod logs. A pod quietly restarting every few minutes is the kind of slow bleed you want surfaced before it becomes an outage.

Variables: One Dashboard for Every Service

Variables are what stop you from cloning the same dashboard fifty times. Define them once and the dashboard re-targets itself with a dropdown. These three cover most of what I need:

# Namespace variable
name: namespace
query: label_values(kube_pod_info, namespace)
multi: true
include_all: true

# Service variable (filtered by namespace)
name: service
query: label_values(kube_pod_info{namespace=~"$namespace"}, pod)
multi: false

# Time comparison
name: comparison
options:
  - 1h ago
  - 1d ago
  - 1w ago

Then reference them in the queries:

# Current value
sum(rate(http_requests_total{namespace="$namespace"}[5m]))

# Comparison value (offset by selected time)
sum(rate(http_requests_total{namespace="$namespace"}[5m] offset $comparison))

That offset $comparison trick gives you a “vs last week” overlay almost for free, which is the context-over-numbers idea turned into a single query.

Annotations: Correlate Without Guessing

Annotations mark events directly on your graphs. Two sources earn their keep:

# Deployment annotations
name: Deployments
datasource: Prometheus
query: changes(kube_deployment_status_observed_generation{deployment="$service"}[5m]) > 0
tags: deployment

# Alert annotations
name: Alerts
datasource: Alertmanager
filter: service="$service"
tags: alert

Now a latency spike with a deployment marker underneath it answers the “what changed?” question before you’ve even opened a terminal. Half my incidents turn out to be “we deployed something,” and the marker tells me that instantly.

Wiring Dashboards Together

This is how progressive disclosure actually works in Grafana: links between dashboards, so the one-glance view can hand you off to the detail view without you hunting through folders.

Data links

On a service panel, link straight to that service’s detail dashboard:

URL: /d/service-detail?var-service=${__field.labels.service}
Title: View ${__field.labels.service} details

Panel links

On the cluster overview, link down into namespace-specific views:

URL: /d/namespace-view?var-namespace=${namespace}
Title: Drill down to namespace

Explore links

And link out to Loki so jumping from a metric to the matching logs is one click, not a context switch:

URL: /explore?left=["now-1h","now","Loki",{"expr":"{namespace=\"${namespace}\",pod=~\"${pod}\"}"}]
Title: View logs in Explore

Keeping It Organised

Once you have more than a handful of dashboards, structure stops being optional. A bit of discipline here is the difference between finding the right dashboard in two seconds and giving up and grepping Prometheus directly.

Folder structure

├── Overview
│   ├── Platform Health
│   └── SLO Summary
├── Services
│   ├── API Gateway
│   ├── Auth Service
│   └── ...
├── Infrastructure
│   ├── Kubernetes Cluster
│   ├── Nodes
│   └── Storage
├── Debugging
│   ├── Service Debug
│   └── Incident Investigation
└── Archive
    └── (old dashboards, hidden)

Naming convention

[Category] - [Subject] - [Type]

Examples:
- Services - API Gateway - Overview
- Services - API Gateway - Debug
- Infrastructure - Kubernetes - Nodes
- SLO - Platform - 30d

Starring and the home dashboard

Set your “is everything OK?” dashboard as the Grafana home so it’s the first thing you see when you log in, and star the few you open daily. The faster the important ones are to reach, the less you fall back on bad habits.

Layer 3: Dashboards as Code

Clicking dashboards together in the UI is fine until someone fat-fingers a panel, or your Grafana pod gets rescheduled and you discover the dashboard lived only in its memory. The fix is the same one I apply to everything else: put it in Git and let ArgoCD deploy it. A dashboard you can’t rebuild from source is a dashboard you don’t really own.

Grafana dashboard ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: service-overview-dashboard
  labels:
    grafana_dashboard: "1"
data:
  service-overview.json: |
    {
      "title": "Service Overview",
      "uid": "service-overview",
      "panels": [
        ...
      ]
    }

Drop a ConfigMap with the grafana_dashboard: "1" label and the sidecar picks it up automatically. Your dashboard JSON now lives next to the service it monitors, reviewed in the same merge request that ships the service.

Grafonnet (Jsonnet library)

Hand-editing dashboard JSON gets old fast. Grafonnet lets you generate it from code, so a shared panel becomes a function you reuse instead of copy-pasted JSON you forget to keep in sync:

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;

dashboard.new(
  'Service Overview',
  uid='service-overview',
  tags=['service', 'overview'],
)
.addPanel(
  grafana.statPanel.new(
    'Error Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target(
      'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100',
    )
  ),
  gridPos={ x: 0, y: 0, w: 6, h: 4 },
)

The Full Picture: What I Actually Run

After all the layering, here’s the whole thing in practice. For my homelab cluster I keep exactly five dashboards:

Home - everything OK? Answered in one glance.
SLOs - am I meeting my objectives, and how’s the error budget?
Cluster - Kubernetes health from the top down.
Service Debug - the variable-driven deep dive for when something’s wrong.
Incident - the messy one that pulls metrics, logs and traces together.

That’s the lot. I actively resist making a sixth. Every one of these gets opened weekly or more, which is the only test that matters. The day a dashboard goes a month without being opened, I delete it, and so far I’ve never missed one.

Compare that to the 47-dashboard graveyard from the top of this post. The difference isn’t that I’m collecting less data, my observability stack hoovers up plenty. The difference is that five dashboards I trust beat fifty I’ve trained myself to ignore.

Fighting Dashboard Rot

Five dashboards stays at five only because I keep pruning. Left alone, any Grafana drifts back toward archaeology. A little process keeps it honest.

Review on a schedule

Once a month I go through everything and ask three things: which dashboards got opened, are the queries still pointing at services that exist, and do the thresholds still match how the system actually behaves. Metrics get renamed, SLOs get tightened, and a threshold that made sense a year ago can quietly stop meaning anything.

Sunset what nobody uses

The rule is simple and unsentimental. Not opened in 30 days, it moves to the Archive folder. Sixty days in Archive, it’s gone. Git has the history if I’m ever wrong.

Write down what each one is for

Every dashboard gets a short description so the next person (often future me, who has forgotten everything) knows why it exists:

Purpose: Quick health check for production services
Audience: On-call engineers
When to use: Daily check, incident first response
Links to: Service Debug dashboard for details

Why I Bother

A dashboard is a communication tool. A good one tells you something true about your system fast enough that you act on it. A bad one is noise, and noise teaches people to ignore the monitoring entirely, which is worse than having none because you think you’re covered.

That’s the thread back to the start. Collecting metrics is easy and Grafana makes drawing graphs easy, so the bottleneck was never data. The hard part is understanding, and a dashboard nobody can read leaves you exactly as blind as no dashboard at all, just with more cubes on the screen. Build the few that earn their place, keep them in Git, prune the rest, and the next time something breaks at 2 AM you’ll be glad you can read the answer in five seconds.

Where Dashboards Go Wrong#

Start Simple: One Dashboard, One Question#

Let people drill down instead of cramming it all in#

Give every number something to compare against#

Make it readable in five seconds#

Layer 1: The Dashboards You Actually Need#

Service overview#

SLO dashboard#

Kubernetes cluster dashboard#

Debugging dashboard#

Layer 2: Panels Worth Stealing#

Request rate with status codes#

Latency heatmap#

Resource usage vs limits#

Pod restart indicator#

Variables: One Dashboard for Every Service#

Annotations: Correlate Without Guessing#

Wiring Dashboards Together#

Data links#

Panel links#

Explore links#

Keeping It Organised#

Folder structure#

Naming convention#

Starring and the home dashboard#

Layer 3: Dashboards as Code#

Grafana dashboard ConfigMap#

Grafonnet (Jsonnet library)#

The Full Picture: What I Actually Run#

Fighting Dashboard Rot#

Review on a schedule#

Sunset what nobody uses#

Write down what each one is for#

Why I Bother#