SRE Dashboard
What Is an SRE Dashboard?
An SRE dashboard provides real-time visibility into service reliability through SLO achievement, error budget consumption, latency, and incident metrics.
Typical Dashboard Panels
| Panel | Metric | Purpose | |-------|--------|---------| | SLO Achievement | % of successful requests | Is the service meeting its target? | | Error Budget | Remaining error budget | How much downtime is allowed? | | Latency (P95) | Milliseconds | How fast is the service? | | Request Rate | Requests per second | Traffic volume | | Incident Count | Number of active incidents | Current outage status |
PromQL Queries for SRE Dashboards
SLO Achievement (Last 30 Days)
(
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) * 100
Error Budget Consumption
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
) * 43200 # 43200 = 30 days ร 1440 minutes for 99.9% SLO
P95 Latency Trend
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Grafana Dashboard JSON
{
"title": "SRE Overview",
"panels": [
{
"title": "SLO Achievement",
"type": "gauge",
"targets": [{
"expr": "(
sum(rate(http_requests_total{status!~\"5..\"}[30d]))
/
sum(rate(http_requests_total[30d]))
) * 100",
"legendFormat": "SLO %"
}],
"thresholds": [
{ "value": 99.9, "color": "green" },
{ "value": 99.0, "color": "yellow" },
{ "value": 0, "color": "red" }
]
},
{
"title": "Error Budget Remaining",
"type": "stat",
"targets": [{
"expr": "43200 - (
1 - (
sum(rate(http_requests_total{status!~\"5..\"}[30d]))
/
sum(rate(http_requests_total[30d]))
)
) * 43200"
}]
}
]
}
Monitoring Key Metrics
The Four Golden Signals
| Signal | What It Measures | Alert Threshold | |--------|-----------------|----------------| | Latency | Time to serve a request | P95 > 500ms | | Traffic | Requests per second | > 2ร baseline | | Errors | Failed request rate | > 1% of total | | Saturation | Resource utilization | CPU/DB > 80% |
Alert Rules
groups:
- name: sre-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeded 1%"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 500ms"
Summary
SRE dashboards in Grafana provide real-time SLO tracking, error budget monitoring, and latency visualization. Use PromQL queries and alert rules to maintain service reliability.
Key takeaways:
- SRE dashboard: SLO %, error budget, P95 latency, request rate, incidents |
- PromQL queries calculate SLO, error budget consumption, and latency |
- Four golden signals: latency, traffic, errors, saturation |
- Alert rules trigger on error rate > 1% or P95 latency > 500ms |
- Grafana thresholds: green (>= SLO), yellow (warning), red (critical) |
You've completed this course! You now have a complete SRE foundation.
What's Next: Chaos Engineering
The next chapter covers chaos engineering โ testing system resilience through controlled failures.
Dashboard Implementation
Step-by-Step Setup
- Install Prometheus and Grafana
# Using Docker Compose
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- Configure Prometheus to scrape your app
# prometheus.yml
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['myapp:8080']
- Import SRE Dashboard in Grafana
- Open Grafana at
http://localhost:3000 - Click + โ Import
- Paste the dashboard JSON
- Select Prometheus data source
Custom Alert Rules
# prometheus-alerts.yml
groups:
- name: sre
rules:
- alert: BudgetExhausted
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
) * 43200 < 0
for: 1h
labels:
severity: critical
annotations:
summary: "Error budget exhausted"
- alert: BudgetWarning
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
) * 43200 < 4320 # < 10% remaining
for: 6h
labels:
severity: warning
Dashboard Best Practices
| Practice | Why | |----------|-----| | Focus on SLO, not all metrics | Avoid dashboard overload | | Use red/yellow/green thresholds | Immediate visual understanding | | Add alert annotations | See incidents on timeline | | Combine multiple data sources | Full picture of system health | | Keep it simple | Too many panels = confusion | | Share with the team | Everyone sees the same reality |