SRE Dashboard

What Is an SRE Dashboard?

An SRE dashboard provides real-time visibility into service reliability through SLO achievement, error budget consumption, latency, and incident metrics.

Typical Dashboard Panels

| Panel | Metric | Purpose | |-------|--------|---------| | SLO Achievement | % of successful requests | Is the service meeting its target? | | Error Budget | Remaining error budget | How much downtime is allowed? | | Latency (P95) | Milliseconds | How fast is the service? | | Request Rate | Requests per second | Traffic volume | | Incident Count | Number of active incidents | Current outage status |

PromQL Queries for SRE Dashboards

SLO Achievement (Last 30 Days)

(
  sum(rate(http_requests_total{status!~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) * 100

Error Budget Consumption

(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[30d]))
    /
    sum(rate(http_requests_total[30d]))
  )
) * 43200  # 43200 = 30 days × 1440 minutes for 99.9% SLO

P95 Latency Trend

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Grafana Dashboard JSON

{
  "title": "SRE Overview",
  "panels": [
    {
      "title": "SLO Achievement",
      "type": "gauge",
      "targets": [{
        "expr": "(
          sum(rate(http_requests_total{status!~\"5..\"}[30d]))
          /
          sum(rate(http_requests_total[30d]))
        ) * 100",
        "legendFormat": "SLO %"
      }],
      "thresholds": [
        { "value": 99.9, "color": "green" },
        { "value": 99.0, "color": "yellow" },
        { "value": 0, "color": "red" }
      ]
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "targets": [{
        "expr": "43200 - (
          1 - (
            sum(rate(http_requests_total{status!~\"5..\"}[30d]))
            /
            sum(rate(http_requests_total[30d]))
          )
        ) * 43200"
      }]
    }
  ]
}

Monitoring Key Metrics

The Four Golden Signals

| Signal | What It Measures | Alert Threshold | |--------|-----------------|----------------| | Latency | Time to serve a request | P95 > 500ms | | Traffic | Requests per second | > 2× baseline | | Errors | Failed request rate | > 1% of total | | Saturation | Resource utilization | CPU/DB > 80% |

Alert Rules

groups:
  - name: sre-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeded 1%"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 500ms"

Summary

SRE dashboards in Grafana provide real-time SLO tracking, error budget monitoring, and latency visualization. Use PromQL queries and alert rules to maintain service reliability.

Key takeaways:

SRE dashboard: SLO %, error budget, P95 latency, request rate, incidents |
PromQL queries calculate SLO, error budget consumption, and latency |
Four golden signals: latency, traffic, errors, saturation |
Alert rules trigger on error rate > 1% or P95 latency > 500ms |
Grafana thresholds: green (>= SLO), yellow (warning), red (critical) |

You've completed this course! You now have a complete SRE foundation.

What's Next: Chaos Engineering

The next chapter covers chaos engineering — testing system resilience through controlled failures.

Dashboard Implementation

Step-by-Step Setup

Install Prometheus and Grafana

# Using Docker Compose
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
  
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

Configure Prometheus to scrape your app

# prometheus.yml
scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['myapp:8080']

Import SRE Dashboard in Grafana

Open Grafana at http://localhost:3000
Click + → Import
Paste the dashboard JSON
Select Prometheus data source

Custom Alert Rules

# prometheus-alerts.yml
groups:
  - name: sre
    rules:
      - alert: BudgetExhausted
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[30d]))
              /
              sum(rate(http_requests_total[30d]))
            )
          ) * 43200 < 0
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Error budget exhausted"

      - alert: BudgetWarning
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[30d]))
              /
              sum(rate(http_requests_total[30d]))
            )
          ) * 43200 < 4320  # < 10% remaining
        for: 6h
        labels:
          severity: warning

Dashboard Best Practices

| Practice | Why | |----------|-----| | Focus on SLO, not all metrics | Avoid dashboard overload | | Use red/yellow/green thresholds | Immediate visual understanding | | Add alert annotations | See incidents on timeline | | Combine multiple data sources | Full picture of system health | | Keep it simple | Too many panels = confusion | | Share with the team | Everyone sees the same reality |