SRE Dashboard & Runbooks

๐Ÿ”ฅ Vibe Prompt

"Create an SRE dashboard: SLO compliance, burn rate, on-call status. Automate runbooks for common incidents."

SRE Dashboard Panels

SLO Compliance (30d window)
โ”œโ”€โ”€ Availability: 99.92% (SLO: 99.9%) โœ…
โ”œโ”€โ”€ Latency p99: 320ms (SLO: 500ms) โœ…
โ””โ”€โ”€ Error Budget Remaining: 67% โœ…

Burn Rate (1h window)
โ”œโ”€โ”€ Availability: 0.02% budget burned
โ””โ”€โ”€ Alert: green (normal)

Incident Summary
โ”œโ”€โ”€ Last 24h: 0 SEV-1, 1 SEV-2, 3 SEV-3
โ””โ”€โ”€ MTTR (30d): 28 minutes

On-Call
โ”œโ”€โ”€ Primary: Alice (until Mon 9am)
โ””โ”€โ”€ Secondary: Bob

Automated Runbook

# Automated CPU spike runbook
import requests, subprocess, json

def cpu_runbook():
    # 1. Identify culprit
    top_output = subprocess.run(["kubectl", "top", "pods", "-n", "production"], capture_output=True, text=True)
    # Find highest CPU pod
    lines = top_output.stdout.strip().split("\n")[1:]
    culprit = max(lines, key=lambda l: float(l.split()[1].replace("m", "")))
    pod_name = culprit.split()[0]
    
    # 2. Get logs
    logs = subprocess.run(["kubectl", "logs", "--tail=100", pod_name, "-n", "production"], capture_output=True, text=True)
    
    # 3. Check if there's a known pattern
    if "OOM" in logs.stdout:
        action = "Increase memory limit"
    elif "connection refused" in logs.stdout:
        action = "Restart dependent service"
    else:
        action = "Scale replicas + investigate"
    
    # 4. Execute fix
    subprocess.run(["kubectl", "scale", "deploy", pod_name.rsplit("-", 1)[0], "--replicas=10", "-n", "production"])
    
    # 5. Post to Slack
    slack_msg = {"text": f"๐Ÿšจ CPU Runbook: {pod_name}\nAction: {action}"}
    requests.post("https://hooks.slack.com/services/...", json=slack_msg)
    
    print(f"Runbook executed: {action}")

Runbook Automation Levels

| Level | Description | Example | |-------|-------------|---------| | L1 | Manual (read docs) | Wiki page | | L2 | Semi-automated (click button) | Jenkins job | | L3 | Full auto (no human) | Auto-scaling | | L4 | Predictive (prevent) | Load forecasting |

Common Runbooks

| Incident | Runbook | |----------|---------| | CPU spike | Scale up, check for leak | | Memory leak | Restart, increase limit, fix code | | DB slow | Check slow queries, add index | | Certificate expiry | Auto-renew (cert-manager) | | Disk full | Clean logs, increase PV | | Pod crash loop | Check logs, rollback version |

SRE Course Complete! ๐ŸŽ‰

  • โœ… SLO & SLI
  • โœ… Incident Response
  • โœ… Capacity Planning
  • โœ… Chaos Engineering
  • โœ… Dashboard & Runbooks

DevOps Track Complete! ๐ŸŽ‰

  • โœ… Docker Compose
  • โœ… Kubernetes & Helm
  • โœ… Cloud AWS
  • โœ… Serverless
  • โœ… Monitoring
  • โœ… GitOps
  • โœ… SRE

Key Points

  • Understand the core concepts thoroughly
  • Practice with hands-on code examples
  • Apply knowledge to real-world problems
  • Review and reinforce through exercises

Further Learning

  • Official documentation
  • Open source projects on GitHub
  • Community forums and discussions
  • Related courses and tutorials

What Are Runbooks?

Runbooks are documented procedures for operating and troubleshooting systems. They ensure consistent, repeatable responses to common scenarios.

Runbook Structure

| Section | What It Contains | |---------|-----------------| | Title | Clear, searchable name | | Symptoms | What the user/monitoring sees | | Severity | Impact assessment (SEV-1 to SEV-4) | | Pre-checks | Quick health checks before diving in | | Diagnosis | Steps to identify root cause | | Resolution | Step-by-step fix instructions | | Verification | How to confirm the fix worked | | Escalation | Who to contact if unresolved |

Example Runbooks

High CPU Alert Runbook

# Runbook: High CPU Utilization

## Symptoms
- PagerDuty alert: "CPU > 80% for 5 minutes"
- Users report slow page loads

## Severity: SEV-2

## Pre-checks
1. Check if there was a recent deployment
2. Check if traffic spiked (holiday, promotion)
3. Check if dependent services are healthy

## Diagnosis
```bash
# Check top CPU consumers
top -b -n 1 | head -20

# Check specific process
ps aux --sort=-%cpu | head -10

# Check container resource usage
docker stats --no-stream

# Check Kubernetes pod resource usage
kubectl top pods -n production

# Check recent application logs
journalctl -u myapp -n 50 --no-pager

Resolution

Option A: Scale up

kubectl scale deployment myapp --replicas=10 -n production

Option B: Restart the service

kubectl rollout restart deployment myapp -n production

Option C: Identify and fix the code issue

  1. Check recent deployment for code changes
  2. Rollback if a recent change caused the issue
kubectl rollout undo deployment myapp -n production

Verification

# CPU should drop below threshold within 2 minutes
top -b -n 1 | head -5

# Check alert clears in PagerDuty/Prometheus
curl -s http://prometheus:9090/api/v1/query?query=...

Escalation

If unresolved after 15 minutes, escalate to:

  • Senior SRE: @sre-lead
  • Engineering Manager: @eng-manager

### Database Connection Exhaustion Runbook

```markdown
# Runbook: Database Connection Pool Exhaustion

## Symptoms
- Application errors: "could not acquire connection from pool"
- Increased query latency
- Intermittent 5xx errors

## Severity: SEV-1

## Diagnosis
```bash
# Check active connections
SELECT count(*) FROM pg_stat_activity;

# Check max connections
SHOW max_connections;

# Check which queries are running long
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

# Check application connection pool status
curl -s http://myapp:8080/health/db | jq .

Resolution

Immediate (Stop the bleeding)

# Terminate long-running idle queries
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes';

Temporary (Increase pool)

# Increase max_connections (requires restart)
# Edit postgresql.conf and restart
# OR use RDS parameter group
aws rds modify-db-parameter-group \
  --db-parameter-group-name myapp-pg \
  --parameters "ParameterName=max_connections,ParameterValue=300,ApplyMethod=immediate"

Permanent (Fix the app)

  1. Check for connection leaks in the code
  2. Ensure connections are returned to pool after use
  3. Add connection pool monitoring (HikariCP, PgBouncer)

## Runbook Automation

```bash
#!/bin/bash
# auto-runbook.sh โ€” Execute runbook steps automatically

ALERT_NAME="$1"

case "$ALERT_NAME" in
  "HighCPU")
    echo "=== Auto-Runbook: High CPU ==="
    echo "[1] Checking top consumers..."
    ps aux --sort=-%cpu | head -5
    
    echo "[2] Checking recent deployment..."
    kubectl rollout history deployment/myapp
    
    echo "[3] Scaling up temporarily..."
    kubectl scale deployment/myapp --replicas=5
    ;;
    
  "DBConnectionExhausted")
    echo "=== Auto-Runbook: DB Connection Exhaustion ==="
    echo "[1] Killing idle transactions..."
    psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle in transaction' AND now() - query_start > interval '5 minutes';"
    
    echo "[2] Notifying team..."
    curl -X POST -H 'Content-Type: application/json' \
      -d '{"text": "โš ๏ธ DB connections exhausted โ€” auto-remediation applied"}' \
      "$SLACK_WEBHOOK"
    ;;
esac

Summary

Runbooks provide step-by-step procedures for diagnosing and resolving common incidents. Automating runbook steps reduces MTTR (Mean Time to Resolution).

Key takeaways:

  • Runbook sections: symptoms โ†’ severity โ†’ pre-checks โ†’ diagnosis โ†’ resolution โ†’ verification โ†’ escalation |
  • High CPU: check processes โ†’ scale up or rollback recent deploy |
  • DB exhaustion: kill idle transactions โ†’ increase pool โ†’ fix connection leak |
  • Automate runbook steps with shell scripts triggered by alerts |
  • Always include verification steps to confirm resolution |
  • Escalation path ensures unresolved incidents get senior attention |
  • Keep runbooks in version control (Git) for continuous improvement |

You've completed this course! You now have a complete SRE foundation.

Unlock Full Tutorial

This chapter is paid content. Join the project to unlock over 5000 words of deep analysis, including 10+ god-tier Prompts and real Source Code examples!