SRE Dashboard & Runbooks

🔥 Vibe Prompt

"Create an SRE dashboard: SLO compliance, burn rate, on-call status. Automate runbooks for common incidents."

SRE Dashboard Panels

SLO Compliance (30d window)
├── Availability: 99.92% (SLO: 99.9%) ✅
├── Latency p99: 320ms (SLO: 500ms) ✅
└── Error Budget Remaining: 67% ✅

Burn Rate (1h window)
├── Availability: 0.02% budget burned
└── Alert: green (normal)

Incident Summary
├── Last 24h: 0 SEV-1, 1 SEV-2, 3 SEV-3
└── MTTR (30d): 28 minutes

On-Call
├── Primary: Alice (until Mon 9am)
└── Secondary: Bob

Automated Runbook

# Automated CPU spike runbook
import requests, subprocess, json

def cpu_runbook():
    # 1. Identify culprit
    top_output = subprocess.run(["kubectl", "top", "pods", "-n", "production"], capture_output=True, text=True)
    # Find highest CPU pod
    lines = top_output.stdout.strip().split("\n")[1:]
    culprit = max(lines, key=lambda l: float(l.split()[1].replace("m", "")))
    pod_name = culprit.split()[0]
    
    # 2. Get logs
    logs = subprocess.run(["kubectl", "logs", "--tail=100", pod_name, "-n", "production"], capture_output=True, text=True)
    
    # 3. Check if there's a known pattern
    if "OOM" in logs.stdout:
        action = "Increase memory limit"
    elif "connection refused" in logs.stdout:
        action = "Restart dependent service"
    else:
        action = "Scale replicas + investigate"
    
    # 4. Execute fix
    subprocess.run(["kubectl", "scale", "deploy", pod_name.rsplit("-", 1)[0], "--replicas=10", "-n", "production"])
    
    # 5. Post to Slack
    slack_msg = {"text": f"🚨 CPU Runbook: {pod_name}\nAction: {action}"}
    requests.post("https://hooks.slack.com/services/...", json=slack_msg)
    
    print(f"Runbook executed: {action}")

Runbook Automation Levels

| Level | Description | Example | |-------|-------------|---------| | L1 | Manual (read docs) | Wiki page | | L2 | Semi-automated (click button) | Jenkins job | | L3 | Full auto (no human) | Auto-scaling | | L4 | Predictive (prevent) | Load forecasting |

Common Runbooks

| Incident | Runbook | |----------|---------| | CPU spike | Scale up, check for leak | | Memory leak | Restart, increase limit, fix code | | DB slow | Check slow queries, add index | | Certificate expiry | Auto-renew (cert-manager) | | Disk full | Clean logs, increase PV | | Pod crash loop | Check logs, rollback version |

SRE Course Complete! 🎉

✅ SLO & SLI
✅ Incident Response
✅ Capacity Planning
✅ Chaos Engineering
✅ Dashboard & Runbooks

DevOps Track Complete! 🎉

✅ Docker Compose
✅ Kubernetes & Helm
✅ Cloud AWS
✅ Serverless
✅ Monitoring
✅ GitOps
✅ SRE

Key Points

Understand the core concepts thoroughly
Practice with hands-on code examples
Apply knowledge to real-world problems
Review and reinforce through exercises

Further Learning

Official documentation
Open source projects on GitHub
Community forums and discussions
Related courses and tutorials

What Are Runbooks?

Runbooks are documented procedures for operating and troubleshooting systems. They ensure consistent, repeatable responses to common scenarios.

Runbook Structure

| Section | What It Contains | |---------|-----------------| | Title | Clear, searchable name | | Symptoms | What the user/monitoring sees | | Severity | Impact assessment (SEV-1 to SEV-4) | | Pre-checks | Quick health checks before diving in | | Diagnosis | Steps to identify root cause | | Resolution | Step-by-step fix instructions | | Verification | How to confirm the fix worked | | Escalation | Who to contact if unresolved |

Example Runbooks

High CPU Alert Runbook

# Runbook: High CPU Utilization

## Symptoms
- PagerDuty alert: "CPU > 80% for 5 minutes"
- Users report slow page loads

## Severity: SEV-2

## Pre-checks
1. Check if there was a recent deployment
2. Check if traffic spiked (holiday, promotion)
3. Check if dependent services are healthy

## Diagnosis
```bash
# Check top CPU consumers
top -b -n 1 | head -20

# Check specific process
ps aux --sort=-%cpu | head -10

# Check container resource usage
docker stats --no-stream

# Check Kubernetes pod resource usage
kubectl top pods -n production

# Check recent application logs
journalctl -u myapp -n 50 --no-pager

Resolution

Option A: Scale up

kubectl scale deployment myapp --replicas=10 -n production

Option B: Restart the service

kubectl rollout restart deployment myapp -n production

Option C: Identify and fix the code issue

Check recent deployment for code changes
Rollback if a recent change caused the issue

kubectl rollout undo deployment myapp -n production

Verification

# CPU should drop below threshold within 2 minutes
top -b -n 1 | head -5

# Check alert clears in PagerDuty/Prometheus
curl -s http://prometheus:9090/api/v1/query?query=...

Escalation

If unresolved after 15 minutes, escalate to:

Senior SRE: @sre-lead
Engineering Manager: @eng-manager


### Database Connection Exhaustion Runbook

```markdown
# Runbook: Database Connection Pool Exhaustion

## Symptoms
- Application errors: "could not acquire connection from pool"
- Increased query latency
- Intermittent 5xx errors

## Severity: SEV-1

## Diagnosis
```bash
# Check active connections
SELECT count(*) FROM pg_stat_activity;

# Check max connections
SHOW max_connections;

# Check which queries are running long
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

# Check application connection pool status
curl -s http://myapp:8080/health/db | jq .

Resolution

Immediate (Stop the bleeding)

# Terminate long-running idle queries
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes';

Temporary (Increase pool)

# Increase max_connections (requires restart)
# Edit postgresql.conf and restart
# OR use RDS parameter group
aws rds modify-db-parameter-group \
  --db-parameter-group-name myapp-pg \
  --parameters "ParameterName=max_connections,ParameterValue=300,ApplyMethod=immediate"

Permanent (Fix the app)

Check for connection leaks in the code
Ensure connections are returned to pool after use
Add connection pool monitoring (HikariCP, PgBouncer)


## Runbook Automation

```bash
#!/bin/bash
# auto-runbook.sh — Execute runbook steps automatically

ALERT_NAME="$1"

case "$ALERT_NAME" in
  "HighCPU")
    echo "=== Auto-Runbook: High CPU ==="
    echo "[1] Checking top consumers..."
    ps aux --sort=-%cpu | head -5
    
    echo "[2] Checking recent deployment..."
    kubectl rollout history deployment/myapp
    
    echo "[3] Scaling up temporarily..."
    kubectl scale deployment/myapp --replicas=5
    ;;
    
  "DBConnectionExhausted")
    echo "=== Auto-Runbook: DB Connection Exhaustion ==="
    echo "[1] Killing idle transactions..."
    psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle in transaction' AND now() - query_start > interval '5 minutes';"
    
    echo "[2] Notifying team..."
    curl -X POST -H 'Content-Type: application/json' \
      -d '{"text": "⚠️ DB connections exhausted — auto-remediation applied"}' \
      "$SLACK_WEBHOOK"
    ;;
esac

Summary

Runbooks provide step-by-step procedures for diagnosing and resolving common incidents. Automating runbook steps reduces MTTR (Mean Time to Resolution).

Key takeaways:

Runbook sections: symptoms → severity → pre-checks → diagnosis → resolution → verification → escalation |
High CPU: check processes → scale up or rollback recent deploy |
DB exhaustion: kill idle transactions → increase pool → fix connection leak |
Automate runbook steps with shell scripts triggered by alerts |
Always include verification steps to confirm resolution |
Escalation path ensures unresolved incidents get senior attention |
Keep runbooks in version control (Git) for continuous improvement |

You've completed this course! You now have a complete SRE foundation.

SRE Dashboard & Runbooks

🔥 Vibe Prompt

SRE Dashboard Panels

Automated Runbook

Runbook Automation Levels

Common Runbooks

SRE Course Complete! 🎉

DevOps Track Complete! 🎉

Key Points

Further Learning

What Are Runbooks?

Runbook Structure

Example Runbooks

High CPU Alert Runbook

Resolution

Option A: Scale up

Option B: Restart the service

Option C: Identify and fix the code issue

Verification

Escalation

Resolution

Immediate (Stop the bleeding)

Temporary (Increase pool)

Permanent (Fix the app)

Summary

Unlock Full Tutorial