SRE Dashboard & Runbooks
๐ฅ Vibe Prompt
"Create an SRE dashboard: SLO compliance, burn rate, on-call status. Automate runbooks for common incidents."
SRE Dashboard Panels
SLO Compliance (30d window)
โโโ Availability: 99.92% (SLO: 99.9%) โ
โโโ Latency p99: 320ms (SLO: 500ms) โ
โโโ Error Budget Remaining: 67% โ
Burn Rate (1h window)
โโโ Availability: 0.02% budget burned
โโโ Alert: green (normal)
Incident Summary
โโโ Last 24h: 0 SEV-1, 1 SEV-2, 3 SEV-3
โโโ MTTR (30d): 28 minutes
On-Call
โโโ Primary: Alice (until Mon 9am)
โโโ Secondary: Bob
Automated Runbook
# Automated CPU spike runbook
import requests, subprocess, json
def cpu_runbook():
# 1. Identify culprit
top_output = subprocess.run(["kubectl", "top", "pods", "-n", "production"], capture_output=True, text=True)
# Find highest CPU pod
lines = top_output.stdout.strip().split("\n")[1:]
culprit = max(lines, key=lambda l: float(l.split()[1].replace("m", "")))
pod_name = culprit.split()[0]
# 2. Get logs
logs = subprocess.run(["kubectl", "logs", "--tail=100", pod_name, "-n", "production"], capture_output=True, text=True)
# 3. Check if there's a known pattern
if "OOM" in logs.stdout:
action = "Increase memory limit"
elif "connection refused" in logs.stdout:
action = "Restart dependent service"
else:
action = "Scale replicas + investigate"
# 4. Execute fix
subprocess.run(["kubectl", "scale", "deploy", pod_name.rsplit("-", 1)[0], "--replicas=10", "-n", "production"])
# 5. Post to Slack
slack_msg = {"text": f"๐จ CPU Runbook: {pod_name}\nAction: {action}"}
requests.post("https://hooks.slack.com/services/...", json=slack_msg)
print(f"Runbook executed: {action}")
Runbook Automation Levels
| Level | Description | Example | |-------|-------------|---------| | L1 | Manual (read docs) | Wiki page | | L2 | Semi-automated (click button) | Jenkins job | | L3 | Full auto (no human) | Auto-scaling | | L4 | Predictive (prevent) | Load forecasting |
Common Runbooks
| Incident | Runbook | |----------|---------| | CPU spike | Scale up, check for leak | | Memory leak | Restart, increase limit, fix code | | DB slow | Check slow queries, add index | | Certificate expiry | Auto-renew (cert-manager) | | Disk full | Clean logs, increase PV | | Pod crash loop | Check logs, rollback version |
SRE Course Complete! ๐
- โ SLO & SLI
- โ Incident Response
- โ Capacity Planning
- โ Chaos Engineering
- โ Dashboard & Runbooks
DevOps Track Complete! ๐
- โ Docker Compose
- โ Kubernetes & Helm
- โ Cloud AWS
- โ Serverless
- โ Monitoring
- โ GitOps
- โ SRE
Key Points
- Understand the core concepts thoroughly
- Practice with hands-on code examples
- Apply knowledge to real-world problems
- Review and reinforce through exercises
Further Learning
- Official documentation
- Open source projects on GitHub
- Community forums and discussions
- Related courses and tutorials
What Are Runbooks?
Runbooks are documented procedures for operating and troubleshooting systems. They ensure consistent, repeatable responses to common scenarios.
Runbook Structure
| Section | What It Contains | |---------|-----------------| | Title | Clear, searchable name | | Symptoms | What the user/monitoring sees | | Severity | Impact assessment (SEV-1 to SEV-4) | | Pre-checks | Quick health checks before diving in | | Diagnosis | Steps to identify root cause | | Resolution | Step-by-step fix instructions | | Verification | How to confirm the fix worked | | Escalation | Who to contact if unresolved |
Example Runbooks
High CPU Alert Runbook
# Runbook: High CPU Utilization
## Symptoms
- PagerDuty alert: "CPU > 80% for 5 minutes"
- Users report slow page loads
## Severity: SEV-2
## Pre-checks
1. Check if there was a recent deployment
2. Check if traffic spiked (holiday, promotion)
3. Check if dependent services are healthy
## Diagnosis
```bash
# Check top CPU consumers
top -b -n 1 | head -20
# Check specific process
ps aux --sort=-%cpu | head -10
# Check container resource usage
docker stats --no-stream
# Check Kubernetes pod resource usage
kubectl top pods -n production
# Check recent application logs
journalctl -u myapp -n 50 --no-pager
Resolution
Option A: Scale up
kubectl scale deployment myapp --replicas=10 -n production
Option B: Restart the service
kubectl rollout restart deployment myapp -n production
Option C: Identify and fix the code issue
- Check recent deployment for code changes
- Rollback if a recent change caused the issue
kubectl rollout undo deployment myapp -n production
Verification
# CPU should drop below threshold within 2 minutes
top -b -n 1 | head -5
# Check alert clears in PagerDuty/Prometheus
curl -s http://prometheus:9090/api/v1/query?query=...
Escalation
If unresolved after 15 minutes, escalate to:
- Senior SRE: @sre-lead
- Engineering Manager: @eng-manager
### Database Connection Exhaustion Runbook
```markdown
# Runbook: Database Connection Pool Exhaustion
## Symptoms
- Application errors: "could not acquire connection from pool"
- Increased query latency
- Intermittent 5xx errors
## Severity: SEV-1
## Diagnosis
```bash
# Check active connections
SELECT count(*) FROM pg_stat_activity;
# Check max connections
SHOW max_connections;
# Check which queries are running long
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;
# Check application connection pool status
curl -s http://myapp:8080/health/db | jq .
Resolution
Immediate (Stop the bleeding)
# Terminate long-running idle queries
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes';
Temporary (Increase pool)
# Increase max_connections (requires restart)
# Edit postgresql.conf and restart
# OR use RDS parameter group
aws rds modify-db-parameter-group \
--db-parameter-group-name myapp-pg \
--parameters "ParameterName=max_connections,ParameterValue=300,ApplyMethod=immediate"
Permanent (Fix the app)
- Check for connection leaks in the code
- Ensure connections are returned to pool after use
- Add connection pool monitoring (HikariCP, PgBouncer)
## Runbook Automation
```bash
#!/bin/bash
# auto-runbook.sh โ Execute runbook steps automatically
ALERT_NAME="$1"
case "$ALERT_NAME" in
"HighCPU")
echo "=== Auto-Runbook: High CPU ==="
echo "[1] Checking top consumers..."
ps aux --sort=-%cpu | head -5
echo "[2] Checking recent deployment..."
kubectl rollout history deployment/myapp
echo "[3] Scaling up temporarily..."
kubectl scale deployment/myapp --replicas=5
;;
"DBConnectionExhausted")
echo "=== Auto-Runbook: DB Connection Exhaustion ==="
echo "[1] Killing idle transactions..."
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle in transaction' AND now() - query_start > interval '5 minutes';"
echo "[2] Notifying team..."
curl -X POST -H 'Content-Type: application/json' \
-d '{"text": "โ ๏ธ DB connections exhausted โ auto-remediation applied"}' \
"$SLACK_WEBHOOK"
;;
esac
Summary
Runbooks provide step-by-step procedures for diagnosing and resolving common incidents. Automating runbook steps reduces MTTR (Mean Time to Resolution).
Key takeaways:
- Runbook sections: symptoms โ severity โ pre-checks โ diagnosis โ resolution โ verification โ escalation |
- High CPU: check processes โ scale up or rollback recent deploy |
- DB exhaustion: kill idle transactions โ increase pool โ fix connection leak |
- Automate runbook steps with shell scripts triggered by alerts |
- Always include verification steps to confirm resolution |
- Escalation path ensures unresolved incidents get senior attention |
- Keep runbooks in version control (Git) for continuous improvement |
You've completed this course! You now have a complete SRE foundation.