SRE Dashboard & Runbooks

🔥 Vibe Prompt

"Create an SRE dashboard: SLO compliance, burn rate, on-call status. Automate runbooks for common incidents."

SRE Dashboard Panels

SLO Compliance (30d window)
├── Availability: 99.92% (SLO: 99.9%) ✅
├── Latency p99: 320ms (SLO: 500ms) ✅
└── Error Budget Remaining: 67% ✅

Burn Rate (1h window)
├── Availability: 0.02% budget burned
└── Alert: green (normal)

Incident Summary
├── Last 24h: 0 SEV-1, 1 SEV-2, 3 SEV-3
└── MTTR (30d): 28 minutes

On-Call
├── Primary: Alice (until Mon 9am)
└── Secondary: Bob

Automated Runbook

# Automated CPU spike runbook
import requests, subprocess, json

def cpu_runbook():
    # 1. Identify culprit
    top_output = subprocess.run(["kubectl", "top", "pods", "-n", "production"], capture_output=True, text=True)
    # Find highest CPU pod
    lines = top_output.stdout.strip().split("\n")[1:]
    culprit = max(lines, key=lambda l: float(l.split()[1].replace("m", "")))
    pod_name = culprit.split()[0]
    
    # 2. Get logs
    logs = subprocess.run(["kubectl", "logs", "--tail=100", pod_name, "-n", "production"], capture_output=True, text=True)
    
    # 3. Check if there's a known pattern
    if "OOM" in logs.stdout:
        action = "Increase memory limit"
    elif "connection refused" in logs.stdout:
        action = "Restart dependent service"
    else:
        action = "Scale replicas + investigate"
    
    # 4. Execute fix
    subprocess.run(["kubectl", "scale", "deploy", pod_name.rsplit("-", 1)[0], "--replicas=10", "-n", "production"])
    
    # 5. Post to Slack
    slack_msg = {"text": f"🚨 CPU Runbook: {pod_name}\nAction: {action}"}
    requests.post("https://hooks.slack.com/services/...", json=slack_msg)
    
    print(f"Runbook executed: {action}")

Runbook Automation Levels

| Level | Description | Example | |-------|-------------|---------| | L1 | Manual (read docs) | Wiki page | | L2 | Semi-automated (click button) | Jenkins job | | L3 | Full auto (no human) | Auto-scaling | | L4 | Predictive (prevent) | Load forecasting |

Common Runbooks

| Incident | Runbook | |----------|---------| | CPU spike | Scale up, check for leak | | Memory leak | Restart, increase limit, fix code | | DB slow | Check slow queries, add index | | Certificate expiry | Auto-renew (cert-manager) | | Disk full | Clean logs, increase PV | | Pod crash loop | Check logs, rollback version |

SRE Course Complete! 🎉

✅ SLO & SLI
✅ Incident Response
✅ Capacity Planning
✅ Chaos Engineering
✅ Dashboard & Runbooks

DevOps Track Complete! 🎉

✅ Docker Compose
✅ Kubernetes & Helm
✅ Cloud AWS
✅ Serverless
✅ Monitoring
✅ GitOps
✅ SRE

Key Points

Understand the core concepts thoroughly
Practice with hands-on code examples
Apply knowledge to real-world problems
Review and reinforce through exercises

Further Learning

Official documentation
Open source projects on GitHub
Community forums and discussions
Related courses and tutorials

Runbook：當警報響了，你的團隊知道該做什麼嗎？

想像一個情境：半夜 3 點，你的手機響了——PagerDuty 發出 P0 警報，網站完全無法訪問。

如果你的團隊沒有 Runbook：

值班工程師還在睡夢中驚醒
打開電腦，愣在那裡不知道從哪裡開始查
花了 30 分鐘才找到問題，但已經損失了幾十萬

如果你的團隊有 Runbook：

值班工程師按照 Runbook 步驟操作
5 分鐘內完成初步診斷
10 分鐘內執行緩解措施
服務恢復，繼續睡覺

一份好的 Runbook 長什麼樣？

# Runbook: 網站 503 - API 服務無法訪問

## 影響範圍
- 所有 API 請求失敗
- 前端頁面載入正常但無法登入
- 影響所有使用者

## 嚴重程度
P0 - Critical

## 診斷步驟

### Step 1: 檢查 ECS 服務狀態
```bash
aws ecs describe-services --cluster production --services api-service

✅ 正常 → 繼續 ❌ 服務崩潰 → 跳到 Step 4

Step 2: 檢查 RDS 連線數

aws rds describe-db-instances --db-instance-identifier mydb

✅ 正常 → 繼續 ❌ 連線數爆滿 → 跳到 Step 5

Step 3: 檢查最近部署

git log --oneline -10

如果最近有部署，考慮回退：

git revert HEAD

Step 4: 重啟服務

aws ecs update-service --cluster production --service api-service --force-new-deployment

Step 5: 增加 RDS 連線數上限

aws rds modify-db-instance --db-instance-identifier mydb \
  --max-connections 500

恢復確認

[ ] API 回應正常（200 OK）
[ ] 前端可以登入
[ ] 錯誤率 < 0.1%
[ ] PagerDuty 警報關閉

事後覆盤

根因：
預防措施：
監控告警改進：


### 課程總結
你已完成 SRE 的五堂課程，涵蓋了從 SLI/SLO 定義到事故應變、容量規劃、混沌工程，再到 SRE 儀表板與 Runbook。

SRE 的核心不是工具，而是**文化**——用工程方法來管理維運，用自動化來減少手動操作，用數據來驅動決策。