SRE Dashboard & Runbooks

🔥 Vibe Prompt

"Create an SRE dashboard: SLO compliance, burn rate, on-call status. Automate runbooks for common incidents."

SRE Dashboard Panels

SLO Compliance (30d window)
├── Availability: 99.92% (SLO: 99.9%) ✅
├── Latency p99: 320ms (SLO: 500ms) ✅
└── Error Budget Remaining: 67% ✅

Burn Rate (1h window)
├── Availability: 0.02% budget burned
└── Alert: green (normal)

Incident Summary
├── Last 24h: 0 SEV-1, 1 SEV-2, 3 SEV-3
└── MTTR (30d): 28 minutes

On-Call
├── Primary: Alice (until Mon 9am)
└── Secondary: Bob

Automated Runbook

# Automated CPU spike runbook
import requests, subprocess, json

def cpu_runbook():
    # 1. Identify culprit
    top_output = subprocess.run(["kubectl", "top", "pods", "-n", "production"], capture_output=True, text=True)
    # Find highest CPU pod
    lines = top_output.stdout.strip().split("\n")[1:]
    culprit = max(lines, key=lambda l: float(l.split()[1].replace("m", "")))
    pod_name = culprit.split()[0]
    
    # 2. Get logs
    logs = subprocess.run(["kubectl", "logs", "--tail=100", pod_name, "-n", "production"], capture_output=True, text=True)
    
    # 3. Check if there's a known pattern
    if "OOM" in logs.stdout:
        action = "Increase memory limit"
    elif "connection refused" in logs.stdout:
        action = "Restart dependent service"
    else:
        action = "Scale replicas + investigate"
    
    # 4. Execute fix
    subprocess.run(["kubectl", "scale", "deploy", pod_name.rsplit("-", 1)[0], "--replicas=10", "-n", "production"])
    
    # 5. Post to Slack
    slack_msg = {"text": f"🚨 CPU Runbook: {pod_name}\nAction: {action}"}
    requests.post("https://hooks.slack.com/services/...", json=slack_msg)
    
    print(f"Runbook executed: {action}")

Runbook Automation Levels

| Level | Description | Example | |-------|-------------|---------| | L1 | Manual (read docs) | Wiki page | | L2 | Semi-automated (click button) | Jenkins job | | L3 | Full auto (no human) | Auto-scaling | | L4 | Predictive (prevent) | Load forecasting |

Common Runbooks

| Incident | Runbook | |----------|---------| | CPU spike | Scale up, check for leak | | Memory leak | Restart, increase limit, fix code | | DB slow | Check slow queries, add index | | Certificate expiry | Auto-renew (cert-manager) | | Disk full | Clean logs, increase PV | | Pod crash loop | Check logs, rollback version |

SRE Course Complete! 🎉

  • ✅ SLO & SLI
  • ✅ Incident Response
  • ✅ Capacity Planning
  • ✅ Chaos Engineering
  • ✅ Dashboard & Runbooks

DevOps Track Complete! 🎉

  • ✅ Docker Compose
  • ✅ Kubernetes & Helm
  • ✅ Cloud AWS
  • ✅ Serverless
  • ✅ Monitoring
  • ✅ GitOps
  • ✅ SRE

Key Points

  • Understand the core concepts thoroughly
  • Practice with hands-on code examples
  • Apply knowledge to real-world problems
  • Review and reinforce through exercises

Further Learning

  • Official documentation
  • Open source projects on GitHub
  • Community forums and discussions
  • Related courses and tutorials


Runbook:當警報響了,你的團隊知道該做什麼嗎?

想像一個情境:半夜 3 點,你的手機響了——PagerDuty 發出 P0 警報,網站完全無法訪問。

如果你的團隊沒有 Runbook:

  • 值班工程師還在睡夢中驚醒
  • 打開電腦,愣在那裡不知道從哪裡開始查
  • 花了 30 分鐘才找到問題,但已經損失了幾十萬

如果你的團隊有 Runbook:

  • 值班工程師按照 Runbook 步驟操作
  • 5 分鐘內完成初步診斷
  • 10 分鐘內執行緩解措施
  • 服務恢復,繼續睡覺

一份好的 Runbook 長什麼樣?

# Runbook: 網站 503 - API 服務無法訪問

## 影響範圍
- 所有 API 請求失敗
- 前端頁面載入正常但無法登入
- 影響所有使用者

## 嚴重程度
P0 - Critical

## 診斷步驟

### Step 1: 檢查 ECS 服務狀態
```bash
aws ecs describe-services --cluster production --services api-service

✅ 正常 → 繼續 ❌ 服務崩潰 → 跳到 Step 4

Step 2: 檢查 RDS 連線數

aws rds describe-db-instances --db-instance-identifier mydb

✅ 正常 → 繼續 ❌ 連線數爆滿 → 跳到 Step 5

Step 3: 檢查最近部署

git log --oneline -10

如果最近有部署,考慮回退:

git revert HEAD

Step 4: 重啟服務

aws ecs update-service --cluster production --service api-service --force-new-deployment

Step 5: 增加 RDS 連線數上限

aws rds modify-db-instance --db-instance-identifier mydb \
  --max-connections 500

恢復確認

  • [ ] API 回應正常(200 OK)
  • [ ] 前端可以登入
  • [ ] 錯誤率 < 0.1%
  • [ ] PagerDuty 警報關閉

事後覆盤

  • 根因:
  • 預防措施:
  • 監控告警改進:

### 課程總結
你已完成 SRE 的五堂課程,涵蓋了從 SLI/SLO 定義到事故應變、容量規劃、混沌工程,再到 SRE 儀表板與 Runbook。

SRE 的核心不是工具,而是**文化**——用工程方法來管理維運,用自動化來減少手動操作,用數據來驅動決策。

解鎖完整教學內容

本章為付費內容。加入專案即可解鎖超過 5000 字的深度解析,包含 10 個以上神級 Prompt 與真實 Source Code 範例!