SRE Dashboard & Runbooks
🔥 Vibe Prompt
"Create an SRE dashboard: SLO compliance, burn rate, on-call status. Automate runbooks for common incidents."
SRE Dashboard Panels
SLO Compliance (30d window)
├── Availability: 99.92% (SLO: 99.9%) ✅
├── Latency p99: 320ms (SLO: 500ms) ✅
└── Error Budget Remaining: 67% ✅
Burn Rate (1h window)
├── Availability: 0.02% budget burned
└── Alert: green (normal)
Incident Summary
├── Last 24h: 0 SEV-1, 1 SEV-2, 3 SEV-3
└── MTTR (30d): 28 minutes
On-Call
├── Primary: Alice (until Mon 9am)
└── Secondary: Bob
Automated Runbook
# Automated CPU spike runbook
import requests, subprocess, json
def cpu_runbook():
# 1. Identify culprit
top_output = subprocess.run(["kubectl", "top", "pods", "-n", "production"], capture_output=True, text=True)
# Find highest CPU pod
lines = top_output.stdout.strip().split("\n")[1:]
culprit = max(lines, key=lambda l: float(l.split()[1].replace("m", "")))
pod_name = culprit.split()[0]
# 2. Get logs
logs = subprocess.run(["kubectl", "logs", "--tail=100", pod_name, "-n", "production"], capture_output=True, text=True)
# 3. Check if there's a known pattern
if "OOM" in logs.stdout:
action = "Increase memory limit"
elif "connection refused" in logs.stdout:
action = "Restart dependent service"
else:
action = "Scale replicas + investigate"
# 4. Execute fix
subprocess.run(["kubectl", "scale", "deploy", pod_name.rsplit("-", 1)[0], "--replicas=10", "-n", "production"])
# 5. Post to Slack
slack_msg = {"text": f"🚨 CPU Runbook: {pod_name}\nAction: {action}"}
requests.post("https://hooks.slack.com/services/...", json=slack_msg)
print(f"Runbook executed: {action}")
Runbook Automation Levels
| Level | Description | Example | |-------|-------------|---------| | L1 | Manual (read docs) | Wiki page | | L2 | Semi-automated (click button) | Jenkins job | | L3 | Full auto (no human) | Auto-scaling | | L4 | Predictive (prevent) | Load forecasting |
Common Runbooks
| Incident | Runbook | |----------|---------| | CPU spike | Scale up, check for leak | | Memory leak | Restart, increase limit, fix code | | DB slow | Check slow queries, add index | | Certificate expiry | Auto-renew (cert-manager) | | Disk full | Clean logs, increase PV | | Pod crash loop | Check logs, rollback version |
SRE Course Complete! 🎉
- ✅ SLO & SLI
- ✅ Incident Response
- ✅ Capacity Planning
- ✅ Chaos Engineering
- ✅ Dashboard & Runbooks
DevOps Track Complete! 🎉
- ✅ Docker Compose
- ✅ Kubernetes & Helm
- ✅ Cloud AWS
- ✅ Serverless
- ✅ Monitoring
- ✅ GitOps
- ✅ SRE
Key Points
- Understand the core concepts thoroughly
- Practice with hands-on code examples
- Apply knowledge to real-world problems
- Review and reinforce through exercises
Further Learning
- Official documentation
- Open source projects on GitHub
- Community forums and discussions
- Related courses and tutorials
Runbook:當警報響了,你的團隊知道該做什麼嗎?
想像一個情境:半夜 3 點,你的手機響了——PagerDuty 發出 P0 警報,網站完全無法訪問。
如果你的團隊沒有 Runbook:
- 值班工程師還在睡夢中驚醒
- 打開電腦,愣在那裡不知道從哪裡開始查
- 花了 30 分鐘才找到問題,但已經損失了幾十萬
如果你的團隊有 Runbook:
- 值班工程師按照 Runbook 步驟操作
- 5 分鐘內完成初步診斷
- 10 分鐘內執行緩解措施
- 服務恢復,繼續睡覺
一份好的 Runbook 長什麼樣?
# Runbook: 網站 503 - API 服務無法訪問
## 影響範圍
- 所有 API 請求失敗
- 前端頁面載入正常但無法登入
- 影響所有使用者
## 嚴重程度
P0 - Critical
## 診斷步驟
### Step 1: 檢查 ECS 服務狀態
```bash
aws ecs describe-services --cluster production --services api-service
✅ 正常 → 繼續 ❌ 服務崩潰 → 跳到 Step 4
Step 2: 檢查 RDS 連線數
aws rds describe-db-instances --db-instance-identifier mydb
✅ 正常 → 繼續 ❌ 連線數爆滿 → 跳到 Step 5
Step 3: 檢查最近部署
git log --oneline -10
如果最近有部署,考慮回退:
git revert HEAD
Step 4: 重啟服務
aws ecs update-service --cluster production --service api-service --force-new-deployment
Step 5: 增加 RDS 連線數上限
aws rds modify-db-instance --db-instance-identifier mydb \
--max-connections 500
恢復確認
- [ ] API 回應正常(200 OK)
- [ ] 前端可以登入
- [ ] 錯誤率 < 0.1%
- [ ] PagerDuty 警報關閉
事後覆盤
- 根因:
- 預防措施:
- 監控告警改進:
### 課程總結
你已完成 SRE 的五堂課程,涵蓋了從 SLI/SLO 定義到事故應變、容量規劃、混沌工程,再到 SRE 儀表板與 Runbook。
SRE 的核心不是工具,而是**文化**——用工程方法來管理維運,用自動化來減少手動操作,用數據來驅動決策。