Incident Response
๐ฅ Vibe Prompt
"Design an incident response process: detection, response, mitigation, postmortem. Set up on-call rotation."
Incident Lifecycle
Detect โ Page โ Triage โ Mitigate โ Resolve โ Postmortem
(monitor) (PagerDuty) (assign severity) (fix) (done) (learn)
Severity Levels
| Severity | Description | Response Time | Example | |----------|-------------|---------------|---------| | SEV-1 | Complete outage | 5 min | All users can't login | | SEV-2 | Partial outage | 15 min | Slow for 10% of users | | SEV-3 | Degraded | 1 hour | Feature X is broken | | SEV-4 | Minor | Next business day | Cosmetic bug |
On-Call Schedule
# PagerDuty rotation
schedule:
- name: primary
rotation: 7 days (Mon 9am โ next Mon 9am)
handoff: Slack handoff doc
- name: secondary (escalation)
rotation: 7 days (offset)
shadow: primary
Postmortem Template
# Postmortem: [Title]
## Summary
- Date/Time: YYYY-MM-DD HH:MM UTC
- Duration: X hours Y minutes
- Severity: SEV-X
- Impact: X users affected, $Y revenue loss
## Timeline
- HH:MM UTC - Alert fired (p99 > 1s)
- HH:MM UTC - Engineer paged
- HH:MM UTC - Root cause identified
- HH:MM UTC - Mitigation applied
- HH:MM UTC - Service restored
## Root Cause
- [Brief description of what went wrong]
## Action Items
- [ ] Fix X (owner, due date)
- [ ] Add alert for Y
- [ ] Update runbook for Z
## Blameless Culture
- What went well?
- What went wrong?
- What can we improve?
Blameless Postmortem Principles
| Principle | Why | |-----------|-----| | No blame | Focus on system, not people | | Full timeline | Complete picture of events | | Root cause | Why did each step happen? | | Action items | Concrete, tracked follow-ups | | Share widely | Everyone learns |
Best Practices
- Automated alerts for known patterns
- Clear escalation paths
- Postmortem within 48 hours
- Track action items in Jira/Linear
- Regular incident drills (game days)
Chapter Summary
- Understand core concepts and principles
- Master implementation methods and techniques
- Familiar with common issues and solutions
- Able to apply in real projects
Further Reading
- Official documentation and API references
- Open source examples on GitHub
- Technical books and online courses
- Community discussions and tech blogs
Incident Response Lifecycle
SRE follows a structured incident response process to minimize downtime.
The Five Stages
| Stage | What Happens | Goal | |-------|-------------|------| | Detection | Monitoring alerts on anomaly | Identify the problem ASAP | | Triage | Assess severity and impact | Prioritize response | | Mitigation | Stop the bleeding | Restore service | | Resolution | Fix root cause | Prevent recurrence | | Follow-up | Postmortem, action items | Learn and improve |
Severity Levels
| Level | Name | Response Time | Example | |-------|------|---------------|---------| | SEV-1 | Critical | < 15 min | Site down, data loss | | SEV-2 | High | < 1 hour | Feature broken, degraded | | SEV-3 | Medium | < 4 hours | Minor bug, cosmetic | | SEV-4 | Low | Next sprint | Enhancement, trivial |
Incident Response Runbook
#!/bin/bash
# incident-response.sh โ SRE Incident Response Script
SEVERITY="${1:-SEV-3}"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"
echo "=== Incident $INCIDENT_ID ($SEVERITY) ==="
echo "Started: $TIMESTAMP"
# Step 1: Notify the team
function notify_team() {
local message="{\"text\": \"๐จ *$INCIDENT_ID* [$SEVERITY] - Incident detected at $TIMESTAMP\"}"
curl -X POST -H 'Content-Type: application/json' \
-d "$message" \
"$SLACK_WEBHOOK_URL"
}
# Step 2: Check basic health
function check_basic_health() {
echo "[1/4] Checking server health..."
# CPU
CPU_USAGE=$(top -l 1 | grep "CPU usage" | awk '{print $3}' | sed 's/%//')
echo "CPU: ${CPU_USAGE}%"
# Memory
MEM_USAGE=$(vm_stat | awk '/Pages active/ {print $3}' | sed 's/\.//')
MEM_TOTAL=$(sysctl hw.memsize | awk '{print $2}')
MEM_PCT=$((MEM_USAGE * 4096 * 100 / MEM_TOTAL))
echo "Memory: ${MEM_PCT}%"
# Disk
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
echo "Disk: ${DISK_USAGE}%"
if [ "$CPU_USAGE" -gt 90 ] || [ "$DISK_USAGE" -gt 90 ]; then
echo "โ ๏ธ Resource threshold exceeded!"
return 1
fi
return 0
}
# Step 3: Check application health
function check_app_health() {
echo "[2/4] Checking application health..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 https://app.example.com/health)
if [ "$HTTP_CODE" != "200" ]; then
echo "โ Health check returned HTTP $HTTP_CODE"
return 1
fi
echo "โ
Application healthy"
return 0
}
# Step 4: Check recent deployments
function check_deployments() {
echo "[3/4] Checking recent deployments..."
# Check for deployments in last hour
# git log --since="1 hour ago" --oneline
echo "Done"
}
# Step 5: Gather logs
function gather_logs() {
echo "[4/4] Gathering recent error logs..."
journalctl -u myapp --since "1 hour ago" --no-pager | grep -i "error\|exception\|timeout\|fail" | tail -20
}
notify_team
check_basic_health
check_app_health
check_deployments
gather_logs
echo "=== Incident response initiated ==="
Postmortem Template
# Postmortem: INC-20261215-143022
## Summary
- **Duration**: 35 minutes (14:30 - 15:05 UTC)
- **Impact**: 15% of users experienced 5xx errors
- **Root Cause**: Database connection pool exhausted
## Timeline
| Time | Event |
|------|-------|
| 14:30 | PagerDuty alert: 5xx rate > 5% |
| 14:32 | Engineer on-call acknowledged |
| 14:35 | Identified database connection spike |
| 14:40 | Increased max_connections from 100 to 200 |
| 14:45 | Error rate stabilized |
| 15:05 | Full recovery confirmed |
## Root Cause
A new deployment included a connection leak โ connections were not returned to the pool.
## Action Items
- [ ] Add connection pool monitoring alert
- [ ] Fix connection leak in api/users handler
- [ ] Add max_connections auto-scaling
- [ ] Update deployment checklist with DB review
Summary
Incident response follows a structured lifecycle: detect, triage, mitigate, resolve, follow-up. Automation scripts and postmortems ensure continuous improvement.
Key takeaways:
- Five stages: detection โ triage โ mitigation โ resolution โ follow-up |
- Severity levels: SEV-1 (critical, <15min) to SEV-4 (low) |
- Incident runbook: notify โ check health โ check deployments โ gather logs |
- Postmortem: timeline, root cause, action items, no blame |
- Automation saves time during high-stress incidents |
- Always check recent deployments first |
- Gather logs immediately before they rotate |
- Postmortems prevent recurrence |
What's Next: SRE Dashboard
The next chapter covers building real-time SRE dashboards.