Incident Response

🔥 Vibe Prompt

"Design an incident response process: detection, response, mitigation, postmortem. Set up on-call rotation."

Incident Lifecycle

Detect → Page → Triage → Mitigate → Resolve → Postmortem
  (monitor)  (PagerDuty) (assign severity) (fix) (done) (learn)

Severity Levels

| Severity | Description | Response Time | Example | |----------|-------------|---------------|---------| | SEV-1 | Complete outage | 5 min | All users can't login | | SEV-2 | Partial outage | 15 min | Slow for 10% of users | | SEV-3 | Degraded | 1 hour | Feature X is broken | | SEV-4 | Minor | Next business day | Cosmetic bug |

On-Call Schedule

# PagerDuty rotation
schedule:
  - name: primary
    rotation: 7 days (Mon 9am → next Mon 9am)
    handoff: Slack handoff doc
  
  - name: secondary (escalation)
    rotation: 7 days (offset)
    shadow: primary

Postmortem Template

# Postmortem: [Title]

## Summary
- Date/Time: YYYY-MM-DD HH:MM UTC
- Duration: X hours Y minutes
- Severity: SEV-X
- Impact: X users affected, $Y revenue loss

## Timeline
- HH:MM UTC - Alert fired (p99 > 1s)
- HH:MM UTC - Engineer paged
- HH:MM UTC - Root cause identified
- HH:MM UTC - Mitigation applied
- HH:MM UTC - Service restored

## Root Cause
- [Brief description of what went wrong]

## Action Items
- [ ] Fix X (owner, due date)
- [ ] Add alert for Y
- [ ] Update runbook for Z

## Blameless Culture
- What went well?
- What went wrong?
- What can we improve?

Blameless Postmortem Principles

| Principle | Why | |-----------|-----| | No blame | Focus on system, not people | | Full timeline | Complete picture of events | | Root cause | Why did each step happen? | | Action items | Concrete, tracked follow-ups | | Share widely | Everyone learns |

Best Practices

Automated alerts for known patterns
Clear escalation paths
Postmortem within 48 hours
Track action items in Jira/Linear
Regular incident drills (game days)

Chapter Summary

Understand core concepts and principles
Master implementation methods and techniques
Familiar with common issues and solutions
Able to apply in real projects

Incident Response Lifecycle

SRE follows a structured incident response process to minimize downtime.

The Five Stages

| Stage | What Happens | Goal | |-------|-------------|------| | Detection | Monitoring alerts on anomaly | Identify the problem ASAP | | Triage | Assess severity and impact | Prioritize response | | Mitigation | Stop the bleeding | Restore service | | Resolution | Fix root cause | Prevent recurrence | | Follow-up | Postmortem, action items | Learn and improve |

Severity Levels

| Level | Name | Response Time | Example | |-------|------|---------------|---------| | SEV-1 | Critical | < 15 min | Site down, data loss | | SEV-2 | High | < 1 hour | Feature broken, degraded | | SEV-3 | Medium | < 4 hours | Minor bug, cosmetic | | SEV-4 | Low | Next sprint | Enhancement, trivial |

Incident Response Runbook

#!/bin/bash
# incident-response.sh — SRE Incident Response Script

SEVERITY="${1:-SEV-3}"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"

echo "=== Incident $INCIDENT_ID ($SEVERITY) ==="
echo "Started: $TIMESTAMP"

# Step 1: Notify the team
function notify_team() {
  local message="{\"text\": \"🚨 *$INCIDENT_ID* [$SEVERITY] - Incident detected at $TIMESTAMP\"}"
  curl -X POST -H 'Content-Type: application/json' \
    -d "$message" \
    "$SLACK_WEBHOOK_URL"
}

# Step 2: Check basic health
function check_basic_health() {
  echo "[1/4] Checking server health..."
  
  # CPU
  CPU_USAGE=$(top -l 1 | grep "CPU usage" | awk '{print $3}' | sed 's/%//')
  echo "CPU: ${CPU_USAGE}%"
  
  # Memory
  MEM_USAGE=$(vm_stat | awk '/Pages active/ {print $3}' | sed 's/\.//')
  MEM_TOTAL=$(sysctl hw.memsize | awk '{print $2}')
  MEM_PCT=$((MEM_USAGE * 4096 * 100 / MEM_TOTAL))
  echo "Memory: ${MEM_PCT}%"
  
  # Disk
  DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
  echo "Disk: ${DISK_USAGE}%"
  
  if [ "$CPU_USAGE" -gt 90 ] || [ "$DISK_USAGE" -gt 90 ]; then
    echo "⚠️ Resource threshold exceeded!"
    return 1
  fi
  return 0
}

# Step 3: Check application health
function check_app_health() {
  echo "[2/4] Checking application health..."
  
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 https://app.example.com/health)
  
  if [ "$HTTP_CODE" != "200" ]; then
    echo "❌ Health check returned HTTP $HTTP_CODE"
    return 1
  fi
  echo "✅ Application healthy"
  return 0
}

# Step 4: Check recent deployments
function check_deployments() {
  echo "[3/4] Checking recent deployments..."
  # Check for deployments in last hour
  # git log --since="1 hour ago" --oneline
  echo "Done"
}

# Step 5: Gather logs
function gather_logs() {
  echo "[4/4] Gathering recent error logs..."
  journalctl -u myapp --since "1 hour ago" --no-pager | grep -i "error\|exception\|timeout\|fail" | tail -20
}

notify_team
check_basic_health
check_app_health
check_deployments
gather_logs

echo "=== Incident response initiated ==="

Postmortem Template

# Postmortem: INC-20261215-143022

## Summary
- **Duration**: 35 minutes (14:30 - 15:05 UTC)
- **Impact**: 15% of users experienced 5xx errors
- **Root Cause**: Database connection pool exhausted

## Timeline
| Time | Event |
|------|-------|
| 14:30 | PagerDuty alert: 5xx rate > 5% |
| 14:32 | Engineer on-call acknowledged |
| 14:35 | Identified database connection spike |
| 14:40 | Increased max_connections from 100 to 200 |
| 14:45 | Error rate stabilized |
| 15:05 | Full recovery confirmed |

## Root Cause
A new deployment included a connection leak — connections were not returned to the pool.

## Action Items
- [ ] Add connection pool monitoring alert
- [ ] Fix connection leak in api/users handler
- [ ] Add max_connections auto-scaling
- [ ] Update deployment checklist with DB review

Summary

Incident response follows a structured lifecycle: detect, triage, mitigate, resolve, follow-up. Automation scripts and postmortems ensure continuous improvement.

Key takeaways:

Five stages: detection → triage → mitigation → resolution → follow-up |
Severity levels: SEV-1 (critical, <15min) to SEV-4 (low) |
Incident runbook: notify → check health → check deployments → gather logs |
Postmortem: timeline, root cause, action items, no blame |
Automation saves time during high-stress incidents |
Always check recent deployments first |
Gather logs immediately before they rotate |
Postmortems prevent recurrence |

What's Next: SRE Dashboard

The next chapter covers building real-time SRE dashboards.

Incident Response

🔥 Vibe Prompt

Incident Lifecycle

Severity Levels

On-Call Schedule

Postmortem Template

Blameless Postmortem Principles

Best Practices

Chapter Summary

Further Reading

Incident Response Lifecycle

The Five Stages

Severity Levels

Incident Response Runbook

Postmortem Template

Summary

What's Next: SRE Dashboard

Unlock Full Tutorial