Incident Response

๐Ÿ”ฅ Vibe Prompt

"Design an incident response process: detection, response, mitigation, postmortem. Set up on-call rotation."

Incident Lifecycle

Detect โ†’ Page โ†’ Triage โ†’ Mitigate โ†’ Resolve โ†’ Postmortem
  (monitor)  (PagerDuty) (assign severity) (fix) (done) (learn)

Severity Levels

| Severity | Description | Response Time | Example | |----------|-------------|---------------|---------| | SEV-1 | Complete outage | 5 min | All users can't login | | SEV-2 | Partial outage | 15 min | Slow for 10% of users | | SEV-3 | Degraded | 1 hour | Feature X is broken | | SEV-4 | Minor | Next business day | Cosmetic bug |

On-Call Schedule

# PagerDuty rotation
schedule:
  - name: primary
    rotation: 7 days (Mon 9am โ†’ next Mon 9am)
    handoff: Slack handoff doc
  
  - name: secondary (escalation)
    rotation: 7 days (offset)
    shadow: primary

Postmortem Template

# Postmortem: [Title]

## Summary
- Date/Time: YYYY-MM-DD HH:MM UTC
- Duration: X hours Y minutes
- Severity: SEV-X
- Impact: X users affected, $Y revenue loss

## Timeline
- HH:MM UTC - Alert fired (p99 > 1s)
- HH:MM UTC - Engineer paged
- HH:MM UTC - Root cause identified
- HH:MM UTC - Mitigation applied
- HH:MM UTC - Service restored

## Root Cause
- [Brief description of what went wrong]

## Action Items
- [ ] Fix X (owner, due date)
- [ ] Add alert for Y
- [ ] Update runbook for Z

## Blameless Culture
- What went well?
- What went wrong?
- What can we improve?

Blameless Postmortem Principles

| Principle | Why | |-----------|-----| | No blame | Focus on system, not people | | Full timeline | Complete picture of events | | Root cause | Why did each step happen? | | Action items | Concrete, tracked follow-ups | | Share widely | Everyone learns |

Best Practices

  • Automated alerts for known patterns
  • Clear escalation paths
  • Postmortem within 48 hours
  • Track action items in Jira/Linear
  • Regular incident drills (game days)

Chapter Summary

  • Understand core concepts and principles
  • Master implementation methods and techniques
  • Familiar with common issues and solutions
  • Able to apply in real projects

Further Reading

  • Official documentation and API references
  • Open source examples on GitHub
  • Technical books and online courses
  • Community discussions and tech blogs

Incident Response Lifecycle

SRE follows a structured incident response process to minimize downtime.

The Five Stages

| Stage | What Happens | Goal | |-------|-------------|------| | Detection | Monitoring alerts on anomaly | Identify the problem ASAP | | Triage | Assess severity and impact | Prioritize response | | Mitigation | Stop the bleeding | Restore service | | Resolution | Fix root cause | Prevent recurrence | | Follow-up | Postmortem, action items | Learn and improve |

Severity Levels

| Level | Name | Response Time | Example | |-------|------|---------------|---------| | SEV-1 | Critical | < 15 min | Site down, data loss | | SEV-2 | High | < 1 hour | Feature broken, degraded | | SEV-3 | Medium | < 4 hours | Minor bug, cosmetic | | SEV-4 | Low | Next sprint | Enhancement, trivial |

Incident Response Runbook

#!/bin/bash
# incident-response.sh โ€” SRE Incident Response Script

SEVERITY="${1:-SEV-3}"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"

echo "=== Incident $INCIDENT_ID ($SEVERITY) ==="
echo "Started: $TIMESTAMP"

# Step 1: Notify the team
function notify_team() {
  local message="{\"text\": \"๐Ÿšจ *$INCIDENT_ID* [$SEVERITY] - Incident detected at $TIMESTAMP\"}"
  curl -X POST -H 'Content-Type: application/json' \
    -d "$message" \
    "$SLACK_WEBHOOK_URL"
}

# Step 2: Check basic health
function check_basic_health() {
  echo "[1/4] Checking server health..."
  
  # CPU
  CPU_USAGE=$(top -l 1 | grep "CPU usage" | awk '{print $3}' | sed 's/%//')
  echo "CPU: ${CPU_USAGE}%"
  
  # Memory
  MEM_USAGE=$(vm_stat | awk '/Pages active/ {print $3}' | sed 's/\.//')
  MEM_TOTAL=$(sysctl hw.memsize | awk '{print $2}')
  MEM_PCT=$((MEM_USAGE * 4096 * 100 / MEM_TOTAL))
  echo "Memory: ${MEM_PCT}%"
  
  # Disk
  DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
  echo "Disk: ${DISK_USAGE}%"
  
  if [ "$CPU_USAGE" -gt 90 ] || [ "$DISK_USAGE" -gt 90 ]; then
    echo "โš ๏ธ Resource threshold exceeded!"
    return 1
  fi
  return 0
}

# Step 3: Check application health
function check_app_health() {
  echo "[2/4] Checking application health..."
  
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 https://app.example.com/health)
  
  if [ "$HTTP_CODE" != "200" ]; then
    echo "โŒ Health check returned HTTP $HTTP_CODE"
    return 1
  fi
  echo "โœ… Application healthy"
  return 0
}

# Step 4: Check recent deployments
function check_deployments() {
  echo "[3/4] Checking recent deployments..."
  # Check for deployments in last hour
  # git log --since="1 hour ago" --oneline
  echo "Done"
}

# Step 5: Gather logs
function gather_logs() {
  echo "[4/4] Gathering recent error logs..."
  journalctl -u myapp --since "1 hour ago" --no-pager | grep -i "error\|exception\|timeout\|fail" | tail -20
}

notify_team
check_basic_health
check_app_health
check_deployments
gather_logs

echo "=== Incident response initiated ==="

Postmortem Template

# Postmortem: INC-20261215-143022

## Summary
- **Duration**: 35 minutes (14:30 - 15:05 UTC)
- **Impact**: 15% of users experienced 5xx errors
- **Root Cause**: Database connection pool exhausted

## Timeline
| Time | Event |
|------|-------|
| 14:30 | PagerDuty alert: 5xx rate > 5% |
| 14:32 | Engineer on-call acknowledged |
| 14:35 | Identified database connection spike |
| 14:40 | Increased max_connections from 100 to 200 |
| 14:45 | Error rate stabilized |
| 15:05 | Full recovery confirmed |

## Root Cause
A new deployment included a connection leak โ€” connections were not returned to the pool.

## Action Items
- [ ] Add connection pool monitoring alert
- [ ] Fix connection leak in api/users handler
- [ ] Add max_connections auto-scaling
- [ ] Update deployment checklist with DB review

Summary

Incident response follows a structured lifecycle: detect, triage, mitigate, resolve, follow-up. Automation scripts and postmortems ensure continuous improvement.

Key takeaways:

  • Five stages: detection โ†’ triage โ†’ mitigation โ†’ resolution โ†’ follow-up |
  • Severity levels: SEV-1 (critical, <15min) to SEV-4 (low) |
  • Incident runbook: notify โ†’ check health โ†’ check deployments โ†’ gather logs |
  • Postmortem: timeline, root cause, action items, no blame |
  • Automation saves time during high-stress incidents |
  • Always check recent deployments first |
  • Gather logs immediately before they rotate |
  • Postmortems prevent recurrence |

What's Next: SRE Dashboard

The next chapter covers building real-time SRE dashboards.

Unlock Full Tutorial

This chapter is paid content. Join the project to unlock over 5000 words of deep analysis, including 10+ god-tier Prompts and real Source Code examples!