Chaos Engineering

🔥 Vibe Prompt

"Run a chaos experiment: kill 2 pods in a 5-replica service. Verify auto-healing and no user impact."

Principles

1. Define steady state (normal metrics)
2. Form hypothesis (system will survive)
3. Introduce failure (kill pod, slow network)
4. Compare to steady state
5. Fix & expand

Chaos Mesh

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-experiment
spec:
  action: pod-kill
  mode: fixed-percent
  value: "40"  # Kill 40% of pods
  selector:
    namespaces: ["production"]
    labelSelectors:
      app: api
  duration: "60s"
  scheduler:
    cron: "@every 24h"  # Run daily

Network Chaos

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
spec:
  action: delay
  delay:
    latency: "500ms"
    jitter: "100ms"
  selector:
    namespaces: ["production"]
    labelSelectors:
      app: api
  duration: "120s"

Steady State Checks

# Pre/post experiment metric comparison
steady_state = {
    "availability": 99.99,
    "p99_latency": 200,
    "error_rate": 0.01
}

# During experiment, expect:
# - availability > 99.9%
# - p99_latency < 1000ms
# - error_rate < 5%

Chaos Game Day

1. Schedule: Friday 10am (avoid peak)
2. Scope: production (or staging first)
3. Scenarios:
   - Kill 3 pods in a 10-replica deployment
   - Block DB traffic for 30s
   - Introduce 1s latency to external API
   - Fill 80% of disk
4. Team: on-call engineer + observer
5. Outcome: verify auto-scaling, retry, circuit breakers

Tools

| Tool | Type | |------|------| | Chaos Mesh | K8s-native (pod, network, stress) | | Litmus | SRE workflows | | Gremlin | SaaS (more features) | | AWS FIS | AWS-native |

Best Practices

Start small (staging, not production)
Have a rollback plan (disable chaos immediately)
Automate experiments (CronJob)
Share results company-wide
Build a "Chaos Dashboard" for history
Never run chaos without monitoring!

Chapter Summary

Understand core concepts and principles
Master implementation methods and techniques
Familiar with common issues and solutions
Able to apply in real projects

Chaos Engineering Principles

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions.

Core Principles

| Principle | Description | |-----------|-------------| | Define steady state | Measure normal system behavior (latency, throughput, error rate) | | Hypothesize | Predict what will happen when a fault is introduced | | Introduce faults | Run controlled experiments (kill servers, inject latency) | | Measure | Compare results against steady state | | Automate | Run experiments continuously in production | | Minimize blast radius | Start small, limit impact |

Chaos Experiment Types

| Experiment | What It Tests | Tool | |------------|--------------|------| | Kill a server | Load balancer failover | Chaos Monkey | | Inject latency | Timeout handling, retries | Chaos Mesh | | Network partition | Distributed system resilience | Gremlin | | CPU spike | Auto-scaling, resource limits | stress-ng | | Disk fill | Disk space monitoring | dd + fallocate | | Database failure | Connection pool, caching | Network blocking |

Running Chaos Experiments

Using Chaos Mesh

# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-experiment
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-server
  duration: "60s"
  scheduler:
    cron: "@every 30m"

kubectl apply -f pod-kill.yaml

# Check experiment status
kubectl get podchaos

# Watch events
kubectl describe podchaos pod-kill-experiment

Latency Injection

# http-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-delay
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
  target: Request
  port: 8080
  delay:
    latency: "1000ms"
    correlation: "50"
    jitter: "100ms"
  duration: "5m"

Measuring the Impact

Metrics to Watch

| Metric | Expected During Experiment | Alert if | |--------|---------------------------|----------| | Error rate | Increases then recovers | Stays elevated after experiment | | P95 latency | Spikes during injection | Doesn't return to baseline | | Request rate | Drops then recovers | Permanently lower | | Active connections | Spikes | Connection pool exhaustion | | Auto-scaling events | Triggers | Fails to scale |

Game Day Preparation

# Game Day: Database Failover Test

## Scenario
Primary database goes down. Can the application gracefully failover?

## Hypothesis
The read replica will serve traffic within 30 seconds of primary failure.

## Steps
1. Block network to primary database (port 5432)
2. Monitor application behavior for 5 minutes
3. Verify read-only mode activates
4. Restore connection
5. Verify full recovery

## Success Criteria
- Error rate stays below 5%
- Read operations continue uninterrupted
- Write operations queue and replay after recovery
- Recovery completes within 60 seconds

Summary

Chaos engineering proactively tests system resilience through controlled experiments. Start small, measure everything, and automate recurring tests.

Key takeaways:

Principles: define steady state → hypothesize → introduce fault → measure |
Experiments: pod kill, latency injection, network partition, CPU spike |
Chaos Mesh: Kubernetes-native chaos engineering tool |
Game Days: structured failure scenario with success criteria |
Always monitor impact: error rate, latency, connections, scaling |
Automate experiments to run on a schedule |
Start with small blast radius, expand gradually |
Use results to improve system resilience and incident response |

What's Next: SRE Dashboard

The next chapter covers building SRE monitoring dashboards.