Chaos Engineering
๐ฅ Vibe Prompt
"Run a chaos experiment: kill 2 pods in a 5-replica service. Verify auto-healing and no user impact."
Principles
1. Define steady state (normal metrics)
2. Form hypothesis (system will survive)
3. Introduce failure (kill pod, slow network)
4. Compare to steady state
5. Fix & expand
Chaos Mesh
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-experiment
spec:
action: pod-kill
mode: fixed-percent
value: "40" # Kill 40% of pods
selector:
namespaces: ["production"]
labelSelectors:
app: api
duration: "60s"
scheduler:
cron: "@every 24h" # Run daily
Network Chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
spec:
action: delay
delay:
latency: "500ms"
jitter: "100ms"
selector:
namespaces: ["production"]
labelSelectors:
app: api
duration: "120s"
Steady State Checks
# Pre/post experiment metric comparison
steady_state = {
"availability": 99.99,
"p99_latency": 200,
"error_rate": 0.01
}
# During experiment, expect:
# - availability > 99.9%
# - p99_latency < 1000ms
# - error_rate < 5%
Chaos Game Day
1. Schedule: Friday 10am (avoid peak)
2. Scope: production (or staging first)
3. Scenarios:
- Kill 3 pods in a 10-replica deployment
- Block DB traffic for 30s
- Introduce 1s latency to external API
- Fill 80% of disk
4. Team: on-call engineer + observer
5. Outcome: verify auto-scaling, retry, circuit breakers
Tools
| Tool | Type | |------|------| | Chaos Mesh | K8s-native (pod, network, stress) | | Litmus | SRE workflows | | Gremlin | SaaS (more features) | | AWS FIS | AWS-native |
Best Practices
- Start small (staging, not production)
- Have a rollback plan (disable chaos immediately)
- Automate experiments (CronJob)
- Share results company-wide
- Build a "Chaos Dashboard" for history
- Never run chaos without monitoring!
Chapter Summary
- Understand core concepts and principles
- Master implementation methods and techniques
- Familiar with common issues and solutions
- Able to apply in real projects
Further Reading
- Official documentation and API references
- Open source examples on GitHub
- Technical books and online courses
- Community discussions and tech blogs
Chaos Engineering Principles
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions.
Core Principles
| Principle | Description | |-----------|-------------| | Define steady state | Measure normal system behavior (latency, throughput, error rate) | | Hypothesize | Predict what will happen when a fault is introduced | | Introduce faults | Run controlled experiments (kill servers, inject latency) | | Measure | Compare results against steady state | | Automate | Run experiments continuously in production | | Minimize blast radius | Start small, limit impact |
Chaos Experiment Types
| Experiment | What It Tests | Tool | |------------|--------------|------| | Kill a server | Load balancer failover | Chaos Monkey | | Inject latency | Timeout handling, retries | Chaos Mesh | | Network partition | Distributed system resilience | Gremlin | | CPU spike | Auto-scaling, resource limits | stress-ng | | Disk fill | Disk space monitoring | dd + fallocate | | Database failure | Connection pool, caching | Network blocking |
Running Chaos Experiments
Using Chaos Mesh
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-experiment
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: web-server
duration: "60s"
scheduler:
cron: "@every 30m"
kubectl apply -f pod-kill.yaml
# Check experiment status
kubectl get podchaos
# Watch events
kubectl describe podchaos pod-kill-experiment
Latency Injection
# http-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: http-delay
namespace: chaos-testing
spec:
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-gateway
target: Request
port: 8080
delay:
latency: "1000ms"
correlation: "50"
jitter: "100ms"
duration: "5m"
Measuring the Impact
Metrics to Watch
| Metric | Expected During Experiment | Alert if | |--------|---------------------------|----------| | Error rate | Increases then recovers | Stays elevated after experiment | | P95 latency | Spikes during injection | Doesn't return to baseline | | Request rate | Drops then recovers | Permanently lower | | Active connections | Spikes | Connection pool exhaustion | | Auto-scaling events | Triggers | Fails to scale |
Game Day Preparation
# Game Day: Database Failover Test
## Scenario
Primary database goes down. Can the application gracefully failover?
## Hypothesis
The read replica will serve traffic within 30 seconds of primary failure.
## Steps
1. Block network to primary database (port 5432)
2. Monitor application behavior for 5 minutes
3. Verify read-only mode activates
4. Restore connection
5. Verify full recovery
## Success Criteria
- Error rate stays below 5%
- Read operations continue uninterrupted
- Write operations queue and replay after recovery
- Recovery completes within 60 seconds
Summary
Chaos engineering proactively tests system resilience through controlled experiments. Start small, measure everything, and automate recurring tests.
Key takeaways:
- Principles: define steady state โ hypothesize โ introduce fault โ measure |
- Experiments: pod kill, latency injection, network partition, CPU spike |
- Chaos Mesh: Kubernetes-native chaos engineering tool |
- Game Days: structured failure scenario with success criteria |
- Always monitor impact: error rate, latency, connections, scaling |
- Automate experiments to run on a schedule |
- Start with small blast radius, expand gradually |
- Use results to improve system resilience and incident response |
What's Next: SRE Dashboard
The next chapter covers building SRE monitoring dashboards.