SLO & SLI
๐ฅ Vibe Prompt
"Define SLIs and SLOs for an API service: availability 99.9%, latency p99 <500ms. Set up error budget."
SLI (Service Level Indicator)
Availability SLI:
= successful_requests / total_requests * 100
= sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Latency SLI (p99):
= histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
SLO (Service Level Objective)
objectives:
- name: availability
target: 99.9% # 3 nines
measurement_window: 30d
sli: "good_requests / total_requests"
- name: latency_p99
target: 500ms
measurement_window: 30d
sli: "p99 latency"
- name: throughput
target: 1000 req/s
measurement_window: 1h
Error Budget
Error Budget = 1 - SLO
Example: SLO = 99.9% โ Error Budget = 0.1% = ~43.2 min/month
Purpose:
- How much downtime is "allowed"?
- Error budget burn โ stop deploying (focus on reliability)
Burn Rate Alert
# If burning budget too fast (3h of full budget in 1h = 3x burn rate)
alert: HighErrorBudgetBurn
expr: (1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > 0.001
labels: { severity: critical }
annotations:
summary: "Error budget burn rate > 3x!"
Multi-SLO Windows
| Window | Purpose | |--------|---------| | 1h | Burn rate alerting | | 7d | Weekly reliability review | | 30d | SLO compliance reporting | | 365d | Annual target (e.g., 99.9% = 8.76h downtime) |
Best Practices
- Start with 2-3 key SLIs per service
- SLO slightly less than what you can achieve (buffer)
- Error budget = deployment velocity throttle
- Alert on burn rate, not SLO breach
Chapter Summary
- Understand core concepts and principles
- Master implementation methods and techniques
- Familiar with common issues and solutions
- Able to apply in real projects
Further Reading
- Official documentation and API references
- Open source examples on GitHub
- Technical books and online courses
- Community discussions and tech blogs
Implementation Example
Basic Example
# This section provides a complete implementation example
Steps
- Setup: Configure development environment
- Data: Prepare required data
- Implementation: Build core functionality
- Testing: Verify correctness
- Optimization: Improve performance
Common Errors
| Error Type | Cause | Solution | |------------|-------|----------| | Compilation | Syntax | Check code syntax | | Runtime | Environment | Verify dependencies installed | | Logic | Algorithm | Step-by-step debugging | | Performance | Efficiency | Use profilers |
Code Example
import sys
def main():
print("Hello, World!")
if __name__ == "__main__":
main()
References
- Official documentation
- API reference
- Open source examples
- Community discussions
Error Budget Calculation
The error budget is the amount of downtime allowed by your SLO.
Formula
$$\text{Error Budget} = (1 - \text{SLO}) \times \text{Total Time}$$
| SLO | Monthly Error Budget | Yearly Error Budget | |-----|--------------------|--------------------| | 99.9% (three nines) | 43m 12s | 8h 46m | | 99.95% | 21m 36s | 4h 23m | | 99.99% (four nines) | 4m 19s | 52m 34s | | 99.999% (five nines) | 26s | 5m 15s |
Error Budget Policy
# Error budget policy example
error_budget_policy:
window: 30d # Rolling 30-day window
# Action thresholds
thresholds:
- level: warning
consumed: 50% # Alert the team
action: "Schedule postmortem"
- level: critical
consumed: 75% # Freeze all releases
action: "Stop all deployments until budget recovers"
- level: exhausted
consumed: 100%
action: "Emergency incident review"
Monitoring SLIs
Instrumentation Examples
# Python: Track request latency with Prometheus
from prometheus_client import Histogram, Counter
import time
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency in seconds',
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
def track_request(method: str, endpoint: str, status: int):
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
@app.middleware("http")
async def monitor_requests(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_LATENCY.observe(duration)
track_request(
method=request.method,
endpoint=request.url.path,
status=response.status_code
)
return response
SLI Calculation in PromQL
# Availability SLI (30d window)
(
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) * 100
# Latency SLI (P95 over 5m)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Throughput SLI
sum(rate(http_requests_total[5m]))
SLO Compliance Report
#!/bin/bash
# slo-report.sh โ Generate SLO compliance report
SLO_TARGET=99.9
WINDOW="30d"
# Query Prometheus for availability
AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode "query=(
sum(rate(http_requests_total{status!~\"5..\"}[$WINDOW]))
/
sum(rate(http_requests_total[$WINDOW]))
) * 100" | jq -r '.data.result[0].value[1]')
echo "SLO Compliance Report"
echo "===================="
echo "Window: $WINDOW"
echo "Target: $SLO_TARGET%"
echo "Actual: $AVAILABILITY%"
if (( $(echo "$AVAILABILITY >= $SLO_TARGET" | bc -l) )); then
echo "Status: โ
COMPLIANT"
echo "Budget remaining: $(echo "100 - ($SLO_TARGET - $AVAILABILITY) * 43200" | bc) seconds"
else
echo "Status: โ BUDGET EXHAUSTED"
echo "Overspent: $(echo "($SLO_TARGET - $AVAILABILITY) * 43200" | bc) seconds"
fi
Summary
SLIs measure service performance, SLOs set targets, and error budgets balance reliability with innovation. Together they form the foundation of SRE practice.
Key takeaways:
- SLI: specific metric (latency, availability, throughput) |
- SLO: target value for the SLI over a time window |
- Error budget: (1 - SLO) ร total time = allowed downtime |
- 99.9% SLO = 43 minutes downtime per month |
- Error budget policy: warning at 50%, freeze at 75%, emergency at 100% |
- Instrument with Prometheus client libraries |
- Calculate SLIs with PromQL queries |
- Generate compliance reports to track SLO achievement |
What's Next: Incident Response
The next chapter covers incident response processes and runbooks.