SLO & SLI

๐Ÿ”ฅ Vibe Prompt

"Define SLIs and SLOs for an API service: availability 99.9%, latency p99 <500ms. Set up error budget."

SLI (Service Level Indicator)

Availability SLI:
  = successful_requests / total_requests * 100
  = sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Latency SLI (p99):
  = histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

SLO (Service Level Objective)

objectives:
  - name: availability
    target: 99.9%  # 3 nines
    measurement_window: 30d
    sli: "good_requests / total_requests"
  
  - name: latency_p99
    target: 500ms
    measurement_window: 30d
    sli: "p99 latency"
  
  - name: throughput
    target: 1000 req/s
    measurement_window: 1h

Error Budget

Error Budget = 1 - SLO
Example: SLO = 99.9% โ†’ Error Budget = 0.1% = ~43.2 min/month

Purpose:
- How much downtime is "allowed"?
- Error budget burn โ†’ stop deploying (focus on reliability)

Burn Rate Alert

# If burning budget too fast (3h of full budget in 1h = 3x burn rate)
alert: HighErrorBudgetBurn
expr: (1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > 0.001
labels: { severity: critical }
annotations:
  summary: "Error budget burn rate > 3x!"

Multi-SLO Windows

| Window | Purpose | |--------|---------| | 1h | Burn rate alerting | | 7d | Weekly reliability review | | 30d | SLO compliance reporting | | 365d | Annual target (e.g., 99.9% = 8.76h downtime) |

Best Practices

  • Start with 2-3 key SLIs per service
  • SLO slightly less than what you can achieve (buffer)
  • Error budget = deployment velocity throttle
  • Alert on burn rate, not SLO breach

Chapter Summary

  • Understand core concepts and principles
  • Master implementation methods and techniques
  • Familiar with common issues and solutions
  • Able to apply in real projects

Further Reading

  • Official documentation and API references
  • Open source examples on GitHub
  • Technical books and online courses
  • Community discussions and tech blogs

Implementation Example

Basic Example

# This section provides a complete implementation example

Steps

  1. Setup: Configure development environment
  2. Data: Prepare required data
  3. Implementation: Build core functionality
  4. Testing: Verify correctness
  5. Optimization: Improve performance

Common Errors

| Error Type | Cause | Solution | |------------|-------|----------| | Compilation | Syntax | Check code syntax | | Runtime | Environment | Verify dependencies installed | | Logic | Algorithm | Step-by-step debugging | | Performance | Efficiency | Use profilers |

Code Example

import sys

def main():
    print("Hello, World!")

if __name__ == "__main__":
    main()

References

  • Official documentation
  • API reference
  • Open source examples
  • Community discussions

Error Budget Calculation

The error budget is the amount of downtime allowed by your SLO.

Formula

$$\text{Error Budget} = (1 - \text{SLO}) \times \text{Total Time}$$

| SLO | Monthly Error Budget | Yearly Error Budget | |-----|--------------------|--------------------| | 99.9% (three nines) | 43m 12s | 8h 46m | | 99.95% | 21m 36s | 4h 23m | | 99.99% (four nines) | 4m 19s | 52m 34s | | 99.999% (five nines) | 26s | 5m 15s |

Error Budget Policy

# Error budget policy example
error_budget_policy:
  window: 30d  # Rolling 30-day window
  
  # Action thresholds
  thresholds:
    - level: warning
      consumed: 50%  # Alert the team
      action: "Schedule postmortem"
    
    - level: critical
      consumed: 75%  # Freeze all releases
      action: "Stop all deployments until budget recovers"
    
    - level: exhausted
      consumed: 100%
      action: "Emergency incident review"

Monitoring SLIs

Instrumentation Examples

# Python: Track request latency with Prometheus
from prometheus_client import Histogram, Counter
import time

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency in seconds',
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

def track_request(method: str, endpoint: str, status: int):
    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

@app.middleware("http")
async def monitor_requests(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    
    REQUEST_LATENCY.observe(duration)
    track_request(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    )
    
    return response

SLI Calculation in PromQL

# Availability SLI (30d window)
(
  sum(rate(http_requests_total{status!~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) * 100

# Latency SLI (P95 over 5m)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Throughput SLI
sum(rate(http_requests_total[5m]))

SLO Compliance Report

#!/bin/bash
# slo-report.sh โ€” Generate SLO compliance report

SLO_TARGET=99.9
WINDOW="30d"

# Query Prometheus for availability
AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode "query=(
    sum(rate(http_requests_total{status!~\"5..\"}[$WINDOW]))
    /
    sum(rate(http_requests_total[$WINDOW]))
  ) * 100" | jq -r '.data.result[0].value[1]')

echo "SLO Compliance Report"
echo "===================="
echo "Window: $WINDOW"
echo "Target: $SLO_TARGET%"
echo "Actual: $AVAILABILITY%"

if (( $(echo "$AVAILABILITY >= $SLO_TARGET" | bc -l) )); then
  echo "Status: โœ… COMPLIANT"
  echo "Budget remaining: $(echo "100 - ($SLO_TARGET - $AVAILABILITY) * 43200" | bc) seconds"
else
  echo "Status: โŒ BUDGET EXHAUSTED"
  echo "Overspent: $(echo "($SLO_TARGET - $AVAILABILITY) * 43200" | bc) seconds"
fi

Summary

SLIs measure service performance, SLOs set targets, and error budgets balance reliability with innovation. Together they form the foundation of SRE practice.

Key takeaways:

  • SLI: specific metric (latency, availability, throughput) |
  • SLO: target value for the SLI over a time window |
  • Error budget: (1 - SLO) ร— total time = allowed downtime |
  • 99.9% SLO = 43 minutes downtime per month |
  • Error budget policy: warning at 50%, freeze at 75%, emergency at 100% |
  • Instrument with Prometheus client libraries |
  • Calculate SLIs with PromQL queries |
  • Generate compliance reports to track SLO achievement |

What's Next: Incident Response

The next chapter covers incident response processes and runbooks.

Member Exclusive Free Tutorial

This chapter is free exclusive content for registered members! Please login or register to unlock immediately.

Login / Register Now