Prometheus — Metrics Collection and Alerting
Why Prometheus Matters
Prometheus is the industry standard for metrics collection and alerting in cloud-native environments. It is a CNCF graduated project and the most widely used monitoring system for Kubernetes. Understanding Prometheus is essential for any DevOps, SRE, or platform engineer.
Why this matters for your career:
- Prometheus is the standard metrics system for Kubernetes (used by 90%+ of K8s clusters)
- PromQL is a skill that transfers to Thanos, Cortex, and Grafana Cloud
- Prometheus knowledge is required for CKA, CKAD, and AWS certifications
- Time-series monitoring is fundamental to production operations
What Is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit. It collects metrics from configured targets at regular intervals, evaluates rule expressions, displays results, and triggers alerts when conditions are met.
Core Features
| Feature | Description | |---------|-------------| | Pull model | Prometheus scrapes metrics from HTTP endpoints (no push) | | Time-series DB | Stores metrics with timestamps in an efficient format | | PromQL | Powerful query language for aggregation and analysis | | AlertManager | Handles alert deduplication, grouping, and routing | | Service discovery | Auto-discovers targets in Kubernetes, Consul, EC2 | | Multi-dimensional | Labels provide multiple dimensions for data |
Architecture
Service (exposes /metrics) ◄── Prometheus (scrape every 15s)
│
PromQL queries
│
┌─────────┴─────────┐
▼ ▼
Grafana AlertManager
(dashboards) (Slack, Email, PagerDuty)
Installation
With Docker Compose
version: "3.9"
services:
prometheus:
image: prom/prometheus:v2.50
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
volumes:
prometheus_data:
Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'my-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: my-app
Metric Types
| Type | Description | Example |
|------|-------------|--------|
| Counter | Monotonically increasing value | http_requests_total |
| Gauge | Value that can go up or down | memory_usage_bytes, cpu_usage_percent |
| Histogram | Observes values in configurable buckets | http_request_duration_seconds |
| Summary | Similar to histogram but calculates quantiles client-side | rpc_duration_seconds |
Instrumenting Your Application
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'path', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'path'])
IN_PROGRESS = Gauge('http_requests_in_progress', 'Requests currently being processed')
# Use in an endpoint
def handle_request(method, path):
IN_PROGRESS.inc()
start = time.time()
try:
# Process request
status = 200
except:
status = 500
finally:
duration = time.time() - start
REQUEST_COUNT.labels(method=method, path=path, status=status).inc()
REQUEST_DURATION.labels(method=method, path=path).observe(duration)
IN_PROGRESS.dec()
# Start metrics server on port 8000
start_http_server(8000)
PromQL Examples
Basic Queries
# All time-series
http_requests_total
# With label filter
http_requests_total{method="POST", status="200"}
# Regex matching
http_requests_total{method=~"GET|POST", path=~"/api/.*"}
Rate and Increase
# Per-second rate (last 5 min)
rate(http_requests_total[5m])
# Total increase in the last hour
increase(http_requests_total[1h])
# Per-second rate by path
sum by(path) (rate(http_requests_total[5m]))
Aggregation
# Sum by service
sum by(service) (rate(http_requests_total[5m]))
# Average across instances
avg by(job) (rate(http_requests_total[5m]))
# Top 5 endpoints (by request count)
topk(5, sum by(path) (rate(http_requests_total[5m])))
Latency Percentiles
# P50 latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# P99 latency by path
histogram_quantile(0.99, sum by(le, path) (rate(http_request_duration_seconds_bucket[5m])))
AlertManager
Alert Rules
# alerts.yml
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency > 2s on {{ $labels.job }}"
description: "P95 latency is {{ $value }}s (threshold: 2s)"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
AlertManager Config
# alertmanager.yml
route:
receiver: default
routes:
- match:
severity: critical
receiver: pagerduty
repeat_interval: 5m
- match:
severity: warning
receiver: slack
receivers:
- name: default
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
- name: pagerduty
pagerduty_configs:
- routing_key: 'your-pagerduty-key'
- name: slack
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#warnings'
Best Practices
| Practice | Reason |
|----------|--------|
| Use the RED method | Rate, Errors, Duration — the 3 golden signals |
| Use the USE method | Utilization, Saturation, Errors — for infrastructure |
| Use labels consistently | service, namespace, environment — standardize |
| Keep label cardinality low | High cardinality breaks Prometheus |
| Set appropriate scrape intervals | 15s default, adjust based on data volume |
| Set retention time | balance storage cost vs. historical analysis |
| Use recording rules for expensive queries | Pre-compute complex PromQL |
| Alert on symptoms, not causes | Alert on error rate, not CPU spike |
Summary
Prometheus is the industry standard for metrics collection in cloud-native environments. Its pull model, multi-dimensional data model, and powerful PromQL make it the monitoring backbone for Kubernetes. AlertManager handles notification routing and deduplication.
Key takeaways:
- Prometheus scrapes metrics from HTTP endpoints (pull model)
- Metric types: Counter (increasing), Gauge (variable), Histogram (bucketed), Summary (quantiles)
- PromQL: rate(), increase(), histogram_quantile(), sum by(), topk()
- AlertManager deduplicates, groups, and routes alerts to Slack, PagerDuty, email
- RED method: Rate, Errors, Duration for service monitoring
- USE method: Utilization, Saturation, Errors for resource monitoring
- Keep label cardinality low to prevent performance issues
- Use recording rules to pre-compute expensive queries
What's Next: Grafana
The next chapter covers Grafana — creating dashboards, configuring data sources, alerting, and building operational visibility.