Prometheus — Metrics Collection and Alerting

Why Prometheus Matters

Prometheus is the industry standard for metrics collection and alerting in cloud-native environments. It is a CNCF graduated project and the most widely used monitoring system for Kubernetes. Understanding Prometheus is essential for any DevOps, SRE, or platform engineer.

Why this matters for your career:

Prometheus is the standard metrics system for Kubernetes (used by 90%+ of K8s clusters)
PromQL is a skill that transfers to Thanos, Cortex, and Grafana Cloud
Prometheus knowledge is required for CKA, CKAD, and AWS certifications
Time-series monitoring is fundamental to production operations

What Is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit. It collects metrics from configured targets at regular intervals, evaluates rule expressions, displays results, and triggers alerts when conditions are met.

Core Features

| Feature | Description | |---------|-------------| | Pull model | Prometheus scrapes metrics from HTTP endpoints (no push) | | Time-series DB | Stores metrics with timestamps in an efficient format | | PromQL | Powerful query language for aggregation and analysis | | AlertManager | Handles alert deduplication, grouping, and routing | | Service discovery | Auto-discovers targets in Kubernetes, Consul, EC2 | | Multi-dimensional | Labels provide multiple dimensions for data |

Architecture

Service (exposes /metrics) ◄── Prometheus (scrape every 15s)
                                                        │
                                                    PromQL queries
                                                        │
                                              ┌─────────┴─────────┐
                                              ▼                   ▼
                                         Grafana           AlertManager
                                        (dashboards)      (Slack, Email, PagerDuty)

Installation

With Docker Compose

version: "3.9"
services:
  prometheus:
    image: prom/prometheus:v2.50
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

volumes:
  prometheus_data:

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'my-app'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: my-app

Metric Types

| Type | Description | Example | |------|-------------|--------| | Counter | Monotonically increasing value | http_requests_total | | Gauge | Value that can go up or down | memory_usage_bytes, cpu_usage_percent | | Histogram | Observes values in configurable buckets | http_request_duration_seconds | | Summary | Similar to histogram but calculates quantiles client-side | rpc_duration_seconds |

Instrumenting Your Application

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'path', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'path'])
IN_PROGRESS = Gauge('http_requests_in_progress', 'Requests currently being processed')

# Use in an endpoint
def handle_request(method, path):
    IN_PROGRESS.inc()
    start = time.time()
    
    try:
        # Process request
        status = 200
    except:
        status = 500
    finally:
        duration = time.time() - start
        REQUEST_COUNT.labels(method=method, path=path, status=status).inc()
        REQUEST_DURATION.labels(method=method, path=path).observe(duration)
        IN_PROGRESS.dec()

# Start metrics server on port 8000
start_http_server(8000)

PromQL Examples

Basic Queries

# All time-series
http_requests_total

# With label filter
http_requests_total{method="POST", status="200"}

# Regex matching
http_requests_total{method=~"GET|POST", path=~"/api/.*"}

Rate and Increase

# Per-second rate (last 5 min)
rate(http_requests_total[5m])

# Total increase in the last hour
increase(http_requests_total[1h])

# Per-second rate by path
sum by(path) (rate(http_requests_total[5m]))

Aggregation

# Sum by service
sum by(service) (rate(http_requests_total[5m]))

# Average across instances
avg by(job) (rate(http_requests_total[5m]))

# Top 5 endpoints (by request count)
topk(5, sum by(path) (rate(http_requests_total[5m])))

Latency Percentiles

# P50 latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency by path
histogram_quantile(0.99, sum by(le, path) (rate(http_request_duration_seconds_bucket[5m])))

AlertManager

Alert Rules

# alerts.yml
groups:
  - name: my-app
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 2s on {{ $labels.job }}"
          description: "P95 latency is {{ $value }}s (threshold: 2s)"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

AlertManager Config

# alertmanager.yml
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
      repeat_interval: 5m
    - match:
        severity: warning
      receiver: slack

receivers:
  - name: default
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: pagerduty
    pagerduty_configs:
      - routing_key: 'your-pagerduty-key'

  - name: slack
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#warnings'

Best Practices

| Practice | Reason | |----------|--------| | Use the RED method | Rate, Errors, Duration — the 3 golden signals | | Use the USE method | Utilization, Saturation, Errors — for infrastructure | | Use labels consistently | service, namespace, environment — standardize | | Keep label cardinality low | High cardinality breaks Prometheus | | Set appropriate scrape intervals | 15s default, adjust based on data volume | | Set retention time | balance storage cost vs. historical analysis | | Use recording rules for expensive queries | Pre-compute complex PromQL | | Alert on symptoms, not causes | Alert on error rate, not CPU spike |

Summary

Prometheus is the industry standard for metrics collection in cloud-native environments. Its pull model, multi-dimensional data model, and powerful PromQL make it the monitoring backbone for Kubernetes. AlertManager handles notification routing and deduplication.

Key takeaways:

Prometheus scrapes metrics from HTTP endpoints (pull model)
Metric types: Counter (increasing), Gauge (variable), Histogram (bucketed), Summary (quantiles)
PromQL: rate(), increase(), histogram_quantile(), sum by(), topk()
AlertManager deduplicates, groups, and routes alerts to Slack, PagerDuty, email
RED method: Rate, Errors, Duration for service monitoring
USE method: Utilization, Saturation, Errors for resource monitoring
Keep label cardinality low to prevent performance issues
Use recording rules to pre-compute expensive queries

What's Next: Grafana

The next chapter covers Grafana — creating dashboards, configuring data sources, alerting, and building operational visibility.