Prometheus — Metrics Collection and Alerting

Why Prometheus Matters

Prometheus is the industry standard for metrics collection and alerting in cloud-native environments. It is a CNCF graduated project and the most widely used monitoring system for Kubernetes. Understanding Prometheus is essential for any DevOps, SRE, or platform engineer.

Why this matters for your career:

  • Prometheus is the standard metrics system for Kubernetes (used by 90%+ of K8s clusters)
  • PromQL is a skill that transfers to Thanos, Cortex, and Grafana Cloud
  • Prometheus knowledge is required for CKA, CKAD, and AWS certifications
  • Time-series monitoring is fundamental to production operations

What Is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit. It collects metrics from configured targets at regular intervals, evaluates rule expressions, displays results, and triggers alerts when conditions are met.

Core Features

| Feature | Description | |---------|-------------| | Pull model | Prometheus scrapes metrics from HTTP endpoints (no push) | | Time-series DB | Stores metrics with timestamps in an efficient format | | PromQL | Powerful query language for aggregation and analysis | | AlertManager | Handles alert deduplication, grouping, and routing | | Service discovery | Auto-discovers targets in Kubernetes, Consul, EC2 | | Multi-dimensional | Labels provide multiple dimensions for data |

Architecture

Service (exposes /metrics) ◄── Prometheus (scrape every 15s)
                                                        │
                                                    PromQL queries
                                                        │
                                              ┌─────────┴─────────┐
                                              ▼                   ▼
                                         Grafana           AlertManager
                                        (dashboards)      (Slack, Email, PagerDuty)

Installation

With Docker Compose

version: "3.9"
services:
  prometheus:
    image: prom/prometheus:v2.50
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

volumes:
  prometheus_data:

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'my-app'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: my-app

Metric Types

| Type | Description | Example | |------|-------------|--------| | Counter | Monotonically increasing value | http_requests_total | | Gauge | Value that can go up or down | memory_usage_bytes, cpu_usage_percent | | Histogram | Observes values in configurable buckets | http_request_duration_seconds | | Summary | Similar to histogram but calculates quantiles client-side | rpc_duration_seconds |

Instrumenting Your Application

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'path', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'path'])
IN_PROGRESS = Gauge('http_requests_in_progress', 'Requests currently being processed')

# Use in an endpoint
def handle_request(method, path):
    IN_PROGRESS.inc()
    start = time.time()
    
    try:
        # Process request
        status = 200
    except:
        status = 500
    finally:
        duration = time.time() - start
        REQUEST_COUNT.labels(method=method, path=path, status=status).inc()
        REQUEST_DURATION.labels(method=method, path=path).observe(duration)
        IN_PROGRESS.dec()

# Start metrics server on port 8000
start_http_server(8000)

PromQL Examples

Basic Queries

# All time-series
http_requests_total

# With label filter
http_requests_total{method="POST", status="200"}

# Regex matching
http_requests_total{method=~"GET|POST", path=~"/api/.*"}

Rate and Increase

# Per-second rate (last 5 min)
rate(http_requests_total[5m])

# Total increase in the last hour
increase(http_requests_total[1h])

# Per-second rate by path
sum by(path) (rate(http_requests_total[5m]))

Aggregation

# Sum by service
sum by(service) (rate(http_requests_total[5m]))

# Average across instances
avg by(job) (rate(http_requests_total[5m]))

# Top 5 endpoints (by request count)
topk(5, sum by(path) (rate(http_requests_total[5m])))

Latency Percentiles

# P50 latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency by path
histogram_quantile(0.99, sum by(le, path) (rate(http_request_duration_seconds_bucket[5m])))

AlertManager

Alert Rules

# alerts.yml
groups:
  - name: my-app
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 2s on {{ $labels.job }}"
          description: "P95 latency is {{ $value }}s (threshold: 2s)"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"

AlertManager Config

# alertmanager.yml
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
      repeat_interval: 5m
    - match:
        severity: warning
      receiver: slack

receivers:
  - name: default
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: pagerduty
    pagerduty_configs:
      - routing_key: 'your-pagerduty-key'

  - name: slack
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#warnings'

Best Practices

| Practice | Reason | |----------|--------| | Use the RED method | Rate, Errors, Duration — the 3 golden signals | | Use the USE method | Utilization, Saturation, Errors — for infrastructure | | Use labels consistently | service, namespace, environment — standardize | | Keep label cardinality low | High cardinality breaks Prometheus | | Set appropriate scrape intervals | 15s default, adjust based on data volume | | Set retention time | balance storage cost vs. historical analysis | | Use recording rules for expensive queries | Pre-compute complex PromQL | | Alert on symptoms, not causes | Alert on error rate, not CPU spike |

Summary

Prometheus is the industry standard for metrics collection in cloud-native environments. Its pull model, multi-dimensional data model, and powerful PromQL make it the monitoring backbone for Kubernetes. AlertManager handles notification routing and deduplication.

Key takeaways:

  • Prometheus scrapes metrics from HTTP endpoints (pull model)
  • Metric types: Counter (increasing), Gauge (variable), Histogram (bucketed), Summary (quantiles)
  • PromQL: rate(), increase(), histogram_quantile(), sum by(), topk()
  • AlertManager deduplicates, groups, and routes alerts to Slack, PagerDuty, email
  • RED method: Rate, Errors, Duration for service monitoring
  • USE method: Utilization, Saturation, Errors for resource monitoring
  • Keep label cardinality low to prevent performance issues
  • Use recording rules to pre-compute expensive queries

What's Next: Grafana

The next chapter covers Grafana — creating dashboards, configuring data sources, alerting, and building operational visibility.

Member Exclusive Free Tutorial

This chapter is free exclusive content for registered members! Please login or register to unlock immediately.

Login / Register Now