Grafana — Dashboards, Alerting, and Visualization

Why Grafana Matters

Grafana is the leading open-source platform for monitoring and observability. It connects to Prometheus, Loki, Elasticsearch, CloudWatch, and 50+ other data sources. It creates beautiful dashboards, evaluates alerts, and provides a single pane of glass for your entire infrastructure.

Why this matters for your career:

Grafana is the standard visualization tool for Prometheus and monitoring stacks
Dashboards are how operations teams understand system health at a glance
Grafana skills are expected for SRE, DevOps, and platform engineering roles
Grafana Alerting replaces multiple separate alerting systems with a unified approach

What Is Grafana?

Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources.

Key Features

| Feature | Description | |---------|-------------| | Panels | Visualizations: graphs, tables, stats, gauges, heatmaps, bar charts | | Dashboards | Organized groups of panels for specific views | | Data sources | Prometheus, Loki, Elasticsearch, CloudWatch, InfluxDB, PostgreSQL, and 50+ | | Alerting | Unified alert engine with evaluation, routing, and notification | | Variables | Dynamic filter controls (environment, service, time range) | | Annotations | Mark events on graphs (deployments, incidents, rollbacks) | | Folders | Organize dashboards by team, service, or environment | | Teams | Role-based access control for dashboards and data sources | | Provisioning | Define dashboards and data sources as code (YAML/JSON) | | Explore | Ad-hoc query editor for PromQL, LogQL, and other query languages |

Installation

# Docker
docker run -d -p 3000:3000 --name grafana grafana/grafana:10.3

# Docker Compose
grafana:
  image: grafana/grafana:10.3
  ports:
    - "3000:3000"
  volumes:
    - grafana_data:/var/lib/grafana
    - ./grafana/provisioning:/etc/grafana/provisioning
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=admin
    - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel

Default login: admin / admin (first login prompts to change password).

Configuring Data Sources

Via UI

Log into Grafana
Click Connections → Add new connection
Search for your data source (e.g., Prometheus)
Enter the URL (e.g., http://prometheus:9090)
Click Save & Test

Via Provisioning (as code)

# /etc/grafana/provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
    editable: false

Creating Dashboards

Key Panel Types

| Panel Type | Best For | |-----------|---------| | Time Series | Line/area graphs showing metrics over time | | Stat | Single numeric value (current CPU, uptime) | | Gauge | Value within a range (memory usage 0-100%) | | Bar Chart | Comparing values across categories | | Table | Detailed data with sorting and filtering | | Heatmap | Distribution of values over time (latency heatmap) | | Logs | Log viewer with highlighting and filtering | | Traces | Trace visualization (requires Tempo/Jaeger) |

Dashboard Design Best Practices

| Practice | Reason | |----------|--------| | Group related metrics together | Logical organization makes dashboards intuitive | | Use consistent color schemes | Red = bad, Green = good, Yellow = warning | | Add descriptions to panels | Help other engineers understand what they're seeing | | Use dashboard variables | Dynamic filtering (env, service, instance) | | Keep it simple | One dashboard = one concern (service, infrastructure, business) | | Use templating | Avoid hardcoding label values | | Add annotations for deployments | Correlate code changes with performance changes | | Set appropriate time ranges | Default to last 6h, allow custom ranges |

Dashboard Variables

# In dashboard settings
list:
  - name: namespace
    type: query
    query: label_values(namespace)
    refresh: onDashboardLoad

  - name: service
    type: query
    query: label_values({namespace="$namespace"}, service)
    refresh: onTimeRangeChanged

  - name: pod
    type: query
    query: label_values({namespace="$namespace", service="$service"}, pod)
    refresh: onTimeRangeChanged

This creates cascading filters: select namespace → services update → select service → pods update.

Grafana Alerting

Create an Alert Rule

# Unified alerting (Grafana 8+)
apiVersion: 1
groups:
  - name: my-app-alerts
    interval: 30s
    rules:
      - uid: high_error_rate
        title: "High Error Rate"
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: 'rate(http_errors_total[5m]) / rate(http_requests_total[5m]) * 100 > 5'
        no_data_state: NoData
        exec_err_state: Alerting
        for: 5m
        annotations:
          summary: "Error rate is above 5%"
        labels:
          severity: critical

Notification Channels

| Channel | When to Use | |---------|-------------| | Slack | Team communication, non-urgent | | Email | Stakeholder notifications, daily reports | | PagerDuty | On-call escalation for critical alerts | | Webhook | Custom integrations (Jira, Teams, Discord) | | Telegram | Simple personal notifications | | OpsGenie | Incident management integration | | Pushover | Mobile push notifications |

Annotations

Annotations overlay events on your graphs:

# Via API
# POST /api/annotations
{
  "dashboardUID": "abc123",
  "time": 1700000000000,
  "text": "Deployment v1.2.3 to production",
  "tags": ["deployment", "production"]
}

Automated annotations from CI/CD pipelines help correlate performance changes with deployments.

Provisioning Dashboards as Code

// /etc/grafana/provisioning/dashboards/my-app.json
{
  "title": "My App Overview",
  "tags": ["my-app", "production"],
  "timezone": "browser",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
      "targets": [{
        "expr": "sum by(service) (rate(http_requests_total[5m]))",
        "legendFormat": "{{service}}"
      }]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "targets": [{
        "expr": "histogram_quantile(0.99, sum by(le, service) (rate(http_request_duration_seconds_bucket[5m])))",
        "legendFormat": "{{service}}"
      }]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 24, "x": 0, "y": 8 },
      "targets": [{
        "expr": "rate(http_errors_total[5m]) / rate(http_requests_total[5m]) * 100",
        "legendFormat": "{{service}}"
      }]
    }
  ]
}

Best Practices

| Practice | Reason | |----------|--------| | Provision everything as code | Dashboards in Git = version controlled, auditable | | Use dashboard variables | One dashboard per service, filter by environment | | Set up alerting on SLOs | Alert on error budget burn rate, not raw thresholds | | Add annotations for deployments | Correlate code changes with metric changes | | Use Explore for ad-hoc queries | Don't clutter dashboards with one-off queries | | Limit dashboard refresh rate | 15-30s is enough; faster causes unnecessary load | | Organize with tags and folders | Engineers can find relevant dashboards easily | | Set appropriate permissions | Read-only for viewers, edit for operators |

Summary

Grafana provides a unified visualization platform for all your observability data. Connect Prometheus, Loki, and other data sources, build dashboards with variables and annotations, set up alerts, and provision everything as code.

Key takeaways:

Grafana connects to 50+ data sources (Prometheus, Loki, CloudWatch, etc.)
Panel types: Time Series, Stat, Gauge, Bar Chart, Table, Heatmap, Logs, Traces
Dashboard variables provide dynamic filtering (env, service, pod)
Unified alerting evaluates rules and routes notifications
Annotations correlate events (deployments) with metric changes
Provision dashboards and data sources as code (YAML/JSON)
Use consistent layouts and color schemes for readability
One dashboard per concern: service overview, infrastructure, business metrics

What's Next: Loki — Log Aggregation

The next chapter covers Grafana Loki — log aggregation for Kubernetes, LogQL queries, and integrating logs with metrics in Grafana.