Grafana — Dashboards, Alerting, and Visualization
Why Grafana Matters
Grafana is the leading open-source platform for monitoring and observability. It connects to Prometheus, Loki, Elasticsearch, CloudWatch, and 50+ other data sources. It creates beautiful dashboards, evaluates alerts, and provides a single pane of glass for your entire infrastructure.
Why this matters for your career:
- Grafana is the standard visualization tool for Prometheus and monitoring stacks
- Dashboards are how operations teams understand system health at a glance
- Grafana skills are expected for SRE, DevOps, and platform engineering roles
- Grafana Alerting replaces multiple separate alerting systems with a unified approach
What Is Grafana?
Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources.
Key Features
| Feature | Description | |---------|-------------| | Panels | Visualizations: graphs, tables, stats, gauges, heatmaps, bar charts | | Dashboards | Organized groups of panels for specific views | | Data sources | Prometheus, Loki, Elasticsearch, CloudWatch, InfluxDB, PostgreSQL, and 50+ | | Alerting | Unified alert engine with evaluation, routing, and notification | | Variables | Dynamic filter controls (environment, service, time range) | | Annotations | Mark events on graphs (deployments, incidents, rollbacks) | | Folders | Organize dashboards by team, service, or environment | | Teams | Role-based access control for dashboards and data sources | | Provisioning | Define dashboards and data sources as code (YAML/JSON) | | Explore | Ad-hoc query editor for PromQL, LogQL, and other query languages |
Installation
# Docker
docker run -d -p 3000:3000 --name grafana grafana/grafana:10.3
# Docker Compose
grafana:
image: grafana/grafana:10.3
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
Default login: admin / admin (first login prompts to change password).
Configuring Data Sources
Via UI
- Log into Grafana
- Click Connections → Add new connection
- Search for your data source (e.g., Prometheus)
- Enter the URL (e.g., http://prometheus:9090)
- Click Save & Test
Via Provisioning (as code)
# /etc/grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686
editable: false
Creating Dashboards
Key Panel Types
| Panel Type | Best For | |-----------|---------| | Time Series | Line/area graphs showing metrics over time | | Stat | Single numeric value (current CPU, uptime) | | Gauge | Value within a range (memory usage 0-100%) | | Bar Chart | Comparing values across categories | | Table | Detailed data with sorting and filtering | | Heatmap | Distribution of values over time (latency heatmap) | | Logs | Log viewer with highlighting and filtering | | Traces | Trace visualization (requires Tempo/Jaeger) |
Dashboard Design Best Practices
| Practice | Reason | |----------|--------| | Group related metrics together | Logical organization makes dashboards intuitive | | Use consistent color schemes | Red = bad, Green = good, Yellow = warning | | Add descriptions to panels | Help other engineers understand what they're seeing | | Use dashboard variables | Dynamic filtering (env, service, instance) | | Keep it simple | One dashboard = one concern (service, infrastructure, business) | | Use templating | Avoid hardcoding label values | | Add annotations for deployments | Correlate code changes with performance changes | | Set appropriate time ranges | Default to last 6h, allow custom ranges |
Dashboard Variables
# In dashboard settings
list:
- name: namespace
type: query
query: label_values(namespace)
refresh: onDashboardLoad
- name: service
type: query
query: label_values({namespace="$namespace"}, service)
refresh: onTimeRangeChanged
- name: pod
type: query
query: label_values({namespace="$namespace", service="$service"}, pod)
refresh: onTimeRangeChanged
This creates cascading filters: select namespace → services update → select service → pods update.
Grafana Alerting
Create an Alert Rule
# Unified alerting (Grafana 8+)
apiVersion: 1
groups:
- name: my-app-alerts
interval: 30s
rules:
- uid: high_error_rate
title: "High Error Rate"
condition: A
data:
- refId: A
datasourceUid: prometheus
model:
expr: 'rate(http_errors_total[5m]) / rate(http_requests_total[5m]) * 100 > 5'
no_data_state: NoData
exec_err_state: Alerting
for: 5m
annotations:
summary: "Error rate is above 5%"
labels:
severity: critical
Notification Channels
| Channel | When to Use | |---------|-------------| | Slack | Team communication, non-urgent | | Email | Stakeholder notifications, daily reports | | PagerDuty | On-call escalation for critical alerts | | Webhook | Custom integrations (Jira, Teams, Discord) | | Telegram | Simple personal notifications | | OpsGenie | Incident management integration | | Pushover | Mobile push notifications |
Annotations
Annotations overlay events on your graphs:
# Via API
# POST /api/annotations
{
"dashboardUID": "abc123",
"time": 1700000000000,
"text": "Deployment v1.2.3 to production",
"tags": ["deployment", "production"]
}
Automated annotations from CI/CD pipelines help correlate performance changes with deployments.
Provisioning Dashboards as Code
// /etc/grafana/provisioning/dashboards/my-app.json
{
"title": "My App Overview",
"tags": ["my-app", "production"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [{
"expr": "sum by(service) (rate(http_requests_total[5m]))",
"legendFormat": "{{service}}"
}]
},
{
"title": "P99 Latency",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [{
"expr": "histogram_quantile(0.99, sum by(le, service) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "{{service}}"
}]
},
{
"title": "Error Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 8 },
"targets": [{
"expr": "rate(http_errors_total[5m]) / rate(http_requests_total[5m]) * 100",
"legendFormat": "{{service}}"
}]
}
]
}
Best Practices
| Practice | Reason | |----------|--------| | Provision everything as code | Dashboards in Git = version controlled, auditable | | Use dashboard variables | One dashboard per service, filter by environment | | Set up alerting on SLOs | Alert on error budget burn rate, not raw thresholds | | Add annotations for deployments | Correlate code changes with metric changes | | Use Explore for ad-hoc queries | Don't clutter dashboards with one-off queries | | Limit dashboard refresh rate | 15-30s is enough; faster causes unnecessary load | | Organize with tags and folders | Engineers can find relevant dashboards easily | | Set appropriate permissions | Read-only for viewers, edit for operators |
Summary
Grafana provides a unified visualization platform for all your observability data. Connect Prometheus, Loki, and other data sources, build dashboards with variables and annotations, set up alerts, and provision everything as code.
Key takeaways:
- Grafana connects to 50+ data sources (Prometheus, Loki, CloudWatch, etc.)
- Panel types: Time Series, Stat, Gauge, Bar Chart, Table, Heatmap, Logs, Traces
- Dashboard variables provide dynamic filtering (env, service, pod)
- Unified alerting evaluates rules and routes notifications
- Annotations correlate events (deployments) with metric changes
- Provision dashboards and data sources as code (YAML/JSON)
- Use consistent layouts and color schemes for readability
- One dashboard per concern: service overview, infrastructure, business metrics
What's Next: Loki — Log Aggregation
The next chapter covers Grafana Loki — log aggregation for Kubernetes, LogQL queries, and integrating logs with metrics in Grafana.