Hands-On: Complete Monitoring System
Vibe Prompt
"Help me write a complete docker-compose.yml file that includes Prometheus, Grafana, Loki, Promtail, Jaeger, and Node Exporter."
Complete Compose
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
GF_INSTALL_PLUGINS: grafana-lokiexplore-app
volumes:
- grafana_data:/var/lib/grafana
loki:
image: grafana/loki:latest
ports: ["3100:3100"]
command: -config.file=/etc/loki/local-config.yaml
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yaml
jaeger:
image: jaegertracing/all-in-one:latest
ports: ["16686:16686", "4317:4317"]
environment:
COLLECTOR_OTLP_ENABLED: "true"
node_exporter:
image: prom/node-exporter:latest
ports: ["9100:9100"]
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
volumes:
prometheus_data:
grafana_data:
Three Pillars of Observability
Metrics (Prometheus) → System numerical indicators
Logs (Loki) → Event records
Traces (Jaeger) → Request flows
↓
Grafana Unified Dashboard
Course Summary
Monitoring course completed!
- ✅ Prometheus metrics collection
- ✅ Grafana dashboards
- ✅ Loki log aggregation
- ✅ OpenTelemetry + Jaeger
- ✅ Complete observability platform
Complete Monitoring Architecture Overview
User Request
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Nginx Ingress│───▶│ API Server │───▶│ PostgreSQL │
│ (Prometheus) │ │ (OTel SDK) │ │ (exporter) │
└─────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ Prometheus (Metrics Collection) │
└─────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Grafana │ │ Loki (Logs) │ │ Jaeger │
│ (Dashboard)│ │ │ │ (Tracing) │
└─────────────┘ └──────────────┘ └──────────────┘
│
▼
┌─────────────┐
│ Alertmanager│
│ Slack/Email │
└─────────────┘
Pricing Estimation (Self-Hosted vs SaaS)
| Solution | Monthly Cost | Advantages | Disadvantages | |----------|:------------:|------------|---------------| | Self-Hosted Prometheus+Grafana | $0 (Only EC2 costs) | Full control, unlimited data | Requires maintenance | | Grafana Cloud (Free) | $0 | No maintenance, 14-day retention | Limited Metrics/Logs | | Grafana Cloud (Pro) | ~$50/month and up | No maintenance, scalable | Costs grow with data volume | | Datadog | ~$15/host/month | Most integrated features | Costs rise quickly with high traffic | | AWS Managed Prometheus | $0.09/million samples | Best AWS integration | AWS-only usage |
Next Steps
- Establish SLO Dashboard: Combine SLI with error budgets
- Design On-Call Rotations: Integrate with PagerDuty/Opsgenie
- Implement Auto Remediation: Use Webhooks to trigger automated fix scripts
Common Errors
Code Examples
Building a Complete Observability Platform
Integrate Prometheus, Grafana, Loki, OpenTelemetry, and Jaeger to unify Metrics (what happened), Logs (why it happened), and Traces (where it happened).
The Three Pillars
| Pillar | Tool | Questions Answered | |:----|:----|:---------| | Metrics | Prometheus | CPU spike? Traffic surge? | | Logs | Loki | Detailed error messages? | | Traces | OpenTelemetry | Which service is slowest? |
Course Summary
This monitoring course takes you from Prometheus, Grafana, Loki, OpenTelemetry to a complete system. Now you can build comprehensive monitoring solutions for any system.
Understanding Observability: The Foundation of Modern Systems
Observability is not just about monitoring—it's about understanding the internal state of a system through its external outputs. In today's distributed, microservices-based architectures, traditional monitoring approaches are insufficient. You need a holistic view that combines metrics, logs, and traces to diagnose issues quickly and proactively prevent failures.
Why Observability Matters for Business Success
In the cloud-native era, system downtime translates directly to revenue loss. According to industry studies, even a 1-minute outage can cost enterprises thousands to millions of dollars, depending on their scale. Observability platforms like the one we're building provide real-time insights that enable:
- Faster Incident Response: Reduce mean time to detection (MTTD) and mean time to resolution (MTTR) by 50-80% through correlated data.
- Proactive Capacity Planning: Use predictive analytics on metrics to scale resources before bottlenecks occur.
- Improved User Experience: Monitor end-to-end request flows to identify and eliminate latency hotspots.
- Compliance and Auditing: Maintain detailed logs and traces for regulatory requirements (GDPR, HIPAA, SOC2).
- Cost Optimization: Identify underutilized resources and optimize cloud spending through granular usage metrics.
For developers and founders, investing in observability is investing in system reliability, customer satisfaction, and competitive advantage. It transforms reactive firefighting into proactive system management.
The Three Pillars Explained
1. Metrics: The Quantitative Pulse of Your System
Metrics are numerical representations of system behavior over time. Prometheus excels at collecting and storing time-series data, making it ideal for tracking:
- System Health: CPU utilization, memory usage, disk I/O, network throughput
- Application Performance: Request rates, error rates, latency distributions
- Business KPIs: User signups, transaction volumes, conversion rates
Prometheus uses a powerful query language called PromQL that allows you to create complex analytical queries. For example, you can calculate the 99th percentile latency of API requests over the last 5 minutes, or identify services with error rates exceeding 5%.
2. Logs: The Narrative of System Events
Logs provide the detailed, contextual information that explains what happened at specific points in time. Loki, designed to be lightweight and cost-effective, aggregates logs from all your services and stores them efficiently.
Key advantages of Loki over traditional log solutions:
- Label-based indexing: Instead of parsing every log line, Loki indexes only labels, making it much faster and cheaper
- Grafana integration: Seamlessly query logs alongside metrics and traces in the same dashboard
- Scalable storage: Handles petabytes of logs without the complexity of ELK stack
With Promtail as the log shipper, you can collect logs from any source—application logs, system logs, container logs—and route them to Loki with proper labeling for easy querying.
3. Traces: The Journey of a Request
Distributed tracing follows a request as it travels through multiple services, providing a complete picture of its path and performance. Jaeger implements the OpenTracing standard, allowing you to instrument your applications once and get end-to-end visibility.
Traces are composed of spans, which represent individual operations within a service. Each span has:
- Operation name: What was executed
- Timestamps: Start and end time
- Tags: Key-value metadata (e.g., HTTP status code, user ID)
- Logs: Event timestamps within the span
- References: Links to parent and child spans
By analyzing traces, you can identify bottlenecks, understand service dependencies, and debug complex issues that span multiple microservices.
Implementation Strategy Using Vibe Coding
Step 1: Setting Up the Docker Compose Environment
We'll start by creating a docker-compose.yml file that orchestrates all our monitoring components. This approach allows us to run the entire observability stack locally for development and testing.
The compose file defines six services:
- Prometheus: The metrics collection engine that scrapes targets and stores time-series data
- Grafana: The visualization layer that creates dashboards and alerts
- Loki: The log aggregation system that stores and indexes logs
- Promtail: The log collector that reads log files and sends them to Loki
- Jaeger: The tracing backend that collects and visualizes distributed traces
- Node Exporter: The system metrics exporter that provides host-level metrics
Each service is configured with appropriate ports, volumes, and environment variables. We use named volumes for persistent data storage, ensuring that metrics, dashboards, and logs survive container restarts.
Step 2: Configuring Prometheus
Prometheus requires a configuration file (prometheus.yml) that defines:
- Global settings: Scrape intervals, evaluation intervals
- Scrape configs: Which targets to monitor and how often
- Rule files: Alerting rules and recording rules
- Alerting: Notification endpoints for Alertmanager
For a production setup, you would configure Prometheus to scrape metrics from your application services, databases, and infrastructure components. Each target exposes metrics on an HTTP endpoint (typically /metrics), which Prometheus polls at regular intervals.
Step 3: Setting Up Grafana
Grafana acts as the central dashboard for all your observability data. Key configuration steps include:
- Initial setup: Access Grafana at http://localhost:3000 with default credentials (admin/admin)
- Data source configuration: Add Prometheus, Loki, and Jaeger as data sources
- Dashboard creation: Build dashboards that combine metrics, logs, and traces
- Alerting: Configure alert rules that trigger notifications based on metric thresholds
Grafana's powerful panel system allows you to create visualizations that show multiple data types side by side. For example, you can have a graph showing CPU usage, a table showing recent errors, and a trace visualization all in one dashboard.
Step 4: Configuring Loki and Promtail
Loki requires minimal configuration for basic setups, but for production, you'll want to configure:
- Storage: How logs are stored (local filesystem, S3, etc.)
- Indexing: Label-based indexing strategy
- Ingestion: Rate limits and chunk targets
Promtail configuration involves:
- Scraping configs: Which log files to read
- Position storage: Where to store read positions to avoid re-reading
- Targets: How to label logs for querying
The key is to ensure that all your services are producing logs in a structured format (JSON preferred) with consistent labels that make querying efficient.
Step 5: Instrumenting Applications with OpenTelemetry
To enable distributed tracing, you need to instrument your applications using OpenTelemetry SDKs. The process involves:
- Installing SDKs: Add OpenTelemetry libraries to your application dependencies
- Creating tracers: Initialize tracer providers with appropriate configuration
- Adding spans: Wrap key operations in spans to track their execution
- Exporting traces: Configure exporters to send traces to Jaeger
For Python applications, you would use the opentelemetry-sdk and opentelemetry-exporter-jaeger packages. For Java, you'd use the OpenTelemetry Java agent or SDK. The goal is to have end-to-end traces that show the complete request flow through your system.
Step 6: Integrating Alertmanager for Notifications
Alertmanager handles alerts sent from Prometheus and routes them to appropriate receivers. Configuration involves:
- Receivers: Define notification channels (Slack, Email, PagerDuty)
- Routes: Specify how alerts should be routed based on labels
- Inhibit rules: Prevent notification storms by suppressing related alerts
This ensures that the right people get notified at the right time through their preferred communication channels.
Advanced Configuration Considerations
Security Hardening
In production environments, you must secure all components:
- Authentication: Enable Grafana authentication, use OAuth or LDAP
- Authorization: Configure role-based access control (RBAC) in Grafana
- TLS: Enable HTTPS for all services
- Network policies: Restrict access to monitoring endpoints
Scaling for Production
As your system grows, you'll need to scale the monitoring stack:
- Prometheus: Use Thanos or Cortex for long-term storage and horizontal scaling
- Loki: Deploy in a distributed mode with multiple ingesters and chunk stores
- Grafana: Run in a clustered configuration with shared databases
- Jaeger: Use the collector-based architecture with multiple agents
Data Retention and Archival
Implement appropriate retention policies:
- Metrics: Keep high-resolution data for 15 days, aggregated data for 1 year
- Logs: Retain logs for 30 days, archive older logs to cold storage
- Traces: Store traces for 7-14 days depending on business requirements
Cost Optimization Strategies
While self-hosted solutions appear "free," they have hidden costs:
- Infrastructure: EC2 instances, storage volumes, network bandwidth
- Maintenance: Engineering time for updates, backups, troubleshooting
- Opportunity cost: Resources spent on monitoring instead of features
SaaS solutions like Grafana Cloud offer predictable pricing but may become expensive at scale. The optimal approach often involves:
- Self-hosting core components (Prometheus, Loki)
- Using SaaS for advanced features (Grafana Cloud for dashboards, Alertmanager as a service)
- Implementing data sampling and aggregation to reduce storage costs
Real-World Use Cases
E-commerce Platform Monitoring
An e-commerce site would monitor:
- Frontend: Page load times, conversion rates, cart abandonment
- API Layer: Request rates, error rates, authentication failures
- Database: Query performance, connection pool usage, replication lag
- Payment Processing: Transaction success rates, fraud detection metrics
By correlating metrics, logs, and traces, engineers can quickly identify whether a spike in errors is due to a database issue, a third-party API failure, or a code deployment problem.
Microservices Architecture
In a microservices environment, observability is critical because:
- Services are independently deployable and scalable
- Failures can cascade across service boundaries
- Debugging requires understanding request flows across multiple services
Tracing becomes essential to understand how a user request flows through different services, while metrics help identify which