Hands-On: Complete Monitoring System

Vibe Prompt

"Help me write a complete docker-compose.yml file that includes Prometheus, Grafana, Loki, Promtail, Jaeger, and Node Exporter."

Complete Compose

version: "3.8"

services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
  
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_INSTALL_PLUGINS: grafana-lokiexplore-app
    volumes:
      - grafana_data:/var/lib/grafana
  
  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]
    command: -config.file=/etc/loki/local-config.yaml
  
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yaml
  
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports: ["16686:16686", "4317:4317"]
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
  
  node_exporter:
    image: prom/node-exporter:latest
    ports: ["9100:9100"]
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'

volumes:
  prometheus_data:
  grafana_data:

Three Pillars of Observability

Metrics (Prometheus) → System numerical indicators
Logs (Loki)        → Event records
Traces (Jaeger)    → Request flows
          ↓
    Grafana Unified Dashboard

Course Summary

Monitoring course completed!

✅ Prometheus metrics collection
✅ Grafana dashboards
✅ Loki log aggregation
✅ OpenTelemetry + Jaeger
✅ Complete observability platform

Complete Monitoring Architecture Overview

User Request
    │
    ▼
┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│  Nginx Ingress│───▶│  API Server   │───▶│  PostgreSQL   │
│ (Prometheus) │    │ (OTel SDK)   │    │ (exporter)   │
└─────────────┘    └──────────────┘    └──────────────┘
    │                    │                    │
    ▼                    ▼                    ▼
┌─────────────────────────────────────────────────┐
│              Prometheus (Metrics Collection)    │
└─────────────────────────────────────────────────┘
    │                    │                    │
    ▼                    ▼                    ▼
┌─────────────┐  ┌──────────────┐  ┌──────────────┐
│   Grafana   │  │  Loki (Logs) │  │  Jaeger      │
│  (Dashboard)│  │              │  │  (Tracing)   │
└─────────────┘  └──────────────┘  └──────────────┘
    │
    ▼
┌─────────────┐
│  Alertmanager│
│ Slack/Email │
└─────────────┘

Pricing Estimation (Self-Hosted vs SaaS)

| Solution | Monthly Cost | Advantages | Disadvantages | |----------|:------------:|------------|---------------| | Self-Hosted Prometheus+Grafana | $0 (Only EC2 costs) | Full control, unlimited data | Requires maintenance | | Grafana Cloud (Free) | $0 | No maintenance, 14-day retention | Limited Metrics/Logs | | Grafana Cloud (Pro) | ~$50/month and up | No maintenance, scalable | Costs grow with data volume | | Datadog | ~$15/host/month | Most integrated features | Costs rise quickly with high traffic | | AWS Managed Prometheus | $0.09/million samples | Best AWS integration | AWS-only usage |

Next Steps

Establish SLO Dashboard: Combine SLI with error budgets
Design On-Call Rotations: Integrate with PagerDuty/Opsgenie
Implement Auto Remediation: Use Webhooks to trigger automated fix scripts

Common Errors

Code Examples

Building a Complete Observability Platform

Integrate Prometheus, Grafana, Loki, OpenTelemetry, and Jaeger to unify Metrics (what happened), Logs (why it happened), and Traces (where it happened).

The Three Pillars

| Pillar | Tool | Questions Answered | |:----|:----|:---------| | Metrics | Prometheus | CPU spike? Traffic surge? | | Logs | Loki | Detailed error messages? | | Traces | OpenTelemetry | Which service is slowest? |

Course Summary

This monitoring course takes you from Prometheus, Grafana, Loki, OpenTelemetry to a complete system. Now you can build comprehensive monitoring solutions for any system.

Understanding Observability: The Foundation of Modern Systems

Observability is not just about monitoring—it's about understanding the internal state of a system through its external outputs. In today's distributed, microservices-based architectures, traditional monitoring approaches are insufficient. You need a holistic view that combines metrics, logs, and traces to diagnose issues quickly and proactively prevent failures.

Why Observability Matters for Business Success

In the cloud-native era, system downtime translates directly to revenue loss. According to industry studies, even a 1-minute outage can cost enterprises thousands to millions of dollars, depending on their scale. Observability platforms like the one we're building provide real-time insights that enable:

Faster Incident Response: Reduce mean time to detection (MTTD) and mean time to resolution (MTTR) by 50-80% through correlated data.
Proactive Capacity Planning: Use predictive analytics on metrics to scale resources before bottlenecks occur.
Improved User Experience: Monitor end-to-end request flows to identify and eliminate latency hotspots.
Compliance and Auditing: Maintain detailed logs and traces for regulatory requirements (GDPR, HIPAA, SOC2).
Cost Optimization: Identify underutilized resources and optimize cloud spending through granular usage metrics.

For developers and founders, investing in observability is investing in system reliability, customer satisfaction, and competitive advantage. It transforms reactive firefighting into proactive system management.

The Three Pillars Explained

1. Metrics: The Quantitative Pulse of Your System

Metrics are numerical representations of system behavior over time. Prometheus excels at collecting and storing time-series data, making it ideal for tracking:

System Health: CPU utilization, memory usage, disk I/O, network throughput
Application Performance: Request rates, error rates, latency distributions
Business KPIs: User signups, transaction volumes, conversion rates

Prometheus uses a powerful query language called PromQL that allows you to create complex analytical queries. For example, you can calculate the 99th percentile latency of API requests over the last 5 minutes, or identify services with error rates exceeding 5%.

2. Logs: The Narrative of System Events

Logs provide the detailed, contextual information that explains what happened at specific points in time. Loki, designed to be lightweight and cost-effective, aggregates logs from all your services and stores them efficiently.

Key advantages of Loki over traditional log solutions:

Label-based indexing: Instead of parsing every log line, Loki indexes only labels, making it much faster and cheaper
Grafana integration: Seamlessly query logs alongside metrics and traces in the same dashboard
Scalable storage: Handles petabytes of logs without the complexity of ELK stack

With Promtail as the log shipper, you can collect logs from any source—application logs, system logs, container logs—and route them to Loki with proper labeling for easy querying.

3. Traces: The Journey of a Request

Distributed tracing follows a request as it travels through multiple services, providing a complete picture of its path and performance. Jaeger implements the OpenTracing standard, allowing you to instrument your applications once and get end-to-end visibility.

Traces are composed of spans, which represent individual operations within a service. Each span has:

Operation name: What was executed
Timestamps: Start and end time
Tags: Key-value metadata (e.g., HTTP status code, user ID)
Logs: Event timestamps within the span
References: Links to parent and child spans

By analyzing traces, you can identify bottlenecks, understand service dependencies, and debug complex issues that span multiple microservices.

Implementation Strategy Using Vibe Coding

Step 1: Setting Up the Docker Compose Environment

We'll start by creating a docker-compose.yml file that orchestrates all our monitoring components. This approach allows us to run the entire observability stack locally for development and testing.

The compose file defines six services:

Prometheus: The metrics collection engine that scrapes targets and stores time-series data
Grafana: The visualization layer that creates dashboards and alerts
Loki: The log aggregation system that stores and indexes logs
Promtail: The log collector that reads log files and sends them to Loki
Jaeger: The tracing backend that collects and visualizes distributed traces
Node Exporter: The system metrics exporter that provides host-level metrics

Each service is configured with appropriate ports, volumes, and environment variables. We use named volumes for persistent data storage, ensuring that metrics, dashboards, and logs survive container restarts.

Step 2: Configuring Prometheus

Prometheus requires a configuration file (prometheus.yml) that defines:

Global settings: Scrape intervals, evaluation intervals
Scrape configs: Which targets to monitor and how often
Rule files: Alerting rules and recording rules
Alerting: Notification endpoints for Alertmanager

For a production setup, you would configure Prometheus to scrape metrics from your application services, databases, and infrastructure components. Each target exposes metrics on an HTTP endpoint (typically /metrics), which Prometheus polls at regular intervals.

Step 3: Setting Up Grafana

Grafana acts as the central dashboard for all your observability data. Key configuration steps include:

Initial setup: Access Grafana at http://localhost:3000 with default credentials (admin/admin)
Data source configuration: Add Prometheus, Loki, and Jaeger as data sources
Dashboard creation: Build dashboards that combine metrics, logs, and traces
Alerting: Configure alert rules that trigger notifications based on metric thresholds

Grafana's powerful panel system allows you to create visualizations that show multiple data types side by side. For example, you can have a graph showing CPU usage, a table showing recent errors, and a trace visualization all in one dashboard.

Step 4: Configuring Loki and Promtail

Loki requires minimal configuration for basic setups, but for production, you'll want to configure:

Storage: How logs are stored (local filesystem, S3, etc.)
Indexing: Label-based indexing strategy
Ingestion: Rate limits and chunk targets

Promtail configuration involves:

Scraping configs: Which log files to read
Position storage: Where to store read positions to avoid re-reading
Targets: How to label logs for querying

The key is to ensure that all your services are producing logs in a structured format (JSON preferred) with consistent labels that make querying efficient.

Step 5: Instrumenting Applications with OpenTelemetry

To enable distributed tracing, you need to instrument your applications using OpenTelemetry SDKs. The process involves:

Installing SDKs: Add OpenTelemetry libraries to your application dependencies
Creating tracers: Initialize tracer providers with appropriate configuration
Adding spans: Wrap key operations in spans to track their execution
Exporting traces: Configure exporters to send traces to Jaeger

For Python applications, you would use the opentelemetry-sdk and opentelemetry-exporter-jaeger packages. For Java, you'd use the OpenTelemetry Java agent or SDK. The goal is to have end-to-end traces that show the complete request flow through your system.

Step 6: Integrating Alertmanager for Notifications

Alertmanager handles alerts sent from Prometheus and routes them to appropriate receivers. Configuration involves:

Receivers: Define notification channels (Slack, Email, PagerDuty)
Routes: Specify how alerts should be routed based on labels
Inhibit rules: Prevent notification storms by suppressing related alerts

This ensures that the right people get notified at the right time through their preferred communication channels.

Advanced Configuration Considerations

Security Hardening

In production environments, you must secure all components:

Authentication: Enable Grafana authentication, use OAuth or LDAP
Authorization: Configure role-based access control (RBAC) in Grafana
TLS: Enable HTTPS for all services
Network policies: Restrict access to monitoring endpoints

Scaling for Production

As your system grows, you'll need to scale the monitoring stack:

Prometheus: Use Thanos or Cortex for long-term storage and horizontal scaling
Loki: Deploy in a distributed mode with multiple ingesters and chunk stores
Grafana: Run in a clustered configuration with shared databases
Jaeger: Use the collector-based architecture with multiple agents

Data Retention and Archival

Implement appropriate retention policies:

Metrics: Keep high-resolution data for 15 days, aggregated data for 1 year
Logs: Retain logs for 30 days, archive older logs to cold storage
Traces: Store traces for 7-14 days depending on business requirements

Cost Optimization Strategies

While self-hosted solutions appear "free," they have hidden costs:

Infrastructure: EC2 instances, storage volumes, network bandwidth
Maintenance: Engineering time for updates, backups, troubleshooting
Opportunity cost: Resources spent on monitoring instead of features

SaaS solutions like Grafana Cloud offer predictable pricing but may become expensive at scale. The optimal approach often involves:

Self-hosting core components (Prometheus, Loki)
Using SaaS for advanced features (Grafana Cloud for dashboards, Alertmanager as a service)
Implementing data sampling and aggregation to reduce storage costs

Real-World Use Cases

E-commerce Platform Monitoring

An e-commerce site would monitor:

Frontend: Page load times, conversion rates, cart abandonment
API Layer: Request rates, error rates, authentication failures
Database: Query performance, connection pool usage, replication lag
Payment Processing: Transaction success rates, fraud detection metrics

By correlating metrics, logs, and traces, engineers can quickly identify whether a spike in errors is due to a database issue, a third-party API failure, or a code deployment problem.

Microservices Architecture

In a microservices environment, observability is critical because:

Services are independently deployable and scalable
Failures can cascade across service boundaries
Debugging requires understanding request flows across multiple services

Tracing becomes essential to understand how a user request flows through different services, while metrics help identify which

Hands-On: Complete Monitoring System

Vibe Prompt

Complete Compose

Three Pillars of Observability

Course Summary

Complete Monitoring Architecture Overview

Pricing Estimation (Self-Hosted vs SaaS)

Next Steps

Common Errors

Code Examples

Building a Complete Observability Platform

The Three Pillars

Course Summary

Understanding Observability: The Foundation of Modern Systems

Why Observability Matters for Business Success

The Three Pillars Explained

1. Metrics: The Quantitative Pulse of Your System

2. Logs: The Narrative of System Events

3. Traces: The Journey of a Request

Implementation Strategy Using Vibe Coding

Step 1: Setting Up the Docker Compose Environment

Step 2: Configuring Prometheus

Step 3: Setting Up Grafana

Step 4: Configuring Loki and Promtail

Step 5: Instrumenting Applications with OpenTelemetry

Step 6: Integrating Alertmanager for Notifications

Advanced Configuration Considerations

Security Hardening

Scaling for Production

Data Retention and Archival

Cost Optimization Strategies

Real-World Use Cases

E-commerce Platform Monitoring

Microservices Architecture

Unlock Full Tutorial