OpenTelemetry — Distributed Tracing and Instrumentation

Why OpenTelemetry Matters

OpenTelemetry (OTel) is the industry standard for observability instrumentation. It provides a single set of APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (traces, metrics, logs). It is the second most active CNCF project after Kubernetes.

Why this matters for your career:

  • OpenTelemetry is the industry standard — adopted by major cloud providers and observability vendors
  • It provides vendor-neutral instrumentation (switch between Jaeger, Tempo, Datadog, New Relic)
  • OTel skills are increasingly required for backend and platform engineering roles
  • Distributed tracing is essential for debugging microservices architectures

What Is OpenTelemetry?

OpenTelemetry is a collection of tools, APIs, and SDKs used to instrument, generate, collect, and export telemetry data.

Components

| Component | Purpose | |-----------|---------| | OTel API | Language-specific interfaces for creating spans, metrics, logs | | OTel SDK | Implementation of the API with sampling, processing, exporting | | OTel Collector | Vendor-agnostic agent for receiving, processing, and exporting telemetry | | Instrumentation Libraries | Auto-instrumentation for popular frameworks (Express, Spring, Django) | | Exporters | Send data to backends (Jaeger, Tempo, Prometheus, Datadog) |

Core Concepts

Traces and Spans

A trace represents the entire journey of a request as it travels through a distributed system. A span represents a single unit of work within a trace.

Trace: POST /api/orders
├── Span: authenticate-user (2ms)
├── Span: validate-order (5ms)
├── Span: process-payment (120ms)
│   ├── Span: call-payment-gateway (115ms)
│   └── Span: update-payment-status (3ms)
├── Span: update-inventory (20ms)
│   └── Span: db-query-update-stock (18ms)
└── Span: send-confirmation (8ms)

Span Attributes

Each span carries:

  • Name: Operation name (e.g., "process-payment")
  • Span ID: Unique identifier
  • Trace ID: Links all spans in the same trace
  • Parent Span ID: Links to the parent span (hierarchy)
  • Start/End Time: Duration calculation
  • Attributes: Key-value pairs (e.g., order.id, payment.amount)
  • Events: Timestamped log messages within the span
  • Status: OK, Error, or Unset

Instrumentation

Node.js Auto-Instrumentation

// app.js — top of entry file
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');

// Create and configure tracer provider
const provider = new NodeTracerProvider();

// Configure exporter to send to OTel Collector
const exporter = new OTLPTraceExporter({
  url: 'http://localhost:4318/v1/traces',
});

// Add span processor (batch for performance)
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Auto-instrument HTTP, Express, gRPC, database calls
registerInstrumentations({
  instrumentations: getNodeAutoInstrumentations(),
});

// Now create your Express app as usual
const express = require('express');
const app = express();
// ... all routes are automatically traced

Python Manual Instrumentation

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Set up tracer provider
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

# Auto-instrument HTTP requests
RequestsInstrumentor().instrument()

# Manual tracing
tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process-order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.value", 59.99)

        # Nested span for database call
        with tracer.start_as_current_span("db-query") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.query", "SELECT * FROM orders WHERE id = %s")
            # ... execute query

        # Nested span for external API call
        with tracer.start_as_current_span("payment-gateway") as pay_span:
            pay_span.set_attribute("payment.provider", "stripe")
            pay_span.set_attribute("payment.amount", 59.99)
            # ... call payment API

        return {"status": "success", "order_id": order_id}

OpenTelemetry Collector

The Collector receives, processes, and exports telemetry data. It acts as a central hub.

Collector Configuration

# otel-collector-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

# Exporters — send to multiple backends
exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus, logging]

Visualization Backends

| Backend | Type | Integration | |---------|------|-------------| | Jaeger | Tracing UI | Native OTLP support | | Grafana Tempo | Tracing + metrics | Native OTLP + Grafana | | Grafana | Combined dashboards | Tempo, Prometheus, Loki datasources | | SigNoz | Open-source APM | Native OTLP | | Datadog | Commercial APM | OTel → Datadog exporter | | New Relic | Commercial APM | OTel → New Relic exporter | | AWS X-Ray | Cloud tracing | AWS OTel Distro |

Sampling Strategies

| Strategy | Description | Use Case | |----------|-------------|----------| | Head-based | Decision at span creation | Simple, may miss important traces | | Tail-based | Decision after span completes | Captures all errors, more complex | | Probabilistic | Sample X% of all traces | Low overhead, good for high volume | | Rate-limiting | Max N traces per second | Control costs |

# Probabilistic sampling
exporters:
  otlp:
    sampling:
      probability: 0.1  # Sample 10% of traces

Best Practices

| Practice | Reason | |----------|--------| | Add instrumentation at the start of a project | Adding later requires more refactoring | | Use auto-instrumentation when possible | Less code, covers standard libraries | | Add manual spans for business logic | Custom visibility into important operations | | Set meaningful span attributes | Enable filtering and analysis | | Set span status on errors | Easily identify failed spans | | Use batch span processor | Better performance than simple processor | | Deploy the OTel Collector | Centralized processing, buffering, retries | | Use consistent naming conventions | Easier to search and correlate |

Summary

OpenTelemetry is the industry standard for distributed tracing and observability instrumentation. It provides vendor-neutral APIs and SDKs for generating traces, metrics, and logs. The OTel Collector centralizes processing and export. Combined with Jaeger or Tempo for visualization, OTel gives you complete visibility into your distributed systems.

Key takeaways:

  • OpenTelemetry is vendor-neutral — switch backends without changing instrumentation
  • Traces = tree of spans showing request flow through services
  • Auto-instrumentation covers HTTP, databases, and frameworks with zero code
  • Manual instrumentation adds custom spans for business logic
  • The OTel Collector receives, processes, and exports telemetry data
  • Sampling controls costs (probabilistic, rate-limiting, tail-based)
  • Use consistent span names and attributes for effective analysis
  • Deploy Jaeger or Tempo for trace visualization

What's Next: Full Observability Stack

The next chapter combines Prometheus, Grafana, Loki, and OpenTelemetry into a complete observability stack — deploy with Docker Compose, configure data sources, and build unified dashboards.

Unlock Full Tutorial

This chapter is paid content. Join the project to unlock over 5000 words of deep analysis, including 10+ god-tier Prompts and real Source Code examples!