OpenTelemetry — Distributed Tracing and Instrumentation
Why OpenTelemetry Matters
OpenTelemetry (OTel) is the industry standard for observability instrumentation. It provides a single set of APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (traces, metrics, logs). It is the second most active CNCF project after Kubernetes.
Why this matters for your career:
- OpenTelemetry is the industry standard — adopted by major cloud providers and observability vendors
- It provides vendor-neutral instrumentation (switch between Jaeger, Tempo, Datadog, New Relic)
- OTel skills are increasingly required for backend and platform engineering roles
- Distributed tracing is essential for debugging microservices architectures
What Is OpenTelemetry?
OpenTelemetry is a collection of tools, APIs, and SDKs used to instrument, generate, collect, and export telemetry data.
Components
| Component | Purpose | |-----------|---------| | OTel API | Language-specific interfaces for creating spans, metrics, logs | | OTel SDK | Implementation of the API with sampling, processing, exporting | | OTel Collector | Vendor-agnostic agent for receiving, processing, and exporting telemetry | | Instrumentation Libraries | Auto-instrumentation for popular frameworks (Express, Spring, Django) | | Exporters | Send data to backends (Jaeger, Tempo, Prometheus, Datadog) |
Core Concepts
Traces and Spans
A trace represents the entire journey of a request as it travels through a distributed system. A span represents a single unit of work within a trace.
Trace: POST /api/orders
├── Span: authenticate-user (2ms)
├── Span: validate-order (5ms)
├── Span: process-payment (120ms)
│ ├── Span: call-payment-gateway (115ms)
│ └── Span: update-payment-status (3ms)
├── Span: update-inventory (20ms)
│ └── Span: db-query-update-stock (18ms)
└── Span: send-confirmation (8ms)
Span Attributes
Each span carries:
- Name: Operation name (e.g., "process-payment")
- Span ID: Unique identifier
- Trace ID: Links all spans in the same trace
- Parent Span ID: Links to the parent span (hierarchy)
- Start/End Time: Duration calculation
- Attributes: Key-value pairs (e.g., order.id, payment.amount)
- Events: Timestamped log messages within the span
- Status: OK, Error, or Unset
Instrumentation
Node.js Auto-Instrumentation
// app.js — top of entry file
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
// Create and configure tracer provider
const provider = new NodeTracerProvider();
// Configure exporter to send to OTel Collector
const exporter = new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
});
// Add span processor (batch for performance)
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Auto-instrument HTTP, Express, gRPC, database calls
registerInstrumentations({
instrumentations: getNodeAutoInstrumentations(),
});
// Now create your Express app as usual
const express = require('express');
const app = express();
// ... all routes are automatically traced
Python Manual Instrumentation
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Set up tracer provider
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
# Auto-instrument HTTP requests
RequestsInstrumentor().instrument()
# Manual tracing
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process-order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.value", 59.99)
# Nested span for database call
with tracer.start_as_current_span("db-query") as db_span:
db_span.set_attribute("db.system", "postgresql")
db_span.set_attribute("db.query", "SELECT * FROM orders WHERE id = %s")
# ... execute query
# Nested span for external API call
with tracer.start_as_current_span("payment-gateway") as pay_span:
pay_span.set_attribute("payment.provider", "stripe")
pay_span.set_attribute("payment.amount", 59.99)
# ... call payment API
return {"status": "success", "order_id": order_id}
OpenTelemetry Collector
The Collector receives, processes, and exports telemetry data. It acts as a central hub.
Collector Configuration
# otel-collector-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
# Exporters — send to multiple backends
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
logging:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus, logging]
Visualization Backends
| Backend | Type | Integration | |---------|------|-------------| | Jaeger | Tracing UI | Native OTLP support | | Grafana Tempo | Tracing + metrics | Native OTLP + Grafana | | Grafana | Combined dashboards | Tempo, Prometheus, Loki datasources | | SigNoz | Open-source APM | Native OTLP | | Datadog | Commercial APM | OTel → Datadog exporter | | New Relic | Commercial APM | OTel → New Relic exporter | | AWS X-Ray | Cloud tracing | AWS OTel Distro |
Sampling Strategies
| Strategy | Description | Use Case | |----------|-------------|----------| | Head-based | Decision at span creation | Simple, may miss important traces | | Tail-based | Decision after span completes | Captures all errors, more complex | | Probabilistic | Sample X% of all traces | Low overhead, good for high volume | | Rate-limiting | Max N traces per second | Control costs |
# Probabilistic sampling
exporters:
otlp:
sampling:
probability: 0.1 # Sample 10% of traces
Best Practices
| Practice | Reason | |----------|--------| | Add instrumentation at the start of a project | Adding later requires more refactoring | | Use auto-instrumentation when possible | Less code, covers standard libraries | | Add manual spans for business logic | Custom visibility into important operations | | Set meaningful span attributes | Enable filtering and analysis | | Set span status on errors | Easily identify failed spans | | Use batch span processor | Better performance than simple processor | | Deploy the OTel Collector | Centralized processing, buffering, retries | | Use consistent naming conventions | Easier to search and correlate |
Summary
OpenTelemetry is the industry standard for distributed tracing and observability instrumentation. It provides vendor-neutral APIs and SDKs for generating traces, metrics, and logs. The OTel Collector centralizes processing and export. Combined with Jaeger or Tempo for visualization, OTel gives you complete visibility into your distributed systems.
Key takeaways:
- OpenTelemetry is vendor-neutral — switch backends without changing instrumentation
- Traces = tree of spans showing request flow through services
- Auto-instrumentation covers HTTP, databases, and frameworks with zero code
- Manual instrumentation adds custom spans for business logic
- The OTel Collector receives, processes, and exports telemetry data
- Sampling controls costs (probabilistic, rate-limiting, tail-based)
- Use consistent span names and attributes for effective analysis
- Deploy Jaeger or Tempo for trace visualization
What's Next: Full Observability Stack
The next chapter combines Prometheus, Grafana, Loki, and OpenTelemetry into a complete observability stack — deploy with Docker Compose, configure data sources, and build unified dashboards.