Implementing Distributed Tracing in Microservices with NestJS + OpenTelemetry — A Practical Guide to Jaeger & Grafana Tempo Integration
Do you know what the first wall you hit after migrating to microservices looks like? You have logs but can't tell which service slowed things down, and you have error messages but can't trace where they originated — that frustrating feeling. When I first operated a microservices environment, I once wasted half a day chasing down a single incident, manually cross-referencing logs from six different services in Kibana by timestamp.
OpenTelemetry (OTel) lets you quickly pinpoint the root cause in exactly that situation. It's an open-source observability framework maintained by the CNCF that visualizes every path a request takes across a distributed system as a single Trace. Instead of cross-referencing logs from multiple services by timestamp, you can instantly see which services a request passed through and how long each took — all on one screen. Being a vendor-neutral standard also means you can swap backends between Jaeger, Datadog, and Grafana Tempo at will.
By the end of this article, you will have attached OTel to a NestJS service and confirmed Traces locally in Jaeger. If you're already familiar with OTel concepts, feel free to jump straight to the Practical Application section.
Core Concepts
Trace, Span, and Context Propagation
The way distributed tracing works can be summarized in one sentence: when a request enters the first service, a unique trace_id is generated, and this ID is forwarded along with every downstream service call via HTTP headers. Each service records its own processing scope as a Span, and all Spans are reassembled into a single Trace that reveals the full path.
| Concept | Description |
|---|---|
| Trace | The complete journey of a single request across multiple services. Identified by a unique trace_id |
| Span | An individual unit of work within a Trace. Includes start/end times, metadata (attributes), and status |
| Context Propagation | The mechanism for passing Trace IDs and Span IDs between services via HTTP headers (traceparent, tracestate) and similar means |
| OTel Collector | An intermediate gateway that receives, transforms, and routes telemetry from apps. Acts as a decoupling hub between apps and backends |
| OTLP | OpenTelemetry Line Protocol. A standard transport format over gRPC (a high-performance binary RPC framework) or HTTP |
What is Context Propagation? When Service A calls Service B over HTTP, it embeds the current context in the request header in the form
traceparent: 00-{trace_id}-{span_id}-01. Service B reads this header and attaches its own Span to the same Trace.
OTel Collector — The Hub That Decouples Apps from Backends
Without a Collector, apps must send data directly to Jaeger or Tempo. Swapping out the backend means changing app code too. With a Collector in the middle, the app always exports only via OTLP, and routing decisions are managed entirely in the Collector's configuration file.
The example below shows how to configure a Collector pipeline using a YAML configuration file.
# otel-config.yaml — Collector 파이프라인 구성
receivers:
otlp:
protocols:
grpc: {}
http: {}
processors:
batch: {}
# tail_sampling은 기본 otelcol 배포판에 없습니다
# otel/opentelemetry-collector-contrib 이미지를 사용해야 합니다
tail_sampling:
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
exporters:
# jaeger exporter는 Jaeger 1.35 이후 레거시로, otlp 방식 권장
otlp/jaeger:
endpoint: "jaeger:4317"
tls:
insecure: true
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlp/jaeger, otlp/tempo]What is Tail Sampling? Head-based sampling decides whether to collect a request at the moment it enters, whereas tail-based sampling collects all Spans first and then decides based on the full picture. Tail Sampling is necessary to avoid missing Traces where errors or slow queries occur. Note that this processor is only included in
otelcontribcol(the contrib build), not in the default distribution (otelcol), so be careful when choosing your image.
Now that we've covered the concepts, let's move on to actual code.
Practical Application
Example 1: Attaching the OTel SDK to a NestJS Service
Honestly, the part where people make the most mistakes when setting up for the first time is the initialization order. The OTel SDK must be initialized before NestJS loads any modules. This is exactly why import './tracing' must be the very first line in main.ts.
First, install the packages. Pinning versions now will save you from surprises caused by API changes later.
pnpm add \
@opentelemetry/sdk-node@0.57.0 \
@opentelemetry/auto-instrumentations-node@0.57.0 \
@opentelemetry/exporter-trace-otlp-http@0.57.0 \
@opentelemetry/resources@1.30.0 \
@opentelemetry/semantic-conventions@1.30.0 \
@opentelemetry/api@1.9.0// tracing.ts — 이 파일은 main.ts보다 먼저 실행되어야 합니다
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
// SEMRESATTRS_SERVICE_NAME은 SDK 1.x에서 deprecated 예정입니다
// semantic-conventions 1.27+ 기준 ATTR_SERVICE_NAME으로 마이그레이션을 권장합니다
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'order-service',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// 프로세스 종료 시 미전송 Span을 flush하고 SDK를 정상 종료합니다
// 이 처리가 없으면 마지막 요청의 Trace가 유실될 수 있습니다
process.on('SIGTERM', () => sdk.shutdown());
process.on('SIGINT', () => sdk.shutdown());// main.ts — import './tracing'이 반드시 첫 번째 줄이어야 합니다
import './tracing';
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
async function bootstrap() {
const app = await NestFactory.create(AppModule);
await app.listen(3000);
}
bootstrap();| Code Point | Description |
|---|---|
getNodeAutoInstrumentations() |
Auto-instruments Express, TypeORM, Prisma, Redis, HTTP clients, and more without any code changes |
ATTR_SERVICE_NAME |
The name used to distinguish services in Jaeger and Tempo. Must be set to a unique value |
OTLPTraceExporter |
Sends Traces to the Collector via OTLP/HTTP. Use the exporter-trace-otlp-grpc package if you prefer gRPC |
sdk.shutdown() |
Flushes buffered Spans when SIGTERM or SIGINT is received. Omitting this causes the last request's Trace to be lost |
Example 2: Tracing Business Logic with Custom Spans
Auto-instrumentation alone is sometimes not enough. For cases where you want to trace internal business flows in detail — such as payment processing logic — you can add custom Spans. Here's a pattern commonly used in production.
// order.service.ts
import { Injectable } from '@nestjs/common';
import { trace, SpanStatusCode } from '@opentelemetry/api';
@Injectable()
export class OrderService {
async processOrder(orderId: string) {
const tracer = trace.getTracer('order-service');
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
const result = await this.orderRepository.process(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end(); // finally에서 반드시 호출해야 Span이 닫힙니다
}
});
}
}What happens if you forget
span.end()? The Span won't close, causing a memory leak on the Collector side, and that Span may not appear in the Trace viewer at all. It's recommended to develop the habit of always placing it in afinallyblock.
Continuing Traces Across Asynchronous RabbitMQ Boundaries
HTTP calls propagate context automatically via headers, but what about message queues? Fortunately, getNodeAutoInstrumentations() includes amqplib instrumentation. When publishing a message, it automatically injects the current Trace context into the message headers, and upon consumption it automatically extracts it — so Traces remain unbroken even across asynchronous boundaries.
I still remember being genuinely impressed the first time I saw the Order Service → RabbitMQ → Notification Service flow connected as a single Trace in Jaeger.
That said, it's too early to be optimistic that "everything works automatically." In cases where you use non-standard response patterns like amqplib's direct-reply-to, or route through an in-house wrapper library rather than amqplib directly, auto-instrumentation may not work. In those situations, you'll need to manually write code to inject and extract context into and from message headers.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Vendor Neutrality | Instrument once, then swap to any backend — Jaeger, Datadog, New Relic, etc. |
| Auto-Instrumentation | Instruments about 80% of major frameworks — Express, TypeORM, Redis, HTTP clients — without code changes |
| Signal Correlation | Links Traces, Metrics, and Logs via Trace ID, dramatically reducing root cause analysis time |
| CNCF Standard | The second-largest CNCF project after Kubernetes, with a robust community and ecosystem |
Disadvantages and Caveats
The figures below are reference values measured in real service operations. Variance is significant depending on request patterns, instance specs, and sampling ratios, so it's recommended to measure directly in your own staging environment.
| Item | Impact | Mitigation |
|---|---|---|
| CPU Overhead | Can increase by tens of percent over baseline depending on environment | Adjust sampling ratio (1–10%), actively use Batch Processor |
| Memory RSS | A few MB additional | Separate sidecar Collector to reduce app memory burden |
| P99 Latency | May increase slightly under sustained load | Configure async/buffering Exporter settings to prevent app blocking |
| Network Cost | Collector egress costs arise depending on traffic volume | Use Tail Sampling to send only meaningful Traces |
What is the Batch Processor? Instead of exporting Spans one by one immediately, it accumulates them by count or time interval and sends them in batches. This significantly reduces performance overhead by minimizing the number of network calls.
The Most Common Mistakes in Production
- Missing the
tracing.tsinitialization order — Ifimport './tracing'is not the first line inmain.ts, Express will not have been patched by the time NestJS modules load, and HTTP auto-instrumentation will not work. - Using dynamic values for Span names — When unique Span names like
processOrder-${orderId}exceed 1,000 entries, the index performance of backends like Jaeger and Tempo degrades sharply. It's recommended to keep Span names fixed (e.g.,processOrder) and put dynamic values in attributes. - The app freezing when the Exporter fails in production — If the Collector goes down briefly, the Exporter queue can fill up and block app responses. It's a good idea to familiarize yourself in advance with async/buffering options such as
maxQueueSizeandscheduledDelayMillis.
Closing Thoughts
OpenTelemetry is the most practical choice in a microservices environment for replacing guesswork with data when answering "where did it break?" Investing a few lines of code today can noticeably reduce your entire team's on-call burden.
Three steps you can take right now:
- Install packages: Run the
pnpm addcommand above and pastetracing.tsinto your project — the basic setup is complete. - Run Jaeger locally: Spin up Jaeger with
docker run -d --name jaeger -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:latestand see Traces accumulating for yourself athttp://localhost:16686. - Add custom Spans: Attach
tracer.startActiveSpan()one by one to the core business logic that auto-instrumentation misses, and you can incrementally improve visibility.
Next article: Tuning OTel Collector sampling strategies to match production traffic patterns — designing Head-based vs. Tail-based policies, cost optimization, and a strategy for 100% retention of error Traces.
References
For conceptual understanding
- OpenTelemetry Official Docs — Traces
- OpenTelemetry Official Docs — Context Propagation
- OpenTelemetry Official Docs — Sampling
For practical configuration
- OpenTelemetry NestJS 완전 가이드 2026 | SigNoz
- Kubernetes Advanced Observability with OTel, Jaeger, and Tempo | johal.in
For advanced tuning