Implementing Distributed Tracing in Microservices with NestJS + OpenTelemetry — A Practical Guide to Jaeger & Grafana Tempo Integration

Do you know what the first wall you hit after migrating to microservices looks like? You have logs but can't tell which service slowed things down, and you have error messages but can't trace where they originated — that frustrating feeling. When I first operated a microservices environment, I once wasted half a day chasing down a single incident, manually cross-referencing logs from six different services in Kibana by timestamp.

OpenTelemetry (OTel) lets you quickly pinpoint the root cause in exactly that situation. It's an open-source observability framework maintained by the CNCF that visualizes every path a request takes across a distributed system as a single Trace. Instead of cross-referencing logs from multiple services by timestamp, you can instantly see which services a request passed through and how long each took — all on one screen. Being a vendor-neutral standard also means you can swap backends between Jaeger, Datadog, and Grafana Tempo at will.

By the end of this article, you will have attached OTel to a NestJS service and confirmed Traces locally in Jaeger. If you're already familiar with OTel concepts, feel free to jump straight to the Practical Application section.

Core Concepts

Trace, Span, and Context Propagation

The way distributed tracing works can be summarized in one sentence: when a request enters the first service, a unique trace_id is generated, and this ID is forwarded along with every downstream service call via HTTP headers. Each service records its own processing scope as a Span, and all Spans are reassembled into a single Trace that reveals the full path.

Concept	Description
Trace	The complete journey of a single request across multiple services. Identified by a unique `trace_id`
Span	An individual unit of work within a Trace. Includes start/end times, metadata (attributes), and status
Context Propagation	The mechanism for passing Trace IDs and Span IDs between services via HTTP headers (`traceparent`, `tracestate`) and similar means
OTel Collector	An intermediate gateway that receives, transforms, and routes telemetry from apps. Acts as a decoupling hub between apps and backends
OTLP	OpenTelemetry Line Protocol. A standard transport format over gRPC (a high-performance binary RPC framework) or HTTP

What is Context Propagation? When Service A calls Service B over HTTP, it embeds the current context in the request header in the form traceparent: 00-{trace_id}-{span_id}-01. Service B reads this header and attaches its own Span to the same Trace.

OTel Collector — The Hub That Decouples Apps from Backends

Without a Collector, apps must send data directly to Jaeger or Tempo. Swapping out the backend means changing app code too. With a Collector in the middle, the app always exports only via OTLP, and routing decisions are managed entirely in the Collector's configuration file.

The example below shows how to configure a Collector pipeline using a YAML configuration file.

yaml

# otel-config.yaml — Collector 파이프라인 구성
receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}
 
processors:
  batch: {}
  # tail_sampling은 기본 otelcol 배포판에 없습니다
  # otel/opentelemetry-collector-contrib 이미지를 사용해야 합니다
  tail_sampling:
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
 
exporters:
  # jaeger exporter는 Jaeger 1.35 이후 레거시로, otlp 방식 권장
  otlp/jaeger:
    endpoint: "jaeger:4317"
    tls:
      insecure: true
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/jaeger, otlp/tempo]

What is Tail Sampling? Head-based sampling decides whether to collect a request at the moment it enters, whereas tail-based sampling collects all Spans first and then decides based on the full picture. Tail Sampling is necessary to avoid missing Traces where errors or slow queries occur. Note that this processor is only included in otelcontribcol (the contrib build), not in the default distribution (otelcol), so be careful when choosing your image.

Now that we've covered the concepts, let's move on to actual code.

Practical Application

Example 1: Attaching the OTel SDK to a NestJS Service

Honestly, the part where people make the most mistakes when setting up for the first time is the initialization order. The OTel SDK must be initialized before NestJS loads any modules. This is exactly why import './tracing' must be the very first line in main.ts.

First, install the packages. Pinning versions now will save you from surprises caused by API changes later.

bash

pnpm add \
  @opentelemetry/sdk-node@0.57.0 \
  @opentelemetry/auto-instrumentations-node@0.57.0 \
  @opentelemetry/exporter-trace-otlp-http@0.57.0 \
  @opentelemetry/resources@1.30.0 \
  @opentelemetry/semantic-conventions@1.30.0 \
  @opentelemetry/api@1.9.0

typescript

// tracing.ts — 이 파일은 main.ts보다 먼저 실행되어야 합니다
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
// SEMRESATTRS_SERVICE_NAME은 SDK 1.x에서 deprecated 예정입니다
// semantic-conventions 1.27+ 기준 ATTR_SERVICE_NAME으로 마이그레이션을 권장합니다
 
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
 
sdk.start();
 
// 프로세스 종료 시 미전송 Span을 flush하고 SDK를 정상 종료합니다
// 이 처리가 없으면 마지막 요청의 Trace가 유실될 수 있습니다
process.on('SIGTERM', () => sdk.shutdown());
process.on('SIGINT', () => sdk.shutdown());

typescript

// main.ts — import './tracing'이 반드시 첫 번째 줄이어야 합니다
import './tracing';
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
 
async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3000);
}
bootstrap();

Code Point	Description
`getNodeAutoInstrumentations()`	Auto-instruments Express, TypeORM, Prisma, Redis, HTTP clients, and more without any code changes
`ATTR_SERVICE_NAME`	The name used to distinguish services in Jaeger and Tempo. Must be set to a unique value
`OTLPTraceExporter`	Sends Traces to the Collector via OTLP/HTTP. Use the `exporter-trace-otlp-grpc` package if you prefer gRPC
`sdk.shutdown()`	Flushes buffered Spans when SIGTERM or SIGINT is received. Omitting this causes the last request's Trace to be lost

Example 2: Tracing Business Logic with Custom Spans

Auto-instrumentation alone is sometimes not enough. For cases where you want to trace internal business flows in detail — such as payment processing logic — you can add custom Spans. Here's a pattern commonly used in production.

typescript

// order.service.ts
import { Injectable } from '@nestjs/common';
import { trace, SpanStatusCode } from '@opentelemetry/api';
 
@Injectable()
export class OrderService {
  async processOrder(orderId: string) {
    const tracer = trace.getTracer('order-service');
 
    return tracer.startActiveSpan('processOrder', async (span) => {
      span.setAttribute('order.id', orderId);
      try {
        const result = await this.orderRepository.process(orderId);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (err) {
        span.recordException(err);
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw err;
      } finally {
        span.end(); // finally에서 반드시 호출해야 Span이 닫힙니다
      }
    });
  }
}

What happens if you forget span.end()? The Span won't close, causing a memory leak on the Collector side, and that Span may not appear in the Trace viewer at all. It's recommended to develop the habit of always placing it in a finally block.

Continuing Traces Across Asynchronous RabbitMQ Boundaries

HTTP calls propagate context automatically via headers, but what about message queues? Fortunately, getNodeAutoInstrumentations() includes amqplib instrumentation. When publishing a message, it automatically injects the current Trace context into the message headers, and upon consumption it automatically extracts it — so Traces remain unbroken even across asynchronous boundaries.

I still remember being genuinely impressed the first time I saw the Order Service → RabbitMQ → Notification Service flow connected as a single Trace in Jaeger.

That said, it's too early to be optimistic that "everything works automatically." In cases where you use non-standard response patterns like amqplib's direct-reply-to, or route through an in-house wrapper library rather than amqplib directly, auto-instrumentation may not work. In those situations, you'll need to manually write code to inject and extract context into and from message headers.

Pros and Cons

Advantages

Item	Details
Vendor Neutrality	Instrument once, then swap to any backend — Jaeger, Datadog, New Relic, etc.
Auto-Instrumentation	Instruments about 80% of major frameworks — Express, TypeORM, Redis, HTTP clients — without code changes
Signal Correlation	Links Traces, Metrics, and Logs via Trace ID, dramatically reducing root cause analysis time
CNCF Standard	The second-largest CNCF project after Kubernetes, with a robust community and ecosystem

Disadvantages and Caveats

The figures below are reference values measured in real service operations. Variance is significant depending on request patterns, instance specs, and sampling ratios, so it's recommended to measure directly in your own staging environment.

Item	Impact	Mitigation
CPU Overhead	Can increase by tens of percent over baseline depending on environment	Adjust sampling ratio (1–10%), actively use Batch Processor
Memory RSS	A few MB additional	Separate sidecar Collector to reduce app memory burden
P99 Latency	May increase slightly under sustained load	Configure async/buffering Exporter settings to prevent app blocking
Network Cost	Collector egress costs arise depending on traffic volume	Use Tail Sampling to send only meaningful Traces

What is the Batch Processor? Instead of exporting Spans one by one immediately, it accumulates them by count or time interval and sends them in batches. This significantly reduces performance overhead by minimizing the number of network calls.

The Most Common Mistakes in Production

Missing the tracing.ts initialization order — If import './tracing' is not the first line in main.ts, Express will not have been patched by the time NestJS modules load, and HTTP auto-instrumentation will not work.
Using dynamic values for Span names — When unique Span names like processOrder-${orderId} exceed 1,000 entries, the index performance of backends like Jaeger and Tempo degrades sharply. It's recommended to keep Span names fixed (e.g., processOrder) and put dynamic values in attributes.
The app freezing when the Exporter fails in production — If the Collector goes down briefly, the Exporter queue can fill up and block app responses. It's a good idea to familiarize yourself in advance with async/buffering options such as maxQueueSize and scheduledDelayMillis.

Closing Thoughts

OpenTelemetry is the most practical choice in a microservices environment for replacing guesswork with data when answering "where did it break?" Investing a few lines of code today can noticeably reduce your entire team's on-call burden.

Three steps you can take right now:

Install packages: Run the pnpm add command above and paste tracing.ts into your project — the basic setup is complete.
Run Jaeger locally: Spin up Jaeger with docker run -d --name jaeger -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:latest and see Traces accumulating for yourself at http://localhost:16686.
Add custom Spans: Attach tracer.startActiveSpan() one by one to the core business logic that auto-instrumentation misses, and you can incrementally improve visibility.

Next article: Tuning OTel Collector sampling strategies to match production traffic patterns — designing Head-based vs. Tail-based policies, cost optimization, and a strategy for 100% retention of error Traces.

References

For conceptual understanding

For practical configuration

For advanced tuning

Implementing Distributed Tracing in Microservices with NestJS + OpenTelemetry — A Practical Guide to Jaeger & Grafana Tempo Integration | DEV BAK - 기술블로그

Backend

Implementing Distributed Tracing in Microservices with NestJS + OpenTelemetry — A Practical Guide to Jaeger & Grafana Tempo Integration

Core Concepts

Trace, Span, and Context Propagation

Concept	Description
Trace	The complete journey of a single request across multiple services. Identified by a unique `trace_id`
Span	An individual unit of work within a Trace. Includes start/end times, metadata (attributes), and status
Context Propagation	The mechanism for passing Trace IDs and Span IDs between services via HTTP headers (`traceparent`, `tracestate`) and similar means
OTel Collector	An intermediate gateway that receives, transforms, and routes telemetry from apps. Acts as a decoupling hub between apps and backends
OTLP	OpenTelemetry Line Protocol. A standard transport format over gRPC (a high-performance binary RPC framework) or HTTP

What is Context Propagation? When Service A calls Service B over HTTP, it embeds the current context in the request header in the form traceparent: 00-{trace_id}-{span_id}-01. Service B reads this header and attaches its own Span to the same Trace.

OTel Collector — The Hub That Decouples Apps from Backends

The example below shows how to configure a Collector pipeline using a YAML configuration file.

yaml

# otel-config.yaml — Collector 파이프라인 구성
receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}
 
processors:
  batch: {}
  # tail_sampling은 기본 otelcol 배포판에 없습니다
  # otel/opentelemetry-collector-contrib 이미지를 사용해야 합니다
  tail_sampling:
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
 
exporters:
  # jaeger exporter는 Jaeger 1.35 이후 레거시로, otlp 방식 권장
  otlp/jaeger:
    endpoint: "jaeger:4317"
    tls:
      insecure: true
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/jaeger, otlp/tempo]

What is Tail Sampling? Head-based sampling decides whether to collect a request at the moment it enters, whereas tail-based sampling collects all Spans first and then decides based on the full picture. Tail Sampling is necessary to avoid missing Traces where errors or slow queries occur. Note that this processor is only included in otelcontribcol (the contrib build), not in the default distribution (otelcol), so be careful when choosing your image.

Now that we've covered the concepts, let's move on to actual code.

Practical Application

Example 1: Attaching the OTel SDK to a NestJS Service

First, install the packages. Pinning versions now will save you from surprises caused by API changes later.

bash

pnpm add \
  @opentelemetry/sdk-node@0.57.0 \
  @opentelemetry/auto-instrumentations-node@0.57.0 \
  @opentelemetry/exporter-trace-otlp-http@0.57.0 \
  @opentelemetry/resources@1.30.0 \
  @opentelemetry/semantic-conventions@1.30.0 \
  @opentelemetry/api@1.9.0

typescript

// tracing.ts — 이 파일은 main.ts보다 먼저 실행되어야 합니다
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
// SEMRESATTRS_SERVICE_NAME은 SDK 1.x에서 deprecated 예정입니다
// semantic-conventions 1.27+ 기준 ATTR_SERVICE_NAME으로 마이그레이션을 권장합니다
 
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
 
sdk.start();
 
// 프로세스 종료 시 미전송 Span을 flush하고 SDK를 정상 종료합니다
// 이 처리가 없으면 마지막 요청의 Trace가 유실될 수 있습니다
process.on('SIGTERM', () => sdk.shutdown());
process.on('SIGINT', () => sdk.shutdown());

typescript

// main.ts — import './tracing'이 반드시 첫 번째 줄이어야 합니다
import './tracing';
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
 
async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3000);
}
bootstrap();

Code Point	Description
`getNodeAutoInstrumentations()`	Auto-instruments Express, TypeORM, Prisma, Redis, HTTP clients, and more without any code changes
`ATTR_SERVICE_NAME`	The name used to distinguish services in Jaeger and Tempo. Must be set to a unique value
`OTLPTraceExporter`	Sends Traces to the Collector via OTLP/HTTP. Use the `exporter-trace-otlp-grpc` package if you prefer gRPC
`sdk.shutdown()`	Flushes buffered Spans when SIGTERM or SIGINT is received. Omitting this causes the last request's Trace to be lost

Example 2: Tracing Business Logic with Custom Spans

typescript

// order.service.ts
import { Injectable } from '@nestjs/common';
import { trace, SpanStatusCode } from '@opentelemetry/api';
 
@Injectable()
export class OrderService {
  async processOrder(orderId: string) {
    const tracer = trace.getTracer('order-service');
 
    return tracer.startActiveSpan('processOrder', async (span) => {
      span.setAttribute('order.id', orderId);
      try {
        const result = await this.orderRepository.process(orderId);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (err) {
        span.recordException(err);
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw err;
      } finally {
        span.end(); // finally에서 반드시 호출해야 Span이 닫힙니다
      }
    });
  }
}

What happens if you forget span.end()? The Span won't close, causing a memory leak on the Collector side, and that Span may not appear in the Trace viewer at all. It's recommended to develop the habit of always placing it in a finally block.

Continuing Traces Across Asynchronous RabbitMQ Boundaries

I still remember being genuinely impressed the first time I saw the Order Service → RabbitMQ → Notification Service flow connected as a single Trace in Jaeger.

Pros and Cons

Advantages

Item	Details
Vendor Neutrality	Instrument once, then swap to any backend — Jaeger, Datadog, New Relic, etc.
Auto-Instrumentation	Instruments about 80% of major frameworks — Express, TypeORM, Redis, HTTP clients — without code changes
Signal Correlation	Links Traces, Metrics, and Logs via Trace ID, dramatically reducing root cause analysis time
CNCF Standard	The second-largest CNCF project after Kubernetes, with a robust community and ecosystem

Disadvantages and Caveats

Item	Impact	Mitigation
CPU Overhead	Can increase by tens of percent over baseline depending on environment	Adjust sampling ratio (1–10%), actively use Batch Processor
Memory RSS	A few MB additional	Separate sidecar Collector to reduce app memory burden
P99 Latency	May increase slightly under sustained load	Configure async/buffering Exporter settings to prevent app blocking
Network Cost	Collector egress costs arise depending on traffic volume	Use Tail Sampling to send only meaningful Traces

What is the Batch Processor? Instead of exporting Spans one by one immediately, it accumulates them by count or time interval and sends them in batches. This significantly reduces performance overhead by minimizing the number of network calls.

The Most Common Mistakes in Production

Missing the tracing.ts initialization order — If import './tracing' is not the first line in main.ts, Express will not have been patched by the time NestJS modules load, and HTTP auto-instrumentation will not work.
Using dynamic values for Span names — When unique Span names like processOrder-${orderId} exceed 1,000 entries, the index performance of backends like Jaeger and Tempo degrades sharply. It's recommended to keep Span names fixed (e.g., processOrder) and put dynamic values in attributes.
The app freezing when the Exporter fails in production — If the Collector goes down briefly, the Exporter queue can fill up and block app responses. It's a good idea to familiarize yourself in advance with async/buffering options such as maxQueueSize and scheduledDelayMillis.

Closing Thoughts

Three steps you can take right now:

Install packages: Run the pnpm add command above and paste tracing.ts into your project — the basic setup is complete.
Run Jaeger locally: Spin up Jaeger with docker run -d --name jaeger -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:latest and see Traces accumulating for yourself at http://localhost:16686.
Add custom Spans: Attach tracer.startActiveSpan() one by one to the core business logic that auto-instrumentation misses, and you can incrementally improve visibility.

Next article: Tuning OTel Collector sampling strategies to match production traffic patterns — designing Head-based vs. Tail-based policies, cost optimization, and a strategy for 100% retention of error Traces.

References

For conceptual understanding

For practical configuration

For advanced tuning

Core Concepts

Trace, Span, and Context Propagation

OTel Collector — The Hub That Decouples Apps from Backends

Practical Application

Example 1: Attaching the OTel SDK to a NestJS Service

Example 2: Tracing Business Logic with Custom Spans

Continuing Traces Across Asynchronous RabbitMQ Boundaries

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Production

Closing Thoughts

References

Core Concepts

Trace, Span, and Context Propagation

OTel Collector — The Hub That Decouples Apps from Backends

Practical Application

Example 1: Attaching the OTel SDK to a NestJS Service

Example 2: Tracing Business Logic with Custom Spans

Continuing Traces Across Asynchronous RabbitMQ Boundaries

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Production

Closing Thoughts

References

Recommended Posts

Redesigning the Backend with WebAssembly — A Practical Guide to Server-Side Wasm in 2025

pgvector Performance Limits Measured: When HNSW Breaks Down at 1M and 10M Vectors

Spring Boot 4 + Java 21 Virtual Threads: How I/O Throughput Changes Dramatically Without Touching Your Code

An Introduction to Platform Engineering and IDPs (Internal Developer Platforms) — How Developers Can Focus on Code Without Worrying About Infrastructure

"It's internal, so it should be fine" is over — Applying Zero Trust to Microservice APIs: mTLS, JWT, and OPA in Practice

A Practical Guide to Apache Kafka and Event-Driven Architecture for Breaking Microservice Coupling