Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

A service connected to GPT-4 suddenly starts giving nonsensical answers. You dig through the logs and find no errors. HTTP response codes are all 200. But users are angry. If you've built LLM-powered systems, you've probably been here. Traditional APM can tell you "the LLM call took 1.2 seconds," but nothing about what actually matters — what prompt went in and what answer came out. Because LLM systems are non-deterministic — the same input can produce a different result every time — reproducing an incident requires capturing the exact input, model parameters, and temperature values at the moment it happened.

The solution that emerged for this problem is OpenTelemetry (OTel)'s GenAI Semantic Conventions. The approach layers an LLM-specific semantic layer on top of OTel, the CNCF standard observability framework, to collect traces, metrics, and logs in a structured way. Major vendors like Datadog, Grafana, and AWS have already announced native support, making it the de facto industry standard.

By the end of this article, you'll be able to add tracing to an existing OpenAI project and visualize the full flow of a RAG pipeline in Jaeger. Examples are written in Python; if you're using JavaScript/TypeScript, the same patterns apply at the OpenLLMetry official repository.

Core Concepts

Why OTel Takes a Different Approach for LLMs

Tracing a typical service is deterministic. If a DB query is slow, there's a slow query log. Function execution time is roughly consistent given the same input. LLMs are different. Send the same prompt ten times with temperature=0.8 and you get ten different answers. On top of that, cost varies with input token count, and models can change mid-deployment.

Because of these unique characteristics, OTel defines a separate gen_ai.* namespace.

Attribute	Description	Example Value
`gen_ai.system`	AI provider identifier	`openai`, `anthropic`
`gen_ai.request.model`	Requested model	`gpt-4o`
`gen_ai.response.model`	Model that actually responded	`gpt-4o-2024-08-06`
`gen_ai.usage.input_tokens`	Number of input tokens	`1234`
`gen_ai.usage.output_tokens`	Number of output tokens	`256`
`gen_ai.response.finish_reasons`	Reason generation stopped	`["stop"]`, `["length"]`
`gen_ai.operation.name`	Operation type	`chat`, `embeddings`

When I first saw this attribute list, having both gen_ai.request.model and gen_ai.response.model seemed odd. "Doesn't the model you request just respond?" But in practice, these two values diverge more often than you'd expect — when an API provider silently auto-upgrades the model version. If you requested gpt-4o but the response shows gpt-4o-2024-08-06, you can track exactly when a different version started responding. It seems minor, but it's surprisingly useful for catching cases where answer quality subtly shifts after a model upgrade.

How Trace Data Flows

A single LLM call doesn't exist in isolation. It's almost always part of a pipeline: HTTP request → document retrieval → LLM call → response formatting. OTel wraps all of this into a single Trace and represents each step as a Span.

yaml

[User Request]
    └── Trace (Root Span: HTTP Request)
          ├── Span: RAG Document Retrieval (vector DB query)
          ├── Span: Context Preparation
          ├── Span: LLM Call (includes gen_ai.* attributes)
          │     ├── Event: gen_ai.content.prompt (opt-in)
          │     └── Event: gen_ai.content.completion (opt-in)
          └── Span: Response Formatting

Privacy Design Principle: Prompt and response content is not collected by default. It must be explicitly enabled with the OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=true environment variable. This opt-in design prevents PII or medical information from flowing directly into the tracing backend.

One recent shift worth noting: older approaches often embedded the full prompt text directly as a Span attribute. But as prompts grew to several KB in size, tracing backend performance visibly degraded. The current spec has evolved toward separating this into Log-based Events. Span attributes carry only metadata; actual content is sent separately as attached Events.

The Role of Each Span Type

Span types are subdivided to reflect the complexity of LLM systems. As multi-agent systems became more common, the need arose to distinguish agent decision-making steps as a separate type — the gen_ai.agent.* conventions were split into their own spec as part of this trend. MCP (Model Context Protocol) tool call tracing is also being incorporated here.

Span Type	Target Operation
LLM inference	Inference calls like `chat`, `completion`
Embeddings	Text vectorization
Retrieval	Vector DB, document search
Tool execution	External tools invoked by agents
Agent steps	`gen_ai.agent.*` — agent decision-making steps

Practical Application

Example 1: End-to-End RAG Pipeline Tracing

When I first built a RAG pipeline, the most frustrating thing was not being able to answer "retrieval worked fine, so why is the answer wrong?" Logging retrieval and LLM calls separately still left the two stages disconnected, making it nearly impossible to establish causality. Once you add OTel, the retrieval stage and LLM call stage are connected within the same trace, and you can see the full picture at a glance.

Let's start with SDK initialization. TracerProvider is the entry point for the tracing pipeline, BatchSpanProcessor batches Spans before sending to reduce network overhead, and OTLPSpanExporter is the component that actually sends Spans to the OTel Collector.

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
 
# TracerProvider: entry point for the tracing pipeline
# BatchSpanProcessor: batches Spans before sending (reduces network overhead)
# OTLPSpanExporter: delivers Spans to the OTel Collector
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
 
# Enable automatic injection of gen_ai.* attributes into the OpenAI SDK
OpenAIInstrumentor().instrument()
 
tracer = trace.get_tracer("rag-service", "1.0.0")
 
# Minimal signature example — actual implementation varies by vector DB
def vector_db_search(query: str, k: int) -> list[dict]:
    """Embeds the query and returns k similar documents from the vector DB"""
    ...
 
def build_prompt(query: str, docs: list[dict]) -> str:
    """Combines retrieved documents as context and returns the prompt string"""
    ...
 
def answer_question(query: str) -> str:
    with tracer.start_as_current_span("rag.pipeline") as span:
        span.set_attribute("user.query", query)
 
        # Retrieval step — must be manually spanned since auto-instrumentation doesn't cover it
        with tracer.start_as_current_span("rag.retrieval") as ret_span:
            docs = vector_db_search(query, k=5)
            ret_span.set_attribute("rag.retrieved_docs", len(docs))
            ret_span.set_attribute("rag.query_vector_dim", 1536)
 
        # LLM call — OpenAIInstrumentor automatically creates a child Span inside this block
        # rag.llm_call (manual) → openai.chat (automatic, with gen_ai.* attributes) forms a parent-child structure
        with tracer.start_as_current_span("rag.llm_call"):
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": build_prompt(query, docs)}],
                temperature=0.3,
            )
 
        return response.choices[0].message.content

Code Point	Role
`OpenAIInstrumentor().instrument()`	Patches the OpenAI SDK — automatically adds `gen_ai.*` attributes to all subsequent `chat.completions.create()` calls
`tracer.start_as_current_span("rag.retrieval")`	Manually creates a Span for the retrieval step, which auto-instrumentation doesn't cover
`ret_span.set_attribute("rag.retrieved_docs", ...)`	Records the number of retrieved results as a Span attribute — can later be cross-analyzed against answer quality
`rag.llm_call` (manual) + OpenAI Span (automatic)	Intentional parent-child nesting. `rag.llm_call` carries business context, while the auto-instrumented Span inside it records the actual API call in detail

With this structure, a trace UI like Jaeger shows "5 documents retrieved → LLM call → 1,847 tokens consumed → response generated" all on one screen. The causal chain — "sparse retrieval caused the LLM to hallucinate" — is visible in a single trace view.

Now let's go one level deeper from the application code and look at how to trace the internal behavior of an agent framework.

Example 2: Tracing Multi-Agent Workflows with OpenLLMetry

When you build agents with LangChain, you inevitably hit a moment where you have no idea "what is this thing doing right now?" This is especially true when it loops dozens of times and exceeds the context window. There's an error message, but figuring out which tool it called and what decision it made just before is genuinely difficult. With OpenLLMetry, the framework's internal operations are automatically decomposed into Spans.

python

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import agent, task
 
# One-line initialization enables automatic instrumentation for LangChain/LlamaIndex/CrewAI
Traceloop.init(
    app_name="research-agent",
    api_endpoint="http://otel-collector:4318",
)
 
@agent(name="research_coordinator")
def run_research(topic: str) -> str:
    # The @agent decorator wraps the entire function as a single Agent Span
    # Internal LangChain LLM calls and tool executions are each automatically recorded as child Spans
    llm = ChatOpenAI(model="gpt-4o")
    agent_executor = AgentExecutor(
        agent=create_react_agent(llm, tools=[...]),
        tools=[...]
    )
    return agent_executor.invoke({"input": f"Research about: {topic}"})["output"]
 
@task(name="summarize_findings")
def summarize(raw_findings: str) -> str:
    # The @task decorator creates a separate Task Span distinct from the agent loop
    # This makes it easy to distinguish agent steps from post-processing stages in the trace
    from openai import OpenAI
    client = OpenAI()
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize: {raw_findings}"}]
    ).choices[0].message.content

OpenLLMetry: An OTel-based automatic LLM instrumentation library developed by Traceloop. Supports over 40 providers including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. Acquired by ServiceNow in March 2026 but continues to be maintained as open source (Apache 2.0).

The Spans generated by @agent and @task decorators serve different roles. @agent wraps the entire agent execution flow as a single root Span and automatically generates child Spans for internal LLM calls and tool executions. @task is used to mark independent processing steps separate from the agent loop. In a trace UI, you can see a timeline of "the agent called tool A, it failed, retried with tool B, and succeeded on which attempt."

Now let's step away from application code and look at what can be done at the Collector level.

Example 3: Aggregating Token Cost Metrics

Honestly, OTel Collector configuration looks intimidating at first glance. But once you've experienced "why is this month's OpenAI bill so high?" running in production without this pipeline, you understand why it's worth the investment. Once you can see which endpoints are consuming how many tokens by feature, you'll inevitably find a cost hotspot you didn't expect.

For application developers: This section covers configuration from the perspective of an infrastructure/DevOps person who operates the OTel Collector. If your team has an infrastructure owner, share this section and set it up together.

gen_ai.usage.input_tokens and gen_ai.usage.output_tokens are recorded as Span attributes. Converting these values into Prometheus metrics requires going through the spanmetrics connector — it's the bridge component that reads Span attributes and transforms them into metrics.

yaml

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  # PII masking — required when prompt capture is enabled
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["gen_ai.content.prompt"], "\\b\\d{6}-\\d{7}\\b", "***")
 
  # Tail-based sampling — 100% of error Spans, only 10% of successful ones
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: sample-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
 
# spanmetrics: connector that transforms Span attributes into aggregatable metrics
# enables exporting Span attributes like gen_ai.usage.input_tokens to Prometheus
connectors:
  spanmetrics:
    namespace: gen_ai
    dimensions:
      - name: gen_ai.system
      - name: gen_ai.request.model
      - name: span.name
    metrics_flush_interval: 15s
 
exporters:
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
  otlp/jaeger:
    endpoint: "http://jaeger:4317"
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, tail_sampling]
      # also deliver Spans to the spanmetrics connector — required for metric conversion
      exporters: [otlp/jaeger, spanmetrics]
    metrics:
      # also collect metrics generated by spanmetrics
      receivers: [otlp, spanmetrics]
      exporters: [prometheusremotewrite]

The key insight in this pipeline is that the Collector handles PII masking and sampling as a middle layer. You don't need to put this logic in your application code, and if you later switch backends from Jaeger to Datadog, the application code doesn't need to change. Thanks to the spanmetrics connector, you can slice gen_ai_usage_input_tokens_total in Grafana by span.name or gen_ai.request.model to immediately see costs by feature.

Pros and Cons Analysis

Here is a summary of the constraints encountered and how to address them when applying this to real projects.

Advantages

Item	Details
Vendor independence	Implement `gen_ai.*` standard attributes once and send to any backend — Datadog, Grafana, Jaeger, etc.
Existing APM integration	Existing HTTP/DB Spans and LLM Spans are connected within the same Trace — enabling full service flow visibility
Auto-instrumentation	Auto-instrumentation packages available for 40+ providers including OpenAI, Anthropic, LangChain, and LlamaIndex
Cost tracking	Quantify actual cost per model and per feature using token usage metrics
Standard ecosystem	CNCF project with strong community backing and long-term sustainability

Disadvantages and Caveats

Item	Details	Mitigation
Experimental instability	Most GenAI Semconv attributes are still experimental — attribute names can change in minor versions	Pin convention versions in CI, automate migration scripts
Privacy and compliance	Enabling prompt capture risks leaking PII or trade secrets	Build data masking pipelines, access controls, and retention policies first
Storage and performance overhead	Full prompt storage can make Span sizes several KB to tens of KB	Combine head-based and tail-based sampling, use content hashing
Custom instrumentation unavoidable	Auto-instrumentation only covers LLM SDK call level — RAG chunking, prompt rendering, etc. require manual instrumentation	Add instrumentation for business logic steps with `tracer.start_as_current_span()`
Ecosystem fragmentation	Langfuse, Helicone, LangSmith, etc. still maintain their own proprietary formats	Apply OTel standard first, then use per-tool OTLP adapters

Head-based Sampling: Decides whether to collect a trace at the moment it starts. Fast, but traces that are already dropped cannot be recovered even if an error occurs later.
Tail-based Sampling: Decides whether to collect a trace after the entire trace is complete, based on the outcome. Guarantees error traces are captured, but has higher memory usage.

The Most Common Mistakes in Practice

Turning on prompt capture before anything else — Testing OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=true in a development environment and then deploying to production as-is means customer data accumulates in the tracing backend. The masking pipeline comes before enabling capture.
Trusting auto-instrumentation alone and skipping custom Spans — OpenAIInstrumentor captures LLM calls, but not the document retrieval logic or prompt assembly that happens before them. To know "which step took time" in a RAG pipeline, you need manual instrumentation for each business logic stage.
Not pinning convention versions — If you use opentelemetry-instrumentation-openai without version pinning and it auto-upgrades to a version where attribute names changed, Grafana dashboard queries will all break at once. It's best to pin down to the minor version in requirements.txt.

Closing Thoughts

In the beginning, things worked fine without observability. When you call things directly in a development environment, everything seems to work. But once production traffic hits, you repeatedly find yourself with no idea where something went wrong. User complaints come in and all you can say is "we can't reproduce it." That's when you realize observability for LLM systems isn't optional. GenAI Semantic Conventions being experimental might feel like a risk, but with major vendors already announcing native support, starting now is anything but premature.

Three steps you can take right now:

Add auto-instrumentation to an existing OpenAI project — Time required: ~5 minutes / Prerequisites: OpenAI API key, Python environment
Install with pip install opentelemetry-instrumentation-openai and add OpenAIInstrumentor().instrument() at the app startup point. Even with just a console exporter attached first, you can immediately see gen_ai.* attributes printed in the terminal.
Run Jaeger + OTel Collector locally and visualize traces — Time required: ~20 minutes / Prerequisites: Docker installed
Bring up jaegertracing/all-in-one and the OTel Collector with docker compose up and send traces to the OTLP endpoint (http://localhost:4317). If you have a RAG pipeline, you can see how a trace connecting the retrieval step through the LLM call looks when visualized.
Analyze per-feature costs using token usage metrics — Time required: ~30 minutes / Prerequisites: Step 2 complete, Prometheus/Grafana environment
Attach the spanmetrics connector from Example 3 and you can aggregate gen_ai_usage_input_tokens_total by span.name. Slicing daily token consumption by feature in Grafana will reveal endpoints consuming more cost than you expected.

References

#OpenTelemetry#LLM-Observability#RAG#GenAI-SemanticConventions#Tracing#LangChain#OpenLLMetry#Python#Jaeger#멀티에이전트

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

Core Concepts

Why OTel Takes a Different Approach for LLMs

Because of these unique characteristics, OTel defines a separate gen_ai.* namespace.

Attribute	Description	Example Value
`gen_ai.system`	AI provider identifier	`openai`, `anthropic`
`gen_ai.request.model`	Requested model	`gpt-4o`
`gen_ai.response.model`	Model that actually responded	`gpt-4o-2024-08-06`
`gen_ai.usage.input_tokens`	Number of input tokens	`1234`
`gen_ai.usage.output_tokens`	Number of output tokens	`256`
`gen_ai.response.finish_reasons`	Reason generation stopped	`["stop"]`, `["length"]`
`gen_ai.operation.name`	Operation type	`chat`, `embeddings`

How Trace Data Flows

yaml

[User Request]
    └── Trace (Root Span: HTTP Request)
          ├── Span: RAG Document Retrieval (vector DB query)
          ├── Span: Context Preparation
          ├── Span: LLM Call (includes gen_ai.* attributes)
          │     ├── Event: gen_ai.content.prompt (opt-in)
          │     └── Event: gen_ai.content.completion (opt-in)
          └── Span: Response Formatting

Privacy Design Principle: Prompt and response content is not collected by default. It must be explicitly enabled with the OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=true environment variable. This opt-in design prevents PII or medical information from flowing directly into the tracing backend.

The Role of Each Span Type

Span Type	Target Operation
LLM inference	Inference calls like `chat`, `completion`
Embeddings	Text vectorization
Retrieval	Vector DB, document search
Tool execution	External tools invoked by agents
Agent steps	`gen_ai.agent.*` — agent decision-making steps

Practical Application

Example 1: End-to-End RAG Pipeline Tracing

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
 
# TracerProvider: entry point for the tracing pipeline
# BatchSpanProcessor: batches Spans before sending (reduces network overhead)
# OTLPSpanExporter: delivers Spans to the OTel Collector
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
 
# Enable automatic injection of gen_ai.* attributes into the OpenAI SDK
OpenAIInstrumentor().instrument()
 
tracer = trace.get_tracer("rag-service", "1.0.0")
 
# Minimal signature example — actual implementation varies by vector DB
def vector_db_search(query: str, k: int) -> list[dict]:
    """Embeds the query and returns k similar documents from the vector DB"""
    ...
 
def build_prompt(query: str, docs: list[dict]) -> str:
    """Combines retrieved documents as context and returns the prompt string"""
    ...
 
def answer_question(query: str) -> str:
    with tracer.start_as_current_span("rag.pipeline") as span:
        span.set_attribute("user.query", query)
 
        # Retrieval step — must be manually spanned since auto-instrumentation doesn't cover it
        with tracer.start_as_current_span("rag.retrieval") as ret_span:
            docs = vector_db_search(query, k=5)
            ret_span.set_attribute("rag.retrieved_docs", len(docs))
            ret_span.set_attribute("rag.query_vector_dim", 1536)
 
        # LLM call — OpenAIInstrumentor automatically creates a child Span inside this block
        # rag.llm_call (manual) → openai.chat (automatic, with gen_ai.* attributes) forms a parent-child structure
        with tracer.start_as_current_span("rag.llm_call"):
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": build_prompt(query, docs)}],
                temperature=0.3,
            )
 
        return response.choices[0].message.content

Code Point	Role
`OpenAIInstrumentor().instrument()`	Patches the OpenAI SDK — automatically adds `gen_ai.*` attributes to all subsequent `chat.completions.create()` calls
`tracer.start_as_current_span("rag.retrieval")`	Manually creates a Span for the retrieval step, which auto-instrumentation doesn't cover
`ret_span.set_attribute("rag.retrieved_docs", ...)`	Records the number of retrieved results as a Span attribute — can later be cross-analyzed against answer quality
`rag.llm_call` (manual) + OpenAI Span (automatic)	Intentional parent-child nesting. `rag.llm_call` carries business context, while the auto-instrumented Span inside it records the actual API call in detail

Now let's go one level deeper from the application code and look at how to trace the internal behavior of an agent framework.

Example 2: Tracing Multi-Agent Workflows with OpenLLMetry

python

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import agent, task
 
# One-line initialization enables automatic instrumentation for LangChain/LlamaIndex/CrewAI
Traceloop.init(
    app_name="research-agent",
    api_endpoint="http://otel-collector:4318",
)
 
@agent(name="research_coordinator")
def run_research(topic: str) -> str:
    # The @agent decorator wraps the entire function as a single Agent Span
    # Internal LangChain LLM calls and tool executions are each automatically recorded as child Spans
    llm = ChatOpenAI(model="gpt-4o")
    agent_executor = AgentExecutor(
        agent=create_react_agent(llm, tools=[...]),
        tools=[...]
    )
    return agent_executor.invoke({"input": f"Research about: {topic}"})["output"]
 
@task(name="summarize_findings")
def summarize(raw_findings: str) -> str:
    # The @task decorator creates a separate Task Span distinct from the agent loop
    # This makes it easy to distinguish agent steps from post-processing stages in the trace
    from openai import OpenAI
    client = OpenAI()
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize: {raw_findings}"}]
    ).choices[0].message.content

OpenLLMetry: An OTel-based automatic LLM instrumentation library developed by Traceloop. Supports over 40 providers including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. Acquired by ServiceNow in March 2026 but continues to be maintained as open source (Apache 2.0).

Now let's step away from application code and look at what can be done at the Collector level.

Example 3: Aggregating Token Cost Metrics

For application developers: This section covers configuration from the perspective of an infrastructure/DevOps person who operates the OTel Collector. If your team has an infrastructure owner, share this section and set it up together.

yaml

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  # PII masking — required when prompt capture is enabled
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["gen_ai.content.prompt"], "\\b\\d{6}-\\d{7}\\b", "***")
 
  # Tail-based sampling — 100% of error Spans, only 10% of successful ones
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: sample-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
 
# spanmetrics: connector that transforms Span attributes into aggregatable metrics
# enables exporting Span attributes like gen_ai.usage.input_tokens to Prometheus
connectors:
  spanmetrics:
    namespace: gen_ai
    dimensions:
      - name: gen_ai.system
      - name: gen_ai.request.model
      - name: span.name
    metrics_flush_interval: 15s
 
exporters:
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
  otlp/jaeger:
    endpoint: "http://jaeger:4317"
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, tail_sampling]
      # also deliver Spans to the spanmetrics connector — required for metric conversion
      exporters: [otlp/jaeger, spanmetrics]
    metrics:
      # also collect metrics generated by spanmetrics
      receivers: [otlp, spanmetrics]
      exporters: [prometheusremotewrite]

Pros and Cons Analysis

Here is a summary of the constraints encountered and how to address them when applying this to real projects.

Advantages

Item	Details
Vendor independence	Implement `gen_ai.*` standard attributes once and send to any backend — Datadog, Grafana, Jaeger, etc.
Existing APM integration	Existing HTTP/DB Spans and LLM Spans are connected within the same Trace — enabling full service flow visibility
Auto-instrumentation	Auto-instrumentation packages available for 40+ providers including OpenAI, Anthropic, LangChain, and LlamaIndex
Cost tracking	Quantify actual cost per model and per feature using token usage metrics
Standard ecosystem	CNCF project with strong community backing and long-term sustainability

Disadvantages and Caveats

Item	Details	Mitigation
Experimental instability	Most GenAI Semconv attributes are still experimental — attribute names can change in minor versions	Pin convention versions in CI, automate migration scripts
Privacy and compliance	Enabling prompt capture risks leaking PII or trade secrets	Build data masking pipelines, access controls, and retention policies first
Storage and performance overhead	Full prompt storage can make Span sizes several KB to tens of KB	Combine head-based and tail-based sampling, use content hashing
Custom instrumentation unavoidable	Auto-instrumentation only covers LLM SDK call level — RAG chunking, prompt rendering, etc. require manual instrumentation	Add instrumentation for business logic steps with `tracer.start_as_current_span()`
Ecosystem fragmentation	Langfuse, Helicone, LangSmith, etc. still maintain their own proprietary formats	Apply OTel standard first, then use per-tool OTLP adapters

Head-based Sampling: Decides whether to collect a trace at the moment it starts. Fast, but traces that are already dropped cannot be recovered even if an error occurs later.
Tail-based Sampling: Decides whether to collect a trace after the entire trace is complete, based on the outcome. Guarantees error traces are captured, but has higher memory usage.

The Most Common Mistakes in Practice

Turning on prompt capture before anything else — Testing OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=true in a development environment and then deploying to production as-is means customer data accumulates in the tracing backend. The masking pipeline comes before enabling capture.
Trusting auto-instrumentation alone and skipping custom Spans — OpenAIInstrumentor captures LLM calls, but not the document retrieval logic or prompt assembly that happens before them. To know "which step took time" in a RAG pipeline, you need manual instrumentation for each business logic stage.
Not pinning convention versions — If you use opentelemetry-instrumentation-openai without version pinning and it auto-upgrades to a version where attribute names changed, Grafana dashboard queries will all break at once. It's best to pin down to the minor version in requirements.txt.

Closing Thoughts

Three steps you can take right now:

Add auto-instrumentation to an existing OpenAI project — Time required: ~5 minutes / Prerequisites: OpenAI API key, Python environment
Install with pip install opentelemetry-instrumentation-openai and add OpenAIInstrumentor().instrument() at the app startup point. Even with just a console exporter attached first, you can immediately see gen_ai.* attributes printed in the terminal.
Run Jaeger + OTel Collector locally and visualize traces — Time required: ~20 minutes / Prerequisites: Docker installed
Bring up jaegertracing/all-in-one and the OTel Collector with docker compose up and send traces to the OTLP endpoint (http://localhost:4317). If you have a RAG pipeline, you can see how a trace connecting the retrieval step through the LLM call looks when visualized.
Analyze per-feature costs using token usage metrics — Time required: ~30 minutes / Prerequisites: Step 2 complete, Prometheus/Grafana environment
Attach the spanmetrics connector from Example 3 and you can aggregate gen_ai_usage_input_tokens_total by span.name. Slicing daily token consumption by feature in Grafana will reveal endpoints consuming more cost than you expected.

References

#OpenTelemetry#LLM-Observability#RAG#GenAI-SemanticConventions#Tracing#LangChain#OpenLLMetry#Python#Jaeger#멀티에이전트

Core Concepts

Why OTel Takes a Different Approach for LLMs

How Trace Data Flows

The Role of Each Span Type

Practical Application

Example 1: End-to-End RAG Pipeline Tracing

Example 2: Tracing Multi-Agent Workflows with OpenLLMetry

Example 3: Aggregating Token Cost Metrics

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why OTel Takes a Different Approach for LLMs

How Trace Data Flows

The Role of Each Span Type

Practical Application

Example 1: End-to-End RAG Pipeline Tracing

Example 2: Tracing Multi-Agent Workflows with OpenLLMetry

Example 3: Aggregating Token Cost Metrics

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer