Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

A service connected to GPT-4 suddenly starts giving nonsensical answers. You dig through the logs and find no errors. HTTP response codes are all 200. But users are angry. If you've built LLM-powered systems, you've probably been here. Traditional APM can tell you "the LLM call took 1.2 seconds," but nothing about what actually matters — what prompt went in and what answer came out. Because LLM systems are non-deterministic — the same input can produce a different result every time — reproducing an incident requires capturing the exact input, model parameters, and temperature values at the moment it happened.

The solution that emerged for this problem is OpenTelemetry (OTel)'s GenAI Semantic Conventions. The approach layers an LLM-specific semantic layer on top of OTel, the CNCF standard observability framework, to collect traces, metrics, and logs in a structured way. Major vendors like Datadog, Grafana, and AWS have already announced native support, making it the de facto industry standard.

By the end of this article, you'll be able to add tracing to an existing OpenAI project and visualize the full flow of a RAG pipeline in Jaeger. Examples are written in Python; if you're using JavaScript/TypeScript, the same patterns apply at the OpenLLMetry official repository.


Core Concepts

Why OTel Takes a Different Approach for LLMs

Tracing a typical service is deterministic. If a DB query is slow, there's a slow query log. Function execution time is roughly consistent given the same input. LLMs are different. Send the same prompt ten times with temperature=0.8 and you get ten different answers. On top of that, cost varies with input token count, and models can change mid-deployment.

Because of these unique characteristics, OTel defines a separate gen_ai.* namespace.

Attribute Description Example Value
gen_ai.system AI provider identifier openai, anthropic
gen_ai.request.model Requested model gpt-4o
gen_ai.response.model Model that actually responded gpt-4o-2024-08-06
gen_ai.usage.input_tokens Number of input tokens 1234
gen_ai.usage.output_tokens Number of output tokens 256
gen_ai.response.finish_reasons Reason generation stopped ["stop"], ["length"]
gen_ai.operation.name Operation type chat, embeddings

When I first saw this attribute list, having both gen_ai.request.model and gen_ai.response.model seemed odd. "Doesn't the model you request just respond?" But in practice, these two values diverge more often than you'd expect — when an API provider silently auto-upgrades the model version. If you requested gpt-4o but the response shows gpt-4o-2024-08-06, you can track exactly when a different version started responding. It seems minor, but it's surprisingly useful for catching cases where answer quality subtly shifts after a model upgrade.

How Trace Data Flows

A single LLM call doesn't exist in isolation. It's almost always part of a pipeline: HTTP request → document retrieval → LLM call → response formatting. OTel wraps all of this into a single Trace and represents each step as a Span.

yaml
[User Request]
    └── Trace (Root Span: HTTP Request)
          ├── Span: RAG Document Retrieval (vector DB query)
          ├── Span: Context Preparation
          ├── Span: LLM Call (includes gen_ai.* attributes)
          │     ├── Event: gen_ai.content.prompt (opt-in)
          │     └── Event: gen_ai.content.completion (opt-in)
          └── Span: Response Formatting

Privacy Design Principle: Prompt and response content is not collected by default. It must be explicitly enabled with the OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=true environment variable. This opt-in design prevents PII or medical information from flowing directly into the tracing backend.

One recent shift worth noting: older approaches often embedded the full prompt text directly as a Span attribute. But as prompts grew to several KB in size, tracing backend performance visibly degraded. The current spec has evolved toward separating this into Log-based Events. Span attributes carry only metadata; actual content is sent separately as attached Events.

The Role of Each Span Type

Span types are subdivided to reflect the complexity of LLM systems. As multi-agent systems became more common, the need arose to distinguish agent decision-making steps as a separate type — the gen_ai.agent.* conventions were split into their own spec as part of this trend. MCP (Model Context Protocol) tool call tracing is also being incorporated here.

Span Type Target Operation
LLM inference Inference calls like chat, completion
Embeddings Text vectorization
Retrieval Vector DB, document search
Tool execution External tools invoked by agents
Agent steps gen_ai.agent.* — agent decision-making steps

Practical Application

Example 1: End-to-End RAG Pipeline Tracing

When I first built a RAG pipeline, the most frustrating thing was not being able to answer "retrieval worked fine, so why is the answer wrong?" Logging retrieval and LLM calls separately still left the two stages disconnected, making it nearly impossible to establish causality. Once you add OTel, the retrieval stage and LLM call stage are connected within the same trace, and you can see the full picture at a glance.

Let's start with SDK initialization. TracerProvider is the entry point for the tracing pipeline, BatchSpanProcessor batches Spans before sending to reduce network overhead, and OTLPSpanExporter is the component that actually sends Spans to the OTel Collector.

python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
 
# TracerProvider: entry point for the tracing pipeline
# BatchSpanProcessor: batches Spans before sending (reduces network overhead)
# OTLPSpanExporter: delivers Spans to the OTel Collector
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
 
# Enable automatic injection of gen_ai.* attributes into the OpenAI SDK
OpenAIInstrumentor().instrument()
 
tracer = trace.get_tracer("rag-service", "1.0.0")
 
# Minimal signature example — actual implementation varies by vector DB
def vector_db_search(query: str, k: int) -> list[dict]:
    """Embeds the query and returns k similar documents from the vector DB"""
    ...
 
def build_prompt(query: str, docs: list[dict]) -> str:
    """Combines retrieved documents as context and returns the prompt string"""
    ...
 
def answer_question(query: str) -> str:
    with tracer.start_as_current_span("rag.pipeline") as span:
        span.set_attribute("user.query", query)
 
        # Retrieval step — must be manually spanned since auto-instrumentation doesn't cover it
        with tracer.start_as_current_span("rag.retrieval") as ret_span:
            docs = vector_db_search(query, k=5)
            ret_span.set_attribute("rag.retrieved_docs", len(docs))
            ret_span.set_attribute("rag.query_vector_dim", 1536)
 
        # LLM call — OpenAIInstrumentor automatically creates a child Span inside this block
        # rag.llm_call (manual) → openai.chat (automatic, with gen_ai.* attributes) forms a parent-child structure
        with tracer.start_as_current_span("rag.llm_call"):
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": build_prompt(query, docs)}],
                temperature=0.3,
            )
 
        return response.choices[0].message.content
Code Point Role
OpenAIInstrumentor().instrument() Patches the OpenAI SDK — automatically adds gen_ai.* attributes to all subsequent chat.completions.create() calls
tracer.start_as_current_span("rag.retrieval") Manually creates a Span for the retrieval step, which auto-instrumentation doesn't cover
ret_span.set_attribute("rag.retrieved_docs", ...) Records the number of retrieved results as a Span attribute — can later be cross-analyzed against answer quality
rag.llm_call (manual) + OpenAI Span (automatic) Intentional parent-child nesting. rag.llm_call carries business context, while the auto-instrumented Span inside it records the actual API call in detail

With this structure, a trace UI like Jaeger shows "5 documents retrieved → LLM call → 1,847 tokens consumed → response generated" all on one screen. The causal chain — "sparse retrieval caused the LLM to hallucinate" — is visible in a single trace view.

Now let's go one level deeper from the application code and look at how to trace the internal behavior of an agent framework.

Example 2: Tracing Multi-Agent Workflows with OpenLLMetry

When you build agents with LangChain, you inevitably hit a moment where you have no idea "what is this thing doing right now?" This is especially true when it loops dozens of times and exceeds the context window. There's an error message, but figuring out which tool it called and what decision it made just before is genuinely difficult. With OpenLLMetry, the framework's internal operations are automatically decomposed into Spans.

python
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import agent, task
 
# One-line initialization enables automatic instrumentation for LangChain/LlamaIndex/CrewAI
Traceloop.init(
    app_name="research-agent",
    api_endpoint="http://otel-collector:4318",
)
 
@agent(name="research_coordinator")
def run_research(topic: str) -> str:
    # The @agent decorator wraps the entire function as a single Agent Span
    # Internal LangChain LLM calls and tool executions are each automatically recorded as child Spans
    llm = ChatOpenAI(model="gpt-4o")
    agent_executor = AgentExecutor(
        agent=create_react_agent(llm, tools=[...]),
        tools=[...]
    )
    return agent_executor.invoke({"input": f"Research about: {topic}"})["output"]
 
@task(name="summarize_findings")
def summarize(raw_findings: str) -> str:
    # The @task decorator creates a separate Task Span distinct from the agent loop
    # This makes it easy to distinguish agent steps from post-processing stages in the trace
    from openai import OpenAI
    client = OpenAI()
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize: {raw_findings}"}]
    ).choices[0].message.content

OpenLLMetry: An OTel-based automatic LLM instrumentation library developed by Traceloop. Supports over 40 providers including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. Acquired by ServiceNow in March 2026 but continues to be maintained as open source (Apache 2.0).

The Spans generated by @agent and @task decorators serve different roles. @agent wraps the entire agent execution flow as a single root Span and automatically generates child Spans for internal LLM calls and tool executions. @task is used to mark independent processing steps separate from the agent loop. In a trace UI, you can see a timeline of "the agent called tool A, it failed, retried with tool B, and succeeded on which attempt."

Now let's step away from application code and look at what can be done at the Collector level.

Example 3: Aggregating Token Cost Metrics

Honestly, OTel Collector configuration looks intimidating at first glance. But once you've experienced "why is this month's OpenAI bill so high?" running in production without this pipeline, you understand why it's worth the investment. Once you can see which endpoints are consuming how many tokens by feature, you'll inevitably find a cost hotspot you didn't expect.

For application developers: This section covers configuration from the perspective of an infrastructure/DevOps person who operates the OTel Collector. If your team has an infrastructure owner, share this section and set it up together.

gen_ai.usage.input_tokens and gen_ai.usage.output_tokens are recorded as Span attributes. Converting these values into Prometheus metrics requires going through the spanmetrics connector — it's the bridge component that reads Span attributes and transforms them into metrics.

yaml
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  # PII masking — required when prompt capture is enabled
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["gen_ai.content.prompt"], "\\b\\d{6}-\\d{7}\\b", "***")
 
  # Tail-based sampling — 100% of error Spans, only 10% of successful ones
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: sample-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
 
# spanmetrics: connector that transforms Span attributes into aggregatable metrics
# enables exporting Span attributes like gen_ai.usage.input_tokens to Prometheus
connectors:
  spanmetrics:
    namespace: gen_ai
    dimensions:
      - name: gen_ai.system
      - name: gen_ai.request.model
      - name: span.name
    metrics_flush_interval: 15s
 
exporters:
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
  otlp/jaeger:
    endpoint: "http://jaeger:4317"
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, tail_sampling]
      # also deliver Spans to the spanmetrics connector — required for metric conversion
      exporters: [otlp/jaeger, spanmetrics]
    metrics:
      # also collect metrics generated by spanmetrics
      receivers: [otlp, spanmetrics]
      exporters: [prometheusremotewrite]

The key insight in this pipeline is that the Collector handles PII masking and sampling as a middle layer. You don't need to put this logic in your application code, and if you later switch backends from Jaeger to Datadog, the application code doesn't need to change. Thanks to the spanmetrics connector, you can slice gen_ai_usage_input_tokens_total in Grafana by span.name or gen_ai.request.model to immediately see costs by feature.


Pros and Cons Analysis

Here is a summary of the constraints encountered and how to address them when applying this to real projects.

Advantages

Item Details
Vendor independence Implement gen_ai.* standard attributes once and send to any backend — Datadog, Grafana, Jaeger, etc.
Existing APM integration Existing HTTP/DB Spans and LLM Spans are connected within the same Trace — enabling full service flow visibility
Auto-instrumentation Auto-instrumentation packages available for 40+ providers including OpenAI, Anthropic, LangChain, and LlamaIndex
Cost tracking Quantify actual cost per model and per feature using token usage metrics
Standard ecosystem CNCF project with strong community backing and long-term sustainability

Disadvantages and Caveats

Item Details Mitigation
Experimental instability Most GenAI Semconv attributes are still experimental — attribute names can change in minor versions Pin convention versions in CI, automate migration scripts
Privacy and compliance Enabling prompt capture risks leaking PII or trade secrets Build data masking pipelines, access controls, and retention policies first
Storage and performance overhead Full prompt storage can make Span sizes several KB to tens of KB Combine head-based and tail-based sampling, use content hashing
Custom instrumentation unavoidable Auto-instrumentation only covers LLM SDK call level — RAG chunking, prompt rendering, etc. require manual instrumentation Add instrumentation for business logic steps with tracer.start_as_current_span()
Ecosystem fragmentation Langfuse, Helicone, LangSmith, etc. still maintain their own proprietary formats Apply OTel standard first, then use per-tool OTLP adapters

Head-based Sampling: Decides whether to collect a trace at the moment it starts. Fast, but traces that are already dropped cannot be recovered even if an error occurs later.
Tail-based Sampling: Decides whether to collect a trace after the entire trace is complete, based on the outcome. Guarantees error traces are captured, but has higher memory usage.

The Most Common Mistakes in Practice

  1. Turning on prompt capture before anything else — Testing OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=true in a development environment and then deploying to production as-is means customer data accumulates in the tracing backend. The masking pipeline comes before enabling capture.

  2. Trusting auto-instrumentation alone and skipping custom Spans — OpenAIInstrumentor captures LLM calls, but not the document retrieval logic or prompt assembly that happens before them. To know "which step took time" in a RAG pipeline, you need manual instrumentation for each business logic stage.

  3. Not pinning convention versions — If you use opentelemetry-instrumentation-openai without version pinning and it auto-upgrades to a version where attribute names changed, Grafana dashboard queries will all break at once. It's best to pin down to the minor version in requirements.txt.


Closing Thoughts

In the beginning, things worked fine without observability. When you call things directly in a development environment, everything seems to work. But once production traffic hits, you repeatedly find yourself with no idea where something went wrong. User complaints come in and all you can say is "we can't reproduce it." That's when you realize observability for LLM systems isn't optional. GenAI Semantic Conventions being experimental might feel like a risk, but with major vendors already announcing native support, starting now is anything but premature.

Three steps you can take right now:

  1. Add auto-instrumentation to an existing OpenAI project — Time required: ~5 minutes / Prerequisites: OpenAI API key, Python environment
    Install with pip install opentelemetry-instrumentation-openai and add OpenAIInstrumentor().instrument() at the app startup point. Even with just a console exporter attached first, you can immediately see gen_ai.* attributes printed in the terminal.

  2. Run Jaeger + OTel Collector locally and visualize traces — Time required: ~20 minutes / Prerequisites: Docker installed
    Bring up jaegertracing/all-in-one and the OTel Collector with docker compose up and send traces to the OTLP endpoint (http://localhost:4317). If you have a RAG pipeline, you can see how a trace connecting the retrieval step through the LLM call looks when visualized.

  3. Analyze per-feature costs using token usage metrics — Time required: ~30 minutes / Prerequisites: Step 2 complete, Prometheus/Grafana environment
    Attach the spanmetrics connector from Example 3 and you can aggregate gen_ai_usage_input_tokens_total by span.name. Slicing daily token consumption by feature in Grafana will reveal endpoints consuming more cost than you expected.


References

  • Inside the LLM Call: GenAI Observability with OpenTelemetry | OpenTelemetry Official Blog
  • AI Agent Observability - Evolving Standards and Best Practices | OpenTelemetry
  • Semantic conventions for generative AI systems | OpenTelemetry Official Docs
  • Semantic conventions for generative client AI spans | OpenTelemetry
  • Semantic Conventions for GenAI agent and framework spans | OpenTelemetry
  • How OpenTelemetry Traces LLM Calls, Agent Reasoning, and MCP Tools | Greptime
  • OpenTelemetry Standardizes LLM Tracing: Implementation Guide | earezki.com
  • OpenTelemetry for LLMs: Complete SRE Guide for 2026 | OpenObserve
  • GitHub - traceloop/openllmetry
  • OpenTelemetry (OTel) for LLM Observability | Langfuse
  • Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions | Datadog
  • How to Implement RAG Pipeline Tracing with OpenTelemetry | OneUptime
  • OpenTelemetry GenAI Semantic Conventions | MLflow AI Platform
  • The AI Engineer's Guide to LLM Observability with OpenTelemetry | Agenta
#OpenTelemetry#LLM-Observability#RAG#GenAI-SemanticConventions#Tracing#LangChain#OpenLLMetry#Python#Jaeger#멀티에이전트
Share

Table of Contents

Core ConceptsWhy OTel Takes a Different Approach for LLMsHow Trace Data FlowsThe Role of Each Span TypePractical ApplicationExample 1: End-to-End RAG Pipeline TracingExample 2: Tracing Multi-Agent Workflows with OpenLLMetryExample 3: Aggregating Token Cost MetricsPros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents
AI

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Catching runtime errors at write time with RunContext · output_type · dependency injection When you layer LLM-powered features onto Python code, a nagging an...

May 30, 202624 min read
Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026
AI

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

If your team is shipping RAG, chatbots, or agents to production, this decision is waiting for you If you've ever shipped an AI feature to your product and th...

May 30, 202624 min read
Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables
AI

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-he...

May 30, 202620 min read
Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose
AI

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

If you've ever built an LLM-based app, you've hit this wall. "How do I make it remember past conversations?" You might think you can just shove the entire conve...

May 30, 202629 min read
LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System
AI

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

The most common mistake when first designing a multi-agent system is connecting agents loosely under the vague expectation that "they'll figure out how to collaborate." I thought the same thing at first, and the result was always the same: you can't tell where the control flow is, you can't trace where it failed, and debugging inevitably leads you to redesign everything from scratch.

May 30, 202622 min read
Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer
AI

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines...

May 29, 202628 min read