Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard
A service connected to GPT-4 suddenly starts giving nonsensical answers. You dig through the logs and find no errors. HTTP response codes are all 200. But users are angry. If you've built LLM-powered systems, you've probably been here. Traditional APM can tell you "the LLM call took 1.2 seconds," but nothing about what actually matters — what prompt went in and what answer came out. Because LLM systems are non-deterministic — the same input can produce a different result every time — reproducing an incident requires capturing the exact input, model parameters, and temperature values at the moment it happened.
The solution that emerged for this problem is OpenTelemetry (OTel)'s GenAI Semantic Conventions. The approach layers an LLM-specific semantic layer on top of OTel, the CNCF standard observability framework, to collect traces, metrics, and logs in a structured way. Major vendors like Datadog, Grafana, and AWS have already announced native support, making it the de facto industry standard.
By the end of this article, you'll be able to add tracing to an existing OpenAI project and visualize the full flow of a RAG pipeline in Jaeger. Examples are written in Python; if you're using JavaScript/TypeScript, the same patterns apply at the OpenLLMetry official repository.
Core Concepts
Why OTel Takes a Different Approach for LLMs
Tracing a typical service is deterministic. If a DB query is slow, there's a slow query log. Function execution time is roughly consistent given the same input. LLMs are different. Send the same prompt ten times with temperature=0.8 and you get ten different answers. On top of that, cost varies with input token count, and models can change mid-deployment.
Because of these unique characteristics, OTel defines a separate gen_ai.* namespace.
| Attribute | Description | Example Value |
|---|---|---|
gen_ai.system |
AI provider identifier | openai, anthropic |
gen_ai.request.model |
Requested model | gpt-4o |
gen_ai.response.model |
Model that actually responded | gpt-4o-2024-08-06 |
gen_ai.usage.input_tokens |
Number of input tokens | 1234 |
gen_ai.usage.output_tokens |
Number of output tokens | 256 |
gen_ai.response.finish_reasons |
Reason generation stopped | ["stop"], ["length"] |
gen_ai.operation.name |
Operation type | chat, embeddings |
When I first saw this attribute list, having both gen_ai.request.model and gen_ai.response.model seemed odd. "Doesn't the model you request just respond?" But in practice, these two values diverge more often than you'd expect — when an API provider silently auto-upgrades the model version. If you requested gpt-4o but the response shows gpt-4o-2024-08-06, you can track exactly when a different version started responding. It seems minor, but it's surprisingly useful for catching cases where answer quality subtly shifts after a model upgrade.
How Trace Data Flows
A single LLM call doesn't exist in isolation. It's almost always part of a pipeline: HTTP request → document retrieval → LLM call → response formatting. OTel wraps all of this into a single Trace and represents each step as a Span.
[User Request]
└── Trace (Root Span: HTTP Request)
├── Span: RAG Document Retrieval (vector DB query)
├── Span: Context Preparation
├── Span: LLM Call (includes gen_ai.* attributes)
│ ├── Event: gen_ai.content.prompt (opt-in)
│ └── Event: gen_ai.content.completion (opt-in)
└── Span: Response FormattingPrivacy Design Principle: Prompt and response content is not collected by default. It must be explicitly enabled with the
OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=trueenvironment variable. This opt-in design prevents PII or medical information from flowing directly into the tracing backend.
One recent shift worth noting: older approaches often embedded the full prompt text directly as a Span attribute. But as prompts grew to several KB in size, tracing backend performance visibly degraded. The current spec has evolved toward separating this into Log-based Events. Span attributes carry only metadata; actual content is sent separately as attached Events.
The Role of Each Span Type
Span types are subdivided to reflect the complexity of LLM systems. As multi-agent systems became more common, the need arose to distinguish agent decision-making steps as a separate type — the gen_ai.agent.* conventions were split into their own spec as part of this trend. MCP (Model Context Protocol) tool call tracing is also being incorporated here.
| Span Type | Target Operation |
|---|---|
| LLM inference | Inference calls like chat, completion |
| Embeddings | Text vectorization |
| Retrieval | Vector DB, document search |
| Tool execution | External tools invoked by agents |
| Agent steps | gen_ai.agent.* — agent decision-making steps |
Practical Application
Example 1: End-to-End RAG Pipeline Tracing
When I first built a RAG pipeline, the most frustrating thing was not being able to answer "retrieval worked fine, so why is the answer wrong?" Logging retrieval and LLM calls separately still left the two stages disconnected, making it nearly impossible to establish causality. Once you add OTel, the retrieval stage and LLM call stage are connected within the same trace, and you can see the full picture at a glance.
Let's start with SDK initialization. TracerProvider is the entry point for the tracing pipeline, BatchSpanProcessor batches Spans before sending to reduce network overhead, and OTLPSpanExporter is the component that actually sends Spans to the OTel Collector.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
# TracerProvider: entry point for the tracing pipeline
# BatchSpanProcessor: batches Spans before sending (reduces network overhead)
# OTLPSpanExporter: delivers Spans to the OTel Collector
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
# Enable automatic injection of gen_ai.* attributes into the OpenAI SDK
OpenAIInstrumentor().instrument()
tracer = trace.get_tracer("rag-service", "1.0.0")
# Minimal signature example — actual implementation varies by vector DB
def vector_db_search(query: str, k: int) -> list[dict]:
"""Embeds the query and returns k similar documents from the vector DB"""
...
def build_prompt(query: str, docs: list[dict]) -> str:
"""Combines retrieved documents as context and returns the prompt string"""
...
def answer_question(query: str) -> str:
with tracer.start_as_current_span("rag.pipeline") as span:
span.set_attribute("user.query", query)
# Retrieval step — must be manually spanned since auto-instrumentation doesn't cover it
with tracer.start_as_current_span("rag.retrieval") as ret_span:
docs = vector_db_search(query, k=5)
ret_span.set_attribute("rag.retrieved_docs", len(docs))
ret_span.set_attribute("rag.query_vector_dim", 1536)
# LLM call — OpenAIInstrumentor automatically creates a child Span inside this block
# rag.llm_call (manual) → openai.chat (automatic, with gen_ai.* attributes) forms a parent-child structure
with tracer.start_as_current_span("rag.llm_call"):
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": build_prompt(query, docs)}],
temperature=0.3,
)
return response.choices[0].message.content| Code Point | Role |
|---|---|
OpenAIInstrumentor().instrument() |
Patches the OpenAI SDK — automatically adds gen_ai.* attributes to all subsequent chat.completions.create() calls |
tracer.start_as_current_span("rag.retrieval") |
Manually creates a Span for the retrieval step, which auto-instrumentation doesn't cover |
ret_span.set_attribute("rag.retrieved_docs", ...) |
Records the number of retrieved results as a Span attribute — can later be cross-analyzed against answer quality |
rag.llm_call (manual) + OpenAI Span (automatic) |
Intentional parent-child nesting. rag.llm_call carries business context, while the auto-instrumented Span inside it records the actual API call in detail |
With this structure, a trace UI like Jaeger shows "5 documents retrieved → LLM call → 1,847 tokens consumed → response generated" all on one screen. The causal chain — "sparse retrieval caused the LLM to hallucinate" — is visible in a single trace view.
Now let's go one level deeper from the application code and look at how to trace the internal behavior of an agent framework.
Example 2: Tracing Multi-Agent Workflows with OpenLLMetry
When you build agents with LangChain, you inevitably hit a moment where you have no idea "what is this thing doing right now?" This is especially true when it loops dozens of times and exceeds the context window. There's an error message, but figuring out which tool it called and what decision it made just before is genuinely difficult. With OpenLLMetry, the framework's internal operations are automatically decomposed into Spans.
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import agent, task
# One-line initialization enables automatic instrumentation for LangChain/LlamaIndex/CrewAI
Traceloop.init(
app_name="research-agent",
api_endpoint="http://otel-collector:4318",
)
@agent(name="research_coordinator")
def run_research(topic: str) -> str:
# The @agent decorator wraps the entire function as a single Agent Span
# Internal LangChain LLM calls and tool executions are each automatically recorded as child Spans
llm = ChatOpenAI(model="gpt-4o")
agent_executor = AgentExecutor(
agent=create_react_agent(llm, tools=[...]),
tools=[...]
)
return agent_executor.invoke({"input": f"Research about: {topic}"})["output"]
@task(name="summarize_findings")
def summarize(raw_findings: str) -> str:
# The @task decorator creates a separate Task Span distinct from the agent loop
# This makes it easy to distinguish agent steps from post-processing stages in the trace
from openai import OpenAI
client = OpenAI()
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize: {raw_findings}"}]
).choices[0].message.contentOpenLLMetry: An OTel-based automatic LLM instrumentation library developed by Traceloop. Supports over 40 providers including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. Acquired by ServiceNow in March 2026 but continues to be maintained as open source (Apache 2.0).
The Spans generated by @agent and @task decorators serve different roles. @agent wraps the entire agent execution flow as a single root Span and automatically generates child Spans for internal LLM calls and tool executions. @task is used to mark independent processing steps separate from the agent loop. In a trace UI, you can see a timeline of "the agent called tool A, it failed, retried with tool B, and succeeded on which attempt."
Now let's step away from application code and look at what can be done at the Collector level.
Example 3: Aggregating Token Cost Metrics
Honestly, OTel Collector configuration looks intimidating at first glance. But once you've experienced "why is this month's OpenAI bill so high?" running in production without this pipeline, you understand why it's worth the investment. Once you can see which endpoints are consuming how many tokens by feature, you'll inevitably find a cost hotspot you didn't expect.
For application developers: This section covers configuration from the perspective of an infrastructure/DevOps person who operates the OTel Collector. If your team has an infrastructure owner, share this section and set it up together.
gen_ai.usage.input_tokens and gen_ai.usage.output_tokens are recorded as Span attributes. Converting these values into Prometheus metrics requires going through the spanmetrics connector — it's the bridge component that reads Span attributes and transforms them into metrics.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
# PII masking — required when prompt capture is enabled
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- replace_pattern(attributes["gen_ai.content.prompt"], "\\b\\d{6}-\\d{7}\\b", "***")
# Tail-based sampling — 100% of error Spans, only 10% of successful ones
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: sample-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
# spanmetrics: connector that transforms Span attributes into aggregatable metrics
# enables exporting Span attributes like gen_ai.usage.input_tokens to Prometheus
connectors:
spanmetrics:
namespace: gen_ai
dimensions:
- name: gen_ai.system
- name: gen_ai.request.model
- name: span.name
metrics_flush_interval: 15s
exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
otlp/jaeger:
endpoint: "http://jaeger:4317"
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform, tail_sampling]
# also deliver Spans to the spanmetrics connector — required for metric conversion
exporters: [otlp/jaeger, spanmetrics]
metrics:
# also collect metrics generated by spanmetrics
receivers: [otlp, spanmetrics]
exporters: [prometheusremotewrite]The key insight in this pipeline is that the Collector handles PII masking and sampling as a middle layer. You don't need to put this logic in your application code, and if you later switch backends from Jaeger to Datadog, the application code doesn't need to change. Thanks to the spanmetrics connector, you can slice gen_ai_usage_input_tokens_total in Grafana by span.name or gen_ai.request.model to immediately see costs by feature.
Pros and Cons Analysis
Here is a summary of the constraints encountered and how to address them when applying this to real projects.
Advantages
| Item | Details |
|---|---|
| Vendor independence | Implement gen_ai.* standard attributes once and send to any backend — Datadog, Grafana, Jaeger, etc. |
| Existing APM integration | Existing HTTP/DB Spans and LLM Spans are connected within the same Trace — enabling full service flow visibility |
| Auto-instrumentation | Auto-instrumentation packages available for 40+ providers including OpenAI, Anthropic, LangChain, and LlamaIndex |
| Cost tracking | Quantify actual cost per model and per feature using token usage metrics |
| Standard ecosystem | CNCF project with strong community backing and long-term sustainability |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Experimental instability | Most GenAI Semconv attributes are still experimental — attribute names can change in minor versions | Pin convention versions in CI, automate migration scripts |
| Privacy and compliance | Enabling prompt capture risks leaking PII or trade secrets | Build data masking pipelines, access controls, and retention policies first |
| Storage and performance overhead | Full prompt storage can make Span sizes several KB to tens of KB | Combine head-based and tail-based sampling, use content hashing |
| Custom instrumentation unavoidable | Auto-instrumentation only covers LLM SDK call level — RAG chunking, prompt rendering, etc. require manual instrumentation | Add instrumentation for business logic steps with tracer.start_as_current_span() |
| Ecosystem fragmentation | Langfuse, Helicone, LangSmith, etc. still maintain their own proprietary formats | Apply OTel standard first, then use per-tool OTLP adapters |
Head-based Sampling: Decides whether to collect a trace at the moment it starts. Fast, but traces that are already dropped cannot be recovered even if an error occurs later.
Tail-based Sampling: Decides whether to collect a trace after the entire trace is complete, based on the outcome. Guarantees error traces are captured, but has higher memory usage.
The Most Common Mistakes in Practice
-
Turning on prompt capture before anything else — Testing
OTEL_GENAI_CAPTURE_MESSAGE_CONTENT=truein a development environment and then deploying to production as-is means customer data accumulates in the tracing backend. The masking pipeline comes before enabling capture. -
Trusting auto-instrumentation alone and skipping custom Spans —
OpenAIInstrumentorcaptures LLM calls, but not the document retrieval logic or prompt assembly that happens before them. To know "which step took time" in a RAG pipeline, you need manual instrumentation for each business logic stage. -
Not pinning convention versions — If you use
opentelemetry-instrumentation-openaiwithout version pinning and it auto-upgrades to a version where attribute names changed, Grafana dashboard queries will all break at once. It's best to pin down to the minor version inrequirements.txt.
Closing Thoughts
In the beginning, things worked fine without observability. When you call things directly in a development environment, everything seems to work. But once production traffic hits, you repeatedly find yourself with no idea where something went wrong. User complaints come in and all you can say is "we can't reproduce it." That's when you realize observability for LLM systems isn't optional. GenAI Semantic Conventions being experimental might feel like a risk, but with major vendors already announcing native support, starting now is anything but premature.
Three steps you can take right now:
-
Add auto-instrumentation to an existing OpenAI project — Time required: ~5 minutes / Prerequisites: OpenAI API key, Python environment
Install withpip install opentelemetry-instrumentation-openaiand addOpenAIInstrumentor().instrument()at the app startup point. Even with just a console exporter attached first, you can immediately seegen_ai.*attributes printed in the terminal. -
Run Jaeger + OTel Collector locally and visualize traces — Time required: ~20 minutes / Prerequisites: Docker installed
Bring upjaegertracing/all-in-oneand the OTel Collector withdocker compose upand send traces to the OTLP endpoint (http://localhost:4317). If you have a RAG pipeline, you can see how a trace connecting the retrieval step through the LLM call looks when visualized. -
Analyze per-feature costs using token usage metrics — Time required: ~30 minutes / Prerequisites: Step 2 complete, Prometheus/Grafana environment
Attach thespanmetricsconnector from Example 3 and you can aggregategen_ai_usage_input_tokens_totalbyspan.name. Slicing daily token consumption by feature in Grafana will reveal endpoints consuming more cost than you expected.
References
- Inside the LLM Call: GenAI Observability with OpenTelemetry | OpenTelemetry Official Blog
- AI Agent Observability - Evolving Standards and Best Practices | OpenTelemetry
- Semantic conventions for generative AI systems | OpenTelemetry Official Docs
- Semantic conventions for generative client AI spans | OpenTelemetry
- Semantic Conventions for GenAI agent and framework spans | OpenTelemetry
- How OpenTelemetry Traces LLM Calls, Agent Reasoning, and MCP Tools | Greptime
- OpenTelemetry Standardizes LLM Tracing: Implementation Guide | earezki.com
- OpenTelemetry for LLMs: Complete SRE Guide for 2026 | OpenObserve
- GitHub - traceloop/openllmetry
- OpenTelemetry (OTel) for LLM Observability | Langfuse
- Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions | Datadog
- How to Implement RAG Pipeline Tracing with OpenTelemetry | OneUptime
- OpenTelemetry GenAI Semantic Conventions | MLflow AI Platform
- The AI Engineer's Guide to LLM Observability with OpenTelemetry | Agenta