The Complete Guide to MCP Server Observability: From Prometheus Metrics and Distributed Trace to Anomaly Detection
A single complex agent workflow generates tens to hundreds of tool calls within a single user session. Each call consumes latency, generates errors, and records security events. If this flow remains a black box, the failure to detect latency spikes or abnormal call patterns in advance can lead to system failures or security incidents. The problem is that because MCP is a relatively new protocol based on JSON-RPC 2.0, it is difficult to secure sufficient visibility using existing APM tools alone.
In this article, we examine the MCP observability pipeline step-by-step, consisting of four stages: OpenTelemetry MCP semantic conventions, Prometheus metric instrumentation, distributed tracing context propagation, and SIEM anomaly detection. After reading this article, you will be able to personally configure a pipeline that integrates structured logs, metrics, and distributed tracing into an MCP server and detects abnormal tool call patterns in real time.
Before you begin, here are the prerequisites. You should be able to follow most of the examples if you have basic Python or TypeScript syntax and experience with Docker. Otel Collector YAML and PromQL are explained one by one in each section, so even beginners can read through this without difficulty. The actual target audience is backend or MLOps engineers who are currently deploying MCP to production or are considering its deployment.
Key Concepts
MCP and the 3 Key Elements of Observability
MCP (Model Context Protocol) is a standard communication protocol between AI agents and tools designed by Anthropic and released as open source in 2024. It runs on top of JSON-RPC 2.0 and enables AI agents (Clients) to communicate in a standardized manner with MCP servers (Servers) that provide external tools or data sources.
📌 What is a Tool Call? It is a unit operation in which an agent requests a request from the MCP server. Internally, the JSON-RPC request is serialized and transmitted as a tools/call message type.
In an MCP environment, observability consists of three axes.
| Element | Role | Representative Tool |
|---|---|---|
| Structured Logs | Logs of Tool Call Requests/Responses and Error Causes | Loki, Datadog Logs |
| Metrics | Time series aggregation of latency, calls, and error rates | Prometheus, Grafana Mimir |
| Distributed Traces | Linking Agent Inference → Tool Execution Flow | Jaeger, Grafana Tempo |
The overall pipeline structure integrating the three elements is as follows.
MCP Client (에이전트)
│ tools/call + _meta.traceparent (W3C Trace Context)
▼
MCP Server (Python/FastAPI)
│ OTLP gRPC (트레이스·메트릭·로그) │ 구조화 JSON 로그
▼ ▼
OTel Collector Datadog Cloud SIEM
│ (이상 탐지 룰)
├──▶ Prometheus (메트릭)
│ │
│ └──▶ Grafana (대시보드·알람)
│
├──▶ Grafana Tempo (분산 추적)
│
└──▶ Loki (로그)OpenTelemetry MCP Semantic Convention
OpenTelemetry has defined MCP-specific Span properties and metrics under the gen-ai/mcp/ namespace as semantic conventions. Using MCP-specific conventions instead of existing RPC conventions is recommended.
⚠️ Stability Notice: Currently, most OTel semantic conventions related to Gen AI are in the experimental state. We recommend fixing your OTel SDK version before production deployment and regularly checking the release notes for convention changes.
Span names follow the following format.
{mcp.method.name} {target}
예: tools/call weather_toolHere are examples of key Span attributes.
| Attribute Key | Meaning | Example Value |
|---|---|---|
mcp.method.name |
Called MCP method | tools/call |
mcp.tool.name |
Name of Executed Tool | get_weather |
mcp.session.id |
Session identifier (used only for trace properties) | sess_abc123 |
gen_ai.usage.input_tokens |
Number of input tokens | 412 |
error.type |
Error Classification | timeout |
Distributed Trace Context Propagation
The fact that MCP is based on JSON-RPC 2.0 creates one important limitation.
📌 JSON-RPC 2.0 Brief Overview: In JSON-RPC 2.0, the params object is a standard field that holds method arguments. _meta is a non-standard namespace for protocol extensions, reserved in the MCP specification for applications to use freely.
Since native W3C Trace Context propagation via HTTP headers is not supported, we use a method of injecting traceparent and tracestate into the params._meta property bag.
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "get_weather",
"arguments": { "city": "Seoul" },
"_meta": {
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"tracestate": "vendor=myapp"
}
},
"id": 1
}On the server side, this value is parsed and restored to the OTel context.
from opentelemetry.propagate import extract
from opentelemetry.trace import get_tracer
from typing import Any, TypedDict
tracer = get_tracer("mcp-server")
class McpParams(TypedDict, total=False):
name: str
arguments: dict[str, Any]
_meta: dict[str, str]
class McpRequest(TypedDict):
jsonrpc: str
method: str
params: McpParams
id: int | str
def execute_tool(params: McpParams) -> Any:
"""실제 Tool 실행 로직으로 대체하세요."""
raise NotImplementedError(f"Tool '{params.get('name')}' not implemented")
def handle_tool_call(request: McpRequest) -> Any:
"""_meta에서 W3C Trace Context를 추출해 OTel 스팬과 연결합니다."""
meta = request.get("params", {}).get("_meta", {})
carrier = {
"traceparent": meta.get("traceparent", ""),
"tracestate": meta.get("tracestate", ""),
}
ctx = extract(carrier)
tool_name = request["params"].get("name", "unknown")
with tracer.start_as_current_span(
f"tools/call {tool_name}",
context=ctx,
attributes={
"mcp.method.name": "tools/call",
"mcp.tool.name": tool_name,
},
):
return execute_tool(request["params"])Practical Application
The four steps below can be performed sequentially or applied independently. The Python MCP server (Steps 1, 2, and 4) and the TypeScript agent (Step 3) are a combination frequently seen in actual architectures. This is because Python is primarily used for server implementations combined with ML/AI libraries, while TypeScript is mainly used for the agent orchestration layer. If you use only Python, you can implement the TypeScript logic in Step 3 identically using the Python OTel SDK.
Step 1: Instrumenting Prometheus Metrics on a Python MCP Server
This is an example of collecting latency and error rates per tool by combining prometheus-fastapi-instrumentator with a FastAPI-based MCP server.
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Histogram, Counter
from typing import Any, TypedDict
import time
app = FastAPI()
class ToolCallRequest(TypedDict):
params: dict[str, Any]
# MCP 전용 커스텀 메트릭 정의
# 주의: 낮은 카디널리티 레이블(tool_name, status)만 사용합니다.
# user_id처럼 고유 값이 많은 필드는 Prometheus 레이블이 아닌
# 트레이스 속성이나 로그 필드로 기록하세요.
tool_duration = Histogram(
"mcp_tool_invocation_duration_seconds",
"MCP Tool Call 실행 시간",
labelnames=["tool_name", "status"],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0],
)
tool_errors = Counter(
"mcp_tool_errors_total",
"MCP Tool Call 에러 총 횟수",
labelnames=["tool_name", "error_type"],
)
tool_calls = Counter(
"mcp_tool_calls_total",
"MCP Tool Call 총 호출 수",
labelnames=["tool_name"],
)
# FastAPI 기본 메트릭 자동 계측
Instrumentator().instrument(app).expose(app)
async def execute_tool(tool_name: str, arguments: dict[str, Any]) -> Any:
"""실제 Tool 실행 로직으로 대체하세요."""
raise NotImplementedError(f"Tool '{tool_name}' not implemented")
@app.post("/mcp/tools/call")
async def call_tool(request: ToolCallRequest) -> Any:
tool_name = request["params"].get("name", "unknown")
tool_calls.labels(tool_name=tool_name).inc()
start = time.time()
try:
result = await execute_tool(
tool_name, request["params"].get("arguments", {})
)
duration = time.time() - start
tool_duration.labels(tool_name=tool_name, status="success").observe(duration)
return result
except Exception as e:
duration = time.time() - start
tool_duration.labels(tool_name=tool_name, status="error").observe(duration)
tool_errors.labels(tool_name=tool_name, error_type=type(e).__name__).inc()
raiseWhen Prometheus scrapes metrics exposed from the /metrics endpoint, you can query latency in Grafana using PromQL as shown below.
# Tool별 p95 레이턴시
histogram_quantile(
0.95,
rate(mcp_tool_invocation_duration_seconds_bucket[5m])
) by (tool_name)
# 분당 에러율
rate(mcp_tool_errors_total[1m]) by (tool_name, error_type)
# Tool별 호출량 추이
rate(mcp_tool_calls_total[10m]) by (tool_name)💡 What is Cardinality? In Prometheus, it refers to the number of unique combinations of labels. If you use values that change every time, such as session_id, or labels for thousands of users, such as user_id, the number of time series increases explosively, leading to memory issues. If user-specific analysis is required, it is recommended to record user_id as a trace attribute or log field, and to use only Prometheus labels with limited value types, such as tool_name and status.
Step 2: Configuring an Integrated Pipeline with OpenTelemetry Collector
By using OTel Collector as a central hub, you can collect metrics, traces, and logs generated by MCP servers into a single pipeline and route them to multiple backends.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# MCP 서버 Prometheus /metrics 스크랩
prometheus:
config:
scrape_configs:
- job_name: "mcp-server"
scrape_interval: 15s
static_configs:
- targets: ["mcp-server:8000"]
processors:
# PII 마스킹 — Tool 인자에서 민감 정보 제거
# replace_pattern()은 OTTL(OpenTelemetry Transformation Language) 문법입니다.
# 아래 정규식의 YAML 이스케이프 처리는 실제 환경에서 반드시 검증하세요.
transform/mask_pii:
error_mode: ignore
trace_statements:
- context: span
statements:
- replace_pattern(attributes["mcp.tool.arguments"],
"\"password\"\\s*:\\s*\"[^\"]+\"",
"\"password\": \"[REDACTED]\"")
# 배치 처리로 오버헤드 감소
batch:
send_batch_size: 1000
timeout: 10s
# 메모리 제한
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
# Grafana Tempo로 트레이스 전송
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Prometheus Remote Write로 메트릭 전송
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
# Loki로 로그 전송
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform/mask_pii, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [batch, memory_limiter]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [transform/mask_pii, batch]
exporters: [loki]Step 3: Connecting Agent Inference and Tool Calls with End-to-End Trace
This is an example of configuring a parent-child trace structure in TypeScript that connects the Agent Inference span to the MCP Tool Call span.
📌 Node.js Version Note: The example below uses crypto.randomUUID(). It is supported in Node.js 15 and above. For environments 14 or lower, please install the uuid package and replace it with import { v4 as uuidv4 } from 'uuid'.
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
import { W3CTraceContextPropagator } from "@opentelemetry/core";
const tracer = trace.getTracer("mcp-agent", "1.0.0");
const propagator = new W3CTraceContextPropagator();
interface ToolPlan {
name: string;
sessionId: string;
args: Record<string, unknown>;
}
// Step 1에서 구성한 MCP 서버의 /mcp/tools/call 엔드포인트를 호출합니다.
async function callMcpServer(request: unknown): Promise<unknown> {
// 예: fetch("http://mcp-server:8000/mcp/tools/call", { body: JSON.stringify(request) })
throw new Error("callMcpServer: 실제 HTTP 요청 로직으로 대체하세요");
}
async function planToolCalls(userQuery: string): Promise<ToolPlan[]> {
// 실제 LLM 추론 로직으로 대체하세요
return [];
}
async function runAgentWithObservability(userQuery: string): Promise<void> {
// 에이전트 추론 스팬 시작 (루트)
return await tracer.startActiveSpan(
"gen_ai.agent reasoning",
{
attributes: {
"gen_ai.system": "anthropic",
"gen_ai.request.model": "claude-sonnet-4-6",
"user.query": userQuery,
},
},
async (agentSpan) => {
try {
const toolsToCall = await planToolCalls(userQuery);
for (const tool of toolsToCall) {
// Tool Call 스팬을 에이전트 스팬의 자식으로 생성
await tracer.startActiveSpan(
`tools/call ${tool.name}`,
{
attributes: {
"mcp.method.name": "tools/call",
"mcp.tool.name": tool.name,
"mcp.session.id": tool.sessionId,
},
},
async (toolSpan) => {
try {
// _meta에 traceparent 주입 (W3C Trace Context 전파)
const carrier: Record<string, string> = {};
propagator.inject(context.active(), carrier);
const mcpRequest = {
jsonrpc: "2.0",
method: "tools/call",
params: {
name: tool.name,
arguments: tool.args,
_meta: {
traceparent: carrier["traceparent"] ?? "",
tracestate: carrier["tracestate"] ?? "",
},
},
id: crypto.randomUUID(), // Node.js 15+ 필요
};
const result = await callMcpServer(mcpRequest);
toolSpan.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
toolSpan.setStatus({
code: SpanStatusCode.ERROR,
message: String(err),
});
toolSpan.recordException(err as Error);
throw err;
} finally {
toolSpan.end();
}
}
);
}
agentSpan.setStatus({ code: SpanStatusCode.OK });
} finally {
agentSpan.end();
}
}
);
}The trace generated with this structure is visualized in Grafana Tempo or Jaeger as a hierarchical structure as shown below.
gen_ai.agent reasoning (350ms)
├── tools/call get_weather (45ms)
├── tools/call query_database (180ms)
│ └── db.query SELECT ... (120ms)
└── tools/call send_report (80ms)Step 4: Detecting Tool Call Anomalies with Datadog Cloud SIEM
This is an example of collecting structured logs from an MCP server into Datadog and applying anomaly detection rules.
import logging
import json
from datetime import datetime, timezone
from typing import Optional
logger = logging.getLogger("mcp.security")
def log_tool_call(
tool_name: str,
user_id: str,
session_id: str,
status: str,
duration_ms: float,
error: Optional[str] = None,
) -> None:
"""Datadog SIEM이 파싱할 수 있는 구조화 로그를 출력합니다."""
log_entry: dict = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"event.category": "mcp.tool_call",
"mcp.tool.name": tool_name,
"usr.id": user_id, # Datadog 표준 사용자 속성
"mcp.session.id": session_id,
"http.status": 200 if status == "success" else 500,
"duration_ms": duration_ms,
"status": status,
}
if error:
log_entry["error.message"] = error
log_entry["error.kind"] = "MCPToolError"
logger.info(json.dumps(log_entry))Below is the pseudocode for conceptual explanation of detection rules that can be configured in Datadog Cloud SIEM. Since some syntax may differ from the actual Datadog Detection Rule query language, it is recommended to refer to the official Datadog documentation and adjust it to suit your environment.
# 탐지 룰 1: 특정 Tool 호출 빈도 급증
# 사용자별 Tool Call이 기준 편차 3σ를 초과할 때 경보
@event.category:mcp.tool_call
| anomaly(count, direction:above, threshold:3)
by usr.id, mcp.tool.name
over 5m
# 탐지 룰 2: 연속 인증 실패 후 Tool 호출 시도
# 같은 세션에서 401 이후 5분 내 tools/call 탐지
@event.category:mcp.tool_call AND @http.status:401
| sequence by mcp.session.id
[within 5m] @http.status:401
[within 5m] @event.category:mcp.tool_call
# 탐지 룰 3: 비정상 시간대 고빈도 호출 (야간 자동화 탐지)
@event.category:mcp.tool_call
| @timestamp:[23:00 TO 06:00]
| count > 100 by usr.id over 10mPros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Standardized Measurement | OTel Semantic Conventions Support MCP, Enabling Measurement Without Vendor Lock-in |
| Reuse existing infrastructure | Connect existing monitoring stacks such as Prometheus, Grafana, and Datadog as is |
| Rich Context | Log Tool Name, Arguments, Token Usage, and Latency in a Single Span |
| Security and Observability Integration | Handle SIEM Anomaly Detection and Performance Monitoring in the Same Pipeline |
| Agent Workflow Visibility | Track multi-step agent executions with end-to-end traces |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Context propagation non-standard | JSON-RPC-based MCP does not natively support W3C Trace Context | Implemented by injecting traceparent into params._meta |
| Argument sensitivity issue | Tool arguments may contain PII and secrets | Apply masking policy with OTel Collector's transform processor |
| Cardinality Explosion | Time Series Count Surges When Using Fields with Many Unique Values as Prometheus Labels | Use Only Labels with Limited Value Types, Like tool_name, status |
| Ecosystem immature | OTel MCP semantic convention is in experimental state and automated instrumentation libraries are lacking |
SDK version fixed, regular monitoring of emerging libraries such as OpenLIT and opentelemetry-instrumentation-mcp |
| Latency Overhead | 10–30ms overhead may occur in high-frequency workloads when generating a span for every tool call | Mitigated by adjusting batch processor settings and sampling rate |
| Cost Attribution | Separate design required for token tracking to attribute costs by user/team | Aggregate gen_ai.usage.input_tokens attribute as Prometheus Counter |
The Most Common Mistakes in Practice
- Recording Tool arguments directly to Log Span: Passwords, API keys, and personal information are exposed in telemetry data. It is recommended to set rules to mask sensitive fields in the OTel Collector's transform processor.
- Using high-cardinality values as Prometheus labels: Session IDs or user IDs with thousands or more are not suitable as Prometheus labels. It is recommended to record these identifiers in trace properties or log fields, and to use only Prometheus labels with restricted value types, such as
tool_nameandstatus. - Creating only Tool Call spans without Agent Root spans: If Tool Call spans are isolated, it is impossible to track which user request triggers the tools and in what order. An effective structure is to first open Agent Inference as a Root span and then connect the subsequent Tool Calls as Child spans.
In Conclusion
MCP Observability goes beyond simple log collection; it is an integrated pipeline that connects the entire flow from agent inference to tool execution into a single trace and detects abnormal patterns in real time. With all four stages in place, latency spikes, error explosions, and abnormal call patterns will not remain as black boxes.
Here are 3 steps you can start right now. Each step can be applied independently or stacked in sequence.
- Add Prometheus Metrics: Expose the
/metricsendpoint withpip install prometheus-fastapi-instrumentatorand add themcp_tool_invocation_duration_secondsHistogram to the Tool Call handler. If Prometheus and Grafana are already configured, you can complete the integration in one go. - OTel Collector Deployment and Trace Pipeline Configuration: Starting from
otel-collector-config.yamlin Step 2, connect Grafana Tempo or Jaeger as the trace backend and apply context propagation throughparams._meta. At this stage, the entire flow of agent inference-tool execution is connected into a single trace. - Applying Anomaly Detection Rules: If you are using Datadog, you can configure rules in Cloud SIEM by referring to the detection concept in Step 4. If you are using an open-source stack, you can configure alerts for abnormal tool call frequencies using a combination of Grafana Alerts and Prometheus anomaly detection.
Once the observability pipeline is set up, the next task is to control the MCP requests themselves at the gateway layer. In the next post, we will look at how to deploy Kong AI Gateway 3.12 as an MCP proxy to handle OAuth 2.1 authentication, rate limiting, and team cost attribution in a single layer.
Next Post: We will cover how to deploy Kong AI Gateway 3.12 as a proxy on an MCP server to handle OAuth 2.1 authentication, rate limiting, and team cost attribution in a single layer.
Reference Materials
Official Document
- Semantic conventions for Model Context Protocol (MCP) | OpenTelemetry
- Prometheus and OpenTelemetry - Better Together | OpenTelemetry
Engineering Blog
- MCP Observability with OpenTelemetry | SigNoz
- Monitor MCP servers with OpenLIT and Grafana Cloud | Grafana Labs
- MCP security risks: How to build SIEM detection rules | Datadog
- Securing, Observing, and Governing MCP Servers with Kong AI Gateway | Kong
- Distributed tracing for agentic workflows with OpenTelemetry | Red Hat Developer
- How to Instrument MCP Servers with OpenTelemetry for Production Observability | OneUptime
- Prometheus MCP Server: AI-Driven Monitoring Intelligence for AWS Users | AWS Blog
Community Guide