How to Never Miss Errors with Grafana Tempo TraceQL: A Practical Guide for Sampling Environments
When operating distributed systems, you may feel uneasy wondering, "Could error traces be getting dropped because of sampling?" In Head-based sampling environments, whether a trace is saved is decided the moment a request begins — so if an error occurs later, the trace may never be stored at all and simply won't appear in TraceQL search results. You've probably experienced the frustration of an error leaving no trace behind.
This post is written for both backend and frontend developers and introduces two paths — direct TraceQL queries and RED metric-based alerting via Metrics-Generator — with practical code examples, so you never miss an error even in a sampling environment. Frontend developers can use the same detection and alerting methods described here for client-side errors from the browser, once they are collected into Tempo via the OpenTelemetry SDK. This guide covers the most current approaches available, including Tempo 2.9+ and Grafana v12.1 experimental features.
Prerequisites: To follow the hands-on examples in this post, Tempo and Prometheus must already be deployed. This guide assumes you have an environment where you can directly modify the
tempo.yamlconfiguration file and configure Prometheusremote_write.
Core Concepts
What Are Grafana Tempo and TraceQL
Grafana Tempo is an open-source backend for storing and querying distributed traces in Jaeger, Zipkin, and OpenTelemetry formats. Because it stores traces in object storage (S3, GCS, etc.) without a separate index, it is highly cost-efficient.
TraceQL is Tempo's dedicated query language, with syntax similar to PromQL and LogQL. It allows you to filter traces based on span attributes, resource metadata, and parent-child span relationships.
{ status = error && resource.service.name = "payment-service" }Basic TraceQL Structure: Spans are selected using the
{ filter conditions }form, and aggregation functions are chained with the|pipe operator. It reads similarly to PromQL's label selectors, but differs in that it supports queries involving the trace hierarchy.
How Sampling Interferes with Error Detection
Sampling is a strategy that records only a subset of traces to storage. Traces that are not stored cannot be queried with TraceQL. This is the core constraint.
| Sampling Method | How It Works | Suitability for Error Detection |
|---|---|---|
| Head-based | Decides whether to save at the moment a request starts | Low — decision is made before knowing whether an error will occur |
| Tail-based | Decides whether to save after the trace completes, based on error presence | High — can prioritize preserving error traces |
| Adaptive | Dynamically switches sampling strategy based on query characteristics (under roadmap discussion) | High — automatic optimization planned |
Tail-based Sampling: The OpenTelemetry Collector's
tailsamplingprocessor supports this. You can configure policies to prioritize preserving traces that meet conditions such as error status codes or high latency. Specific Collector configuration will be covered in a follow-up post.
TraceQL Metrics and the local-blocks Processor
TraceQL Metrics is a feature that directly generates time-series metrics from traces using the { filter } | rate() form. To use this feature, the local-blocks processor must be explicitly enabled in the Tempo configuration.
The configuration below is placed under overrides.defaults and applies as the default for all tenants. This section is separate from the top-level metrics_generator block that configures the Metrics-Generator component itself (see Example 3 for practical application).
# tempo.yaml — Enable local-blocks as tenant default
overrides:
defaults:
metrics_generator:
processors:
- local-blocks # Required processor for TraceQL Metrics querieslocal-blocks Processor: Temporarily stores collected trace blocks locally so that TraceQL Metrics queries can aggregate them in real time. Stabilized in Tempo 2.9.
Practical Examples
Example 1: Directly Exploring Error Spans with TraceQL
Using the following query in the Tempo Explore view, you can instantly find error traces for a specific service.
{ status = error && resource.service.name = "payment-service" }To filter by HTTP status code for 4xx/5xx responses, write it like this:
{ span.http.status_code >= 400 }To narrow down errors to a specific endpoint, combine attributes. Per OpenTelemetry Semantic Conventions, it is recommended to use url.path for the URL path attribute (span.http.url is deprecated).
{ status = error && url.path =~ ".*/checkout.*" && span.http.status_code >= 500 }| Attribute Key | Description | Example Value |
|---|---|---|
status |
OpenTelemetry span status | error, ok, unset |
resource.service.name |
Service identifier | "payment-service" |
span.http.status_code |
HTTP response code | 500, >=400 |
url.path |
Request path (regex supported) | =~ ".*/api/.*" |
Example 2: Generating Error Rate Time Series with TraceQL Metrics
After enabling the local-blocks processor, you can draw an error rate graph in a Grafana dashboard using the following query.
{ status = error && resource.service.name = "payment-service" } | rate()To compare error rates across all services broken down by service, use the by() aggregation.
{ status = error } | rate() by (resource.service.name)You can also optimize query performance with Dynamic Sampling hints.
{ status = error } | rate() with(sample=true)
with(sample=true)Trade-off: This hint speeds up queries by accepting approximate aggregation results. It is suitable when response speed matters more than precision, such as in dashboard visualizations. It is not recommended for setting alert thresholds where an accurate error count is required.
Example 3: Generating RED Metrics with Metrics-Generator and Setting Up Prometheus Alerts
To aggregate all errors regardless of sampling, the most reliable approach is to enable Tempo's Metrics-Generator and send the generated metrics to Prometheus.
Step 1 — Tempo Configuration (tempo.yaml)
The block below is the top-level metrics_generator section, which configures the behavior of the Metrics-Generator component itself. It serves a different role from the local-blocks entry under overrides.defaults introduced earlier.
# tempo.yaml — Metrics-Generator component configuration
metrics_generator:
processors:
- service-graphs # Generates request and failure metrics between services
- span-metrics # Generates per-span RED metrics
storage:
remote_write:
- url: http://prometheus:9090/api/v1/write # Prometheus endpoint
span_metrics:
dimensions:
- service.name # Include only low-cardinality attributes
- http.status_codeStep 2 — Verifying Generated Metrics in Prometheus
Once Metrics-Generator is enabled, the following metrics are automatically generated.
| Metric Name | Description |
|---|---|
traces_spanmetrics_calls_total |
Span call count by service and status code |
traces_spanmetrics_duration_seconds_bucket |
Latency histogram |
traces_service_graph_request_total |
Number of requests between services |
traces_service_graph_request_failed_total |
Number of failed requests between services |
Actual label values may vary depending on the Tempo version and configuration. Before writing Alert Rules, it is recommended to query traces_spanmetrics_calls_total in the Prometheus UI first to verify the actual values of the status_code label.
Step 3 — Writing Prometheus Alert Rules
groups:
- name: tempo-error-alerts
rules:
- alert: HighErrorRate
expr: |
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
/ rate(traces_spanmetrics_calls_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} error rate exceeded 5%"
description: "Current error rate: {{ $value | humanizePercentage }}"
- alert: PaymentServiceCriticalError
expr: |
rate(traces_spanmetrics_calls_total{
service="payment-service",
status_code="STATUS_CODE_ERROR"
}[5m]) > 1
for: 1m
labels:
severity: critical
annotations:
summary: "payment-service critical error detected"
description: "Error spans per second: {{ $value | humanize }}"Example 4: Direct TraceQL Alerting in Grafana v12.1 (Experimental Feature)
In Grafana v12.1 and above, you can configure alerts by entering TraceQL Metrics queries directly into an Alert Rule. The UI path is Alerting → Alert rules → New alert rule; selecting Tempo from the data source dropdown in the query editor activates the TraceQL input field.
{ status = error && resource.service.name = "checkout-service" }
| rate()
| by (resource.service.name)After entering the query above, setting a Threshold condition of IS ABOVE 0.05 enables trace-native alerting without any metric conversion.
Note: This feature is in Experimental status as of Grafana v12.1. In production environments, it is recommended to run it alongside the Metrics-Generator-based alerting from Example 3.
Pros and Cons Analysis
Advantages
| Feature / Strategy | Details |
|---|---|
| Direct TraceQL exploration | Flexible span attribute-based filtering without a separate pipeline |
| TraceQL Metrics | Instantly generate time series from stored traces; easy dashboard integration |
| Metrics-Generator | Full Prometheus ecosystem integration; exhaustive aggregation independent of sampling |
| Combination with Tail-based Sampling | Improves TraceQL detection accuracy by increasing the preservation rate of error traces |
| Grafana Alerting integration | Manage alert rules, routing, and receivers from a single UI |
Disadvantages and Caveats
| Feature / Strategy | Details | Mitigation |
|---|---|---|
| Head-based sampling drops | Error traces not stored, making TraceQL queries impossible | Switch to Tail-based Sampling or use Metrics-Generator in parallel |
| local-blocks not enabled | TraceQL Metrics queries return empty results | Explicitly add the local-blocks processor under overrides.defaults |
| Metric cardinality explosion | Adding high-cardinality attributes to span-metrics increases Prometheus load | Include only low-cardinality attributes (service.name, http.status_code) in dimensions |
| v12.1 experimental alerting | Direct TraceQL alerting not recommended for production | Use Prometheus Alert Rules until the feature stabilizes |
| Alert threshold distortion | Error rates underreported in low sampling-ratio environments | Set thresholds based on Metrics-Generator (exhaustive aggregation) |
Cardinality: Refers to the number of label value combinations in a metric. Adding high-cardinality values like user IDs or
trace_idas labels can create as many time series as there are users, causing Prometheus memory usage to spike. It is recommended to include only fixed-value attributes such asservice.name,http.method, andhttp.status_codein dimensions.
RED Metrics: Refers to three indicators — Rate (requests per second), Error (error rate), and Duration (response time). The Metrics-Generator's span-metrics processor generates all three automatically.
The Most Common Mistakes in Practice
-
Not enabling
local-blocks: If you run a TraceQL Metrics query without addinglocal-blockstooverrides.defaults.metrics_generator.processors, you will always get empty results. This setting is separate from the top-levelmetrics_generatorblock, so it is recommended to verify both locations. -
Including high-cardinality attributes in dimensions: Adding unique values like
trace_id, user IDs, or session IDs tospan_metrics.dimensionscan cause Prometheus OOM errors. It is recommended to include only low-cardinality attributes. -
Designing alerts based solely on TraceQL exploration results in a Head-based sampling environment: Traces dropped by sampling cannot be queried at all, so even if TraceQL exploration results appear to show "no errors," actual errors may still exist. It is strongly recommended to run Metrics-Generator-based alerting in parallel.
Closing Thoughts
Your system will now have a dual-layer structure capable of detecting errors regardless of sampling. The most reliable error detection strategy in a sampling environment is to flexibly explore stored traces with TraceQL while connecting Metrics-Generator's exhaustive RED metric aggregation to Prometheus alerts.
Three steps you can take right now:
-
Add the local-blocks and span-metrics processors to the Tempo configuration file.
- Add
local-blockstooverrides.defaults.metrics_generator.processors - Add
span-metricsto the top-levelmetrics_generator.processors - Enter the Prometheus endpoint in
storage.remote_write.url
- Add
-
Run an error rate query in Grafana Explore to confirm that local-blocks is working correctly.
- Query:
{ status = error } | rate() by (resource.service.name) - If the result is empty, re-verify the
local-blocksconfiguration
- Query:
-
Add the
HighErrorRateAlert Rule to Prometheus and connect it to a Slack or PagerDuty receiver in Grafana Alerting.- It is recommended to start with an initial threshold of
> 0.05(5%) and adjust according to your service's characteristics - Before applying the Alert Rule, verify the actual
status_codelabel values in the Prometheus UI first
- It is recommended to start with an initial threshold of
Next Post: How to maximize the preservation rate of error and high-latency traces by configuring the OpenTelemetry Collector's Tail-based Sampling processor
References
- TraceQL Official Documentation | Grafana Tempo
- TraceQL Metrics Query Guide | Grafana Tempo
- TraceQL Metrics Sampling Guide | Grafana Tempo
- Configure TraceQL Metrics | Grafana Tempo
- Trace-Based Error Diagnosis Guide | Grafana Tempo
- Metrics from Traces Overview | Grafana Tempo
- Metrics-Generator Official Documentation | Grafana Tempo
- Span Metrics Processor | Grafana Tempo
- Trace-Based Alert Examples | Grafana Alerting
- TraceQL Metrics Troubleshooting Guide | Grafana Tempo
- Tempo 2.9 Release Notes | Grafana Tempo
- Tempo Metrics-Generator RED Metrics Configuration Hands-on (Community) | OneUptime Blog