How to Never Miss Errors with Grafana Tempo TraceQL: A Practical Guide for Sampling Environments

When operating distributed systems, you may feel uneasy wondering, "Could error traces be getting dropped because of sampling?" In Head-based sampling environments, whether a trace is saved is decided the moment a request begins — so if an error occurs later, the trace may never be stored at all and simply won't appear in TraceQL search results. You've probably experienced the frustration of an error leaving no trace behind.

This post is written for both backend and frontend developers and introduces two paths — direct TraceQL queries and RED metric-based alerting via Metrics-Generator — with practical code examples, so you never miss an error even in a sampling environment. Frontend developers can use the same detection and alerting methods described here for client-side errors from the browser, once they are collected into Tempo via the OpenTelemetry SDK. This guide covers the most current approaches available, including Tempo 2.9+ and Grafana v12.1 experimental features.

Prerequisites: To follow the hands-on examples in this post, Tempo and Prometheus must already be deployed. This guide assumes you have an environment where you can directly modify the tempo.yaml configuration file and configure Prometheus remote_write.

Core Concepts

What Are Grafana Tempo and TraceQL

Grafana Tempo is an open-source backend for storing and querying distributed traces in Jaeger, Zipkin, and OpenTelemetry formats. Because it stores traces in object storage (S3, GCS, etc.) without a separate index, it is highly cost-efficient.

TraceQL is Tempo's dedicated query language, with syntax similar to PromQL and LogQL. It allows you to filter traces based on span attributes, resource metadata, and parent-child span relationships.

traceql

{ status = error && resource.service.name = "payment-service" }

Basic TraceQL Structure: Spans are selected using the { filter conditions } form, and aggregation functions are chained with the | pipe operator. It reads similarly to PromQL's label selectors, but differs in that it supports queries involving the trace hierarchy.

How Sampling Interferes with Error Detection

Sampling is a strategy that records only a subset of traces to storage. Traces that are not stored cannot be queried with TraceQL. This is the core constraint.

Sampling Method	How It Works	Suitability for Error Detection
Head-based	Decides whether to save at the moment a request starts	Low — decision is made before knowing whether an error will occur
Tail-based	Decides whether to save after the trace completes, based on error presence	High — can prioritize preserving error traces
Adaptive	Dynamically switches sampling strategy based on query characteristics (under roadmap discussion)	High — automatic optimization planned

Tail-based Sampling: The OpenTelemetry Collector's tailsampling processor supports this. You can configure policies to prioritize preserving traces that meet conditions such as error status codes or high latency. Specific Collector configuration will be covered in a follow-up post.

TraceQL Metrics and the local-blocks Processor

TraceQL Metrics is a feature that directly generates time-series metrics from traces using the { filter } | rate() form. To use this feature, the local-blocks processor must be explicitly enabled in the Tempo configuration.

The configuration below is placed under overrides.defaults and applies as the default for all tenants. This section is separate from the top-level metrics_generator block that configures the Metrics-Generator component itself (see Example 3 for practical application).

yaml

# tempo.yaml — Enable local-blocks as tenant default
overrides:
  defaults:
    metrics_generator:
      processors:
        - local-blocks  # Required processor for TraceQL Metrics queries

local-blocks Processor: Temporarily stores collected trace blocks locally so that TraceQL Metrics queries can aggregate them in real time. Stabilized in Tempo 2.9.

Practical Examples

Example 1: Directly Exploring Error Spans with TraceQL

Using the following query in the Tempo Explore view, you can instantly find error traces for a specific service.

traceql

{ status = error && resource.service.name = "payment-service" }

To filter by HTTP status code for 4xx/5xx responses, write it like this:

traceql

{ span.http.status_code >= 400 }

To narrow down errors to a specific endpoint, combine attributes. Per OpenTelemetry Semantic Conventions, it is recommended to use url.path for the URL path attribute (span.http.url is deprecated).

traceql

{ status = error && url.path =~ ".*/checkout.*" && span.http.status_code >= 500 }

Attribute Key	Description	Example Value
`status`	OpenTelemetry span status	`error`, `ok`, `unset`
`resource.service.name`	Service identifier	`"payment-service"`
`span.http.status_code`	HTTP response code	`500`, `>=400`
`url.path`	Request path (regex supported)	`=~ "./api/."`

Example 2: Generating Error Rate Time Series with TraceQL Metrics

After enabling the local-blocks processor, you can draw an error rate graph in a Grafana dashboard using the following query.

traceql

{ status = error && resource.service.name = "payment-service" } | rate()

To compare error rates across all services broken down by service, use the by() aggregation.

traceql

{ status = error } | rate() by (resource.service.name)

You can also optimize query performance with Dynamic Sampling hints.

traceql

{ status = error } | rate() with(sample=true)

with(sample=true) Trade-off: This hint speeds up queries by accepting approximate aggregation results. It is suitable when response speed matters more than precision, such as in dashboard visualizations. It is not recommended for setting alert thresholds where an accurate error count is required.

Example 3: Generating RED Metrics with Metrics-Generator and Setting Up Prometheus Alerts

To aggregate all errors regardless of sampling, the most reliable approach is to enable Tempo's Metrics-Generator and send the generated metrics to Prometheus.

Step 1 — Tempo Configuration (tempo.yaml)

The block below is the top-level metrics_generator section, which configures the behavior of the Metrics-Generator component itself. It serves a different role from the local-blocks entry under overrides.defaults introduced earlier.

yaml

# tempo.yaml — Metrics-Generator component configuration
metrics_generator:
  processors:
    - service-graphs   # Generates request and failure metrics between services
    - span-metrics     # Generates per-span RED metrics
  storage:
    remote_write:
      - url: http://prometheus:9090/api/v1/write  # Prometheus endpoint
  span_metrics:
    dimensions:
      - service.name       # Include only low-cardinality attributes
      - http.status_code

Step 2 — Verifying Generated Metrics in Prometheus

Once Metrics-Generator is enabled, the following metrics are automatically generated.

Metric Name	Description
`traces_spanmetrics_calls_total`	Span call count by service and status code
`traces_spanmetrics_duration_seconds_bucket`	Latency histogram
`traces_service_graph_request_total`	Number of requests between services
`traces_service_graph_request_failed_total`	Number of failed requests between services

Actual label values may vary depending on the Tempo version and configuration. Before writing Alert Rules, it is recommended to query traces_spanmetrics_calls_total in the Prometheus UI first to verify the actual values of the status_code label.

Step 3 — Writing Prometheus Alert Rules

yaml

groups:
  - name: tempo-error-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
          / rate(traces_spanmetrics_calls_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} error rate exceeded 5%"
          description: "Current error rate: {{ $value | humanizePercentage }}"
 
      - alert: PaymentServiceCriticalError
        expr: |
          rate(traces_spanmetrics_calls_total{
            service="payment-service",
            status_code="STATUS_CODE_ERROR"
          }[5m]) > 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "payment-service critical error detected"
          description: "Error spans per second: {{ $value | humanize }}"

Example 4: Direct TraceQL Alerting in Grafana v12.1 (Experimental Feature)

In Grafana v12.1 and above, you can configure alerts by entering TraceQL Metrics queries directly into an Alert Rule. The UI path is Alerting → Alert rules → New alert rule; selecting Tempo from the data source dropdown in the query editor activates the TraceQL input field.

traceql

{ status = error && resource.service.name = "checkout-service" }
| rate()
| by (resource.service.name)

After entering the query above, setting a Threshold condition of IS ABOVE 0.05 enables trace-native alerting without any metric conversion.

Note: This feature is in Experimental status as of Grafana v12.1. In production environments, it is recommended to run it alongside the Metrics-Generator-based alerting from Example 3.

Pros and Cons Analysis

Advantages

Feature / Strategy	Details
Direct TraceQL exploration	Flexible span attribute-based filtering without a separate pipeline
TraceQL Metrics	Instantly generate time series from stored traces; easy dashboard integration
Metrics-Generator	Full Prometheus ecosystem integration; exhaustive aggregation independent of sampling
Combination with Tail-based Sampling	Improves TraceQL detection accuracy by increasing the preservation rate of error traces
Grafana Alerting integration	Manage alert rules, routing, and receivers from a single UI

Disadvantages and Caveats

Feature / Strategy	Details	Mitigation
Head-based sampling drops	Error traces not stored, making TraceQL queries impossible	Switch to Tail-based Sampling or use Metrics-Generator in parallel
local-blocks not enabled	TraceQL Metrics queries return empty results	Explicitly add the `local-blocks` processor under `overrides.defaults`
Metric cardinality explosion	Adding high-cardinality attributes to span-metrics increases Prometheus load	Include only low-cardinality attributes (`service.name`, `http.status_code`) in `dimensions`
v12.1 experimental alerting	Direct TraceQL alerting not recommended for production	Use Prometheus Alert Rules until the feature stabilizes
Alert threshold distortion	Error rates underreported in low sampling-ratio environments	Set thresholds based on Metrics-Generator (exhaustive aggregation)

Cardinality: Refers to the number of label value combinations in a metric. Adding high-cardinality values like user IDs or trace_id as labels can create as many time series as there are users, causing Prometheus memory usage to spike. It is recommended to include only fixed-value attributes such as service.name, http.method, and http.status_code in dimensions.

RED Metrics: Refers to three indicators — Rate (requests per second), Error (error rate), and Duration (response time). The Metrics-Generator's span-metrics processor generates all three automatically.

The Most Common Mistakes in Practice

Not enabling local-blocks: If you run a TraceQL Metrics query without adding local-blocks to overrides.defaults.metrics_generator.processors, you will always get empty results. This setting is separate from the top-level metrics_generator block, so it is recommended to verify both locations.
Including high-cardinality attributes in dimensions: Adding unique values like trace_id, user IDs, or session IDs to span_metrics.dimensions can cause Prometheus OOM errors. It is recommended to include only low-cardinality attributes.
Designing alerts based solely on TraceQL exploration results in a Head-based sampling environment: Traces dropped by sampling cannot be queried at all, so even if TraceQL exploration results appear to show "no errors," actual errors may still exist. It is strongly recommended to run Metrics-Generator-based alerting in parallel.

Closing Thoughts

Your system will now have a dual-layer structure capable of detecting errors regardless of sampling. The most reliable error detection strategy in a sampling environment is to flexibly explore stored traces with TraceQL while connecting Metrics-Generator's exhaustive RED metric aggregation to Prometheus alerts.

Three steps you can take right now:

Add the local-blocks and span-metrics processors to the Tempo configuration file.
- Add local-blocks to overrides.defaults.metrics_generator.processors
- Add span-metrics to the top-level metrics_generator.processors
- Enter the Prometheus endpoint in storage.remote_write.url
Run an error rate query in Grafana Explore to confirm that local-blocks is working correctly.
- Query: { status = error } | rate() by (resource.service.name)
- If the result is empty, re-verify the local-blocks configuration
Add the HighErrorRate Alert Rule to Prometheus and connect it to a Slack or PagerDuty receiver in Grafana Alerting.
- It is recommended to start with an initial threshold of > 0.05 (5%) and adjust according to your service's characteristics
- Before applying the Alert Rule, verify the actual status_code label values in the Prometheus UI first

Next Post: How to maximize the preservation rate of error and high-latency traces by configuring the OpenTelemetry Collector's Tail-based Sampling processor

References

How to Never Miss Errors with Grafana Tempo TraceQL: A Practical Guide for Sampling Environments | DEV BAK - 기술블로그

DevOps

How to Never Miss Errors with Grafana Tempo TraceQL: A Practical Guide for Sampling Environments

Prerequisites: To follow the hands-on examples in this post, Tempo and Prometheus must already be deployed. This guide assumes you have an environment where you can directly modify the tempo.yaml configuration file and configure Prometheus remote_write.

Core Concepts

What Are Grafana Tempo and TraceQL

TraceQL is Tempo's dedicated query language, with syntax similar to PromQL and LogQL. It allows you to filter traces based on span attributes, resource metadata, and parent-child span relationships.

traceql

{ status = error && resource.service.name = "payment-service" }

Basic TraceQL Structure: Spans are selected using the { filter conditions } form, and aggregation functions are chained with the | pipe operator. It reads similarly to PromQL's label selectors, but differs in that it supports queries involving the trace hierarchy.

How Sampling Interferes with Error Detection

Sampling is a strategy that records only a subset of traces to storage. Traces that are not stored cannot be queried with TraceQL. This is the core constraint.

Sampling Method	How It Works	Suitability for Error Detection
Head-based	Decides whether to save at the moment a request starts	Low — decision is made before knowing whether an error will occur
Tail-based	Decides whether to save after the trace completes, based on error presence	High — can prioritize preserving error traces
Adaptive	Dynamically switches sampling strategy based on query characteristics (under roadmap discussion)	High — automatic optimization planned

Tail-based Sampling: The OpenTelemetry Collector's tailsampling processor supports this. You can configure policies to prioritize preserving traces that meet conditions such as error status codes or high latency. Specific Collector configuration will be covered in a follow-up post.

TraceQL Metrics and the local-blocks Processor

yaml

# tempo.yaml — Enable local-blocks as tenant default
overrides:
  defaults:
    metrics_generator:
      processors:
        - local-blocks  # Required processor for TraceQL Metrics queries

local-blocks Processor: Temporarily stores collected trace blocks locally so that TraceQL Metrics queries can aggregate them in real time. Stabilized in Tempo 2.9.

Practical Examples

Example 1: Directly Exploring Error Spans with TraceQL

Using the following query in the Tempo Explore view, you can instantly find error traces for a specific service.

traceql

{ status = error && resource.service.name = "payment-service" }

To filter by HTTP status code for 4xx/5xx responses, write it like this:

traceql

{ span.http.status_code >= 400 }

traceql

{ status = error && url.path =~ ".*/checkout.*" && span.http.status_code >= 500 }

Attribute Key	Description	Example Value
`status`	OpenTelemetry span status	`error`, `ok`, `unset`
`resource.service.name`	Service identifier	`"payment-service"`
`span.http.status_code`	HTTP response code	`500`, `>=400`
`url.path`	Request path (regex supported)	`=~ "./api/."`

Example 2: Generating Error Rate Time Series with TraceQL Metrics

After enabling the local-blocks processor, you can draw an error rate graph in a Grafana dashboard using the following query.

traceql

{ status = error && resource.service.name = "payment-service" } | rate()

To compare error rates across all services broken down by service, use the by() aggregation.

traceql

{ status = error } | rate() by (resource.service.name)

You can also optimize query performance with Dynamic Sampling hints.

traceql

{ status = error } | rate() with(sample=true)

with(sample=true) Trade-off: This hint speeds up queries by accepting approximate aggregation results. It is suitable when response speed matters more than precision, such as in dashboard visualizations. It is not recommended for setting alert thresholds where an accurate error count is required.

Example 3: Generating RED Metrics with Metrics-Generator and Setting Up Prometheus Alerts

To aggregate all errors regardless of sampling, the most reliable approach is to enable Tempo's Metrics-Generator and send the generated metrics to Prometheus.

Step 1 — Tempo Configuration (tempo.yaml)

yaml

# tempo.yaml — Metrics-Generator component configuration
metrics_generator:
  processors:
    - service-graphs   # Generates request and failure metrics between services
    - span-metrics     # Generates per-span RED metrics
  storage:
    remote_write:
      - url: http://prometheus:9090/api/v1/write  # Prometheus endpoint
  span_metrics:
    dimensions:
      - service.name       # Include only low-cardinality attributes
      - http.status_code

Step 2 — Verifying Generated Metrics in Prometheus

Once Metrics-Generator is enabled, the following metrics are automatically generated.

Metric Name	Description
`traces_spanmetrics_calls_total`	Span call count by service and status code
`traces_spanmetrics_duration_seconds_bucket`	Latency histogram
`traces_service_graph_request_total`	Number of requests between services
`traces_service_graph_request_failed_total`	Number of failed requests between services

Step 3 — Writing Prometheus Alert Rules

yaml

groups:
  - name: tempo-error-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
          / rate(traces_spanmetrics_calls_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }} error rate exceeded 5%"
          description: "Current error rate: {{ $value | humanizePercentage }}"
 
      - alert: PaymentServiceCriticalError
        expr: |
          rate(traces_spanmetrics_calls_total{
            service="payment-service",
            status_code="STATUS_CODE_ERROR"
          }[5m]) > 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "payment-service critical error detected"
          description: "Error spans per second: {{ $value | humanize }}"

Example 4: Direct TraceQL Alerting in Grafana v12.1 (Experimental Feature)

traceql

{ status = error && resource.service.name = "checkout-service" }
| rate()
| by (resource.service.name)

After entering the query above, setting a Threshold condition of IS ABOVE 0.05 enables trace-native alerting without any metric conversion.

Note: This feature is in Experimental status as of Grafana v12.1. In production environments, it is recommended to run it alongside the Metrics-Generator-based alerting from Example 3.

Pros and Cons Analysis

Advantages

Feature / Strategy	Details
Direct TraceQL exploration	Flexible span attribute-based filtering without a separate pipeline
TraceQL Metrics	Instantly generate time series from stored traces; easy dashboard integration
Metrics-Generator	Full Prometheus ecosystem integration; exhaustive aggregation independent of sampling
Combination with Tail-based Sampling	Improves TraceQL detection accuracy by increasing the preservation rate of error traces
Grafana Alerting integration	Manage alert rules, routing, and receivers from a single UI

Disadvantages and Caveats

Feature / Strategy	Details	Mitigation
Head-based sampling drops	Error traces not stored, making TraceQL queries impossible	Switch to Tail-based Sampling or use Metrics-Generator in parallel
local-blocks not enabled	TraceQL Metrics queries return empty results	Explicitly add the `local-blocks` processor under `overrides.defaults`
Metric cardinality explosion	Adding high-cardinality attributes to span-metrics increases Prometheus load	Include only low-cardinality attributes (`service.name`, `http.status_code`) in `dimensions`
v12.1 experimental alerting	Direct TraceQL alerting not recommended for production	Use Prometheus Alert Rules until the feature stabilizes
Alert threshold distortion	Error rates underreported in low sampling-ratio environments	Set thresholds based on Metrics-Generator (exhaustive aggregation)

Cardinality: Refers to the number of label value combinations in a metric. Adding high-cardinality values like user IDs or trace_id as labels can create as many time series as there are users, causing Prometheus memory usage to spike. It is recommended to include only fixed-value attributes such as service.name, http.method, and http.status_code in dimensions.

RED Metrics: Refers to three indicators — Rate (requests per second), Error (error rate), and Duration (response time). The Metrics-Generator's span-metrics processor generates all three automatically.

The Most Common Mistakes in Practice

Not enabling local-blocks: If you run a TraceQL Metrics query without adding local-blocks to overrides.defaults.metrics_generator.processors, you will always get empty results. This setting is separate from the top-level metrics_generator block, so it is recommended to verify both locations.
Including high-cardinality attributes in dimensions: Adding unique values like trace_id, user IDs, or session IDs to span_metrics.dimensions can cause Prometheus OOM errors. It is recommended to include only low-cardinality attributes.
Designing alerts based solely on TraceQL exploration results in a Head-based sampling environment: Traces dropped by sampling cannot be queried at all, so even if TraceQL exploration results appear to show "no errors," actual errors may still exist. It is strongly recommended to run Metrics-Generator-based alerting in parallel.

Closing Thoughts

Three steps you can take right now:

Add the local-blocks and span-metrics processors to the Tempo configuration file.
- Add local-blocks to overrides.defaults.metrics_generator.processors
- Add span-metrics to the top-level metrics_generator.processors
- Enter the Prometheus endpoint in storage.remote_write.url
Run an error rate query in Grafana Explore to confirm that local-blocks is working correctly.
- Query: { status = error } | rate() by (resource.service.name)
- If the result is empty, re-verify the local-blocks configuration
Add the HighErrorRate Alert Rule to Prometheus and connect it to a Slack or PagerDuty receiver in Grafana Alerting.
- It is recommended to start with an initial threshold of > 0.05 (5%) and adjust according to your service's characteristics
- Before applying the Alert Rule, verify the actual status_code label values in the Prometheus UI first

Next Post: How to maximize the preservation rate of error and high-latency traces by configuring the OpenTelemetry Collector's Tail-based Sampling processor

Core Concepts

What Are Grafana Tempo and TraceQL

How Sampling Interferes with Error Detection

TraceQL Metrics and the local-blocks Processor

Practical Examples

Example 1: Directly Exploring Error Spans with TraceQL

Example 2: Generating Error Rate Time Series with TraceQL Metrics

Example 3: Generating RED Metrics with Metrics-Generator and Setting Up Prometheus Alerts

Example 4: Direct TraceQL Alerting in Grafana v12.1 (Experimental Feature)

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What Are Grafana Tempo and TraceQL

How Sampling Interferes with Error Detection

TraceQL Metrics and the local-blocks Processor

Practical Examples

Example 1: Directly Exploring Error Spans with TraceQL

Example 2: Generating Error Rate Time Series with TraceQL Metrics

Example 3: Generating RED Metrics with Metrics-Generator and Setting Up Prometheus Alerts

Example 4: Direct TraceQL Alerting in Grafana v12.1 (Experimental Feature)

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

OpenTelemetry Collector Tail-based Sampling: How to Preserve 100% of Errors & Slow Requests While Cutting Storage Costs by 70%

Controlling Span Noise and Cardinality Explosion with filterprocessor · transformprocessor in OTel Collector (OpenTelemetry Collector)

A 2-Stage Pipeline Design Using OpenTelemetry Collector Tail Sampling to Retain 100% of Error & Latency Traces While Cutting Observability Costs by 74% (OTel Collector)

100% Error Span Collection, Up to 95% Cost Reduction — Grafana Alloy + OpenTelemetry Tail-Based Sampling Practical Guide

TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x

Grafana Loki + Tempo: Implementing Bidirectional Log-Trace Drill-Down with a Single Trace ID