OpenTelemetry Collector Tail-based Sampling: How to Preserve 100% of Errors & Slow Requests While Cutting Storage Costs by 70%
If you've adopted tracing in a distributed system only to find that traces for the most critical errors or slow requests are being dropped by sampling, you're not alone. Traditional Head-based Sampling makes the keep-or-drop decision the moment the first span arrives, with no way of knowing whether that request will later produce an error or exceed 500ms. The result: your most important traces in production get discarded at random.
By properly configuring the OpenTelemetry Collector's Tail-based Sampling processor, you can preserve 100% of error and slow-request traces while cutting storage costs for normal traces by 70% or more. (Based on AWS ADOT operational data — see "Example 1" in the body for details.) This article walks through the core mechanics of tailsamplingprocessor, production-validated policy combinations, and the pitfalls that commonly trip up practitioners.
By the end of this article, you'll be able to configure a 100% error-preservation policy, latency-based filtering, and per-service differential sampling yourself. This article is aimed at developers with distributed tracing experience who are already familiar with OTel fundamentals and YAML configuration. If you're new to the OTel Collector, it's recommended to review the official Getting Started documentation first.
Core Concepts
Head-based Sampling vs Tail-based Sampling
The fundamental difference between sampling strategies comes down to when the decision is made.
| Decision Point | Error Preservation | Latency-based Judgment | |
|---|---|---|---|
| Head-based | On first span arrival | Not possible | Not possible |
| Tail-based | After trace completion | Possible (100%) | Possible |
Tail-based Sampling: A sampling strategy that waits until all spans comprising a trace have been collected, then examines the full trace context (whether an error occurred, total elapsed time, etc.) before deciding whether to preserve it.
The OTel Collector's tailsamplingprocessor is a processor included in the contrib repository. It buffers traces in memory for the decision_wait duration, then evaluates the configured policies. You can define fine-grained rules by combining 13+ policy types using OR, AND, and composite logic.
The Relationship Between decision_wait and Latency Thresholds
The most important relationship to understand when configuring tailsamplingprocessor is the interaction between decision_wait and threshold_ms.
decision_wait: The time the Collector waits for all spans of a trace to arrive. Once this time elapses, it evaluates policies against the collected spans to decide whether to preserve the trace.threshold_ms: The threshold for latency-based policies. Traces whose total duration exceeds this value are preserved.
The key point is that if decision_wait is shorter than threshold_ms, latency-based decisions can be incomplete. For example, if threshold_ms: 2000 (preserve traces exceeding 2 seconds), decision_wait should be set to at least 2 seconds, and preferably 3–4 seconds or more. Conversely, setting decision_wait: 30s means that even fast traces completing in 500ms will linger in memory for 30 seconds — a tradeoff worth bearing in mind.
Policy Evaluation: OR and AND Combinations
Policies operate in OR mode by default. If any single policy renders a sample verdict, that trace is preserved. When multiple conditions must be satisfied simultaneously, wrap them in an and type policy.
# AND condition example: selectively preserve errors from payment-service or checkout-service only
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
policies:
- name: critical-service-errors
type: and
and:
and_sub_policy:
- name: service-filter
type: string_attribute
string_attribute:
key: service.name
values: [payment-service, checkout-service]
- name: error-filter
type: status_code
status_code: {status_codes: [ERROR]}Why Tail-based Sampling Now — 2025 Ecosystem Changes
In 2025, the OTel ecosystem saw significant changes that further strengthen the case for Tail Sampling.
- W3C TraceContext Level 2 standard adoption: The lower 56 bits of a TraceID are now guaranteed to be random, and encoding the sampling threshold (
th) in theotkey of thetracestateheader has been standardized. - ConsistentProbabilityBased sampler introduced: Mathematically guarantees decision consistency between Head and Tail samplers in distributed systems, reducing the inefficiency of redundantly processing at the Tail stage traces that were already dropped at the Head stage.
ottl_conditionpolicy maturation: Complex conditions can now be expressed without code using OTTL (OpenTelemetry Transformation Language) expressions.
# OTTL condition example: spans that are both errors and exceed 1 second
- name: ottl-error-slow
type: ottl_condition
ottl_condition:
error_mode: ignore
span:
- |
span.status.code == STATUS_CODE_ERROR and
span.duration > Duration("1s")Practical Application
Example 1: Error & Slow Request Preservation + Cost Optimization (AWS ADOT Pattern)
A pattern used in AWS Distro for OpenTelemetry (ADOT) environments to preserve error and slow-request traces while keeping only 5% of normal traces, cutting storage costs by 70% or more.
Caution with
UNSETstatus code: Specifying UNSET alongside ERROR as instatus_codes: [ERROR, UNSET]will also preserve most normal spans that have no explicitly set status. Since UNSET is the default value for spans, useERRORalone if you only want to preserve error traces.
processors:
memory_limiter: # Essential — Tail Sampling is memory-intensive
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
tail_sampling:
decision_wait: 30s # Set generously to accommodate maximum trace completion time
num_traces: 100000
expected_new_traces_per_sec: 500
policies:
- name: keep-errors
type: status_code
status_code: {status_codes: [ERROR]} # Excludes UNSET; preserves explicit errors only
- name: keep-slow
type: latency
latency: {threshold_ms: 1000} # Safe: decision_wait(30s) > threshold(1s)
- name: sample-rest
type: probabilistic
probabilistic: {sampling_percentage: 5}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/backend]| Policy | Type | Role |
|---|---|---|
keep-errors |
status_code |
Preserve all traces with ERROR status code |
keep-slow |
latency |
Preserve all traces exceeding 1 second |
sample-rest |
probabilistic |
Random 5% sampling of remaining normal traces |
Example 2: Per-service Differential Sampling (composite policy)
A pattern that samples the payment service and general APIs at different rates while always preserving errors and slow requests.
Meaning of
rate_allocation.percent: This value is not a sampling probability. It represents the processing share ofmax_total_spans_per_secondbandwidth allocated to each policy. The actual sampling ratio is controlled by theprobabilisticsetting inside each sub-policy.
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
tail_sampling:
decision_wait: 30s
num_traces: 100000
policies:
- name: composite-policy
type: composite
composite:
max_total_spans_per_second: 10000
policy_order:
- errors-policy
- slow-traces-policy
- payment-policy
- default-policy
composite_sub_policy:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 500}
- name: payment-policy # Sample 50% of payment-service traces
type: and
and:
and_sub_policy:
- name: service-match
type: string_attribute
string_attribute:
key: service.name
values: [payment-service]
- name: payment-rate
type: probabilistic
probabilistic: {sampling_percentage: 50}
- name: default-policy # Sample 5% of all remaining traces
type: probabilistic
probabilistic: {sampling_percentage: 5}
rate_allocation:
- policy: errors-policy
percent: 30 # Allocate 30% of total throughput to the error policy
- policy: slow-traces-policy
percent: 30
- policy: payment-policy
percent: 20 # Allocate 20% of total throughput to the payment service policy
- policy: default-policy
percent: 20
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/backend]Example 3: Horizontally Scalable Architecture (Grafana Labs Pattern)
When a single Collector instance reaches its processing limit (14,000+ spans per second), apply a dual-layer architecture. This is a structure validated by Grafana Labs on their own infrastructure.
[Application]
│ OTLP
▼
[Layer 1 Collector — loadbalancingexporter]
│ Consistent routing by traceID
▼
[Layer 2 Collector cluster — tail_sampling processor]
│
▼
[Grafana Tempo / Backend]loadbalancingexporter: Uses traceID as the key to always route all spans sharing the same traceID to the same Layer 2 Collector instance. Without this component, simply scaling Collectors horizontally causes spans from the same traceID to scatter across multiple instances, resulting in incomplete trace evaluations.
# Layer 1 Collector: handles traceID-based routing only (no tail_sampling)
exporters:
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: otelcol-tail-sampling.tracing.svc.cluster.local
port: 4317
service:
pipelines:
traces:
receivers: [otlp]
exporters: [loadbalancing]# Layer 2 Collector: performs actual Tail Sampling
processors:
memory_limiter:
check_interval: 1s
limit_mib: 4096 # Layer 2 bears the trace buffering burden — allocate generously
spike_limit_mib: 1024
tail_sampling:
decision_wait: 30s
num_traces: 100000
expected_new_traces_per_sec: 500
policies:
- name: keep-errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: keep-slow
type: latency
latency: {threshold_ms: 1000}
- name: sample-rest
type: probabilistic
probabilistic: {sampling_percentage: 5}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/backend]Pros and Cons
Advantages
| Item | Details |
|---|---|
| 100% error capture | Checks status_code = ERROR after trace completion, enabling complete preservation that is impossible with Head Sampling |
| Precise slow request detection | Measures total trace duration, then selects only those exceeding the threshold |
| Cost optimization | Dramatically reduces storage costs by sampling normal traces at a low rate (5–10%) |
| Versatile policy combinations | Fine-grained per-service control with 13+ policy types and AND/OR/composite combinations |
| OTTL condition support | The mature ottl_condition (2025) enables complex expression-based filtering without code |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Memory usage | RAM consumed equals num_traces × trace size (100K traces × 20KB ≈ 2GB) |
Apply memory_limiter processor; tune num_traces |
| Late span drops | Spans arriving after decision_wait are unconditionally dropped |
Set decision_wait comfortably above the expected maximum trace duration |
| Horizontal scaling constraints | Spans from the same traceID distributed across multiple instances result in incomplete evaluations | Apply loadbalancingexporter + dual-layer architecture |
| Operational complexity | More policies means harder debugging | Monitor otelcol_processor_tail_sampling_* metrics |
Late Span: A span that arrives significantly later than other spans in the same trace due to network delays, asynchronous processing, queue wait times, etc. If
decision_waitis set too short, these spans arrive after the evaluation is already complete and are discarded.
Most Common Mistakes in Practice
-
Setting
decision_waitshorter than the latency threshold: Ifthreshold_ms: 2000butdecision_wait: 1s, a 2-second slow trace will be dropped before evaluation completes. It is recommended to setdecision_waitto 1.5–2× the expected maximum trace completion time. -
Attempting horizontal scaling without
loadbalancingexporter: Simply replicating the Tail Sampling Collector across multiple instances causes spans from the same traceID to scatter across instances, leading to incomplete trace evaluations. A dual-layer architecture is mandatory. -
Undersetting
num_traces: During traffic spikes, exceeding thenum_traceslimit causes the oldest traces to be dropped first. It is advisable to set this to at leastdecision_wait(seconds) × traces_per_second × 1.5based on peak traffic.
Closing Thoughts
When Tail-based Sampling is applied correctly, you can now capture every error and slow-request trace that matters in production without missing a single one, while dramatically lowering storage costs. This gives you the foundation to run your observability pipeline in a way that achieves both cost efficiency and reliability.
Three steps you can take right now:
-
Switch to the OTel Collector contrib image: Replace the existing
otel/opentelemetry-collectorimage withotel/opentelemetry-collector-contrib. Thetailsamplingprocessoris only included in the contrib repository. For production, it is recommended to pin a specific version rather than using thelatesttag (e.g.,otel/opentelemetry-collector-contrib:0.100.0). -
Start with a minimal set of 3 policies: Apply a simple configuration like Example 1 (error preservation + latency preservation + 5% random), then validate it with
otelcol validate --config=config.yaml. -
Build a sampling metrics dashboard: It is recommended to visualize the metrics below with Prometheus + Grafana to confirm actual retention rates.
| Metric Name | Meaning |
|---|---|
otelcol_processor_tail_sampling_count_traces_sampled |
Number of traces preserved per policy |
otelcol_processor_tail_sampling_sampling_decision_timer_latency |
Time taken to evaluate policies |
otelcol_processor_tail_sampling_late_release_decisions |
Number of drops due to late spans |
Next article: We'll cover how to combine
filterprocessorandtransformprocessorin the OTel Collector pipeline to proactively eliminate unnecessary spans before ingestion, and how to control high-cardinality problems — where the number of unique combinations of span attribute values grows excessively, causing storage and query costs to spike.
References
- Tail Sampling Processor README | GitHub (open-telemetry/opentelemetry-collector-contrib)
- Sampling Concepts | OpenTelemetry Official Docs
- Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog
- OpenTelemetry Sampling Milestones 2025 | OpenTelemetry Blog
- Tail sampling | Grafana OpenTelemetry Docs
- Scale Alloy tail sampling | Grafana Docs
- Add tail sampling policies and strategies | Grafana Tempo Docs
- How Grafana Labs enables horizontally scalable tail sampling | Grafana Blog
- Scaling the Collector | OpenTelemetry Official Docs
- Getting Started with Advanced Sampling | AWS ADOT
- otelcol.processor.tail_sampling | Grafana Alloy Docs