OpenTelemetry Tail Sampling Deep Dive: Composite Policy Design and Memory Optimization with decision_wait
This article is aimed at DevOps engineers already operating an OTel Collector. If you're using head-based sampling, there's one uncomfortable truth you need to face first. Even if your Collector is quietly dropping traces in production, it's hard to notice — and by the time you do, that day's error traces are already gone. Tail Sampling fundamentally solves this problem, but misconfiguration brings a different kind of trouble. Leaving decision_wait: 30s as-is in a 1,000 TPS environment pushes memory to 600MB, which is a precursor to OOM Kill.
By the end of this article, you'll be able to review a production configuration covering composite policy design, decision_wait tuning, and 2-tier architecture setup in under 30 minutes. Based on the tailsamplingprocessor from the OpenTelemetry Collector contrib package, we'll walk through everything step by step — from policy type characteristics to memory estimation formulas and monitoring metrics.
Core Concepts
Head vs. Tail Sampling — What's the Difference?
Head-based sampling decides whether to sample at the moment the first span is created. It's fast and stateless, making horizontal scaling easy — but it must make decisions without knowing whether the trace contains errors or slow responses.
Tail Sampling makes decisions after all spans have arrived at the Collector. To do this, the entire trace must be held in memory for a period of time, which is why memory management becomes the central challenge.
| Item | Head Sampling | Tail Sampling |
|---|---|---|
| Decision point | At trace start | After all spans are collected |
| Error-based decision | Not possible | Possible |
| Latency-based decision | Not possible | Possible |
| Memory requirement | Low | High |
| Horizontal scaling | Easy | Stateful — requires separate design |
What does stateful mean? It means all spans from the same trace must be gathered at the same Collector instance for a correct sampling decision to be made. Simply increasing the number of replicas is not enough for horizontal scaling.
Understanding Core Parameters and Estimating Memory
There are three core parameters that control the Tail Sampling Processor.
processors:
tail_sampling:
decision_wait: 10s # Wait time from first span arrival to decision
num_traces: 100000 # Maximum number of traces to keep in memory
expected_new_traces_per_sec: 1000 # For internal map pre-allocationWhat is
decision_wait? It is the time to wait from the arrival of the first span until a sampling decision is made. It expects all spans for that trace to arrive within this window. Longer values increase accuracy; shorter values save memory.
expected_new_traces_per_seccaution: If this value is set too low relative to the actual incoming rate, the internal map will be resized at runtime, which can cause momentary latency spikes. It is recommended to set this generously at 120–150% of your actual TPS.
It's good practice to estimate memory before going to production.
Required memory ≈ traces_per_second × decision_wait(s) × avg_spans_per_trace × avg_span_size(bytes)With 1,000 traces per second, a 30s wait, an average of 10 spans per trace, and 2KB per span:
1,000 TPS × 30s × 10 spans × 2KB = 600MBUnder the same conditions, reducing decision_wait to 10s brings this down to 200MB — one third. This is why shortening decision_wait is the most effective memory reduction strategy. In the calculation above, num_traces: 100000 accommodates 1,000 TPS × 100s, but with decision_wait: 10s, 30,000 is sufficient. Excess num_traces wastes memory unnecessarily, so it is recommended to adjust this value after your calculation.
decision_wait starting point formula: service p99 response time × 3
| Situation | Recommended decision_wait |
|---|---|
| General web API (p99 < 1s) | 5s – 10s |
| Mixed workload including batch processing | 15s – 20s |
| Legacy systems (p99 > 5s) | 30s (keep default) |
Using GOMEMLIMIT and memory_limiter Together
Using two mechanisms together for memory protection is the current standard pattern.
memory_limiterprocessor: Drops spans when memory exceeds the limit to prevent OOM. It must be placed beforetail_samplingin the pipeline for backpressure to work correctly.GOMEMLIMITenvironment variable: A soft memory limit for the Go runtime. Setting it to 80–90% of the container memory limit causes the GC to aggressively reclaim memory before an OOM Kill occurs.
# Environment variable configuration example (Kubernetes)
env:
- name: GOMEMLIMIT
value: "1800MiB" # When container limit is 2GiBPolicy Types at a Glance
| Policy Type | Description | Typical Use Case |
|---|---|---|
status_code |
Based on ERROR / UNSET / OK | Preserve all error traces |
latency |
Whether threshold_ms is exceeded |
Capture slow requests |
string_attribute |
String attribute values (regex supported) | Filter specific services or paths |
numeric_attribute |
Numeric attribute range | Specific user_id ranges, etc. |
probabilistic |
Probabilistic ratio | Preserve only a portion of normal traffic |
and |
AND combination of multiple policies | Cross-condition filtering |
composite |
Rate allocation per policy | Traffic budget distribution |
always_sample |
All traces pass through | Fallback within composite |
Practical Application
Example 1: Composite Policy — Error + Latency + Probabilistic Sampling
This is the most common configuration. Important traces (errors, slow requests) are preserved in full, while only 5% of the rest are sampled. Below is a complete example including service.pipelines.
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1800
spike_limit_mib: 400
tail_sampling:
decision_wait: 10s
num_traces: 30000 # 1,000 TPS × 10s = 10,000 + buffer
expected_new_traces_per_sec: 1200
policies:
# Preserve all error traces
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Preserve slow traces exceeding 500ms
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 500
# Sample only 5% of the rest
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/backend]Policy evaluation behavior: All policies are evaluated in order, and if any one returns
SAMPLE, the trace is preserved. Policy order does not change the outcome itself, but placing the error policy first reduces evaluation cost via early exit and communicates intent clearly.
| Policy | Role | Notes |
|---|---|---|
errors-policy |
Guarantees error traces | Place first to express intent clearly |
slow-traces-policy |
Captures performance issues | threshold_ms is recommended to be set based on p99 |
probabilistic-policy |
Cost reduction | Evaluated independently, regardless of errors or slow requests |
Example 2: AND Policy — Capture Only Slow Traces from the Payment Service
When you want to target only slow requests from a specific service, you can use the and policy.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 30000
expected_new_traces_per_sec: 1200
policies:
- name: slow-payment-traces
type: and
and:
and_sub_policy:
- name: latency-check
type: latency
latency:
threshold_ms: 300
- name: service-check
type: string_attribute
string_attribute:
key: service.name
values: [payment-service]
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/backend]The and policy finalizes sampling only when all sub-policies return SAMPLE. The example above preserves only traces that are both from payment-service and exceed 300ms.
Example 3: Composite Policy — Allocating Spans Per Second Like a Budget
If you started with the basic configuration from Example 1 but need more precise control over backend costs, you can expand to a composite policy. Use max_total_spans_per_second to cap total throughput and assign a ratio to each policy.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 30000
expected_new_traces_per_sec: 1200
policies:
- name: composite-policy
type: composite
composite:
max_total_spans_per_second: 1000
policy_order: [errors-policy, slow-traces-policy, probabilistic-policy]
composite_sub_policy:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 500
- name: probabilistic-policy
type: always_sample
rate_allocation:
- policy: errors-policy
percent: 50 # Up to 500 spans/sec allocated to error traces
- policy: slow-traces-policy
percent: 30 # Up to 300 spans/sec allocated to slow traces
# The remaining 20% (200 spans) is allocated to probabilistic-policy
# Caution: the remaining budget for policies not listed in rate_allocation
# is NOT distributed to those policies — it is discarded.
# It is recommended to explicitly declare all policies.
rate_allocationcaution: Even if the percentages don't add up to 100%, the remaining budget is not automatically distributed to other policies. Unallocated budget for policies not explicitly listed is discarded. It is recommended to explicitly declare all sub-policies inrate_allocation.
Example 4: Excluding Health Check Paths (invert_match)
You can exclude high-frequency noise requests like /health and /readyz from sampling.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 30000
expected_new_traces_per_sec: 1200
policies:
# Mark health check paths as NOT_SAMPLED
- name: exclude-health-check
type: string_attribute
string_attribute:
key: http.target
values: ["/health", "/readyz", "/metrics"]
invert_match: true
# Sample 5% of remaining traces
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5How
invert_match: trueworks: This policy marks traces matching the specified values asNOT_SAMPLED. Traces that do not match are not decided by this policy and pass on to the next policy. Theinvert_matchpolicy alone does not have the effect of "sampling everything else," so it must always be used in a composite configuration alongside other policies.
Example 5: 2-Tier Architecture — Essential Setup for Horizontal Scaling
Because Tail Sampling is stateful, simple horizontal scaling is not possible. If spans from the same trace are distributed across different Collector instances, correct sampling decisions cannot be made. The officially recommended architecture to solve this problem is the 2-tier configuration.
# Tier 1: Load balancer Collector — trace_id hash-based routing
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
tls:
insecure: true # For development/testing. Production requires TLS certificate configuration.
resolver:
k8s:
service: otel-collector-tailsampling-headless
service:
pipelines:
traces:
receivers: [otlp]
exporters: [loadbalancing]# Tier 2: Tail Sampling Collector — apply the tail_sampling processor from the examples above
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1800
spike_limit_mib: 400
tail_sampling:
decision_wait: 10s
num_traces: 30000
expected_new_traces_per_sec: 1200
policies:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 500
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling]
exporters: [otlp/backend]What is
loadbalancingexporter? It is an exporter that uses trace_id as a hash key to always route all spans of the same trace to the same Collector instance. It is the core component of the 2-tier Tail Sampling architecture — horizontally scaling Tail Sampling Collectors without this layer results in incorrect sampling decisions.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Decisions based on complete information | Judgment after all spans are collected — errors and latency can be captured accurately |
| Cost reduction | 80–95% reduction in traces sent to the backend is achievable |
| Flexible policy combinations | AND / Composite can express complex business rules |
| Noise removal | Health checks and high-frequency normal requests can be easily excluded |
Disadvantages and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Memory pressure | Traces reside in memory proportional to decision_wait × TPS |
Shorten decision_wait + use memory_limiter + GOMEMLIMIT together |
| Drop on num_traces exceeded | When the limit is exceeded, older traces are dropped and incomplete traces may be sent | Monitor sampling_trace_dropped_too_early_count metric |
| Stateful constraint | Simple horizontal scaling is not possible | A preceding loadbalancingexporter layer is required |
| Risk of missing spans | Spans that don't arrive within decision_wait are excluded from the decision |
Set generously based on p99 |
Essential Monitoring Metrics
# If this metric increases, you need to increase num_traces or shorten decision_wait
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_count
# Policy evaluation errors — needed when diagnosing policy configuration issues
otelcol_processor_tail_sampling_sampling_policy_evaluation_error_count
# Actual number of sampled traces — for ratio verification
otelcol_processor_tail_sampling_count_traces_sampledMost Common Mistakes in Practice
- Placing
memory_limiteraftertail_sampling—memory_limitermust be positioned beforetail_samplingin the pipeline for backpressure to work correctly. - Horizontally scaling Tail Sampling Collectors without a 2-tier architecture — Simply increasing the number of replicas without
loadbalancingexporterdistributes spans from the same trace across instances, leading to incorrect sampling decisions. - Using
invert_matchpolicy in isolation —invert_match: trueonly excludes matching traces; it is not an instruction to sample the rest. It must always be configured alongside other sampling policies.
Closing Thoughts
Before adjusting your head sampling ratio, the right order of operations is to first calculate the decision_wait and memory cost for Tail Sampling. Accurately selecting important traces is more valuable than simply lowering the sampling ratio, and shortening decision_wait is where that starts.
Three steps you can take right now:
- Measure your service's p99 response time, calculate an initial value using the
decision_wait = p99 × 3formula, and estimate required memory withTPS × decision_wait × avg_spans × avg_span_size. - Apply a basic 3-stage composite policy in the order
errors-policy → slow-traces-policy → probabilistic-policy, and placememory_limiterat the front of the pipeline. - Add the
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_countmetric to a dashboard in Prometheus. If this value increases, consider expandingnum_tracesor shorteningdecision_wait.
Next article: Designing an autoscaling architecture that integrates Tail Sampling in a Kubernetes environment with KEDA to automatically respond to traffic spikes
References
- Tail Sampling Processor Official README | GitHub — When you need the full policy parameter reference
- OpenTelemetry Official Sampling Concepts Documentation | OpenTelemetry — When encountering the concept of Head vs. Tail Sampling for the first time
- Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog — When you need background on the adoption decision
- Tail-Based Sampling: Sizing, Memory Crashes and Cost Model | Michal Drozd — When you want to go deep into memory OOM cases and cost models
- How to Fix the Collector Memory Leak Caused by Tail Sampling Processor | OneUptime — When you need to troubleshoot an actual memory leak
- How to Right-Size CPU and Memory for the OpenTelemetry Collector | OneUptime — When covering Collector resource sizing in general
- Scale Alloy tail sampling | Grafana OpenTelemetry Docs — When implementing the same configuration in a Grafana Alloy environment
- otelcol.processor.tail_sampling | Grafana Alloy Official Documentation — Alloy component parameter reference
- Scaling the Collector | OpenTelemetry Official Documentation — When verifying 2-tier architecture design principles from official documentation
- Traces at Scale: Head or Tail? Sampling Strategies | DEV Community — When taking a broader look at the context of sampling strategy selection