OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM
When operating high-traffic services, you may encounter situations where the OpenTelemetry Collector suddenly restarts with an OOMKilled status. If your monitoring pipeline dies every 30 minutes, the culprit is most likely the parameter configuration of the tailsamplingprocessor. Tail Sampling is a powerful tool for precisely selecting error and latency traces, but misconfigured, it becomes a memory bomb.
Unlike other optimization guides, this article uses two real production OOM cases and benchmark figures from Michal Drozd to demonstrate — with concrete calculations and configuration examples — that correctly combining decision_wait and num_traces alone can reduce memory usage by up to 75%. From 2-tier architecture configuration and memory_limiter placement as a safety net, to Prometheus alert rules — this is a configuration guide you can take and apply today.
Core Concepts
What Are OTel Collector and Tail Sampling
OpenTelemetry Collector is a vendor-neutral pipeline that collects, processes, and forwards traces, metrics, and logs generated by applications. In a microservices environment, instead of each service sending data directly to a backend (Jaeger, Tempo, etc.), the Collector handles batch processing in the middle.
There are two sampling approaches:
| Approach | Decision Point | Characteristics |
|---|---|---|
| Head Sampling | At trace start | Fast with no memory overhead. However, decisions must be made before knowing whether errors or latency occurred |
| Tail Sampling | After all spans are collected | Can accurately select error and latency traces. However, all spans must be kept in memory until a decision is made |
This is exactly why Tail Sampling consumes so much memory. All spans belonging to a trace must be retained in memory until the sampling decision is made.
Core Parameters of tailsamplingprocessor
Memory usage in tailsamplingprocessor is determined by four parameters:
| Parameter | Default | Role |
|---|---|---|
decision_wait |
30s | Wait time from first span received to sampling decision |
num_traces |
50,000 | Maximum number of traces held simultaneously in memory |
expected_new_traces_per_sec |
0 | Expected number of new traces per second (for pre-allocating internal data structures) |
maximum_trace_size_bytes |
unset | Maximum size of a single trace (immediately dropped if exceeded) |
Memory Usage Calculation Formula
It's important to calculate required memory before changing settings:
Required Memory ≈ traces_per_second × decision_wait_seconds × avg_spans_per_trace × bytes_per_spanFor example, in a 1,000 TPS environment with decision_wait: 30s, an average of 10 spans per trace, and 1KB per span:
1,000 × 30 × 10 × 1KB = 300MB (600MB with 2x safety margin)Safety margin rule: It is recommended to set the container memory limit to 1.5–2x the calculated requirement. The Collector's own base overhead of approximately 200MB must also be included.
Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter
tailsamplingprocessor maintains state in memory. All spans of the same trace must reach the same Collector instance for a correct sampling decision. Standard horizontal scaling (pod replication + round-robin load balancer) splits a single trace across multiple Collectors, breaking the sampling decision entirely.
loadbalancingexporter solves this problem.
| Item | Details |
|---|---|
| How it works | Routes traffic to the same backend Collector consistently, using trace ID as the hash key |
| Role | Distributes traffic in tier 1 (gateway) and ensures state consistency in tier 2 (sampling) |
| Caution | Does not account for backend load, so it is best to set the same memory across all sampling Collectors |
loadbalancingexporter: A prerequisite for safely using Tail Sampling in high-traffic environments. Without this exporter, horizontal scaling actually compromises data integrity.
Practical Application
Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s
Environment: Container memory 512MB, traffic 1,000 TPS
Plugging into the formula immediately reveals the problem:
1,000 TPS × 60s × 10 spans × 1KB = 600MB > 512MB (container limit)The Collector repeatedly filled its memory within 30 minutes and cycled through OOMKilled restarts.
processors:
tail_sampling:
decision_wait: 15s # Reduced from 60s → 15s (approximately 75% memory reduction)
num_traces: 50000 # 1,000 TPS × 15s = max 15,000 traces → 50,000 provides ample buffer
expected_new_traces_per_sec: 1000The reason num_traces: 50000 was kept as-is is that, at 1,000 TPS × 15s, a maximum of 15,000 concurrent traces are needed, making 50,000 already a sufficient buffer. For higher-traffic environments, recalculating this value alongside the others is recommended.
| Before | After | Change |
|---|---|---|
| decision_wait: 60s | decision_wait: 15s | 4x reduction |
| Memory: ~600MB | Memory: ~150MB | 75% reduction |
Halving decision_wait results in an approximately linear relationship where memory also nearly halves. However, actual reduction may vary depending on GC delays and span arrival distribution. For environments with heavy batch jobs or asynchronous processing, it is recommended to first measure actual trace completion times before making adjustments.
Example 2: Exceeding num_traces Limit During Traffic Spikes
Environment: Default num_traces: 50,000, sudden 5x traffic increase
When the num_traces limit is exceeded, the oldest traces are dropped first. Since error traces are more likely to remain in the buffer longer due to processing delays, a paradoxical situation arises where the traces that should be preserved disappear first.
Simply increasing num_traces raises the risk of OOM. The key is to place memory_limiter before tail_sampling to establish a safety net first.
processors:
memory_limiter:
check_interval: 1s
limit_mib: 3600 # 90% of 4GB container
spike_limit_mib: 800
tail_sampling:
decision_wait: 15s
num_traces: 100000
expected_new_traces_per_sec: 5000
maximum_trace_size_bytes: 50000 # Immediately drop traces exceeding 50KB
service:
pipelines:
traces:
processors: [memory_limiter, tail_sampling] # Order mattersThe maximum_trace_size_bytes: 50000 setting prevents abnormal traces packed with debug logs from occupying hundreds of MB on their own. According to Michal Drozd's benchmarks, applying this setting in a 3,000 TPS environment reduced memory from 1.4GB to 890MB.
memory_limiter: A processor that protects the Collector itself from dying due to OOM by rejecting (backpressure) or dropping new data when memory usage reaches the specified limit. It must be positioned beforetail_sampling.
Example 3: 2-Tier Architecture for High-Traffic Production
For high-traffic environments above 5,000 TPS, handling Tail Sampling with a single Collector reaches its limits. The pattern of separating tier 1 (gateway) and tier 2 (dedicated sampling) has become the current production standard.
The following is configuration for two separate Collector processes — not a single file.
# File 1: gateway-collector.yaml (Tier 1 — Gateway Collector)
exporters:
loadbalancing:
routing_key: "traceID" # trace ID-based routing ensures state consistency
protocol:
otlp:
timeout: 1s
resolver:
static:
hostnames:
- sampling-collector-1:4317
- sampling-collector-2:4317
- sampling-collector-3:4317# File 2: sampling-collector.yaml (Tier 2 — Sampling Collector)
processors:
tail_sampling:
decision_wait: 15s
num_traces: 200000
expected_new_traces_per_sec: 3000
maximum_trace_size_bytes: 100000
policies:
# Policies are evaluated with OR: a trace is preserved if any policy matches
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}In this structure, tier 1 is stateless and can be scaled horizontally without restriction, while only tier 2 is subject to vertical scaling (memory increase). It is recommended to set the same memory configuration across all sampling Collectors.
Monitoring and Proactive Response
It's important to receive warnings before an OOM occurs. Registering the following three Prometheus alert rules allows you to respond before issues escalate.
# Prometheus alert rules
groups:
- name: tail_sampling
rules:
- alert: TailSamplingMemoryHigh
expr: otelcol_processor_tail_sampling_count_traces_on_memory > 250000
for: 5m # Filters out transient spikes and only fires on sustained excess
annotations:
summary: "Traces loaded in memory have exceeded the threshold"
- alert: TailSamplingDropHigh
expr: rate(otelcol_processor_tail_sampling_count_spans_dropped[5m]) > 100
for: 5m
annotations:
summary: "Span drops are occurring"
- alert: TailSamplingDroppedTooEarly
expr: rate(otelcol_processor_tail_sampling_sampling_trace_dropped_too_early[5m]) > 10
for: 5m
annotations:
summary: "decision_wait is too short or num_traces is insufficient"| Metric | Meaning |
|---|---|
otelcol_processor_tail_sampling_count_traces_on_memory |
Current number of traces loaded in memory |
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early |
Traces dropped before decision_wait |
otelcol_processor_tail_sampling_sampling_trace_removal_age |
Trace buffer retention time |
otelcol_process_memory_rss |
Collector's actual RSS memory |
When the sampling_trace_removal_age value begins to approach decision_wait, it signals that drops are imminent.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Precise sampling | Can probabilistically sample only normal traces while preserving 100% of error and latency traces |
| Predictable memory control | Shortening decision_wait → proportional memory reduction, calculable in advance with the formula |
| Cost reduction | Precisely eliminates unnecessary trace storage costs using policy-based rules |
| Large trace protection | maximum_trace_size_bytes prevents abnormal traces from breaking the entire pipeline |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| State preservation required | Sampling decisions cannot be made if spans are distributed across multiple Collectors | Use loadbalancingexporter for trace ID-based routing |
| No horizontal scaling | Cannot be solved by Collector replication alone | 2-tier separation; vertical scaling for tier 2 |
| Risk of long trace loss | Setting decision_wait too short causes missing batch/async spans | Measure actual trace completion time before configuring |
| Exceeding num_traces limit | Old traces (including errors) are dropped first | Pre-place memory_limiter + set appropriate values based on traffic |
Most Common Mistakes in Practice
-
Placing
batch→tail_samplingin that order: Spans from the same trace get split across different batches, tangling sampling decisions. The correct order istail_sampling→batch. -
Blindly increasing
num_traceswithoutmemory_limiter:num_tracesis directly proportional to memory. Increasing it without a safety net triggers OOM faster. It is strongly recommended to placememory_limiterfirst, then make adjustments. -
Leaving
expected_new_traces_per_secat 0: When this value is 0, pre-allocation of the internal Go map structure is skipped. Dynamic reallocation occurs every time traces increase, raising GC pressure. Setting it to match the actual TPS keeps memory allocation patterns significantly more stable.
Closing Thoughts
Calculating required memory with one formula (TPS × decision_wait × avg_spans × bytes_per_span), building a safety net with memory_limiter, and establishing a proactive response system with alerts — these are the three core pillars of running Tail Sampling without OOM.
Here are 3 steps you can start right now:
- Check your current metrics: View
otelcol_processor_tail_sampling_count_traces_on_memoryandotelcol_process_memory_rssin Grafana, and use theTPS × decision_wait_seconds × avg_spans × bytes_per_spanformula to compare theoretical versus actual values. - Lower
decision_waitto 15s and placememory_limiterfirst in the pipeline: Most synchronous request chains are fine with 15s. Setlimit_mibto 85–90% of container memory. - Register the
TailSamplingDroppedTooEarlyalert: When this alert fires, it signals thatdecision_waitis too short ornum_tracesis insufficient. Adjust incrementally while watching metric trends.
Next Article
loadbalancingexporterDeep Dive: Covers 2-tier architecture operational strategies for resolving traffic skew and evenly distributing load across sampling Collectors. Includes DNS resolver configuration and health check integration.
References
- Tail Sampling Processor README | opentelemetry-collector-contrib — Official reference for all parameters including
decision_waitandnum_traces - Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model | Michal Drozd — Source of the benchmark figures used in this article. Includes memory profiling results per configuration
- Sampling | OpenTelemetry Official Documentation — Recommended for a deeper understanding of Head Sampling vs Tail Sampling concepts
- Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog — Background and use cases for adopting Tail Sampling
- Load Balancing Exporter README | opentelemetry-collector-contrib — Recommended for a deeper understanding of the 2-tier architecture
- Scaling the Collector | OpenTelemetry — Official guide for horizontal and vertical scaling strategies
- Mastering the OpenTelemetry Memory Limiter Processor | Dash0 — Detailed explanation of
memory_limiterparameters - otelcol.processor.tail_sampling | Grafana Alloy Documentation — Configuration guide for Grafana Alloy environments
- Gateway deployment pattern | OpenTelemetry — Official architecture guide for the gateway pattern