OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM

When operating high-traffic services, you may encounter situations where the OpenTelemetry Collector suddenly restarts with an OOMKilled status. If your monitoring pipeline dies every 30 minutes, the culprit is most likely the parameter configuration of the tailsamplingprocessor. Tail Sampling is a powerful tool for precisely selecting error and latency traces, but misconfigured, it becomes a memory bomb.

Unlike other optimization guides, this article uses two real production OOM cases and benchmark figures from Michal Drozd to demonstrate — with concrete calculations and configuration examples — that correctly combining decision_wait and num_traces alone can reduce memory usage by up to 75%. From 2-tier architecture configuration and memory_limiter placement as a safety net, to Prometheus alert rules — this is a configuration guide you can take and apply today.

Core Concepts

What Are OTel Collector and Tail Sampling

OpenTelemetry Collector is a vendor-neutral pipeline that collects, processes, and forwards traces, metrics, and logs generated by applications. In a microservices environment, instead of each service sending data directly to a backend (Jaeger, Tempo, etc.), the Collector handles batch processing in the middle.

There are two sampling approaches:

Approach	Decision Point	Characteristics
Head Sampling	At trace start	Fast with no memory overhead. However, decisions must be made before knowing whether errors or latency occurred
Tail Sampling	After all spans are collected	Can accurately select error and latency traces. However, all spans must be kept in memory until a decision is made

This is exactly why Tail Sampling consumes so much memory. All spans belonging to a trace must be retained in memory until the sampling decision is made.

Core Parameters of tailsamplingprocessor

Memory usage in tailsamplingprocessor is determined by four parameters:

Parameter	Default	Role
`decision_wait`	30s	Wait time from first span received to sampling decision
`num_traces`	50,000	Maximum number of traces held simultaneously in memory
`expected_new_traces_per_sec`	0	Expected number of new traces per second (for pre-allocating internal data structures)
`maximum_trace_size_bytes`	unset	Maximum size of a single trace (immediately dropped if exceeded)

Memory Usage Calculation Formula

It's important to calculate required memory before changing settings:

Required Memory ≈ traces_per_second × decision_wait_seconds × avg_spans_per_trace × bytes_per_span

For example, in a 1,000 TPS environment with decision_wait: 30s, an average of 10 spans per trace, and 1KB per span:

1,000 × 30 × 10 × 1KB = 300MB (600MB with 2x safety margin)

Safety margin rule: It is recommended to set the container memory limit to 1.5–2x the calculated requirement. The Collector's own base overhead of approximately 200MB must also be included.

Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter

tailsamplingprocessor maintains state in memory. All spans of the same trace must reach the same Collector instance for a correct sampling decision. Standard horizontal scaling (pod replication + round-robin load balancer) splits a single trace across multiple Collectors, breaking the sampling decision entirely.

loadbalancingexporter solves this problem.

Item	Details
How it works	Routes traffic to the same backend Collector consistently, using trace ID as the hash key
Role	Distributes traffic in tier 1 (gateway) and ensures state consistency in tier 2 (sampling)
Caution	Does not account for backend load, so it is best to set the same memory across all sampling Collectors

loadbalancingexporter: A prerequisite for safely using Tail Sampling in high-traffic environments. Without this exporter, horizontal scaling actually compromises data integrity.

Practical Application

Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s

Environment: Container memory 512MB, traffic 1,000 TPS

Plugging into the formula immediately reveals the problem:

1,000 TPS × 60s × 10 spans × 1KB = 600MB > 512MB (container limit)

The Collector repeatedly filled its memory within 30 minutes and cycled through OOMKilled restarts.

yaml

processors:
  tail_sampling:
    decision_wait: 15s          # Reduced from 60s → 15s (approximately 75% memory reduction)
    num_traces: 50000           # 1,000 TPS × 15s = max 15,000 traces → 50,000 provides ample buffer
    expected_new_traces_per_sec: 1000

The reason num_traces: 50000 was kept as-is is that, at 1,000 TPS × 15s, a maximum of 15,000 concurrent traces are needed, making 50,000 already a sufficient buffer. For higher-traffic environments, recalculating this value alongside the others is recommended.

Before	After	Change
decision_wait: 60s	decision_wait: 15s	4x reduction
Memory: ~600MB	Memory: ~150MB	75% reduction

Halving decision_wait results in an approximately linear relationship where memory also nearly halves. However, actual reduction may vary depending on GC delays and span arrival distribution. For environments with heavy batch jobs or asynchronous processing, it is recommended to first measure actual trace completion times before making adjustments.

Example 2: Exceeding num_traces Limit During Traffic Spikes

Environment: Default num_traces: 50,000, sudden 5x traffic increase

When the num_traces limit is exceeded, the oldest traces are dropped first. Since error traces are more likely to remain in the buffer longer due to processing delays, a paradoxical situation arises where the traces that should be preserved disappear first.

Simply increasing num_traces raises the risk of OOM. The key is to place memory_limiter before tail_sampling to establish a safety net first.

yaml

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 3600       # 90% of 4GB container
    spike_limit_mib: 800
 
  tail_sampling:
    decision_wait: 15s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    maximum_trace_size_bytes: 50000   # Immediately drop traces exceeding 50KB
 
service:
  pipelines:
    traces:
      processors: [memory_limiter, tail_sampling]  # Order matters

The maximum_trace_size_bytes: 50000 setting prevents abnormal traces packed with debug logs from occupying hundreds of MB on their own. According to Michal Drozd's benchmarks, applying this setting in a 3,000 TPS environment reduced memory from 1.4GB to 890MB.

memory_limiter: A processor that protects the Collector itself from dying due to OOM by rejecting (backpressure) or dropping new data when memory usage reaches the specified limit. It must be positioned before tail_sampling.

Example 3: 2-Tier Architecture for High-Traffic Production

For high-traffic environments above 5,000 TPS, handling Tail Sampling with a single Collector reaches its limits. The pattern of separating tier 1 (gateway) and tier 2 (dedicated sampling) has become the current production standard.

The following is configuration for two separate Collector processes — not a single file.

yaml

# File 1: gateway-collector.yaml (Tier 1 — Gateway Collector)
exporters:
  loadbalancing:
    routing_key: "traceID"      # trace ID-based routing ensures state consistency
    protocol:
      otlp:
        timeout: 1s
    resolver:
      static:
        hostnames:
          - sampling-collector-1:4317
          - sampling-collector-2:4317
          - sampling-collector-3:4317

yaml

# File 2: sampling-collector.yaml (Tier 2 — Sampling Collector)
processors:
  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    expected_new_traces_per_sec: 3000
    maximum_trace_size_bytes: 100000
    policies:
      # Policies are evaluated with OR: a trace is preserved if any policy matches
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

In this structure, tier 1 is stateless and can be scaled horizontally without restriction, while only tier 2 is subject to vertical scaling (memory increase). It is recommended to set the same memory configuration across all sampling Collectors.

Monitoring and Proactive Response

It's important to receive warnings before an OOM occurs. Registering the following three Prometheus alert rules allows you to respond before issues escalate.

yaml

# Prometheus alert rules
groups:
  - name: tail_sampling
    rules:
      - alert: TailSamplingMemoryHigh
        expr: otelcol_processor_tail_sampling_count_traces_on_memory > 250000
        for: 5m   # Filters out transient spikes and only fires on sustained excess
        annotations:
          summary: "Traces loaded in memory have exceeded the threshold"
 
      - alert: TailSamplingDropHigh
        expr: rate(otelcol_processor_tail_sampling_count_spans_dropped[5m]) > 100
        for: 5m
        annotations:
          summary: "Span drops are occurring"
 
      - alert: TailSamplingDroppedTooEarly
        expr: rate(otelcol_processor_tail_sampling_sampling_trace_dropped_too_early[5m]) > 10
        for: 5m
        annotations:
          summary: "decision_wait is too short or num_traces is insufficient"

Metric	Meaning
`otelcol_processor_tail_sampling_count_traces_on_memory`	Current number of traces loaded in memory
`otelcol_processor_tail_sampling_sampling_trace_dropped_too_early`	Traces dropped before decision_wait
`otelcol_processor_tail_sampling_sampling_trace_removal_age`	Trace buffer retention time
`otelcol_process_memory_rss`	Collector's actual RSS memory

When the sampling_trace_removal_age value begins to approach decision_wait, it signals that drops are imminent.

Pros and Cons Analysis

Advantages

Item	Details
Precise sampling	Can probabilistically sample only normal traces while preserving 100% of error and latency traces
Predictable memory control	Shortening decision_wait → proportional memory reduction, calculable in advance with the formula
Cost reduction	Precisely eliminates unnecessary trace storage costs using policy-based rules
Large trace protection	`maximum_trace_size_bytes` prevents abnormal traces from breaking the entire pipeline

Disadvantages and Caveats

Item	Details	Mitigation
State preservation required	Sampling decisions cannot be made if spans are distributed across multiple Collectors	Use `loadbalancingexporter` for trace ID-based routing
No horizontal scaling	Cannot be solved by Collector replication alone	2-tier separation; vertical scaling for tier 2
Risk of long trace loss	Setting decision_wait too short causes missing batch/async spans	Measure actual trace completion time before configuring
Exceeding num_traces limit	Old traces (including errors) are dropped first	Pre-place `memory_limiter` + set appropriate values based on traffic

Most Common Mistakes in Practice

Placing batch → tail_sampling in that order: Spans from the same trace get split across different batches, tangling sampling decisions. The correct order is tail_sampling → batch.
Blindly increasing num_traces without memory_limiter: num_traces is directly proportional to memory. Increasing it without a safety net triggers OOM faster. It is strongly recommended to place memory_limiter first, then make adjustments.
Leaving expected_new_traces_per_sec at 0: When this value is 0, pre-allocation of the internal Go map structure is skipped. Dynamic reallocation occurs every time traces increase, raising GC pressure. Setting it to match the actual TPS keeps memory allocation patterns significantly more stable.

Closing Thoughts

Calculating required memory with one formula (TPS × decision_wait × avg_spans × bytes_per_span), building a safety net with memory_limiter, and establishing a proactive response system with alerts — these are the three core pillars of running Tail Sampling without OOM.

Here are 3 steps you can start right now:

Check your current metrics: View otelcol_processor_tail_sampling_count_traces_on_memory and otelcol_process_memory_rss in Grafana, and use the TPS × decision_wait_seconds × avg_spans × bytes_per_span formula to compare theoretical versus actual values.
Lower decision_wait to 15s and place memory_limiter first in the pipeline: Most synchronous request chains are fine with 15s. Set limit_mib to 85–90% of container memory.
Register the TailSamplingDroppedTooEarly alert: When this alert fires, it signals that decision_wait is too short or num_traces is insufficient. Adjust incrementally while watching metric trends.

loadbalancingexporter Deep Dive: Covers 2-tier architecture operational strategies for resolving traffic skew and evenly distributing load across sampling Collectors. Includes DNS resolver configuration and health check integration.

References

Tail Sampling Processor README | opentelemetry-collector-contrib — Official reference for all parameters including decision_wait and num_traces
Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model | Michal Drozd — Source of the benchmark figures used in this article. Includes memory profiling results per configuration
Sampling | OpenTelemetry Official Documentation — Recommended for a deeper understanding of Head Sampling vs Tail Sampling concepts
Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog — Background and use cases for adopting Tail Sampling
Load Balancing Exporter README | opentelemetry-collector-contrib — Recommended for a deeper understanding of the 2-tier architecture
Scaling the Collector | OpenTelemetry — Official guide for horizontal and vertical scaling strategies
Mastering the OpenTelemetry Memory Limiter Processor | Dash0 — Detailed explanation of memory_limiter parameters
otelcol.processor.tail_sampling | Grafana Alloy Documentation — Configuration guide for Grafana Alloy environments
Gateway deployment pattern | OpenTelemetry — Official architecture guide for the gateway pattern

OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM | DEV BAK - 기술블로그

DevOps

OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM

Core Concepts

What Are OTel Collector and Tail Sampling

There are two sampling approaches:

Approach	Decision Point	Characteristics
Head Sampling	At trace start	Fast with no memory overhead. However, decisions must be made before knowing whether errors or latency occurred
Tail Sampling	After all spans are collected	Can accurately select error and latency traces. However, all spans must be kept in memory until a decision is made

This is exactly why Tail Sampling consumes so much memory. All spans belonging to a trace must be retained in memory until the sampling decision is made.

Core Parameters of tailsamplingprocessor

Memory usage in tailsamplingprocessor is determined by four parameters:

Parameter	Default	Role
`decision_wait`	30s	Wait time from first span received to sampling decision
`num_traces`	50,000	Maximum number of traces held simultaneously in memory
`expected_new_traces_per_sec`	0	Expected number of new traces per second (for pre-allocating internal data structures)
`maximum_trace_size_bytes`	unset	Maximum size of a single trace (immediately dropped if exceeded)

Memory Usage Calculation Formula

It's important to calculate required memory before changing settings:

Required Memory ≈ traces_per_second × decision_wait_seconds × avg_spans_per_trace × bytes_per_span

For example, in a 1,000 TPS environment with decision_wait: 30s, an average of 10 spans per trace, and 1KB per span:

1,000 × 30 × 10 × 1KB = 300MB (600MB with 2x safety margin)

Safety margin rule: It is recommended to set the container memory limit to 1.5–2x the calculated requirement. The Collector's own base overhead of approximately 200MB must also be included.

Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter

loadbalancingexporter solves this problem.

Item	Details
How it works	Routes traffic to the same backend Collector consistently, using trace ID as the hash key
Role	Distributes traffic in tier 1 (gateway) and ensures state consistency in tier 2 (sampling)
Caution	Does not account for backend load, so it is best to set the same memory across all sampling Collectors

loadbalancingexporter: A prerequisite for safely using Tail Sampling in high-traffic environments. Without this exporter, horizontal scaling actually compromises data integrity.

Practical Application

Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s

Environment: Container memory 512MB, traffic 1,000 TPS

Plugging into the formula immediately reveals the problem:

1,000 TPS × 60s × 10 spans × 1KB = 600MB > 512MB (container limit)

The Collector repeatedly filled its memory within 30 minutes and cycled through OOMKilled restarts.

yaml

processors:
  tail_sampling:
    decision_wait: 15s          # Reduced from 60s → 15s (approximately 75% memory reduction)
    num_traces: 50000           # 1,000 TPS × 15s = max 15,000 traces → 50,000 provides ample buffer
    expected_new_traces_per_sec: 1000

Before	After	Change
decision_wait: 60s	decision_wait: 15s	4x reduction
Memory: ~600MB	Memory: ~150MB	75% reduction

Example 2: Exceeding num_traces Limit During Traffic Spikes

Environment: Default num_traces: 50,000, sudden 5x traffic increase

Simply increasing num_traces raises the risk of OOM. The key is to place memory_limiter before tail_sampling to establish a safety net first.

yaml

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 3600       # 90% of 4GB container
    spike_limit_mib: 800
 
  tail_sampling:
    decision_wait: 15s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    maximum_trace_size_bytes: 50000   # Immediately drop traces exceeding 50KB
 
service:
  pipelines:
    traces:
      processors: [memory_limiter, tail_sampling]  # Order matters

memory_limiter: A processor that protects the Collector itself from dying due to OOM by rejecting (backpressure) or dropping new data when memory usage reaches the specified limit. It must be positioned before tail_sampling.

Example 3: 2-Tier Architecture for High-Traffic Production

The following is configuration for two separate Collector processes — not a single file.

yaml

# File 1: gateway-collector.yaml (Tier 1 — Gateway Collector)
exporters:
  loadbalancing:
    routing_key: "traceID"      # trace ID-based routing ensures state consistency
    protocol:
      otlp:
        timeout: 1s
    resolver:
      static:
        hostnames:
          - sampling-collector-1:4317
          - sampling-collector-2:4317
          - sampling-collector-3:4317

yaml

# File 2: sampling-collector.yaml (Tier 2 — Sampling Collector)
processors:
  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    expected_new_traces_per_sec: 3000
    maximum_trace_size_bytes: 100000
    policies:
      # Policies are evaluated with OR: a trace is preserved if any policy matches
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Monitoring and Proactive Response

It's important to receive warnings before an OOM occurs. Registering the following three Prometheus alert rules allows you to respond before issues escalate.

yaml

# Prometheus alert rules
groups:
  - name: tail_sampling
    rules:
      - alert: TailSamplingMemoryHigh
        expr: otelcol_processor_tail_sampling_count_traces_on_memory > 250000
        for: 5m   # Filters out transient spikes and only fires on sustained excess
        annotations:
          summary: "Traces loaded in memory have exceeded the threshold"
 
      - alert: TailSamplingDropHigh
        expr: rate(otelcol_processor_tail_sampling_count_spans_dropped[5m]) > 100
        for: 5m
        annotations:
          summary: "Span drops are occurring"
 
      - alert: TailSamplingDroppedTooEarly
        expr: rate(otelcol_processor_tail_sampling_sampling_trace_dropped_too_early[5m]) > 10
        for: 5m
        annotations:
          summary: "decision_wait is too short or num_traces is insufficient"

Metric	Meaning
`otelcol_processor_tail_sampling_count_traces_on_memory`	Current number of traces loaded in memory
`otelcol_processor_tail_sampling_sampling_trace_dropped_too_early`	Traces dropped before decision_wait
`otelcol_processor_tail_sampling_sampling_trace_removal_age`	Trace buffer retention time
`otelcol_process_memory_rss`	Collector's actual RSS memory

When the sampling_trace_removal_age value begins to approach decision_wait, it signals that drops are imminent.

Pros and Cons Analysis

Advantages

Item	Details
Precise sampling	Can probabilistically sample only normal traces while preserving 100% of error and latency traces
Predictable memory control	Shortening decision_wait → proportional memory reduction, calculable in advance with the formula
Cost reduction	Precisely eliminates unnecessary trace storage costs using policy-based rules
Large trace protection	`maximum_trace_size_bytes` prevents abnormal traces from breaking the entire pipeline

Disadvantages and Caveats

Item	Details	Mitigation
State preservation required	Sampling decisions cannot be made if spans are distributed across multiple Collectors	Use `loadbalancingexporter` for trace ID-based routing
No horizontal scaling	Cannot be solved by Collector replication alone	2-tier separation; vertical scaling for tier 2
Risk of long trace loss	Setting decision_wait too short causes missing batch/async spans	Measure actual trace completion time before configuring
Exceeding num_traces limit	Old traces (including errors) are dropped first	Pre-place `memory_limiter` + set appropriate values based on traffic

Most Common Mistakes in Practice

Placing batch → tail_sampling in that order: Spans from the same trace get split across different batches, tangling sampling decisions. The correct order is tail_sampling → batch.
Blindly increasing num_traces without memory_limiter: num_traces is directly proportional to memory. Increasing it without a safety net triggers OOM faster. It is strongly recommended to place memory_limiter first, then make adjustments.
Leaving expected_new_traces_per_sec at 0: When this value is 0, pre-allocation of the internal Go map structure is skipped. Dynamic reallocation occurs every time traces increase, raising GC pressure. Setting it to match the actual TPS keeps memory allocation patterns significantly more stable.

Closing Thoughts

Here are 3 steps you can start right now:

Check your current metrics: View otelcol_processor_tail_sampling_count_traces_on_memory and otelcol_process_memory_rss in Grafana, and use the TPS × decision_wait_seconds × avg_spans × bytes_per_span formula to compare theoretical versus actual values.
Lower decision_wait to 15s and place memory_limiter first in the pipeline: Most synchronous request chains are fine with 15s. Set limit_mib to 85–90% of container memory.
Register the TailSamplingDroppedTooEarly alert: When this alert fires, it signals that decision_wait is too short or num_traces is insufficient. Adjust incrementally while watching metric trends.

loadbalancingexporter Deep Dive: Covers 2-tier architecture operational strategies for resolving traffic skew and evenly distributing load across sampling Collectors. Includes DNS resolver configuration and health check integration.

References

Tail Sampling Processor README | opentelemetry-collector-contrib — Official reference for all parameters including decision_wait and num_traces
Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model | Michal Drozd — Source of the benchmark figures used in this article. Includes memory profiling results per configuration
Sampling | OpenTelemetry Official Documentation — Recommended for a deeper understanding of Head Sampling vs Tail Sampling concepts
Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog — Background and use cases for adopting Tail Sampling
Load Balancing Exporter README | opentelemetry-collector-contrib — Recommended for a deeper understanding of the 2-tier architecture
Scaling the Collector | OpenTelemetry — Official guide for horizontal and vertical scaling strategies
Mastering the OpenTelemetry Memory Limiter Processor | Dash0 — Detailed explanation of memory_limiter parameters
otelcol.processor.tail_sampling | Grafana Alloy Documentation — Configuration guide for Grafana Alloy environments
Gateway deployment pattern | OpenTelemetry — Official architecture guide for the gateway pattern

OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM

Core Concepts

What Are OTel Collector and Tail Sampling

Core Parameters of tailsamplingprocessor

Memory Usage Calculation Formula

Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter

Practical Application

Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s

Example 2: Exceeding num_traces Limit During Traffic Spikes

Example 3: 2-Tier Architecture for High-Traffic Production

Monitoring and Proactive Response

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

Next Article

References

OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM

Core Concepts

What Are OTel Collector and Tail Sampling

Core Parameters of tailsamplingprocessor

Memory Usage Calculation Formula

Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter

Practical Application

Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s

Example 2: Exceeding num_traces Limit During Traffic Spikes

Example 3: 2-Tier Architecture for High-Traffic Production

Monitoring and Proactive Response

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

Next Article

References

Recommended Posts

Core Concepts

What Are OTel Collector and Tail Sampling

Core Parameters of tailsamplingprocessor

Memory Usage Calculation Formula

Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter

Practical Application

Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s

Example 2: Exceeding num_traces Limit During Traffic Spikes

Example 3: 2-Tier Architecture for High-Traffic Production

Monitoring and Proactive Response

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

Next Article

References

Core Concepts

What Are OTel Collector and Tail Sampling

Core Parameters of tailsamplingprocessor

Memory Usage Calculation Formula

Why Horizontal Scaling Doesn't Work: The Role of loadbalancingexporter

Practical Application

Example 1: Periodic OOM Every 30 Minutes Caused by decision_wait 60s

Example 2: Exceeding num_traces Limit During Traffic Spikes

Example 3: 2-Tier Architecture for High-Traffic Production

Monitoring and Proactive Response

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

Next Article

References

Recommended Posts

Complete Guide to loadbalancingexporter: Guaranteeing Tail Sampling Accuracy with a 2-Tier Architecture

OpenTelemetry Tail Sampling Deep Dive: Composite Policy Design and Memory Optimization with decision_wait

Tail Sampling + KEDA: A 2-Tier OTel Architecture That Never Misses a Trace During Traffic Spikes

OpenTelemetry Operator + HPA: 2-Layer Gateway Pattern for Preserving Tail Sampling Accuracy

OpenTelemetry Collector Horizontal Scaling: Maintaining Tail Sampling Accuracy with Agent-Gateway Architecture and loadbalancingexporter

Configuring Accurate RED Metrics Independent of Sampling with OpenTelemetry Collector spanmetrics