OpenTelemetry Tail Sampling Deep Dive: Composite Policy Design and Memory Optimization with decision_wait

This article is aimed at DevOps engineers already operating an OTel Collector. If you're using head-based sampling, there's one uncomfortable truth you need to face first. Even if your Collector is quietly dropping traces in production, it's hard to notice — and by the time you do, that day's error traces are already gone. Tail Sampling fundamentally solves this problem, but misconfiguration brings a different kind of trouble. Leaving decision_wait: 30s as-is in a 1,000 TPS environment pushes memory to 600MB, which is a precursor to OOM Kill.

By the end of this article, you'll be able to review a production configuration covering composite policy design, decision_wait tuning, and 2-tier architecture setup in under 30 minutes. Based on the tailsamplingprocessor from the OpenTelemetry Collector contrib package, we'll walk through everything step by step — from policy type characteristics to memory estimation formulas and monitoring metrics.

Core Concepts

Head vs. Tail Sampling — What's the Difference?

Head-based sampling decides whether to sample at the moment the first span is created. It's fast and stateless, making horizontal scaling easy — but it must make decisions without knowing whether the trace contains errors or slow responses.

Tail Sampling makes decisions after all spans have arrived at the Collector. To do this, the entire trace must be held in memory for a period of time, which is why memory management becomes the central challenge.

Item	Head Sampling	Tail Sampling
Decision point	At trace start	After all spans are collected
Error-based decision	Not possible	Possible
Latency-based decision	Not possible	Possible
Memory requirement	Low	High
Horizontal scaling	Easy	Stateful — requires separate design

What does stateful mean? It means all spans from the same trace must be gathered at the same Collector instance for a correct sampling decision to be made. Simply increasing the number of replicas is not enough for horizontal scaling.

Understanding Core Parameters and Estimating Memory

There are three core parameters that control the Tail Sampling Processor.

yaml

processors:
  tail_sampling:
    decision_wait: 10s                    # Wait time from first span arrival to decision
    num_traces: 100000                    # Maximum number of traces to keep in memory
    expected_new_traces_per_sec: 1000     # For internal map pre-allocation

What is decision_wait? It is the time to wait from the arrival of the first span until a sampling decision is made. It expects all spans for that trace to arrive within this window. Longer values increase accuracy; shorter values save memory.

expected_new_traces_per_sec caution: If this value is set too low relative to the actual incoming rate, the internal map will be resized at runtime, which can cause momentary latency spikes. It is recommended to set this generously at 120–150% of your actual TPS.

It's good practice to estimate memory before going to production.

Required memory ≈ traces_per_second × decision_wait(s) × avg_spans_per_trace × avg_span_size(bytes)

With 1,000 traces per second, a 30s wait, an average of 10 spans per trace, and 2KB per span:

1,000 TPS × 30s × 10 spans × 2KB = 600MB

Under the same conditions, reducing decision_wait to 10s brings this down to 200MB — one third. This is why shortening decision_wait is the most effective memory reduction strategy. In the calculation above, num_traces: 100000 accommodates 1,000 TPS × 100s, but with decision_wait: 10s, 30,000 is sufficient. Excess num_traces wastes memory unnecessarily, so it is recommended to adjust this value after your calculation.

decision_wait starting point formula: service p99 response time × 3

Situation	Recommended decision_wait
General web API (p99 < 1s)	5s – 10s
Mixed workload including batch processing	15s – 20s
Legacy systems (p99 > 5s)	30s (keep default)

Using GOMEMLIMIT and memory_limiter Together

Using two mechanisms together for memory protection is the current standard pattern.

memory_limiter processor: Drops spans when memory exceeds the limit to prevent OOM. It must be placed before tail_sampling in the pipeline for backpressure to work correctly.
GOMEMLIMIT environment variable: A soft memory limit for the Go runtime. Setting it to 80–90% of the container memory limit causes the GC to aggressively reclaim memory before an OOM Kill occurs.

yaml

# Environment variable configuration example (Kubernetes)
env:
  - name: GOMEMLIMIT
    value: "1800MiB"   # When container limit is 2GiB

Policy Types at a Glance

Policy Type	Description	Typical Use Case
`status_code`	Based on ERROR / UNSET / OK	Preserve all error traces
`latency`	Whether `threshold_ms` is exceeded	Capture slow requests
`string_attribute`	String attribute values (regex supported)	Filter specific services or paths
`numeric_attribute`	Numeric attribute range	Specific user_id ranges, etc.
`probabilistic`	Probabilistic ratio	Preserve only a portion of normal traffic
`and`	AND combination of multiple policies	Cross-condition filtering
`composite`	Rate allocation per policy	Traffic budget distribution
`always_sample`	All traces pass through	Fallback within composite

Practical Application

Example 1: Composite Policy — Error + Latency + Probabilistic Sampling

This is the most common configuration. Important traces (errors, slow requests) are preserved in full, while only 5% of the rest are sampled. Below is a complete example including service.pipelines.

yaml

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1800
    spike_limit_mib: 400
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000       # 1,000 TPS × 10s = 10,000 + buffer
    expected_new_traces_per_sec: 1200
    policies:
      # Preserve all error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
 
      # Preserve slow traces exceeding 500ms
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 500
 
      # Sample only 5% of the rest
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/backend]

Policy evaluation behavior: All policies are evaluated in order, and if any one returns SAMPLE, the trace is preserved. Policy order does not change the outcome itself, but placing the error policy first reduces evaluation cost via early exit and communicates intent clearly.

Policy	Role	Notes
`errors-policy`	Guarantees error traces	Place first to express intent clearly
`slow-traces-policy`	Captures performance issues	threshold_ms is recommended to be set based on p99
`probabilistic-policy`	Cost reduction	Evaluated independently, regardless of errors or slow requests

Example 2: AND Policy — Capture Only Slow Traces from the Payment Service

When you want to target only slow requests from a specific service, you can use the and policy.

yaml

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      - name: slow-payment-traces
        type: and
        and:
          and_sub_policy:
            - name: latency-check
              type: latency
              latency:
                threshold_ms: 300
            - name: service-check
              type: string_attribute
              string_attribute:
                key: service.name
                values: [payment-service]
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/backend]

The and policy finalizes sampling only when all sub-policies return SAMPLE. The example above preserves only traces that are both from payment-service and exceed 300ms.

Example 3: Composite Policy — Allocating Spans Per Second Like a Budget

If you started with the basic configuration from Example 1 but need more precise control over backend costs, you can expand to a composite policy. Use max_total_spans_per_second to cap total throughput and assign a ratio to each policy.

yaml

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order: [errors-policy, slow-traces-policy, probabilistic-policy]
          composite_sub_policy:
            - name: errors-policy
              type: status_code
              status_code:
                status_codes: [ERROR]
            - name: slow-traces-policy
              type: latency
              latency:
                threshold_ms: 500
            - name: probabilistic-policy
              type: always_sample
          rate_allocation:
            - policy: errors-policy
              percent: 50   # Up to 500 spans/sec allocated to error traces
            - policy: slow-traces-policy
              percent: 30   # Up to 300 spans/sec allocated to slow traces
              # The remaining 20% (200 spans) is allocated to probabilistic-policy
              # Caution: the remaining budget for policies not listed in rate_allocation
              #          is NOT distributed to those policies — it is discarded.
              #          It is recommended to explicitly declare all policies.

rate_allocation caution: Even if the percentages don't add up to 100%, the remaining budget is not automatically distributed to other policies. Unallocated budget for policies not explicitly listed is discarded. It is recommended to explicitly declare all sub-policies in rate_allocation.

Example 4: Excluding Health Check Paths (invert_match)

You can exclude high-frequency noise requests like /health and /readyz from sampling.

yaml

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      # Mark health check paths as NOT_SAMPLED
      - name: exclude-health-check
        type: string_attribute
        string_attribute:
          key: http.target
          values: ["/health", "/readyz", "/metrics"]
          invert_match: true
 
      # Sample 5% of remaining traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

How invert_match: true works: This policy marks traces matching the specified values as NOT_SAMPLED. Traces that do not match are not decided by this policy and pass on to the next policy. The invert_match policy alone does not have the effect of "sampling everything else," so it must always be used in a composite configuration alongside other policies.

Example 5: 2-Tier Architecture — Essential Setup for Horizontal Scaling

Because Tail Sampling is stateful, simple horizontal scaling is not possible. If spans from the same trace are distributed across different Collector instances, correct sampling decisions cannot be made. The officially recommended architecture to solve this problem is the 2-tier configuration.

yaml

# Tier 1: Load balancer Collector — trace_id hash-based routing
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true   # For development/testing. Production requires TLS certificate configuration.
    resolver:
      k8s:
        service: otel-collector-tailsampling-headless
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [loadbalancing]

yaml

# Tier 2: Tail Sampling Collector — apply the tail_sampling processor from the examples above
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1800
    spike_limit_mib: 400
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/backend]

What is loadbalancingexporter? It is an exporter that uses trace_id as a hash key to always route all spans of the same trace to the same Collector instance. It is the core component of the 2-tier Tail Sampling architecture — horizontally scaling Tail Sampling Collectors without this layer results in incorrect sampling decisions.

Pros and Cons Analysis

Advantages

Item	Details
Decisions based on complete information	Judgment after all spans are collected — errors and latency can be captured accurately
Cost reduction	80–95% reduction in traces sent to the backend is achievable
Flexible policy combinations	AND / Composite can express complex business rules
Noise removal	Health checks and high-frequency normal requests can be easily excluded

Disadvantages and Caveats

Item	Description	Mitigation
Memory pressure	Traces reside in memory proportional to `decision_wait × TPS`	Shorten `decision_wait` + use `memory_limiter` + `GOMEMLIMIT` together
Drop on num_traces exceeded	When the limit is exceeded, older traces are dropped and incomplete traces may be sent	Monitor `sampling_trace_dropped_too_early_count` metric
Stateful constraint	Simple horizontal scaling is not possible	A preceding `loadbalancingexporter` layer is required
Risk of missing spans	Spans that don't arrive within `decision_wait` are excluded from the decision	Set generously based on p99

Essential Monitoring Metrics

promql

# If this metric increases, you need to increase num_traces or shorten decision_wait
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_count
 
# Policy evaluation errors — needed when diagnosing policy configuration issues
otelcol_processor_tail_sampling_sampling_policy_evaluation_error_count
 
# Actual number of sampled traces — for ratio verification
otelcol_processor_tail_sampling_count_traces_sampled

Most Common Mistakes in Practice

Placing memory_limiter after tail_sampling — memory_limiter must be positioned before tail_sampling in the pipeline for backpressure to work correctly.
Horizontally scaling Tail Sampling Collectors without a 2-tier architecture — Simply increasing the number of replicas without loadbalancingexporter distributes spans from the same trace across instances, leading to incorrect sampling decisions.
Using invert_match policy in isolation — invert_match: true only excludes matching traces; it is not an instruction to sample the rest. It must always be configured alongside other sampling policies.

Closing Thoughts

Before adjusting your head sampling ratio, the right order of operations is to first calculate the decision_wait and memory cost for Tail Sampling. Accurately selecting important traces is more valuable than simply lowering the sampling ratio, and shortening decision_wait is where that starts.

Three steps you can take right now:

Measure your service's p99 response time, calculate an initial value using the decision_wait = p99 × 3 formula, and estimate required memory with TPS × decision_wait × avg_spans × avg_span_size.
Apply a basic 3-stage composite policy in the order errors-policy → slow-traces-policy → probabilistic-policy, and place memory_limiter at the front of the pipeline.
Add the otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_count metric to a dashboard in Prometheus. If this value increases, consider expanding num_traces or shortening decision_wait.

Next article: Designing an autoscaling architecture that integrates Tail Sampling in a Kubernetes environment with KEDA to automatically respond to traffic spikes

References

Tail Sampling Processor Official README | GitHub — When you need the full policy parameter reference
OpenTelemetry Official Sampling Concepts Documentation | OpenTelemetry — When encountering the concept of Head vs. Tail Sampling for the first time
Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog — When you need background on the adoption decision
Tail-Based Sampling: Sizing, Memory Crashes and Cost Model | Michal Drozd — When you want to go deep into memory OOM cases and cost models
How to Fix the Collector Memory Leak Caused by Tail Sampling Processor | OneUptime — When you need to troubleshoot an actual memory leak
How to Right-Size CPU and Memory for the OpenTelemetry Collector | OneUptime — When covering Collector resource sizing in general
Scale Alloy tail sampling | Grafana OpenTelemetry Docs — When implementing the same configuration in a Grafana Alloy environment
otelcol.processor.tail_sampling | Grafana Alloy Official Documentation — Alloy component parameter reference
Scaling the Collector | OpenTelemetry Official Documentation — When verifying 2-tier architecture design principles from official documentation
Traces at Scale: Head or Tail? Sampling Strategies | DEV Community — When taking a broader look at the context of sampling strategy selection

OpenTelemetry Tail Sampling Deep Dive: Composite Policy Design and Memory Optimization with decision_wait | DEV BAK - 기술블로그

DevOps

OpenTelemetry Tail Sampling Deep Dive: Composite Policy Design and Memory Optimization with decision_wait

Core Concepts

Head vs. Tail Sampling — What's the Difference?

Item	Head Sampling	Tail Sampling
Decision point	At trace start	After all spans are collected
Error-based decision	Not possible	Possible
Latency-based decision	Not possible	Possible
Memory requirement	Low	High
Horizontal scaling	Easy	Stateful — requires separate design

What does stateful mean? It means all spans from the same trace must be gathered at the same Collector instance for a correct sampling decision to be made. Simply increasing the number of replicas is not enough for horizontal scaling.

Understanding Core Parameters and Estimating Memory

There are three core parameters that control the Tail Sampling Processor.

yaml

processors:
  tail_sampling:
    decision_wait: 10s                    # Wait time from first span arrival to decision
    num_traces: 100000                    # Maximum number of traces to keep in memory
    expected_new_traces_per_sec: 1000     # For internal map pre-allocation

What is decision_wait? It is the time to wait from the arrival of the first span until a sampling decision is made. It expects all spans for that trace to arrive within this window. Longer values increase accuracy; shorter values save memory.

expected_new_traces_per_sec caution: If this value is set too low relative to the actual incoming rate, the internal map will be resized at runtime, which can cause momentary latency spikes. It is recommended to set this generously at 120–150% of your actual TPS.

It's good practice to estimate memory before going to production.

Required memory ≈ traces_per_second × decision_wait(s) × avg_spans_per_trace × avg_span_size(bytes)

With 1,000 traces per second, a 30s wait, an average of 10 spans per trace, and 2KB per span:

1,000 TPS × 30s × 10 spans × 2KB = 600MB

decision_wait starting point formula: service p99 response time × 3

Situation	Recommended decision_wait
General web API (p99 < 1s)	5s – 10s
Mixed workload including batch processing	15s – 20s
Legacy systems (p99 > 5s)	30s (keep default)

Using GOMEMLIMIT and memory_limiter Together

Using two mechanisms together for memory protection is the current standard pattern.

memory_limiter processor: Drops spans when memory exceeds the limit to prevent OOM. It must be placed before tail_sampling in the pipeline for backpressure to work correctly.
GOMEMLIMIT environment variable: A soft memory limit for the Go runtime. Setting it to 80–90% of the container memory limit causes the GC to aggressively reclaim memory before an OOM Kill occurs.

yaml

# Environment variable configuration example (Kubernetes)
env:
  - name: GOMEMLIMIT
    value: "1800MiB"   # When container limit is 2GiB

Policy Types at a Glance

Policy Type	Description	Typical Use Case
`status_code`	Based on ERROR / UNSET / OK	Preserve all error traces
`latency`	Whether `threshold_ms` is exceeded	Capture slow requests
`string_attribute`	String attribute values (regex supported)	Filter specific services or paths
`numeric_attribute`	Numeric attribute range	Specific user_id ranges, etc.
`probabilistic`	Probabilistic ratio	Preserve only a portion of normal traffic
`and`	AND combination of multiple policies	Cross-condition filtering
`composite`	Rate allocation per policy	Traffic budget distribution
`always_sample`	All traces pass through	Fallback within composite

Practical Application

Example 1: Composite Policy — Error + Latency + Probabilistic Sampling

yaml

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1800
    spike_limit_mib: 400
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000       # 1,000 TPS × 10s = 10,000 + buffer
    expected_new_traces_per_sec: 1200
    policies:
      # Preserve all error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
 
      # Preserve slow traces exceeding 500ms
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 500
 
      # Sample only 5% of the rest
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/backend]

Policy evaluation behavior: All policies are evaluated in order, and if any one returns SAMPLE, the trace is preserved. Policy order does not change the outcome itself, but placing the error policy first reduces evaluation cost via early exit and communicates intent clearly.

Policy	Role	Notes
`errors-policy`	Guarantees error traces	Place first to express intent clearly
`slow-traces-policy`	Captures performance issues	threshold_ms is recommended to be set based on p99
`probabilistic-policy`	Cost reduction	Evaluated independently, regardless of errors or slow requests

Example 2: AND Policy — Capture Only Slow Traces from the Payment Service

When you want to target only slow requests from a specific service, you can use the and policy.

yaml

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      - name: slow-payment-traces
        type: and
        and:
          and_sub_policy:
            - name: latency-check
              type: latency
              latency:
                threshold_ms: 300
            - name: service-check
              type: string_attribute
              string_attribute:
                key: service.name
                values: [payment-service]
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/backend]

The and policy finalizes sampling only when all sub-policies return SAMPLE. The example above preserves only traces that are both from payment-service and exceed 300ms.

Example 3: Composite Policy — Allocating Spans Per Second Like a Budget

yaml

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order: [errors-policy, slow-traces-policy, probabilistic-policy]
          composite_sub_policy:
            - name: errors-policy
              type: status_code
              status_code:
                status_codes: [ERROR]
            - name: slow-traces-policy
              type: latency
              latency:
                threshold_ms: 500
            - name: probabilistic-policy
              type: always_sample
          rate_allocation:
            - policy: errors-policy
              percent: 50   # Up to 500 spans/sec allocated to error traces
            - policy: slow-traces-policy
              percent: 30   # Up to 300 spans/sec allocated to slow traces
              # The remaining 20% (200 spans) is allocated to probabilistic-policy
              # Caution: the remaining budget for policies not listed in rate_allocation
              #          is NOT distributed to those policies — it is discarded.
              #          It is recommended to explicitly declare all policies.

rate_allocation caution: Even if the percentages don't add up to 100%, the remaining budget is not automatically distributed to other policies. Unallocated budget for policies not explicitly listed is discarded. It is recommended to explicitly declare all sub-policies in rate_allocation.

Example 4: Excluding Health Check Paths (invert_match)

You can exclude high-frequency noise requests like /health and /readyz from sampling.

yaml

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      # Mark health check paths as NOT_SAMPLED
      - name: exclude-health-check
        type: string_attribute
        string_attribute:
          key: http.target
          values: ["/health", "/readyz", "/metrics"]
          invert_match: true
 
      # Sample 5% of remaining traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

How invert_match: true works: This policy marks traces matching the specified values as NOT_SAMPLED. Traces that do not match are not decided by this policy and pass on to the next policy. The invert_match policy alone does not have the effect of "sampling everything else," so it must always be used in a composite configuration alongside other policies.

Example 5: 2-Tier Architecture — Essential Setup for Horizontal Scaling

yaml

# Tier 1: Load balancer Collector — trace_id hash-based routing
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true   # For development/testing. Production requires TLS certificate configuration.
    resolver:
      k8s:
        service: otel-collector-tailsampling-headless
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [loadbalancing]

yaml

# Tier 2: Tail Sampling Collector — apply the tail_sampling processor from the examples above
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1800
    spike_limit_mib: 400
  tail_sampling:
    decision_wait: 10s
    num_traces: 30000
    expected_new_traces_per_sec: 1200
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp/backend]

What is loadbalancingexporter? It is an exporter that uses trace_id as a hash key to always route all spans of the same trace to the same Collector instance. It is the core component of the 2-tier Tail Sampling architecture — horizontally scaling Tail Sampling Collectors without this layer results in incorrect sampling decisions.

Pros and Cons Analysis

Advantages

Item	Details
Decisions based on complete information	Judgment after all spans are collected — errors and latency can be captured accurately
Cost reduction	80–95% reduction in traces sent to the backend is achievable
Flexible policy combinations	AND / Composite can express complex business rules
Noise removal	Health checks and high-frequency normal requests can be easily excluded

Disadvantages and Caveats

Item	Description	Mitigation
Memory pressure	Traces reside in memory proportional to `decision_wait × TPS`	Shorten `decision_wait` + use `memory_limiter` + `GOMEMLIMIT` together
Drop on num_traces exceeded	When the limit is exceeded, older traces are dropped and incomplete traces may be sent	Monitor `sampling_trace_dropped_too_early_count` metric
Stateful constraint	Simple horizontal scaling is not possible	A preceding `loadbalancingexporter` layer is required
Risk of missing spans	Spans that don't arrive within `decision_wait` are excluded from the decision	Set generously based on p99

Essential Monitoring Metrics

promql

# If this metric increases, you need to increase num_traces or shorten decision_wait
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_count
 
# Policy evaluation errors — needed when diagnosing policy configuration issues
otelcol_processor_tail_sampling_sampling_policy_evaluation_error_count
 
# Actual number of sampled traces — for ratio verification
otelcol_processor_tail_sampling_count_traces_sampled

Most Common Mistakes in Practice

Placing memory_limiter after tail_sampling — memory_limiter must be positioned before tail_sampling in the pipeline for backpressure to work correctly.
Horizontally scaling Tail Sampling Collectors without a 2-tier architecture — Simply increasing the number of replicas without loadbalancingexporter distributes spans from the same trace across instances, leading to incorrect sampling decisions.
Using invert_match policy in isolation — invert_match: true only excludes matching traces; it is not an instruction to sample the rest. It must always be configured alongside other sampling policies.

Closing Thoughts

Three steps you can take right now:

Measure your service's p99 response time, calculate an initial value using the decision_wait = p99 × 3 formula, and estimate required memory with TPS × decision_wait × avg_spans × avg_span_size.
Apply a basic 3-stage composite policy in the order errors-policy → slow-traces-policy → probabilistic-policy, and place memory_limiter at the front of the pipeline.
Add the otelcol_processor_tail_sampling_sampling_trace_dropped_too_early_count metric to a dashboard in Prometheus. If this value increases, consider expanding num_traces or shortening decision_wait.

Next article: Designing an autoscaling architecture that integrates Tail Sampling in a Kubernetes environment with KEDA to automatically respond to traffic spikes

References

Tail Sampling Processor Official README | GitHub — When you need the full policy parameter reference
OpenTelemetry Official Sampling Concepts Documentation | OpenTelemetry — When encountering the concept of Head vs. Tail Sampling for the first time
Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog — When you need background on the adoption decision
Tail-Based Sampling: Sizing, Memory Crashes and Cost Model | Michal Drozd — When you want to go deep into memory OOM cases and cost models
How to Fix the Collector Memory Leak Caused by Tail Sampling Processor | OneUptime — When you need to troubleshoot an actual memory leak
How to Right-Size CPU and Memory for the OpenTelemetry Collector | OneUptime — When covering Collector resource sizing in general
Scale Alloy tail sampling | Grafana OpenTelemetry Docs — When implementing the same configuration in a Grafana Alloy environment
otelcol.processor.tail_sampling | Grafana Alloy Official Documentation — Alloy component parameter reference
Scaling the Collector | OpenTelemetry Official Documentation — When verifying 2-tier architecture design principles from official documentation
Traces at Scale: Head or Tail? Sampling Strategies | DEV Community — When taking a broader look at the context of sampling strategy selection

Core Concepts

Head vs. Tail Sampling — What's the Difference?

Understanding Core Parameters and Estimating Memory

Using GOMEMLIMIT and memory_limiter Together

Policy Types at a Glance

Practical Application

Example 1: Composite Policy — Error + Latency + Probabilistic Sampling

Example 2: AND Policy — Capture Only Slow Traces from the Payment Service

Example 3: Composite Policy — Allocating Spans Per Second Like a Budget

Example 4: Excluding Health Check Paths (invert_match)

Example 5: 2-Tier Architecture — Essential Setup for Horizontal Scaling

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Essential Monitoring Metrics

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Head vs. Tail Sampling — What's the Difference?

Understanding Core Parameters and Estimating Memory

Using GOMEMLIMIT and memory_limiter Together

Policy Types at a Glance

Practical Application

Example 1: Composite Policy — Error + Latency + Probabilistic Sampling

Example 2: AND Policy — Capture Only Slow Traces from the Payment Service

Example 3: Composite Policy — Allocating Spans Per Second Like a Budget

Example 4: Excluding Health Check Paths (invert_match)

Example 5: 2-Tier Architecture — Essential Setup for Horizontal Scaling

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Essential Monitoring Metrics

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Tail Sampling + KEDA: A 2-Tier OTel Architecture That Never Misses a Trace During Traffic Spikes

OTel spanmetrics Connector: How to Auto-Generate RED Metrics from Traces Without Code Changes and Connect to Grafana

Building an IDP with Backstage: The Story of Personally Implementing a Self-Service Deployment Environment

Complete Guide to loadbalancingexporter: Guaranteeing Tail Sampling Accuracy with a 2-Tier Architecture

OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM

OpenTelemetry Operator + HPA: 2-Layer Gateway Pattern for Preserving Tail Sampling Accuracy