OpenTelemetry Operator + HPA: 2-Layer Gateway Pattern for Preserving Tail Sampling Accuracy
When operating an Observability pipeline in a Kubernetes environment, there is a dilemma you will inevitably encounter at least once. You want to horizontally scale the OpenTelemetry Collector when traffic spikes, but you hesitate because scaling it up seems like it will break Tail Sampling. The concern that "if I add more Collectors, won't spans from the same trace get scattered?" is a legitimate worry — but this problem can be solved with a well-defined pattern.
By the end of this article, you will be able to directly apply a concrete YAML configuration that satisfies both HPA and Tail Sampling simultaneously using a 2-layer Gateway architecture. It covers how to use the autoscaler block in the OpenTelemetry Operator, queue-based scaling with KEDA, and common mistakes made in production. Note that the OpenTelemetry Operator's TargetAllocator plays the role of distributing Prometheus metric scrape targets in this architecture and is separate from trace routing (details are covered in the appendix section at the end).
This will provide practical insights for those who want to capture every error trace and slow request without gaps, even during unpredictable traffic surges like e-commerce flash sales.
Core Concepts
Why Tail Sampling Conflicts with Horizontal Scaling
Tail Sampling is an approach where all spans of a trace are accumulated in memory, and then the decision of whether to retain them is made retroactively based on criteria such as errors, latency, and attributes. Unlike Head Sampling (which makes an immediate probabilistic decision upfront), Tail Sampling judges based on the full trace context, resulting in much higher quality.
The problem is that spans belonging to the same Trace ID must be gathered in the same Collector instance. If you simply scale a Collector horizontally with Deployment + HPA, a round-robin load balancer scatters spans across multiple instances, making Tail Sampling decisions impossible.
2-Layer Architecture: The Standard Solution Pattern
The way to resolve this conflict is to separate two layers by role.
| Layer | Role | Deployment Type | Scaling Method |
|---|---|---|---|
| Layer 1 (Load Balancing) | Routes spans via traceID hashing | Deployment | Freely scalable with HPA |
| Layer 2 (Tail Sampling) | Aggregates spans and makes sampling decisions | StatefulSet | Carefully, manual or VPA recommended |
Layer 1 is stateless and can be freely scaled up and down with HPA. The loadbalancingexporter hashes the Trace ID to consistently route to a specific StatefulSet Pod in Layer 2, preserving Tail Sampling accuracy in Layer 2.
loadbalancingexporter: An exporter included in OpenTelemetry Collector Contrib that hashes spans based on
traceIDorservice.nameand delivers them to a consistent endpoint. When used with a DNS-based resolver, it dynamically discovers the list of Pods in a StatefulSet's Headless Service.
HPA (Horizontal Pod Autoscaler): A built-in Kubernetes resource that automatically increases or decreases the number of Pods based on CPU/memory utilization or custom metrics. Declaring the
spec.autoscalerblock in the OpenTelemetryCollector CRD causes the Operator to automatically create and manage an HPA.
StatefulSet + Headless Service: A StatefulSet is a Kubernetes workload resource that assigns each Pod a stable and unique network identifier (e.g.,
pod-0,pod-1). When used with a Headless Service, each Pod can be directly addressed via DNS in the formatpod-0.service-name.namespace.svc.cluster.local, making it suitable for cases requiring consistent routing to a specific Pod, such as a Tail Sampling Collector.
Important Role Distinction: The OpenTelemetry Operator's TargetAllocator is a tool for distributing Prometheus metric scrape targets across multiple Collector instances. It has no direct relation to Tail Sampling routing for traces. Confusing the two makes it easy for architecture design to go wrong.
Practical Application
Example 1: E-commerce Flash Sale Response — 2-Layer Configuration
This assumes an environment where traffic spikes by tens to hundreds of times during sale events. Layer 1 auto-scales with HPA, and Layer 2 runs as a fixed StatefulSet.
Layer 1: Load Balancing Gateway (with HPA)
# v1alpha1 is deprecated. For OpenTelemetry Operator v0.80+, using v1beta1 is recommended.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: gateway-lb
namespace: observability
spec:
mode: deployment
autoscaler:
minReplicas: 3
maxReplicas: 20
targetCPUUtilization: 60
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: tail-sampling-collector-headless
port: 4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [loadbalancing]Layer 2: Tail Sampling Collector (StatefulSet)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: tail-sampling-collector
namespace: observability
spec:
mode: statefulset
replicas: 5
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000
policies:
- name: error-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-baseline
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
otlp:
# Modern Jaeger recommends the OTLP direct receive port (4317).
# The legacy gRPC port (14250) may not work in newer Jaeger environments.
endpoint: jaeger-collector:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlp]The table below summarizes the meaning and adjustment criteria for key configuration values.
| Config Key | Example Value | Meaning |
|---|---|---|
targetCPUUtilization |
60 |
Scale out when CPU exceeds 60%. Recommended to set conservatively low |
decision_wait |
30s |
Span collection wait time. Longer is more accurate but increases memory consumption. Assuming ~1KB per span: 100K traces × avg 10 spans ≈ 1GB+ |
num_traces |
100000 |
Maximum number of traces to keep in memory. Must be tuned together with Pod memory resource settings to prevent OOM |
routing_key |
traceID |
Hashing basis for Layer 1. traceID is required for Tail Sampling |
hostname |
*-headless |
Looks up StatefulSet Pod list via DNS using Headless Service |
Example 2: Queue-Based Autoscaling with KEDA
CPU/memory-based HPA has the limitation of not reflecting the internal state of the Collector. For example, CPU utilization can remain low even while the Collector is dropping data. Connecting Collector internal metrics as triggers for a KEDA ScaledObject enables more precise scaling.
These metrics are exposed in Prometheus format on the Collector's metrics port (default 8888). You can check them directly at
http://<collector-svc>:8888/metricsor collect them via Prometheus scrape configuration.
KEDA (Kubernetes Event-Driven Autoscaling): An open-source framework that controls scaling based on external metric sources (Prometheus, Kafka, SQS, etc.) not supported by Kubernetes' built-in HPA. Declaring trigger conditions with a ScaledObject resource internally creates and manages an HPA.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gateway-collector-scaler
namespace: observability
spec:
scaleTargetRef:
name: gateway-lb-collector
minReplicaCount: 2
maxReplicaCount: 30
cooldownPeriod: 300 # Unit: seconds. Wait time before scale-in
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: otelcol_exporter_queue_size
threshold: "1000"
query: >
sum(otelcol_exporter_queue_size{
job="gateway-collector"
})
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
# Metric names may differ depending on OTel Collector version.
# Below v0.90: otelcol_processor_refused_spans
# v0.90 and above: The metric naming scheme has changed — verify the actual name at :8888/metrics.
metricName: otelcol_processor_refused_spans
threshold: "100"
query: >
rate(otelcol_processor_refused_spans_total{
job="gateway-collector"
}[1m])3 Common Mistakes
-
Applying HPA directly to Layer 2 (Tail Sampling) — When a scale-out event occurs on the StatefulSet layer, some spans may be routed to the wrong Pod until the
loadbalancingexporter's DNS-based resolver re-queries the new Pod list (DNS TTL + refresh delay). In-flight spans held in memory during this window may be lost, so it is better to use VPA (Vertical Pod Autoscaler) or manual scaling for Layer 2 and always ensure thedecision_waitperiod has elapsed before scaling. -
Mistaking TargetAllocator as a trace routing tool — TargetAllocator is exclusively for distributing Prometheus scrape targets. Trace Affinity for Tail Sampling is handled by the
loadbalancingexporter. Confusing the roles of the two introduces unnecessary complexity. -
Setting
decision_waitand HPAstabilizationWindowSecondsindependently — If scale-in begins beforedecision_wait(default 30 seconds) completes, undecided spans will be lost. It is recommended to setstabilizationWindowSecondssufficiently larger than thedecision_waitvalue, at a minimum of 300 seconds or more.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Automatic capacity response | HPA/KEDA automatically adjusts the number of Collectors during traffic spikes, minimizing data loss |
| Preserved Tail Sampling accuracy | The loadbalancingexporter's traceID hashing ensures spans from the same trace always reach the same Collector |
| Cost efficiency | Maintains minimum replicas during normal periods and scales only during peak times |
| Metric collection consistency | TargetAllocator's Consistent Hashing prevents scrape duplication and gaps |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Tail Sampling layer scale-in risk | Undecided spans in memory may be lost during scale-down | Recommended to set stabilizationWindowSeconds to 300 seconds or more and coordinate so scale-in starts after decision_wait |
| DNS update delay | When Layer 2 scales out, there is a delay of several to tens of seconds before Layer 1 recognizes new Pods | Lowering DNS TTL and adjusting the dns_refresh_delay setting can reduce the delay |
| Memory pressure | All spans are kept in memory during decision_wait, creating OOM risk during traffic spikes |
It is advisable to set HPA thresholds conservatively below 60% and tune num_traces to match memory resources |
| Limitation of CPU-based HPA | CPU can remain low even while the Collector is dropping data | Consider using KEDA based on otelcol_exporter_queue_size |
| Misunderstanding of TargetAllocator's role | There are attempts to use TargetAllocator for trace routing, but it is exclusively for metric scraping | It is recommended to clearly document role boundaries in architecture docs |
Appendix: Distributing Prometheus Scrape Targets with TargetAllocator
When the Layer 1 Gateway scales from 3 to 10 instances via HPA, Prometheus metric collection targets must also be automatically redistributed. Enabling TargetAllocator allows targets to be automatically reallocated as the number of Collector instances changes. This feature operates completely independently of the trace pipeline.
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: gateway-lb
namespace: observability
spec:
mode: deployment
targetAllocator:
enabled: true
serviceAccount: opentelemetry-targetallocator-sa
prometheusCR:
enabled: true
podMonitorSelector: {}
serviceMonitorSelector: {}
allocationStrategy: consistent-hashing
config: |
receivers:
prometheus:
config:
scrape_configs: [] # Dynamically injected by TargetAllocator
target_allocator:
endpoint: http://${MY_POD_NAMESPACE}-gateway-lb-targetallocator
interval: 30s
collector_id: ${MY_POD_NAME}
exporters:
prometheusremotewrite:
endpoint: http://thanos-receive:19291/api/v1/receive
service:
pipelines:
metrics:
receivers: [prometheus]
exporters: [prometheusremotewrite]Closing Thoughts
The 2-layer architecture (a load balancing layer with HPA + a StatefulSet-based Tail Sampling layer) is the current standard pattern for satisfying both horizontal scalability and sampling accuracy simultaneously.
For teams that have not yet adopted this configuration, it is recommended to introduce it incrementally in the following order.
-
Review your current Collector architecture — If you are running Tail Sampling with a single Deployment, use
kubectl get otelcol -Ato assess the current state and check whether aloadbalancingexporterconfiguration is present. If it is not, layer separation is needed. -
Deploy the Layer 2 StatefulSet first — Layer 2, which serves as the DNS routing target for the
loadbalancingexporter, must be brought up first so that traces flow correctly immediately after Layer 1 is deployed. Adjust thetail-sampling-collectorYAML from the example above to your environment and apply it withkubectl apply -f tail-sampling.yaml. -
Deploy the Layer 1 load balancer, then consider introducing KEDA — After Layer 2 is ready, apply the
gateway-lbYAML and verify HPA creation withkubectl get hpa -n observability. Afterward, if theotelcol_exporter_queue_sizemetric is being collected in Prometheus, it is recommended to refer to the ScaledObject YAML above and switch from CPU-based HPA to queue size-based scaling.
Next article: A guide to optimizing memory usage by tuning
decision_waitand adjustingnum_tracesin Tail Sampling — Collector configuration for stable operation under high traffic, based on real OOM case analysis and profiling results
References
Official Documentation
- Horizontal Pod Autoscaling | OpenTelemetry
- Scaling the Collector | OpenTelemetry
- Target Allocator | OpenTelemetry
- Sampling Concepts | OpenTelemetry
- Sampling Milestones 2025 | OpenTelemetry Blog
- tailsamplingprocessor README | GitHub
- TargetAllocator README | GitHub
- Scale Alloy tail sampling | Grafana Docs
- OpenTelemetry Collector Integration | KEDA
- Target Allocator | AWS ADOT
- HPA for OpenTelemetry Collector | IBM Docs
Community