Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit
Developers who have managed Canary deployments have likely experienced this situation. You launch a new version at 10% traffic, glance at the metrics five minutes later, and find the error rate has increased. The intuition that "there aren't enough samples yet, so let's wait a bit longer" clashes with the intuition that "bad signs are already visible, so why wait?" Ultimately, you end up repeatedly refreshing the dashboard in this ambiguous state, only to manually press the rollback button without any statistical basis. Conversely, the automatic threshold is triggered too late, causing more users to suffer in the meantime.
The common cause of these two scenarios is the "peeking problem" in fixed-sample A/B testing. As data is repeatedly examined, the Type I error rate (false positives) inflates, undermining statistical reliability. Traditional threshold comparisons fail to solve this problem. By combining Sequential Testing and the Alpha Spending function with Flagger's webhook gating layer, you can statistically test every intermediate analysis point correctly and automatically exit the process as soon as a problem is confirmed, thereby reducing the average experiment duration by up to 66%. This article covers the process step-by-step, from the mathematical principles of the Alpha Spending function to how to directly implement and integrate the Flagger webhook service.
Key Concepts
Peeking Problem: Why Repeated Checking Is Dangerous
The fixed-sample test is based on the premise that the test is performed only once all N predetermined samples have been collected. However, in a canary distribution, the metric is checked every minute. If the check is repeated k times, the actual Type I error rate becomes much higher even at a significance level α=0.05.
import numpy as np
def peeking_inflation(alpha: float, num_peeks: int) -> float:
"""반복 검정 시 실제 Type I 오류율 추정 (Bonferroni 상한 — 실제보다 보수적)"""
# 각 검정이 독립적이라 가정한 보수적 상한값
# 실제 상관된 관측값에서는 이보다 낮게 나옴 (12회 peek 시 약 0.30~0.35 수준)
return 1 - (1 - alpha) ** num_peeks
# 5분 간격으로 60분 동안 12번 들여다보면
print(peeking_inflation(0.05, 12)) # ≈ 0.46 (Bonferroni 상한)
# 실제 시뮬레이션 기반 추정치는 약 0.30~0.35 수준이지만,
# 어느 쪽이든 목표 오류율 0.05를 대폭 초과한다는 사실은 변하지 않는다Sequential Testing statistically correctly handles interim analysis to solve this problem. The core idea is to "consume" the total significance level α little by little at each interim analysis time, but to distribute it so that the total does not exceed α.
Information Fraction τ ∈ [0, 1]: The ratio of the number of samples collected to date to the total number of planned samples. If τ = 0.3, it means that 30% of the plan has been completed. It must be calculated based on the number of samples rather than time to maintain statistical significance even in services with uneven traffic.
Alpha Spending Function: Three Strategies
The Alpha Spending function proposed by Lan and DeMets (1983) defines the cumulative consumption α(τ) according to the information fraction τ.
It is important to first understand the background. The O'Brien-Fleming (1979) and Pocock methods were originally developed based on the premise of equal-spaced looks. In canary deployments, analysis timing can become irregular depending on traffic volume, and for such cases, the Alpha Spending extension of Lan-DeMets is required. Rather than viewing the three methods simply as "different styles," they should be understood in the context that O'Brien-Fleming/Pocock is suitable when the analysis schedule is fixed, while Lan-DeMets is suitable when the schedule is fluid.
| Method | Initial Conservatism | Early Exit Probability | Final Nominal Significance Level | Recommended Use Case |
|---|---|---|---|---|
| O'Brien-Fleming | Very High | Low (Strong Evidence Required) | Almost Identical to a Single Test | Regression Detection, Safe Stop, Fixed Analysis Schedule |
| Pocock | Low | High (Equal Distribution) | Conservative (Decreases) | Fast Superiority Confirmation, Fixed Analysis Schedule |
| Lan-DeMets | Flexible | Flexible | Flexible | When the number of analyses or timing changes during the experiment |
import numpy as np
from scipy import stats
def obrien_fleming(tau: float, alpha: float = 0.05) -> float:
"""O'Brien-Fleming 누적 Alpha Spending"""
if tau <= 0:
return 0.0
z_alpha = stats.norm.ppf(1 - alpha / 2)
return 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
def pocock(tau: float, alpha: float = 0.05) -> float:
"""Pocock 누적 Alpha Spending"""
return alpha * np.log(1 + (np.e - 1) * tau)
# τ=0.1 (10% 진행) 시점에서 각 방법이 소비하는 누적 alpha
print(f"O'Brien-Fleming @ τ=0.1: {obrien_fleming(0.1):.6f}") # ≈ 0.000016
print(f"Pocock @ τ=0.1: {pocock(0.1):.6f}") # ≈ 0.017O'Brien-Fleming consumes extremely small amounts of alpha in the beginning. Since it consumes only 0.0016% at τ=0.1, the Z-statistic must exceed approximately 4.5σ to achieve an early exit from the initial 10%. This design captures only truly severe regressions. On the other hand, Pocock has already consumed 34% of the total α at the same time, making an early exit possible even at the Z≈2.1 level.
incremental_alpha(incremental consumption) is the core mechanism of this approach. The total α budget (e.g., 0.05) is treated like "points" to calculate the amount that can be consumed at each intermediate analysis point. Once α is consumed, it cannot be used in the next analysis, and it is guaranteed that the total consumption will not exceed 0.05 even after going through all analyses.
Flagger Gating Layer Architecture
Flagger checks the metric and calls the webhook every analysis.interval. This structure corresponds exactly to the intermediate analysis loop of Sequential Testing.
[Flagger Canary Controller]
│
│ 매 interval마다 webhook POST
▼
[Sequential Test Service] ←─── [Prometheus / Metrics Store]
│ (현재 canary 메트릭)
│
│ 1. 현재 τ 계산 (수집 샘플 / 목표 샘플)
│ 2. incremental_alpha 계산 (이번 분석의 α 예산)
│ 3. 임계 Z-값 산출
│ 4. 실제 Z-통계량과 비교
▼
|Z| > boundary?
├── YES → HTTP 400 → Flagger 실패 카운트 증가 → threshold 초과 시 자동 롤백
└── NO → HTTP 200 → Canary 계속 진행Practical Application
Flagger Canary Analysis Settings
First, register a Sequential Testing webhook to Flagger's Canary resource. Each interval becomes an intermediate analysis point.
# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
progressDeadlineSeconds: 600
service:
port: 80
analysis:
interval: 1m # 매 1분 = 한 번의 중간 분석
threshold: 3 # 3회 연속 실패 시 롤백
maxWeight: 50 # 최대 50% 트래픽
stepWeight: 10 # 매 iteration 10% 증가
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
webhooks:
- name: sequential-test-gate
type: pre-rollout # 트래픽 증가 전 게이팅
url: http://seq-test-service.monitoring/check
timeout: 30s
metadata:
alpha: "0.05"
method: "obrien_fleming"
# Flagger metadata 필드는 문자열만 허용하므로 숫자도 문자열로 전달
target_samples: "10000"
metric: "error_rate"
canary_name: "my-service"The type: pre-rollout webhook is called before each traffic increment phase. If the Sequential Test returns a failure, the traffic does not increase, and only the failure count is incremented. When threshold is reached, the Flagger executes an auto-rollback.
Webhook Service Implementation: AlphaSpendingTest Core Logic
Below is the FastAPI service called via the Flagger webhook. First, let's take a look at the core class for statistical testing.
Background for readers unfamiliar with Z-statistics: The Z-statistic is a value in units of standard deviation indicating whether an observed difference is due to chance. scipy.stats.norm.ppf is a quantile function of the normal distribution that returns "what the Z-value is below this alpha value." If Z > 1.96, it means it is significant at the α=0.05 single test level.
# seq_test_service.py — 핵심 통계 로직
import numpy as np
from scipy import stats
class AlphaSpendingTest:
def __init__(self, alpha: float = 0.05, method: str = "obrien_fleming"):
self.alpha = alpha
self.method = method
def spending_function(self, tau: float) -> float:
"""정보 분율 tau까지 누적 소비할 alpha 계산"""
tau = max(tau, 1e-10) # 0 나눗셈 방지
if self.method == "obrien_fleming":
z_alpha = stats.norm.ppf(1 - self.alpha / 2)
return 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
elif self.method == "pocock":
return self.alpha * np.log(1 + (np.e - 1) * tau)
else:
raise ValueError(f"Unknown method: {self.method}")
def get_critical_z(self, tau: float, prev_tau: float = 0.0) -> float:
"""
현재 중간 분석 시점의 임계 Z-값 계산.
prev_tau: 직전 분석 시점의 τ. 호출자가 이전 소비 이력을 관리해
전달해야 한다. 기본값 0은 이번이 첫 번째 분석임을 의미한다.
여러 번 호출할 때 매번 0으로 두면 누적 소비가 정확하게 추적되지
않으므로, 상태 저장소(Redis 등)나 요청 파라미터로 prev_tau를 전달해야 한다.
"""
# incremental_alpha: "이번 중간 분석에서 소비할 수 있는 남은 alpha 예산"
incremental_alpha = (
self.spending_function(tau) - self.spending_function(prev_tau)
)
# 수치 안정성: 경계값이 극단적으로 커지는 것 방지
incremental_alpha = max(incremental_alpha, 1e-10)
return stats.norm.ppf(1 - incremental_alpha / 2)
def evaluate(self, z_stat: float, tau: float, prev_tau: float = 0.0) -> dict:
boundary = self.get_critical_z(tau, prev_tau)
return {
"stop": abs(z_stat) > boundary,
"z_stat": round(z_stat, 4),
"boundary": round(boundary, 4),
"tau": round(tau, 4),
"incremental_alpha": round(
self.spending_function(tau) - self.spending_function(prev_tau), 6
),
}
def compute_z_stat(
canary_error_rate: float,
baseline_error_rate: float,
n_canary: int,
n_baseline: int,
) -> float:
"""
이표본 비율 차이에 대한 Z-통계량.
풀링된 표준오차(pooled SE)를 사용하는 이유: 귀무가설(두 비율이 같다) 아래서는
pooled SE가 통계적으로 더 효율적이다. 대립가설이 사실일 때는 비풀링 SE가
더 적절할 수 있지만, 회귀 감지가 목적인 Canary 배포에서는 귀무가설 기준
검정이 표준 관행이다.
"""
p_pool = (
canary_error_rate * n_canary + baseline_error_rate * n_baseline
) / (n_canary + n_baseline)
se = np.sqrt(p_pool * (1 - p_pool) * (1 / n_canary + 1 / n_baseline))
if se < 1e-10:
return 0.0
return (canary_error_rate - baseline_error_rate) / seThe roles of each function can be summarized as follows.
| Function | Role |
|---|---|
spending_function(tau) |
Calculation of cumulative alpha to be consumed up to τ |
get_critical_z(tau, prev_tau) |
Calculation of the threshold Z-value for this interim analysis. prev_tau State management is the caller's responsibility |
compute_z_stat() |
Calculation of Z-statistic for pooled SE-based two-sample proportion test |
Webhook Service Implementation: Prometheus Integration and Endpoints
# seq_test_service.py — FastAPI 엔드포인트 및 메트릭 수집
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx
import os
app = FastAPI()
PROMETHEUS_URL = os.getenv("PROMETHEUS_URL", "http://prometheus:9090")
async def fetch_canary_metrics(canary_name: str) -> dict:
"""
Prometheus에서 canary 메트릭 수집.
주의: 아래 쿼리의 레이블(app="{canary_name}")은 예시다.
실제 서비스의 레이블 구조(app, deployment, service 등)에 맞게 수정해야 한다.
primary deployment는 "{canary_name}-primary" 레이블을 쓴다고 가정하지만
실제 Flagger 환경에서는 다를 수 있다.
"""
async with httpx.AsyncClient() as client:
error_rate_query = (
f'sum(rate(http_requests_total{{app="{canary_name}",'
f'status=~"5.."}}[1m])) / '
f'sum(rate(http_requests_total{{app="{canary_name}"}}[1m]))'
)
sample_count_query = (
f'sum(increase(http_requests_total{{app="{canary_name}"}}[1h]))'
)
resp_error = await client.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": error_rate_query},
)
resp_count = await client.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": sample_count_query},
)
error_data = resp_error.json()["data"]["result"]
count_data = resp_count.json()["data"]["result"]
# 쿼리 결과가 비어있으면 메트릭 미수집 상태로 처리
if not error_data or not count_data:
raise ValueError(f"No metrics found for '{canary_name}'. "
"레이블 구조를 확인하세요.")
error_rate = float(error_data[0]["value"][1])
sample_count = float(count_data[0]["value"][1])
return {"error_rate": error_rate, "sample_count": sample_count}
class WebhookPayload(BaseModel):
metadata: dict = {}
@app.post("/check")
async def sequential_test_gate(payload: WebhookPayload):
meta = payload.metadata
alpha = float(meta.get("alpha", "0.05"))
method = meta.get("method", "obrien_fleming")
# Flagger metadata는 문자열만 전달되므로 int 변환 필요
target_samples = int(meta.get("target_samples", "10000"))
canary_name = meta.get("canary_name", "unknown")
min_tau = float(meta.get("min_tau", "0.05")) # 최소 웜업 5%
# prev_tau 상태 관리: 현재 구현은 요청 파라미터로 받는 구조.
# 실제 운영에서는 Redis 등 외부 상태 저장소에서 실험별 prev_tau를 관리해야
# 매 분석마다 증분 소비가 정확하게 누적된다.
prev_tau = float(meta.get("prev_tau", "0.0"))
try:
canary_metrics = await fetch_canary_metrics(canary_name)
baseline_metrics = await fetch_canary_metrics(f"{canary_name}-primary")
except Exception as e:
# 메트릭 수집 실패 시 통과 처리 (fail-open 정책)
# 서비스 특성에 따라 fail-closed로 변경할 수 있음
return {"status": "metrics_unavailable", "reason": str(e)}
n_canary = int(canary_metrics["sample_count"])
tau = n_canary / target_samples
# 최소 웜업 기간 미충족 시 통과 (τ가 너무 작으면 경계값이 극단적으로 높아짐)
if tau < min_tau:
return {"status": "warming_up", "tau": tau, "min_tau": min_tau}
z_stat = compute_z_stat(
canary_metrics["error_rate"],
baseline_metrics["error_rate"],
n_canary,
max(int(baseline_metrics["sample_count"]), 1),
)
tester = AlphaSpendingTest(alpha=alpha, method=method)
result = tester.evaluate(z_stat, min(tau, 1.0), prev_tau)
if result["stop"]:
raise HTTPException(
status_code=400,
detail={
"reason": "sequential_test_boundary_exceeded",
**result,
},
)
return {"status": "pass", **result}Boundary Visualization (Independent script for local analysis)
The code below is an independent analysis script unrelated to the service code (seq_test_service.py). It is used to check locally which boundaries are applied before deployment.
# visualize_boundaries.py — 로컬 분석 전용, 서비스에 포함하지 않는다
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
def obf_boundary(taus, alpha=0.05):
z_alpha = stats.norm.ppf(1 - alpha / 2)
boundaries, prev_spend = [], 0
for tau in taus:
curr_spend = 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
inc = max(curr_spend - prev_spend, 1e-10)
boundaries.append(stats.norm.ppf(1 - inc / 2))
prev_spend = curr_spend
return boundaries
def pocock_boundary(taus, alpha=0.05):
boundaries, prev_spend = [], 0
for tau in taus:
curr_spend = alpha * np.log(1 + (np.e - 1) * tau)
inc = max(curr_spend - prev_spend, 1e-10)
boundaries.append(stats.norm.ppf(1 - inc / 2))
prev_spend = curr_spend
return boundaries
taus = np.linspace(0.01, 1.0, 100)
obf = obf_boundary(taus)
poc = pocock_boundary(taus)
plt.figure(figsize=(10, 5))
plt.plot(taus, obf, label="O'Brien-Fleming", linewidth=2)
plt.plot(taus, poc, label="Pocock", linewidth=2, linestyle="--")
plt.axhline(y=1.96, color="gray", linestyle=":", label="단일 검정 Z=1.96")
plt.xlabel("정보 분율 τ")
plt.ylabel("임계 Z-값")
plt.title("Alpha Spending 경계 비교 (target_samples=10000 기준)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("boundary_comparison.png", dpi=150, bbox_inches="tight")The initial boundary of the O'Brien-Fleming is at the Z≈4.5 level. At time τ=0.1, 4
This means that only very strong signals of 5σ allow for early exit, making it ideal for safe stoppages in case of severe regression (surge in error rate, spike in lag). Overall, Pocock Z≈2.1
It maintains a level of 2.3, making it suitable for faster superiority verification.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Fast Rollback | Automatic exit immediately upon detection of statistically significant issues. Reduces average experiment duration by up to 66% compared to fixed samples |
| Minimizing User Harm | Detect issues in a small amount of Canary traffic and block them before exposure to full users |
| Type I Error Control | Solves the peeking problem with Alpha Spending. The total error rate does not exceed the predefined α. |
| Flexible Schedule | The Lan-DeMets method allows changing the number of intermediate analyses and timing during the experiment |
| Expandable to Stop Uselessness | Add Beta-Spending to stop ineffective experiments early and prevent resource waste |
| Leveraging Existing Infrastructure | Implementable by simply adding a webhook service on top of the Flagger + Prometheus stack |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Initial Conservatism | O'Brien-Fleming's initial boundaries are too high, potentially failing to catch obvious problems | Use O'Brien-Fleming for high-severity metrics (error rates) and Pocock for performance metrics |
| Slightly increase sample size | Total samples required to achieve equivalent power are slightly higher compared to a single test | Set the target sample size with a 10–15% margin |
| Implementation Complexity | Requires integration with external statistics services and prev_tau state management. Requires policy determination in case of service failure. |
Explicit determination of fail-open (pass on service down) or fail-closed policy. |
| Metric Lag | Prometheus scrape interval (default 15s~1m) affects τ calculation accuracy | Set interval to at least 3 times the scrape interval |
| Multiple Metrics Issue | Increase in Overall FWER when Applying Independent Alpha Spending to Multiple Metrics | Bonferroni Correction (α/k) or Hierarchical Test Design |
| Cold Start | When τ is very small, the boundary is extremely high and virtually meaningless | Minimum warm-up period (min_tau ≥ 0.05) must be set |
FWER (Family-Wise Error Rate): The probability of one or more false positives occurring in multiple comparisons. If there are k metrics, the simplest correspondence is the Bonferroni correction, which lowers the significance level of each test to α/k. For example, to apply α=0.05 to 5 metrics, α=0.01 is used for each test.
The Most Common Mistakes in Practice
- Do not design duplicate
thresholdand Sequential Test: An unintended early rollback occurs if the Flagger'sthreshold(accumulated failure count allowed) conflicts with the Sequential Test boundary. Even if the Sequential Test does not exceed the boundary, a simple metric threshold violation can fillthreshold. Clearly separate the roles of the two mechanisms and, whenever possible, design the system to determine whether to roll back based solely on the Sequential Test results. - Calculating information fraction based on time: If calculated as τ = elapsed time / total planned time, the results are statistically insignificant due to insufficient sample size during the early morning hours when traffic is low. For services with uneven traffic, τ must be calculated based on the number of samples.
prev_tauTreat as 0 every time without state management: Accurate calculation of incremental consumption for Sequential Tests requires knowing the τ of the previous analysis. If every call is treated asprev_tau=0, each analysis effectively functions as an independent single test, breaking the error rate guarantee of Alpha Spending. You must manage experiment-specificprev_tauin an external state store such as Redis, or at least have a structure to pass them as request parameters.
In Conclusion
The true value of Sequential Testing lies not in "making decisions faster," but in the statistical certainty that "the error rate is guaranteed regardless of when the decision is made." While threshold comparison methods rely on the intuitive judgment of "let's roll back because the numbers are bad right now," Alpha Spending-based Sequential Testing quantifies and controls the percentage probability that such a judgment will be wrong in the long term. This difference leads to team credibility and operational confidence.
However, the implementation introduced in this article provides a starting point with a single metric and a simple prev_tau delivery structure. In actual production, there comes a point where multi-metric FWER correction, Redis-based state management, and Beta-Spending (discontinuing availability) paired with Alpha Spending become necessary. That point is not the limit of this approach, but a signal to grow to the next stage.
3 Steps to Start Right Now:
- Alpha Spending Boundary Simulation: Run
visualize_boundaries.pylocally to first check what Z-values O'Brien-Fleming and Pocock boundaries require at the target sample size of the actual service (target_samples). Looking at these numbers, you can intuitively determine which method is suitable for the service. - Minimum MVP Webhook Service Deployment: First, deploy the version of
fetch_canary_metricswith the Prometheus integration part replaced with a hardcoded dummy value, then replacewebhooks.urlincanary.yamlwith the actual service address, and then executekubectl apply -f canary.yaml. First, verify that the HTTP 200/400 gating flow works as intended. - Actual Metric Integration and min_tau Tuning: Modify the query to match the service's actual Prometheus label structure and adjust
min_tauto the service's average request throughput. The goal is to ensure that the first significant test is performed after τ ≥ 0.05 based on the actual sample size.
Next Post: A Hierarchical Test Design Method to Control FWER in a Multi-Metric Environment by Early Termination of Ineffective Canaries via Futility Stopping with Beta-Spending
Reference Materials
- Flagger Official Documentation — How it works
- Flagger Official Documentation — Webhooks
- Flagger Official Documentation — Metrics Analysis
- GitHub — fluxcd/flags
- O'Brien-Fleming boundary | Wikipedia
- Alpha Spending Function approach | Penn State STAT 509
- Alpha-Spending Function 정의 | Analytics ToolKit
- Error Spending in Sequential Testing Explained | Analytics-Toolkit.com
- Sequential Testing for Early Stopping of Online Experiments | ACM SIGIR 2015
- Understanding Group Sequential Testing | Towards Data Science
- Sequential Testing | Amplitude Explore
- Lan-DeMets Method Programs | University of Wisconsin
- Mastering Progressive Delivery with Istio and Flagger | Medium