Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit

Developers who have managed Canary deployments have likely experienced this situation. You launch a new version at 10% traffic, glance at the metrics five minutes later, and find the error rate has increased. The intuition that "there aren't enough samples yet, so let's wait a bit longer" clashes with the intuition that "bad signs are already visible, so why wait?" Ultimately, you end up repeatedly refreshing the dashboard in this ambiguous state, only to manually press the rollback button without any statistical basis. Conversely, the automatic threshold is triggered too late, causing more users to suffer in the meantime.

The common cause of these two scenarios is the "peeking problem" in fixed-sample A/B testing. As data is repeatedly examined, the Type I error rate (false positives) inflates, undermining statistical reliability. Traditional threshold comparisons fail to solve this problem. By combining Sequential Testing and the Alpha Spending function with Flagger's webhook gating layer, you can statistically test every intermediate analysis point correctly and automatically exit the process as soon as a problem is confirmed, thereby reducing the average experiment duration by up to 66%. This article covers the process step-by-step, from the mathematical principles of the Alpha Spending function to how to directly implement and integrate the Flagger webhook service.

Key Concepts

Peeking Problem: Why Repeated Checking Is Dangerous

The fixed-sample test is based on the premise that the test is performed only once all N predetermined samples have been collected. However, in a canary distribution, the metric is checked every minute. If the check is repeated k times, the actual Type I error rate becomes much higher even at a significance level α=0.05.

python

import numpy as np
 
def peeking_inflation(alpha: float, num_peeks: int) -> float:
    """반복 검정 시 실제 Type I 오류율 추정 (Bonferroni 상한 — 실제보다 보수적)"""
    # 각 검정이 독립적이라 가정한 보수적 상한값
    # 실제 상관된 관측값에서는 이보다 낮게 나옴 (12회 peek 시 약 0.30~0.35 수준)
    return 1 - (1 - alpha) ** num_peeks
 
# 5분 간격으로 60분 동안 12번 들여다보면
print(peeking_inflation(0.05, 12))  # ≈ 0.46 (Bonferroni 상한)
# 실제 시뮬레이션 기반 추정치는 약 0.30~0.35 수준이지만,
# 어느 쪽이든 목표 오류율 0.05를 대폭 초과한다는 사실은 변하지 않는다

Sequential Testing statistically correctly handles interim analysis to solve this problem. The core idea is to "consume" the total significance level α little by little at each interim analysis time, but to distribute it so that the total does not exceed α.

Information Fraction τ ∈ [0, 1]: The ratio of the number of samples collected to date to the total number of planned samples. If τ = 0.3, it means that 30% of the plan has been completed. It must be calculated based on the number of samples rather than time to maintain statistical significance even in services with uneven traffic.

Alpha Spending Function: Three Strategies

The Alpha Spending function proposed by Lan and DeMets (1983) defines the cumulative consumption α(τ) according to the information fraction τ.

It is important to first understand the background. The O'Brien-Fleming (1979) and Pocock methods were originally developed based on the premise of equal-spaced looks. In canary deployments, analysis timing can become irregular depending on traffic volume, and for such cases, the Alpha Spending extension of Lan-DeMets is required. Rather than viewing the three methods simply as "different styles," they should be understood in the context that O'Brien-Fleming/Pocock is suitable when the analysis schedule is fixed, while Lan-DeMets is suitable when the schedule is fluid.

Method	Initial Conservatism	Early Exit Probability	Final Nominal Significance Level	Recommended Use Case
O'Brien-Fleming	Very High	Low (Strong Evidence Required)	Almost Identical to a Single Test	Regression Detection, Safe Stop, Fixed Analysis Schedule
Pocock	Low	High (Equal Distribution)	Conservative (Decreases)	Fast Superiority Confirmation, Fixed Analysis Schedule
Lan-DeMets	Flexible	Flexible	Flexible	When the number of analyses or timing changes during the experiment

python

import numpy as np
from scipy import stats
 
def obrien_fleming(tau: float, alpha: float = 0.05) -> float:
    """O'Brien-Fleming 누적 Alpha Spending"""
    if tau <= 0:
        return 0.0
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    return 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
 
def pocock(tau: float, alpha: float = 0.05) -> float:
    """Pocock 누적 Alpha Spending"""
    return alpha * np.log(1 + (np.e - 1) * tau)
 
# τ=0.1 (10% 진행) 시점에서 각 방법이 소비하는 누적 alpha
print(f"O'Brien-Fleming @ τ=0.1: {obrien_fleming(0.1):.6f}")  # ≈ 0.000016
print(f"Pocock          @ τ=0.1: {pocock(0.1):.6f}")           # ≈ 0.017

O'Brien-Fleming consumes extremely small amounts of alpha in the beginning. Since it consumes only 0.0016% at τ=0.1, the Z-statistic must exceed approximately 4.5σ to achieve an early exit from the initial 10%. This design captures only truly severe regressions. On the other hand, Pocock has already consumed 34% of the total α at the same time, making an early exit possible even at the Z≈2.1 level.

incremental_alpha(incremental consumption) is the core mechanism of this approach. The total α budget (e.g., 0.05) is treated like "points" to calculate the amount that can be consumed at each intermediate analysis point. Once α is consumed, it cannot be used in the next analysis, and it is guaranteed that the total consumption will not exceed 0.05 even after going through all analyses.

Flagger Gating Layer Architecture

Flagger checks the metric and calls the webhook every analysis.interval. This structure corresponds exactly to the intermediate analysis loop of Sequential Testing.

[Flagger Canary Controller]
        │
        │ 매 interval마다 webhook POST
        ▼
[Sequential Test Service]  ←─── [Prometheus / Metrics Store]
        │                              (현재 canary 메트릭)
        │
        │ 1. 현재 τ 계산 (수집 샘플 / 목표 샘플)
        │ 2. incremental_alpha 계산 (이번 분석의 α 예산)
        │ 3. 임계 Z-값 산출
        │ 4. 실제 Z-통계량과 비교
        ▼
   |Z| > boundary?
   ├── YES → HTTP 400 → Flagger 실패 카운트 증가 → threshold 초과 시 자동 롤백
   └── NO  → HTTP 200 → Canary 계속 진행

Practical Application

Flagger Canary Analysis Settings

First, register a Sequential Testing webhook to Flagger's Canary resource. Each interval becomes an intermediate analysis point.

yaml

# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  progressDeadlineSeconds: 600
  service:
    port: 80
  analysis:
    interval: 1m          # 매 1분 = 한 번의 중간 분석
    threshold: 3          # 3회 연속 실패 시 롤백
    maxWeight: 50         # 최대 50% 트래픽
    stepWeight: 10        # 매 iteration 10% 증가
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
    webhooks:
      - name: sequential-test-gate
        type: pre-rollout   # 트래픽 증가 전 게이팅
        url: http://seq-test-service.monitoring/check
        timeout: 30s
        metadata:
          alpha: "0.05"
          method: "obrien_fleming"
          # Flagger metadata 필드는 문자열만 허용하므로 숫자도 문자열로 전달
          target_samples: "10000"
          metric: "error_rate"
          canary_name: "my-service"

The type: pre-rollout webhook is called before each traffic increment phase. If the Sequential Test returns a failure, the traffic does not increase, and only the failure count is incremented. When threshold is reached, the Flagger executes an auto-rollback.

Webhook Service Implementation: AlphaSpendingTest Core Logic

Below is the FastAPI service called via the Flagger webhook. First, let's take a look at the core class for statistical testing.

Background for readers unfamiliar with Z-statistics: The Z-statistic is a value in units of standard deviation indicating whether an observed difference is due to chance. scipy.stats.norm.ppf is a quantile function of the normal distribution that returns "what the Z-value is below this alpha value." If Z > 1.96, it means it is significant at the α=0.05 single test level.

python

# seq_test_service.py — 핵심 통계 로직
import numpy as np
from scipy import stats
 
 
class AlphaSpendingTest:
    def __init__(self, alpha: float = 0.05, method: str = "obrien_fleming"):
        self.alpha = alpha
        self.method = method
 
    def spending_function(self, tau: float) -> float:
        """정보 분율 tau까지 누적 소비할 alpha 계산"""
        tau = max(tau, 1e-10)  # 0 나눗셈 방지
        if self.method == "obrien_fleming":
            z_alpha = stats.norm.ppf(1 - self.alpha / 2)
            return 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
        elif self.method == "pocock":
            return self.alpha * np.log(1 + (np.e - 1) * tau)
        else:
            raise ValueError(f"Unknown method: {self.method}")
 
    def get_critical_z(self, tau: float, prev_tau: float = 0.0) -> float:
        """
        현재 중간 분석 시점의 임계 Z-값 계산.
 
        prev_tau: 직전 분석 시점의 τ. 호출자가 이전 소비 이력을 관리해
        전달해야 한다. 기본값 0은 이번이 첫 번째 분석임을 의미한다.
        여러 번 호출할 때 매번 0으로 두면 누적 소비가 정확하게 추적되지
        않으므로, 상태 저장소(Redis 등)나 요청 파라미터로 prev_tau를 전달해야 한다.
        """
        # incremental_alpha: "이번 중간 분석에서 소비할 수 있는 남은 alpha 예산"
        incremental_alpha = (
            self.spending_function(tau) - self.spending_function(prev_tau)
        )
        # 수치 안정성: 경계값이 극단적으로 커지는 것 방지
        incremental_alpha = max(incremental_alpha, 1e-10)
        return stats.norm.ppf(1 - incremental_alpha / 2)
 
    def evaluate(self, z_stat: float, tau: float, prev_tau: float = 0.0) -> dict:
        boundary = self.get_critical_z(tau, prev_tau)
        return {
            "stop": abs(z_stat) > boundary,
            "z_stat": round(z_stat, 4),
            "boundary": round(boundary, 4),
            "tau": round(tau, 4),
            "incremental_alpha": round(
                self.spending_function(tau) - self.spending_function(prev_tau), 6
            ),
        }
 
 
def compute_z_stat(
    canary_error_rate: float,
    baseline_error_rate: float,
    n_canary: int,
    n_baseline: int,
) -> float:
    """
    이표본 비율 차이에 대한 Z-통계량.
 
    풀링된 표준오차(pooled SE)를 사용하는 이유: 귀무가설(두 비율이 같다) 아래서는
    pooled SE가 통계적으로 더 효율적이다. 대립가설이 사실일 때는 비풀링 SE가
    더 적절할 수 있지만, 회귀 감지가 목적인 Canary 배포에서는 귀무가설 기준
    검정이 표준 관행이다.
    """
    p_pool = (
        canary_error_rate * n_canary + baseline_error_rate * n_baseline
    ) / (n_canary + n_baseline)
    se = np.sqrt(p_pool * (1 - p_pool) * (1 / n_canary + 1 / n_baseline))
    if se < 1e-10:
        return 0.0
    return (canary_error_rate - baseline_error_rate) / se

The roles of each function can be summarized as follows.

Function	Role
`spending_function(tau)`	Calculation of cumulative alpha to be consumed up to τ
`get_critical_z(tau, prev_tau)`	Calculation of the threshold Z-value for this interim analysis. `prev_tau` State management is the caller's responsibility
`compute_z_stat()`	Calculation of Z-statistic for pooled SE-based two-sample proportion test

Webhook Service Implementation: Prometheus Integration and Endpoints

python

# seq_test_service.py — FastAPI 엔드포인트 및 메트릭 수집
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx
import os
 
app = FastAPI()
PROMETHEUS_URL = os.getenv("PROMETHEUS_URL", "http://prometheus:9090")
 
 
async def fetch_canary_metrics(canary_name: str) -> dict:
    """
    Prometheus에서 canary 메트릭 수집.
 
    주의: 아래 쿼리의 레이블(app="{canary_name}")은 예시다.
    실제 서비스의 레이블 구조(app, deployment, service 등)에 맞게 수정해야 한다.
    primary deployment는 "{canary_name}-primary" 레이블을 쓴다고 가정하지만
    실제 Flagger 환경에서는 다를 수 있다.
    """
    async with httpx.AsyncClient() as client:
        error_rate_query = (
            f'sum(rate(http_requests_total{{app="{canary_name}",'
            f'status=~"5.."}}[1m])) / '
            f'sum(rate(http_requests_total{{app="{canary_name}"}}[1m]))'
        )
        sample_count_query = (
            f'sum(increase(http_requests_total{{app="{canary_name}"}}[1h]))'
        )
 
        resp_error = await client.get(
            f"{PROMETHEUS_URL}/api/v1/query",
            params={"query": error_rate_query},
        )
        resp_count = await client.get(
            f"{PROMETHEUS_URL}/api/v1/query",
            params={"query": sample_count_query},
        )
 
        error_data = resp_error.json()["data"]["result"]
        count_data = resp_count.json()["data"]["result"]
 
        # 쿼리 결과가 비어있으면 메트릭 미수집 상태로 처리
        if not error_data or not count_data:
            raise ValueError(f"No metrics found for '{canary_name}'. "
                             "레이블 구조를 확인하세요.")
 
        error_rate = float(error_data[0]["value"][1])
        sample_count = float(count_data[0]["value"][1])
 
        return {"error_rate": error_rate, "sample_count": sample_count}
 
 
class WebhookPayload(BaseModel):
    metadata: dict = {}
 
 
@app.post("/check")
async def sequential_test_gate(payload: WebhookPayload):
    meta = payload.metadata
    alpha = float(meta.get("alpha", "0.05"))
    method = meta.get("method", "obrien_fleming")
    # Flagger metadata는 문자열만 전달되므로 int 변환 필요
    target_samples = int(meta.get("target_samples", "10000"))
    canary_name = meta.get("canary_name", "unknown")
    min_tau = float(meta.get("min_tau", "0.05"))  # 최소 웜업 5%
 
    # prev_tau 상태 관리: 현재 구현은 요청 파라미터로 받는 구조.
    # 실제 운영에서는 Redis 등 외부 상태 저장소에서 실험별 prev_tau를 관리해야
    # 매 분석마다 증분 소비가 정확하게 누적된다.
    prev_tau = float(meta.get("prev_tau", "0.0"))
 
    try:
        canary_metrics = await fetch_canary_metrics(canary_name)
        baseline_metrics = await fetch_canary_metrics(f"{canary_name}-primary")
    except Exception as e:
        # 메트릭 수집 실패 시 통과 처리 (fail-open 정책)
        # 서비스 특성에 따라 fail-closed로 변경할 수 있음
        return {"status": "metrics_unavailable", "reason": str(e)}
 
    n_canary = int(canary_metrics["sample_count"])
    tau = n_canary / target_samples
 
    # 최소 웜업 기간 미충족 시 통과 (τ가 너무 작으면 경계값이 극단적으로 높아짐)
    if tau < min_tau:
        return {"status": "warming_up", "tau": tau, "min_tau": min_tau}
 
    z_stat = compute_z_stat(
        canary_metrics["error_rate"],
        baseline_metrics["error_rate"],
        n_canary,
        max(int(baseline_metrics["sample_count"]), 1),
    )
 
    tester = AlphaSpendingTest(alpha=alpha, method=method)
    result = tester.evaluate(z_stat, min(tau, 1.0), prev_tau)
 
    if result["stop"]:
        raise HTTPException(
            status_code=400,
            detail={
                "reason": "sequential_test_boundary_exceeded",
                **result,
            },
        )
 
    return {"status": "pass", **result}

Boundary Visualization (Independent script for local analysis)

The code below is an independent analysis script unrelated to the service code (seq_test_service.py). It is used to check locally which boundaries are applied before deployment.

python

# visualize_boundaries.py — 로컬 분석 전용, 서비스에 포함하지 않는다
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
 
 
def obf_boundary(taus, alpha=0.05):
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    boundaries, prev_spend = [], 0
    for tau in taus:
        curr_spend = 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
        inc = max(curr_spend - prev_spend, 1e-10)
        boundaries.append(stats.norm.ppf(1 - inc / 2))
        prev_spend = curr_spend
    return boundaries
 
 
def pocock_boundary(taus, alpha=0.05):
    boundaries, prev_spend = [], 0
    for tau in taus:
        curr_spend = alpha * np.log(1 + (np.e - 1) * tau)
        inc = max(curr_spend - prev_spend, 1e-10)
        boundaries.append(stats.norm.ppf(1 - inc / 2))
        prev_spend = curr_spend
    return boundaries
 
 
taus = np.linspace(0.01, 1.0, 100)
obf = obf_boundary(taus)
poc = pocock_boundary(taus)
 
plt.figure(figsize=(10, 5))
plt.plot(taus, obf, label="O'Brien-Fleming", linewidth=2)
plt.plot(taus, poc, label="Pocock", linewidth=2, linestyle="--")
plt.axhline(y=1.96, color="gray", linestyle=":", label="단일 검정 Z=1.96")
plt.xlabel("정보 분율 τ")
plt.ylabel("임계 Z-값")
plt.title("Alpha Spending 경계 비교 (target_samples=10000 기준)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("boundary_comparison.png", dpi=150, bbox_inches="tight")

The initial boundary of the O'Brien-Fleming is at the Z≈4.5 level. At time τ=0.1, 4

This means that only very strong signals of ~~5σ allow for early exit, making it ideal for safe stoppages in case of severe regression (surge in error rate, spike in lag). Overall, Pocock Z≈2.1~~

It maintains a level of 2.3, making it suitable for faster superiority verification.

Pros and Cons Analysis

Advantages

Item	Content
Fast Rollback	Automatic exit immediately upon detection of statistically significant issues. Reduces average experiment duration by up to 66% compared to fixed samples
Minimizing User Harm	Detect issues in a small amount of Canary traffic and block them before exposure to full users
Type I Error Control	Solves the peeking problem with Alpha Spending. The total error rate does not exceed the predefined α.
Flexible Schedule	The Lan-DeMets method allows changing the number of intermediate analyses and timing during the experiment
Expandable to Stop Uselessness	Add Beta-Spending to stop ineffective experiments early and prevent resource waste
Leveraging Existing Infrastructure	Implementable by simply adding a webhook service on top of the Flagger + Prometheus stack

Disadvantages and Precautions

Item	Content	Response Plan
Initial Conservatism	O'Brien-Fleming's initial boundaries are too high, potentially failing to catch obvious problems	Use O'Brien-Fleming for high-severity metrics (error rates) and Pocock for performance metrics
Slightly increase sample size	Total samples required to achieve equivalent power are slightly higher compared to a single test	Set the target sample size with a 10–15% margin
Implementation Complexity	Requires integration with external statistics services and `prev_tau` state management. Requires policy determination in case of service failure.	Explicit determination of fail-open (pass on service down) or fail-closed policy.
Metric Lag	Prometheus scrape interval (default 15s~1m) affects τ calculation accuracy	Set `interval` to at least 3 times the scrape interval
Multiple Metrics Issue	Increase in Overall FWER when Applying Independent Alpha Spending to Multiple Metrics	Bonferroni Correction (α/k) or Hierarchical Test Design
Cold Start	When τ is very small, the boundary is extremely high and virtually meaningless	Minimum warm-up period (`min_tau` ≥ 0.05) must be set

FWER (Family-Wise Error Rate): The probability of one or more false positives occurring in multiple comparisons. If there are k metrics, the simplest correspondence is the Bonferroni correction, which lowers the significance level of each test to α/k. For example, to apply α=0.05 to 5 metrics, α=0.01 is used for each test.

The Most Common Mistakes in Practice

Do not design duplicate threshold and Sequential Test: An unintended early rollback occurs if the Flagger's threshold (accumulated failure count allowed) conflicts with the Sequential Test boundary. Even if the Sequential Test does not exceed the boundary, a simple metric threshold violation can fill threshold. Clearly separate the roles of the two mechanisms and, whenever possible, design the system to determine whether to roll back based solely on the Sequential Test results.
Calculating information fraction based on time: If calculated as τ = elapsed time / total planned time, the results are statistically insignificant due to insufficient sample size during the early morning hours when traffic is low. For services with uneven traffic, τ must be calculated based on the number of samples.
prev_tau Treat as 0 every time without state management: Accurate calculation of incremental consumption for Sequential Tests requires knowing the τ of the previous analysis. If every call is treated as prev_tau=0, each analysis effectively functions as an independent single test, breaking the error rate guarantee of Alpha Spending. You must manage experiment-specific prev_tau in an external state store such as Redis, or at least have a structure to pass them as request parameters.

In Conclusion

The true value of Sequential Testing lies not in "making decisions faster," but in the statistical certainty that "the error rate is guaranteed regardless of when the decision is made." While threshold comparison methods rely on the intuitive judgment of "let's roll back because the numbers are bad right now," Alpha Spending-based Sequential Testing quantifies and controls the percentage probability that such a judgment will be wrong in the long term. This difference leads to team credibility and operational confidence.

However, the implementation introduced in this article provides a starting point with a single metric and a simple prev_tau delivery structure. In actual production, there comes a point where multi-metric FWER correction, Redis-based state management, and Beta-Spending (discontinuing availability) paired with Alpha Spending become necessary. That point is not the limit of this approach, but a signal to grow to the next stage.

3 Steps to Start Right Now:

Alpha Spending Boundary Simulation: Run visualize_boundaries.py locally to first check what Z-values O'Brien-Fleming and Pocock boundaries require at the target sample size of the actual service (target_samples). Looking at these numbers, you can intuitively determine which method is suitable for the service.
Minimum MVP Webhook Service Deployment: First, deploy the version of fetch_canary_metrics with the Prometheus integration part replaced with a hardcoded dummy value, then replace webhooks.url in canary.yaml with the actual service address, and then execute kubectl apply -f canary.yaml. First, verify that the HTTP 200/400 gating flow works as intended.
Actual Metric Integration and min_tau Tuning: Modify the query to match the service's actual Prometheus label structure and adjust min_tau to the service's average request throughput. The goal is to ensure that the first significant test is performed after τ ≥ 0.05 based on the actual sample size.

Next Post: A Hierarchical Test Design Method to Control FWER in a Multi-Metric Environment by Early Termination of Ineffective Canaries via Futility Stopping with Beta-Spending

Reference Materials

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit | DEV BAK - 기술블로그

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit

Key Concepts

Peeking Problem: Why Repeated Checking Is Dangerous

python

import numpy as np
 
def peeking_inflation(alpha: float, num_peeks: int) -> float:
    """반복 검정 시 실제 Type I 오류율 추정 (Bonferroni 상한 — 실제보다 보수적)"""
    # 각 검정이 독립적이라 가정한 보수적 상한값
    # 실제 상관된 관측값에서는 이보다 낮게 나옴 (12회 peek 시 약 0.30~0.35 수준)
    return 1 - (1 - alpha) ** num_peeks
 
# 5분 간격으로 60분 동안 12번 들여다보면
print(peeking_inflation(0.05, 12))  # ≈ 0.46 (Bonferroni 상한)
# 실제 시뮬레이션 기반 추정치는 약 0.30~0.35 수준이지만,
# 어느 쪽이든 목표 오류율 0.05를 대폭 초과한다는 사실은 변하지 않는다

Alpha Spending Function: Three Strategies

The Alpha Spending function proposed by Lan and DeMets (1983) defines the cumulative consumption α(τ) according to the information fraction τ.

Method	Initial Conservatism	Early Exit Probability	Final Nominal Significance Level	Recommended Use Case
O'Brien-Fleming	Very High	Low (Strong Evidence Required)	Almost Identical to a Single Test	Regression Detection, Safe Stop, Fixed Analysis Schedule
Pocock	Low	High (Equal Distribution)	Conservative (Decreases)	Fast Superiority Confirmation, Fixed Analysis Schedule
Lan-DeMets	Flexible	Flexible	Flexible	When the number of analyses or timing changes during the experiment

python

import numpy as np
from scipy import stats
 
def obrien_fleming(tau: float, alpha: float = 0.05) -> float:
    """O'Brien-Fleming 누적 Alpha Spending"""
    if tau <= 0:
        return 0.0
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    return 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
 
def pocock(tau: float, alpha: float = 0.05) -> float:
    """Pocock 누적 Alpha Spending"""
    return alpha * np.log(1 + (np.e - 1) * tau)
 
# τ=0.1 (10% 진행) 시점에서 각 방법이 소비하는 누적 alpha
print(f"O'Brien-Fleming @ τ=0.1: {obrien_fleming(0.1):.6f}")  # ≈ 0.000016
print(f"Pocock          @ τ=0.1: {pocock(0.1):.6f}")           # ≈ 0.017

Flagger Gating Layer Architecture

Flagger checks the metric and calls the webhook every analysis.interval. This structure corresponds exactly to the intermediate analysis loop of Sequential Testing.

[Flagger Canary Controller]
        │
        │ 매 interval마다 webhook POST
        ▼
[Sequential Test Service]  ←─── [Prometheus / Metrics Store]
        │                              (현재 canary 메트릭)
        │
        │ 1. 현재 τ 계산 (수집 샘플 / 목표 샘플)
        │ 2. incremental_alpha 계산 (이번 분석의 α 예산)
        │ 3. 임계 Z-값 산출
        │ 4. 실제 Z-통계량과 비교
        ▼
   |Z| > boundary?
   ├── YES → HTTP 400 → Flagger 실패 카운트 증가 → threshold 초과 시 자동 롤백
   └── NO  → HTTP 200 → Canary 계속 진행

Practical Application

Flagger Canary Analysis Settings

First, register a Sequential Testing webhook to Flagger's Canary resource. Each interval becomes an intermediate analysis point.

yaml

# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  progressDeadlineSeconds: 600
  service:
    port: 80
  analysis:
    interval: 1m          # 매 1분 = 한 번의 중간 분석
    threshold: 3          # 3회 연속 실패 시 롤백
    maxWeight: 50         # 최대 50% 트래픽
    stepWeight: 10        # 매 iteration 10% 증가
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
    webhooks:
      - name: sequential-test-gate
        type: pre-rollout   # 트래픽 증가 전 게이팅
        url: http://seq-test-service.monitoring/check
        timeout: 30s
        metadata:
          alpha: "0.05"
          method: "obrien_fleming"
          # Flagger metadata 필드는 문자열만 허용하므로 숫자도 문자열로 전달
          target_samples: "10000"
          metric: "error_rate"
          canary_name: "my-service"

Webhook Service Implementation: AlphaSpendingTest Core Logic

Below is the FastAPI service called via the Flagger webhook. First, let's take a look at the core class for statistical testing.

python

# seq_test_service.py — 핵심 통계 로직
import numpy as np
from scipy import stats
 
 
class AlphaSpendingTest:
    def __init__(self, alpha: float = 0.05, method: str = "obrien_fleming"):
        self.alpha = alpha
        self.method = method
 
    def spending_function(self, tau: float) -> float:
        """정보 분율 tau까지 누적 소비할 alpha 계산"""
        tau = max(tau, 1e-10)  # 0 나눗셈 방지
        if self.method == "obrien_fleming":
            z_alpha = stats.norm.ppf(1 - self.alpha / 2)
            return 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
        elif self.method == "pocock":
            return self.alpha * np.log(1 + (np.e - 1) * tau)
        else:
            raise ValueError(f"Unknown method: {self.method}")
 
    def get_critical_z(self, tau: float, prev_tau: float = 0.0) -> float:
        """
        현재 중간 분석 시점의 임계 Z-값 계산.
 
        prev_tau: 직전 분석 시점의 τ. 호출자가 이전 소비 이력을 관리해
        전달해야 한다. 기본값 0은 이번이 첫 번째 분석임을 의미한다.
        여러 번 호출할 때 매번 0으로 두면 누적 소비가 정확하게 추적되지
        않으므로, 상태 저장소(Redis 등)나 요청 파라미터로 prev_tau를 전달해야 한다.
        """
        # incremental_alpha: "이번 중간 분석에서 소비할 수 있는 남은 alpha 예산"
        incremental_alpha = (
            self.spending_function(tau) - self.spending_function(prev_tau)
        )
        # 수치 안정성: 경계값이 극단적으로 커지는 것 방지
        incremental_alpha = max(incremental_alpha, 1e-10)
        return stats.norm.ppf(1 - incremental_alpha / 2)
 
    def evaluate(self, z_stat: float, tau: float, prev_tau: float = 0.0) -> dict:
        boundary = self.get_critical_z(tau, prev_tau)
        return {
            "stop": abs(z_stat) > boundary,
            "z_stat": round(z_stat, 4),
            "boundary": round(boundary, 4),
            "tau": round(tau, 4),
            "incremental_alpha": round(
                self.spending_function(tau) - self.spending_function(prev_tau), 6
            ),
        }
 
 
def compute_z_stat(
    canary_error_rate: float,
    baseline_error_rate: float,
    n_canary: int,
    n_baseline: int,
) -> float:
    """
    이표본 비율 차이에 대한 Z-통계량.
 
    풀링된 표준오차(pooled SE)를 사용하는 이유: 귀무가설(두 비율이 같다) 아래서는
    pooled SE가 통계적으로 더 효율적이다. 대립가설이 사실일 때는 비풀링 SE가
    더 적절할 수 있지만, 회귀 감지가 목적인 Canary 배포에서는 귀무가설 기준
    검정이 표준 관행이다.
    """
    p_pool = (
        canary_error_rate * n_canary + baseline_error_rate * n_baseline
    ) / (n_canary + n_baseline)
    se = np.sqrt(p_pool * (1 - p_pool) * (1 / n_canary + 1 / n_baseline))
    if se < 1e-10:
        return 0.0
    return (canary_error_rate - baseline_error_rate) / se

The roles of each function can be summarized as follows.

Function	Role
`spending_function(tau)`	Calculation of cumulative alpha to be consumed up to τ
`get_critical_z(tau, prev_tau)`	Calculation of the threshold Z-value for this interim analysis. `prev_tau` State management is the caller's responsibility
`compute_z_stat()`	Calculation of Z-statistic for pooled SE-based two-sample proportion test

Webhook Service Implementation: Prometheus Integration and Endpoints

python

# seq_test_service.py — FastAPI 엔드포인트 및 메트릭 수집
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx
import os
 
app = FastAPI()
PROMETHEUS_URL = os.getenv("PROMETHEUS_URL", "http://prometheus:9090")
 
 
async def fetch_canary_metrics(canary_name: str) -> dict:
    """
    Prometheus에서 canary 메트릭 수집.
 
    주의: 아래 쿼리의 레이블(app="{canary_name}")은 예시다.
    실제 서비스의 레이블 구조(app, deployment, service 등)에 맞게 수정해야 한다.
    primary deployment는 "{canary_name}-primary" 레이블을 쓴다고 가정하지만
    실제 Flagger 환경에서는 다를 수 있다.
    """
    async with httpx.AsyncClient() as client:
        error_rate_query = (
            f'sum(rate(http_requests_total{{app="{canary_name}",'
            f'status=~"5.."}}[1m])) / '
            f'sum(rate(http_requests_total{{app="{canary_name}"}}[1m]))'
        )
        sample_count_query = (
            f'sum(increase(http_requests_total{{app="{canary_name}"}}[1h]))'
        )
 
        resp_error = await client.get(
            f"{PROMETHEUS_URL}/api/v1/query",
            params={"query": error_rate_query},
        )
        resp_count = await client.get(
            f"{PROMETHEUS_URL}/api/v1/query",
            params={"query": sample_count_query},
        )
 
        error_data = resp_error.json()["data"]["result"]
        count_data = resp_count.json()["data"]["result"]
 
        # 쿼리 결과가 비어있으면 메트릭 미수집 상태로 처리
        if not error_data or not count_data:
            raise ValueError(f"No metrics found for '{canary_name}'. "
                             "레이블 구조를 확인하세요.")
 
        error_rate = float(error_data[0]["value"][1])
        sample_count = float(count_data[0]["value"][1])
 
        return {"error_rate": error_rate, "sample_count": sample_count}
 
 
class WebhookPayload(BaseModel):
    metadata: dict = {}
 
 
@app.post("/check")
async def sequential_test_gate(payload: WebhookPayload):
    meta = payload.metadata
    alpha = float(meta.get("alpha", "0.05"))
    method = meta.get("method", "obrien_fleming")
    # Flagger metadata는 문자열만 전달되므로 int 변환 필요
    target_samples = int(meta.get("target_samples", "10000"))
    canary_name = meta.get("canary_name", "unknown")
    min_tau = float(meta.get("min_tau", "0.05"))  # 최소 웜업 5%
 
    # prev_tau 상태 관리: 현재 구현은 요청 파라미터로 받는 구조.
    # 실제 운영에서는 Redis 등 외부 상태 저장소에서 실험별 prev_tau를 관리해야
    # 매 분석마다 증분 소비가 정확하게 누적된다.
    prev_tau = float(meta.get("prev_tau", "0.0"))
 
    try:
        canary_metrics = await fetch_canary_metrics(canary_name)
        baseline_metrics = await fetch_canary_metrics(f"{canary_name}-primary")
    except Exception as e:
        # 메트릭 수집 실패 시 통과 처리 (fail-open 정책)
        # 서비스 특성에 따라 fail-closed로 변경할 수 있음
        return {"status": "metrics_unavailable", "reason": str(e)}
 
    n_canary = int(canary_metrics["sample_count"])
    tau = n_canary / target_samples
 
    # 최소 웜업 기간 미충족 시 통과 (τ가 너무 작으면 경계값이 극단적으로 높아짐)
    if tau < min_tau:
        return {"status": "warming_up", "tau": tau, "min_tau": min_tau}
 
    z_stat = compute_z_stat(
        canary_metrics["error_rate"],
        baseline_metrics["error_rate"],
        n_canary,
        max(int(baseline_metrics["sample_count"]), 1),
    )
 
    tester = AlphaSpendingTest(alpha=alpha, method=method)
    result = tester.evaluate(z_stat, min(tau, 1.0), prev_tau)
 
    if result["stop"]:
        raise HTTPException(
            status_code=400,
            detail={
                "reason": "sequential_test_boundary_exceeded",
                **result,
            },
        )
 
    return {"status": "pass", **result}

Boundary Visualization (Independent script for local analysis)

The code below is an independent analysis script unrelated to the service code (seq_test_service.py). It is used to check locally which boundaries are applied before deployment.

python

# visualize_boundaries.py — 로컬 분석 전용, 서비스에 포함하지 않는다
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
 
 
def obf_boundary(taus, alpha=0.05):
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    boundaries, prev_spend = [], 0
    for tau in taus:
        curr_spend = 2 * (1 - stats.norm.cdf(z_alpha / np.sqrt(tau)))
        inc = max(curr_spend - prev_spend, 1e-10)
        boundaries.append(stats.norm.ppf(1 - inc / 2))
        prev_spend = curr_spend
    return boundaries
 
 
def pocock_boundary(taus, alpha=0.05):
    boundaries, prev_spend = [], 0
    for tau in taus:
        curr_spend = alpha * np.log(1 + (np.e - 1) * tau)
        inc = max(curr_spend - prev_spend, 1e-10)
        boundaries.append(stats.norm.ppf(1 - inc / 2))
        prev_spend = curr_spend
    return boundaries
 
 
taus = np.linspace(0.01, 1.0, 100)
obf = obf_boundary(taus)
poc = pocock_boundary(taus)
 
plt.figure(figsize=(10, 5))
plt.plot(taus, obf, label="O'Brien-Fleming", linewidth=2)
plt.plot(taus, poc, label="Pocock", linewidth=2, linestyle="--")
plt.axhline(y=1.96, color="gray", linestyle=":", label="단일 검정 Z=1.96")
plt.xlabel("정보 분율 τ")
plt.ylabel("임계 Z-값")
plt.title("Alpha Spending 경계 비교 (target_samples=10000 기준)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("boundary_comparison.png", dpi=150, bbox_inches="tight")

The initial boundary of the O'Brien-Fleming is at the Z≈4.5 level. At time τ=0.1, 4

This means that only very strong signals of ~~5σ allow for early exit, making it ideal for safe stoppages in case of severe regression (surge in error rate, spike in lag). Overall, Pocock Z≈2.1~~

It maintains a level of 2.3, making it suitable for faster superiority verification.

Pros and Cons Analysis

Advantages

Item	Content
Fast Rollback	Automatic exit immediately upon detection of statistically significant issues. Reduces average experiment duration by up to 66% compared to fixed samples
Minimizing User Harm	Detect issues in a small amount of Canary traffic and block them before exposure to full users
Type I Error Control	Solves the peeking problem with Alpha Spending. The total error rate does not exceed the predefined α.
Flexible Schedule	The Lan-DeMets method allows changing the number of intermediate analyses and timing during the experiment
Expandable to Stop Uselessness	Add Beta-Spending to stop ineffective experiments early and prevent resource waste
Leveraging Existing Infrastructure	Implementable by simply adding a webhook service on top of the Flagger + Prometheus stack

Disadvantages and Precautions

Item	Content	Response Plan
Initial Conservatism	O'Brien-Fleming's initial boundaries are too high, potentially failing to catch obvious problems	Use O'Brien-Fleming for high-severity metrics (error rates) and Pocock for performance metrics
Slightly increase sample size	Total samples required to achieve equivalent power are slightly higher compared to a single test	Set the target sample size with a 10–15% margin
Implementation Complexity	Requires integration with external statistics services and `prev_tau` state management. Requires policy determination in case of service failure.	Explicit determination of fail-open (pass on service down) or fail-closed policy.
Metric Lag	Prometheus scrape interval (default 15s~1m) affects τ calculation accuracy	Set `interval` to at least 3 times the scrape interval
Multiple Metrics Issue	Increase in Overall FWER when Applying Independent Alpha Spending to Multiple Metrics	Bonferroni Correction (α/k) or Hierarchical Test Design
Cold Start	When τ is very small, the boundary is extremely high and virtually meaningless	Minimum warm-up period (`min_tau` ≥ 0.05) must be set

The Most Common Mistakes in Practice

Do not design duplicate threshold and Sequential Test: An unintended early rollback occurs if the Flagger's threshold (accumulated failure count allowed) conflicts with the Sequential Test boundary. Even if the Sequential Test does not exceed the boundary, a simple metric threshold violation can fill threshold. Clearly separate the roles of the two mechanisms and, whenever possible, design the system to determine whether to roll back based solely on the Sequential Test results.
Calculating information fraction based on time: If calculated as τ = elapsed time / total planned time, the results are statistically insignificant due to insufficient sample size during the early morning hours when traffic is low. For services with uneven traffic, τ must be calculated based on the number of samples.
prev_tau Treat as 0 every time without state management: Accurate calculation of incremental consumption for Sequential Tests requires knowing the τ of the previous analysis. If every call is treated as prev_tau=0, each analysis effectively functions as an independent single test, breaking the error rate guarantee of Alpha Spending. You must manage experiment-specific prev_tau in an external state store such as Redis, or at least have a structure to pass them as request parameters.

In Conclusion

3 Steps to Start Right Now:

Alpha Spending Boundary Simulation: Run visualize_boundaries.py locally to first check what Z-values O'Brien-Fleming and Pocock boundaries require at the target sample size of the actual service (target_samples). Looking at these numbers, you can intuitively determine which method is suitable for the service.
Minimum MVP Webhook Service Deployment: First, deploy the version of fetch_canary_metrics with the Prometheus integration part replaced with a hardcoded dummy value, then replace webhooks.url in canary.yaml with the actual service address, and then execute kubectl apply -f canary.yaml. First, verify that the HTTP 200/400 gating flow works as intended.
Actual Metric Integration and min_tau Tuning: Modify the query to match the service's actual Prometheus label structure and adjust min_tau to the service's average request throughput. The goal is to ensure that the first significant test is performed after τ ≥ 0.05 based on the actual sample size.

Next Post: A Hierarchical Test Design Method to Control FWER in a Multi-Metric Environment by Early Termination of Ineffective Canaries via Futility Stopping with Beta-Spending

Key Concepts

Peeking Problem: Why Repeated Checking Is Dangerous

Alpha Spending Function: Three Strategies

Flagger Gating Layer Architecture

Practical Application

Flagger Canary Analysis Settings

Webhook Service Implementation: AlphaSpendingTest Core Logic

Webhook Service Implementation: Prometheus Integration and Endpoints

Boundary Visualization (Independent script for local analysis)

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

Peeking Problem: Why Repeated Checking Is Dangerous

Alpha Spending Function: Three Strategies

Flagger Gating Layer Architecture

Practical Application

Flagger Canary Analysis Settings

Webhook Service Implementation: AlphaSpendingTest Core Logic

Webhook Service Implementation: Prometheus Integration and Endpoints

Boundary Visualization (Independent script for local analysis)

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services

Flagger + Istio A/B Routing: Integrating New Relic NRQL with Conversion Rate as Distribution Gating Criteria

Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments