Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services

Subtitle: Kubernetes Canary Deployment · Python FastAPI · Mann-Whitney U Test · Prometheus · A/B Testing Gating

Any developer who has operated Canary deployments on Kubernetes has likely experienced this at least once. The error rate rose from 0.8% to 1.2%. Should I roll back? Or is it just a traffic spike or temporary noise?

Simple threshold-based gating fails to answer this question. When only 10% of total traffic is routed to the canary, the small sample size is susceptible to natural variation—random noise that always occurs when comparing two identical versions. In fact, in a P99 latency service with a standard deviation of ±30ms, judging based on only 50 canary samples results in a 30% or higher probability of exceeding the threshold even when there are actually no issues. Consequently, situations where perfectly good deployments are rolled back (false positives) or actually bad deployments pass through (missed positives) repeat.

This article covers how to implement an A/B test gating layer robust to natural variation using Python FastAPI and the Mann-Whitney U test by connecting a statistical significance testing service to Flagger's webhooks interface. At the end of this article, you can take the complete code that connects directly to your Flagger CRD.

Key Concepts

Before reading this article

This article is intended for teams that are already using Flagger or are considering adopting it. It assumes a basic understanding of Kubernetes Deployment, Canary CRD, and PromQL. For an introduction to Flagger itself, it is recommended to refer to the official documentation first.

Flagger Webhook Gating Mechanism

Flagger calls an external HTTP endpoint during every canary analysis cycle and determines whether to proceed with deployment based solely on the response code.

HTTP 2xx  →  배포 계속 진행 (트래픽 가중치 증가)
HTTP 4xx/5xx  →  실패 카운트 증가 → threshold 초과 시 롤백

Webhook types are classified according to the timing of the call and the purpose.

Type	Time of invocation	Main use
`pre-rollout`	Before traffic migration	Pre-quality check (k6 load test, etc.)
`rollout`	Every analysis cycle	Real-time metric-based gating ← Key point of this article
`confirm-rollout`	Manual approval before proceeding to the step	Human intervention at the boundary zone
`rollback`	When a rollback occurs	Notification/Post-processing

Core Design Principle: Flagger only looks at the HTTP status code of the webhook response. Any complex operations can be performed internally within the statistics service, and Flagger simply receives the result as 200 or 500. This simple interface is the key to enabling connections to external statistics services.

The Natural Variation Problem and Statistical Solution

If a service with a P99 latency of 180ms ± 30ms records a canary moment P99 of 210ms, is this a problem? It could be natural variation within a 1-sigma range. Statistical significance testing answers this question mathematically.

p-value: The probability that a difference of this magnitude or more extreme will be observed by chance, provided the null hypothesis (that the two distributions are identical) is true. A p < 0.05 value means that "the probability of observing such a result when the null hypothesis is true is less than 5%," and the lower this value, the higher the likelihood that an actual difference exists.
Confidence Interval (CI): The range of uncertainty regarding the effect size. The larger the sample, the narrower the CI becomes.
Statistical Power: The probability of detecting an actual difference. It decreases with lower traffic.

Which testing method should be chosen?

Infrastructure metrics such as latency and error rate do not follow a normal distribution. They have long tails and are sensitive to outliers.

Test Method	Distribution Assumptions	Suitable Metrics	Features
Mann-Whitney U	None (Non-parametric)	Latency, Error Rate	Netflix Kayenta adopted, Robust to outliers
Welch's t-test	Normal distribution	Response size, etc.	No need for homogeneity of variance assumption, safer than standard t-test
Bayesian A/B	None	Conversion Rate, CTR	Intuitive interpretation as "probability that the canary is better"

Mann-Whitney U Test: This is a non-parametric test that compares the distributions of two independent samples based on rank rather than their intrinsic values. Because it does not assume normality, it is suitable for skewed distributions such as latency, and the results are not significantly affected by a single outlier. This is why Netflix Kayenta adopted this method as its core test.

Brief Introduction to Bayesian A/B Testing: The Beta-Bernoulli model is well-suited for metrics dealing with binary outcomes, such as conversion rates and error rates. It sets the prior distribution as Beta(1,1) (uniform prior, no prior knowledge) and updates the posterior distribution to Beta(α+success, β+failure) as the number of successes and failures is observed. Its strength lies in the ability to directly calculate the "probability that the canary's error rate is lower than the baseline," enabling intuitive decision-making without a p-value.

Architecture Overview

The overall flow of the statistical gating layer is as follows.

rust

flowchart LR
    F[Flagger Controller] -->|"POST /check\n{name, namespace, metadata}"| S["Statistical\nSignificance Service\n(FastAPI)"]
    S -->|PromQL 쿼리| P[(Prometheus)]
    P -->|분위수 시계열 반환| S
    S -->|"HTTP 200: pass\n{p_value, effect_size}"| F
    S -->|"HTTP 500: fail\n{degradation_detected}"| F
    F -->|트래픽 증가| C[카나리 Pod]
    F -->|롤백| B[베이스라인 Pod]

Role of each component:

Component	Role	Implementation
Flagger Controller	Canary Cycle Management, Webhook Invocation	Flagger CRD
Statistical Significance Service	Metric Collection + Statistical Testing + Decision	Python FastAPI
Prometheus	Save Canary/Baseline Metrics	Existing Monitoring Stack

Practical Application

Implementation Option A: Python + FastAPI-based Mann-Whitney Gate

Implement the /check endpoint that Flagger will call as a FastAPI.

Practical Constraints on Prometheus Data Collection: Prometheus does not store individual request latency. rate(http_request_duration_seconds_bucket[5m]) only returns the cumulative proportion of histogram buckets, not individual request samples that can be directly fed into Mann-Whitney. If actual request-unit samples are required, a log-based pipeline like Loki must be used in conjunction. As a practical alternative, we use a method here that collects various quantiles of the histogram as approximate distribution samples.

python

# stat_gate/main.py
import os
import logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from scipy.stats import mannwhitneyu  # 비모수 검정 라이브러리
from prometheus_api_client import PrometheusConnect  # Prometheus HTTP API 클라이언트
from prometheus_client import Gauge, start_http_server  # 커스텀 메트릭 노출용
import numpy as np
 
app = FastAPI()
logger = logging.getLogger(__name__)
 
# PROMETHEUS_URL: Kubernetes 클러스터 내부에서는 서비스 FQDN을 사용하세요.
# 형식: http://<서비스명>.<네임스페이스>.svc.cluster.local:<포트>
prom = PrometheusConnect(
    url=os.getenv(
        "PROMETHEUS_URL",
        "http://prometheus-server.monitoring.svc.cluster.local:9090"
    )
)
 
# Grafana 시각화를 위한 커스텀 메트릭 노출 (포트 8001)
# Prometheus가 이 포트를 스크레이프하도록 ServiceMonitor를 추가하세요.
gate_p_value = Gauge("canary_gate_p_value", "Mann-Whitney p-value", ["canary", "namespace"])
gate_effect_ms = Gauge("canary_gate_effect_size_ms", "Effect size (ms)", ["canary", "namespace"])
start_http_server(8001)
 
 
class GateRequest(BaseModel):
    name: str
    namespace: str
    metadata: dict = {}
 
 
def fetch_quantile_samples(
    metric_base: str,
    pod_selector: str,
    duration: str = "5m",
) -> list[float]:
    """
    Prometheus 히스토그램에서 여러 분위수를 수집하여 근사 분포 샘플로 반환.
 
    주의: Prometheus는 개별 요청 레이턴시를 보관하지 않습니다.
    p10~p99 분위수 7개를 근사 샘플로 사용합니다.
    보다 정확한 분포 비교가 필요하다면 Loki 등 로그 기반 파이프라인에서
    실제 요청 샘플을 수집하는 방식을 권장합니다.
    """
    quantiles = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
    samples = []
    for q in quantiles:
        query = (
            f"histogram_quantile({q}, "
            f"rate({metric_base}_bucket{{{pod_selector}}}[{duration}]))"
        )
        try:
            result = prom.custom_query(query)
            for r in result:
                val = r["value"][1]
                if val not in ("NaN", "+Inf", "-Inf"):
                    samples.append(float(val))
        except Exception as e:
            logger.warning("Prometheus 쿼리 실패 (q=%s): %s", q, e)
    return samples
 
 
def run_mann_whitney(
    canary: list[float],
    baseline: list[float],
) -> tuple[float, float]:
    """Mann-Whitney U 검정을 수행하고 (p_value, effect_size_seconds)를 반환."""
    stat, p_value = mannwhitneyu(
        baseline,
        canary,
        alternative="less",  # 단측 검정: "캐너리가 베이스라인보다 나쁜가"만 검정 → 검력 향상
    )
    effect_size = float(np.median(canary) - np.median(baseline))
    return p_value, effect_size
 
 
@app.post("/check")
async def statistical_gate(req: GateRequest):
    window = req.metadata.get("window", "5m")
    alpha = float(req.metadata.get("alpha", "0.05"))
    metric = req.metadata.get("metric", "http_request_duration_seconds")
 
    canary_selector = f'pod=~"{req.name}-[0-9]+-.*",namespace="{req.namespace}"'
    baseline_selector = f'pod=~"{req.name}-primary-[0-9]+-.*",namespace="{req.namespace}"'
 
    try:
        canary_samples = fetch_quantile_samples(metric, canary_selector, window)
        baseline_samples = fetch_quantile_samples(metric, baseline_selector, window)
    except Exception as e:
        logger.error("메트릭 수집 중 오류 발생: %s", e)
        # Prometheus 장애 시 기본 동작: 통과 처리 (fail-open).
        # 운영 환경에서는 팀 정책에 따라 fail-closed(500 반환)로 변경 가능.
        return {"result": "metrics_unavailable", "detail": str(e)}
 
    if len(canary_samples) < 5 or len(baseline_samples) < 5:
        # 표본 부족 → 아직 판단 불가, 데이터 수집을 더 기다림 (통과 처리)
        return {
            "result": "insufficient_data",
            "canary_n": len(canary_samples),
            "baseline_n": len(baseline_samples),
        }
 
    p_value, effect_size = run_mann_whitney(canary_samples, baseline_samples)
 
    # Prometheus 커스텀 메트릭 업데이트 (Grafana 시각화용)
    gate_p_value.labels(canary=req.name, namespace=req.namespace).set(p_value)
    gate_effect_ms.labels(canary=req.name, namespace=req.namespace).set(effect_size * 1000)
 
    if p_value < alpha:
        raise HTTPException(
            status_code=500,
            detail={
                "result": "degradation_detected",
                "p_value": round(p_value, 4),
                "effect_size_ms": round(effect_size * 1000, 2),
                "alpha": alpha,
            },
        )
 
    return {
        "result": "pass",
        "p_value": round(p_value, 4),
        "effect_size_ms": round(effect_size * 1000, 2),
        "canary_n": len(canary_samples),
        "baseline_n": len(baseline_samples),
    }

Key Code Points:

Code Location	Intent
`alternative="less"`	One-sided test. Doubles the power with the same alpha by testing only "Is the canary worse?"
`n < 5` Condition	If quantile-based samples are insufficient, judgment cannot be made → Pass
`try/except` + fail-open	Default policy to not block deployment in case of Prometheus failure (adjustable based on team policy)
`start_http_server(8001)`	Expose p_value, effect_size as Prometheus metrics → Grafana dashboard integration
`raise HTTPException(500)`	Failure signals recognized by Flagger

Implementation Option B: Flagger Canary CRD Configuration

Connect the statistics service with a webhook of type rollout. Traffic increases only when it passes through the statistics gate during every analysis cycle.

yaml

# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  service:
    port: 80
  analysis:
    interval: 1m          # 60초마다 분석 사이클 실행
    threshold: 3          # 3회 연속 실패 시 롤백
    maxWeight: 50         # 최대 50% 트래픽까지 증가
    stepWeight: 10        # 사이클마다 10%씩 증가
    webhooks:
      - name: statistical-gate
        type: rollout                                   # 매 분석 사이클마다 호출
        url: http://stat-svc.monitoring.svc/check
        timeout: 30s                                    # Prometheus 쿼리 시간을 고려해 30s 이상 권장
        metadata:
          window: "5m"
          alpha: "0.05"
          metric: "http_request_duration_seconds"
      - name: pre-load-test
        type: pre-rollout                               # 최초 트래픽 이동 전 한 번만
        url: http://flagger-k6-webhook.monitoring.svc/launch
        timeout: 120s
        metadata:
          script: configmap/k6-script/test.js

Meaning of threshold: 3: A rollback is triggered when the statistics gate returns HTTP 500 three consecutive times. It will not be rolled back by one or two transient failures (network latency, Prometheus overload, etc.). Along with the alpha level, this value is also a key parameter for controlling sensitivity.

Implementation Option C: Boundary Interval Hybrid Manual Approval Gating

When the p-value is in the "boundary range" between 0.05 and 0.15, the pattern is to notify the team instead of making an automatic decision and wait for manual approval.

python

# stat_gate/main.py — confirm-rollout 엔드포인트 추가
import httpx
 
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")
 
# ⚠️ 경고: 아래 딕셔너리는 프로세스 재시작 시 초기화됩니다.
# 프로덕션에서는 반드시 Redis 또는 외부 스토어로 교체하세요.
APPROVAL_STORE: dict[str, bool] = {}
 
 
async def _run_check(req: GateRequest) -> float:
    """/check 엔드포인트와 동일한 로직으로 p-value를 반환하는 헬퍼."""
    metric = req.metadata.get("metric", "http_request_duration_seconds")
    window = req.metadata.get("window", "5m")
 
    canary_selector = f'pod=~"{req.name}-[0-9]+-.*",namespace="{req.namespace}"'
    baseline_selector = f'pod=~"{req.name}-primary-[0-9]+-.*",namespace="{req.namespace}"'
 
    canary_samples = fetch_quantile_samples(metric, canary_selector, window)
    baseline_samples = fetch_quantile_samples(metric, baseline_selector, window)
 
    if len(canary_samples) < 5 or len(baseline_samples) < 5:
        return 0.5  # 표본 부족 → 중립값 반환
 
    p_value, _ = run_mann_whitney(canary_samples, baseline_samples)
    return p_value
 
 
@app.post("/confirm")
async def confirm_gate(req: GateRequest):
    """confirm-rollout 타입 webhook 엔드포인트."""
    key = f"{req.namespace}/{req.name}"
 
    # 이미 수동 승인된 경우 즉시 통과
    if APPROVAL_STORE.get(key):
        del APPROVAL_STORE[key]
        return {"result": "approved"}
 
    try:
        p_value = await _run_check(req)
    except Exception as e:
        logger.error("confirm 게이트 p-value 계산 실패: %s", e)
        raise HTTPException(status_code=500, detail={"result": "check_failed"})
 
    if p_value < 0.05:
        return {"result": "auto_pass", "p_value": round(p_value, 4)}
    elif p_value > 0.15:
        raise HTTPException(
            status_code=500,
            detail={"result": "auto_fail", "p_value": round(p_value, 4)},
        )
    else:
        # 경계 구간: Slack 알림 후 대기
        await _notify_slack(key, p_value)
        raise HTTPException(
            status_code=500,
            detail={"result": "pending_approval", "p_value": round(p_value, 4)},
        )
 
 
@app.post("/approve/{namespace}/{name}")
async def approve(namespace: str, name: str):
    """Slack 봇 또는 관리자 UI에서 수동 승인 시 호출."""
    APPROVAL_STORE[f"{namespace}/{name}"] = True
    return {"result": "stored"}
 
 
async def _notify_slack(key: str, p_value: float):
    if not SLACK_WEBHOOK_URL:
        return
    message = {
        "text": (
            f":warning: *{key}* 카나리 p-value={p_value:.3f} (경계 구간 0.05~0.15). "
            f"수동 승인 필요: `POST /approve/{key}`"
        )
    }
    async with httpx.AsyncClient() as client:
        await client.post(SLACK_WEBHOOK_URL, json=message)

yaml

# canary.yaml — confirm-rollout webhook 추가
    webhooks:
      - name: statistical-gate
        type: rollout
        url: http://stat-svc.monitoring.svc/check
        timeout: 30s
        metadata:
          window: "5m"
          alpha: "0.05"
      - name: human-in-the-loop
        type: confirm-rollout          # 각 단계 진행 전 승인 확인
        url: http://stat-svc.monitoring.svc/confirm
        timeout: 1h                    # 최대 1시간 수동 승인 대기

Implementation Option D: Kayenta-style double gate (p-value + effect size)

The p-value alone is insufficient. If the sample size is very large, even small differences that are practically insignificant are judged to be statistically significant. Netflix Kayenta uses a double gate that determines a difference to be significant only when the 98% confidence interval falls outside the margin of error band.

python

# stat_gate/kayenta_style.py
from scipy.stats import mannwhitneyu, bootstrap  # bootstrap: 비모수 신뢰 구간 계산
import numpy as np
 
 
def kayenta_style_check(
    baseline: list[float],
    canary: list[float],
    alpha: float = 0.02,                 # Kayenta 기본값: 98% CI (alpha=0.02)
    allowed_increase_ratio: float = 0.1  # 중앙값이 10% 이상 증가 시 유의
) -> tuple[bool, dict]:
    """
    Kayenta의 핵심 통계 검정 방식과 동일한 이중 게이트:
    1. Mann-Whitney U p-value < alpha
    2. 효과 크기(중앙값 차이 비율) > allowed_increase_ratio
    두 조건 모두 충족 시에만 실패 판정.
    """
    stat, p_value = mannwhitneyu(baseline, canary, alternative="less")
 
    baseline_median = np.median(baseline)
    canary_median = np.median(canary)
    effect_ratio = (canary_median - baseline_median) / (baseline_median + 1e-10)
 
    # 95% 신뢰 구간 계산 (bootstrap)
    # scipy.stats.bootstrap의 statistic 함수는 반드시 (*data, axis) 시그니처를 따라야 합니다.
    # (b, c, axis) 형태로 작성하면 실행 시 TypeError가 발생합니다.
    def diff_medians(*arrays, axis):
        return np.median(arrays[1], axis=axis) - np.median(arrays[0], axis=axis)
 
    data = (np.array(baseline), np.array(canary))
    ci_result = bootstrap(
        data,
        diff_medians,
        confidence_level=0.95,
        n_resamples=1000,
        random_state=42,
    )
    ci_low, ci_high = ci_result.confidence_interval
 
    # 이중 게이트: 통계적으로 유의 AND 효과 크기도 임계값 초과
    is_degradation = (p_value < alpha) and (effect_ratio > allowed_increase_ratio)
 
    return not is_degradation, {
        "p_value": round(p_value, 4),
        "effect_ratio": round(effect_ratio, 4),
        "ci_95": [round(ci_low, 4), round(ci_high, 4)],
        "allowed_increase_ratio": allowed_increase_ratio,
    }

Pros and Cons Analysis

Advantages

Item	Content
Noise Robustness	Prevents unnecessary rollbacks by statistically eliminating false positives caused by natural variation
Scalability	Uses Flagger's existing webhook interface as is. Can be implemented without changing CRDs
Transparency	Automatically configures audit trail by returning p-values and confidence intervals as JSON
Flexibility	Different testing methods can be applied depending on the metric type (Latency, Error Rate, CTR)
Independence	Since the statistics service is a separate microservice, the testing logic can be changed independently of Flagger upgrades
ML Deployment Suitability	Gating of business metrics (conversion rate, CTR) from LLM and recommendation models using the same pattern is possible

Disadvantages and Precautions

Item	Content	Response Plan
Sample Size Dependence	Low traffic results in False Negatives due to insufficient statistical power	Passes insufficient samples; reduces required sample size with CUPED
Multiple Comparison Issues	Increased Type I Error When Testing Multiple Metrics Simultaneously	Application of Bonferroni or Holm-Bonferroni Correction
Sequential Testing Pitfalls	Alpha Inflation Occurs During Repeated Testing	Applying Alpha Spending Function Is Essential (Discussed in the Next Post)
Webhook Timeout	If timeout is short, determine failure before metric collection	Set to `timeout: 30s` or higher; Prometheus query optimization
Operational Complexity	Added availability and monitoring for the statistics service itself	Autoscaling with HPA; `/health` Health Check Endpoint Configuration
Alpha Level Selection	Block valid deployments if too low, pass bad deployments if too high	Starts at 0.05; incrementally adjusts based on rollback history
APPROVAL_STORE Volatility	In-memory dictionaries lose acknowledgment data upon process restart	Must be replaced with Redis in production

The Most Common Mistakes in Practice

Immediate judgment in undersample conditions: When canary traffic is less than 5%, sufficient data is not accumulated for 30 minutes. If the test is performed without a minimum sample size guard, a bad distribution will pass due to insufficient power. You must implement a guard that passes based on the n < threshold condition.
Testing performance degradation with a two-tailed test: When testing whether the canary is worse than the baseline, using a two-tailed test results in half the power at the same alpha. alternative="less" A one-tailed test must be used.
Set Webhook timeout shorter than Prometheus query time: Aggregation queries can take 10 to 15 seconds when Prometheus is under high load. If set to timeout: 5s, a timeout occurs before data collection, and Flagger considers this a failure. Set it to at least timeout: 30s.

In Conclusion

By connecting a statistical significance testing service to Flagger's webhook interface, you can implement the same level of rigor as Kayenta's core statistical testing methods on your own infrastructure with just a single microservice and dozens of lines of YAML.

3 Steps to Start Right Now:

Verify Statistics Service Locally: Configure the environment with pip install fastapi scipy prometheus-api-client uvicorn prometheus-client httpx and run with uvicorn main:app --reload. Check the response with the command below:
curl -X POST localhost:8000/check \ -H "Content-Type: application/json" \ -d '{"name":"test","namespace":"default","metadata":{"window":"5m","alpha":"0.05","metric":"http_request_duration_seconds"}}'
Connect rollout webhook to the staging cluster: Add webhooks blocks to the existing Canary CRD and set them to type: rollout and threshold: 3. Start loosely with alpha: "0.10" initially and adjust while observing the rollback frequency.
Visualizing Test Results with Grafana Dashboard: The code for implementation option A already exposes the canary_gate_p_value and canary_gate_effect_size_ms metrics on port 8001. By adding a ServiceMonitor to allow Prometheus to scrape this port, you can immediately visualize p-value trends per canary cycle in Grafana. As this data accumulates, it serves as a basis for alpha level tuning.

If you want to know more

CUPED (Controlled-experiment Using Pre-Existing Data): A technique that reduces variance by using metric data from the pre-deployment period as covariates. It is particularly useful for low-traffic services as it allows statistical significance to be reached more quickly with the same traffic. Optimizely and Kameleoon have adopted it as their primary statistical method.

Alpha Spending Function: A method for allocating the total alpha budget to be used over the entire experimental period in Sequential Testing. When applied to the Flagger gating layer, it allows the overall Type I error rate to be controlled below the target alpha, even when performing repeated tests in every analysis cycle. The implementation will be covered in the next post.

Next Part

A method to implement early stopping by applying Sequential Testing and Alpha Spending functions to the Flagger gating layer — Faster rollback, less user damage

Reference Materials

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services | DEV BAK - 기술블로그

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services

Subtitle: Kubernetes Canary Deployment · Python FastAPI · Mann-Whitney U Test · Prometheus · A/B Testing Gating

Key Concepts

Before reading this article

Flagger Webhook Gating Mechanism

Flagger calls an external HTTP endpoint during every canary analysis cycle and determines whether to proceed with deployment based solely on the response code.

HTTP 2xx  →  배포 계속 진행 (트래픽 가중치 증가)
HTTP 4xx/5xx  →  실패 카운트 증가 → threshold 초과 시 롤백

Webhook types are classified according to the timing of the call and the purpose.

Type	Time of invocation	Main use
`pre-rollout`	Before traffic migration	Pre-quality check (k6 load test, etc.)
`rollout`	Every analysis cycle	Real-time metric-based gating ← Key point of this article
`confirm-rollout`	Manual approval before proceeding to the step	Human intervention at the boundary zone
`rollback`	When a rollback occurs	Notification/Post-processing

The Natural Variation Problem and Statistical Solution

p-value: The probability that a difference of this magnitude or more extreme will be observed by chance, provided the null hypothesis (that the two distributions are identical) is true. A p < 0.05 value means that "the probability of observing such a result when the null hypothesis is true is less than 5%," and the lower this value, the higher the likelihood that an actual difference exists.
Confidence Interval (CI): The range of uncertainty regarding the effect size. The larger the sample, the narrower the CI becomes.
Statistical Power: The probability of detecting an actual difference. It decreases with lower traffic.

Which testing method should be chosen?

Infrastructure metrics such as latency and error rate do not follow a normal distribution. They have long tails and are sensitive to outliers.

Test Method	Distribution Assumptions	Suitable Metrics	Features
Mann-Whitney U	None (Non-parametric)	Latency, Error Rate	Netflix Kayenta adopted, Robust to outliers
Welch's t-test	Normal distribution	Response size, etc.	No need for homogeneity of variance assumption, safer than standard t-test
Bayesian A/B	None	Conversion Rate, CTR	Intuitive interpretation as "probability that the canary is better"

Architecture Overview

The overall flow of the statistical gating layer is as follows.

rust

flowchart LR
    F[Flagger Controller] -->|"POST /check\n{name, namespace, metadata}"| S["Statistical\nSignificance Service\n(FastAPI)"]
    S -->|PromQL 쿼리| P[(Prometheus)]
    P -->|분위수 시계열 반환| S
    S -->|"HTTP 200: pass\n{p_value, effect_size}"| F
    S -->|"HTTP 500: fail\n{degradation_detected}"| F
    F -->|트래픽 증가| C[카나리 Pod]
    F -->|롤백| B[베이스라인 Pod]

Role of each component:

Component	Role	Implementation
Flagger Controller	Canary Cycle Management, Webhook Invocation	Flagger CRD
Statistical Significance Service	Metric Collection + Statistical Testing + Decision	Python FastAPI
Prometheus	Save Canary/Baseline Metrics	Existing Monitoring Stack

Practical Application

Implementation Option A: Python + FastAPI-based Mann-Whitney Gate

Implement the /check endpoint that Flagger will call as a FastAPI.

python

# stat_gate/main.py
import os
import logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from scipy.stats import mannwhitneyu  # 비모수 검정 라이브러리
from prometheus_api_client import PrometheusConnect  # Prometheus HTTP API 클라이언트
from prometheus_client import Gauge, start_http_server  # 커스텀 메트릭 노출용
import numpy as np
 
app = FastAPI()
logger = logging.getLogger(__name__)
 
# PROMETHEUS_URL: Kubernetes 클러스터 내부에서는 서비스 FQDN을 사용하세요.
# 형식: http://<서비스명>.<네임스페이스>.svc.cluster.local:<포트>
prom = PrometheusConnect(
    url=os.getenv(
        "PROMETHEUS_URL",
        "http://prometheus-server.monitoring.svc.cluster.local:9090"
    )
)
 
# Grafana 시각화를 위한 커스텀 메트릭 노출 (포트 8001)
# Prometheus가 이 포트를 스크레이프하도록 ServiceMonitor를 추가하세요.
gate_p_value = Gauge("canary_gate_p_value", "Mann-Whitney p-value", ["canary", "namespace"])
gate_effect_ms = Gauge("canary_gate_effect_size_ms", "Effect size (ms)", ["canary", "namespace"])
start_http_server(8001)
 
 
class GateRequest(BaseModel):
    name: str
    namespace: str
    metadata: dict = {}
 
 
def fetch_quantile_samples(
    metric_base: str,
    pod_selector: str,
    duration: str = "5m",
) -> list[float]:
    """
    Prometheus 히스토그램에서 여러 분위수를 수집하여 근사 분포 샘플로 반환.
 
    주의: Prometheus는 개별 요청 레이턴시를 보관하지 않습니다.
    p10~p99 분위수 7개를 근사 샘플로 사용합니다.
    보다 정확한 분포 비교가 필요하다면 Loki 등 로그 기반 파이프라인에서
    실제 요청 샘플을 수집하는 방식을 권장합니다.
    """
    quantiles = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
    samples = []
    for q in quantiles:
        query = (
            f"histogram_quantile({q}, "
            f"rate({metric_base}_bucket{{{pod_selector}}}[{duration}]))"
        )
        try:
            result = prom.custom_query(query)
            for r in result:
                val = r["value"][1]
                if val not in ("NaN", "+Inf", "-Inf"):
                    samples.append(float(val))
        except Exception as e:
            logger.warning("Prometheus 쿼리 실패 (q=%s): %s", q, e)
    return samples
 
 
def run_mann_whitney(
    canary: list[float],
    baseline: list[float],
) -> tuple[float, float]:
    """Mann-Whitney U 검정을 수행하고 (p_value, effect_size_seconds)를 반환."""
    stat, p_value = mannwhitneyu(
        baseline,
        canary,
        alternative="less",  # 단측 검정: "캐너리가 베이스라인보다 나쁜가"만 검정 → 검력 향상
    )
    effect_size = float(np.median(canary) - np.median(baseline))
    return p_value, effect_size
 
 
@app.post("/check")
async def statistical_gate(req: GateRequest):
    window = req.metadata.get("window", "5m")
    alpha = float(req.metadata.get("alpha", "0.05"))
    metric = req.metadata.get("metric", "http_request_duration_seconds")
 
    canary_selector = f'pod=~"{req.name}-[0-9]+-.*",namespace="{req.namespace}"'
    baseline_selector = f'pod=~"{req.name}-primary-[0-9]+-.*",namespace="{req.namespace}"'
 
    try:
        canary_samples = fetch_quantile_samples(metric, canary_selector, window)
        baseline_samples = fetch_quantile_samples(metric, baseline_selector, window)
    except Exception as e:
        logger.error("메트릭 수집 중 오류 발생: %s", e)
        # Prometheus 장애 시 기본 동작: 통과 처리 (fail-open).
        # 운영 환경에서는 팀 정책에 따라 fail-closed(500 반환)로 변경 가능.
        return {"result": "metrics_unavailable", "detail": str(e)}
 
    if len(canary_samples) < 5 or len(baseline_samples) < 5:
        # 표본 부족 → 아직 판단 불가, 데이터 수집을 더 기다림 (통과 처리)
        return {
            "result": "insufficient_data",
            "canary_n": len(canary_samples),
            "baseline_n": len(baseline_samples),
        }
 
    p_value, effect_size = run_mann_whitney(canary_samples, baseline_samples)
 
    # Prometheus 커스텀 메트릭 업데이트 (Grafana 시각화용)
    gate_p_value.labels(canary=req.name, namespace=req.namespace).set(p_value)
    gate_effect_ms.labels(canary=req.name, namespace=req.namespace).set(effect_size * 1000)
 
    if p_value < alpha:
        raise HTTPException(
            status_code=500,
            detail={
                "result": "degradation_detected",
                "p_value": round(p_value, 4),
                "effect_size_ms": round(effect_size * 1000, 2),
                "alpha": alpha,
            },
        )
 
    return {
        "result": "pass",
        "p_value": round(p_value, 4),
        "effect_size_ms": round(effect_size * 1000, 2),
        "canary_n": len(canary_samples),
        "baseline_n": len(baseline_samples),
    }

Key Code Points:

Code Location	Intent
`alternative="less"`	One-sided test. Doubles the power with the same alpha by testing only "Is the canary worse?"
`n < 5` Condition	If quantile-based samples are insufficient, judgment cannot be made → Pass
`try/except` + fail-open	Default policy to not block deployment in case of Prometheus failure (adjustable based on team policy)
`start_http_server(8001)`	Expose p_value, effect_size as Prometheus metrics → Grafana dashboard integration
`raise HTTPException(500)`	Failure signals recognized by Flagger

Implementation Option B: Flagger Canary CRD Configuration

Connect the statistics service with a webhook of type rollout. Traffic increases only when it passes through the statistics gate during every analysis cycle.

yaml

# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  service:
    port: 80
  analysis:
    interval: 1m          # 60초마다 분석 사이클 실행
    threshold: 3          # 3회 연속 실패 시 롤백
    maxWeight: 50         # 최대 50% 트래픽까지 증가
    stepWeight: 10        # 사이클마다 10%씩 증가
    webhooks:
      - name: statistical-gate
        type: rollout                                   # 매 분석 사이클마다 호출
        url: http://stat-svc.monitoring.svc/check
        timeout: 30s                                    # Prometheus 쿼리 시간을 고려해 30s 이상 권장
        metadata:
          window: "5m"
          alpha: "0.05"
          metric: "http_request_duration_seconds"
      - name: pre-load-test
        type: pre-rollout                               # 최초 트래픽 이동 전 한 번만
        url: http://flagger-k6-webhook.monitoring.svc/launch
        timeout: 120s
        metadata:
          script: configmap/k6-script/test.js

Implementation Option C: Boundary Interval Hybrid Manual Approval Gating

When the p-value is in the "boundary range" between 0.05 and 0.15, the pattern is to notify the team instead of making an automatic decision and wait for manual approval.

python

# stat_gate/main.py — confirm-rollout 엔드포인트 추가
import httpx
 
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")
 
# ⚠️ 경고: 아래 딕셔너리는 프로세스 재시작 시 초기화됩니다.
# 프로덕션에서는 반드시 Redis 또는 외부 스토어로 교체하세요.
APPROVAL_STORE: dict[str, bool] = {}
 
 
async def _run_check(req: GateRequest) -> float:
    """/check 엔드포인트와 동일한 로직으로 p-value를 반환하는 헬퍼."""
    metric = req.metadata.get("metric", "http_request_duration_seconds")
    window = req.metadata.get("window", "5m")
 
    canary_selector = f'pod=~"{req.name}-[0-9]+-.*",namespace="{req.namespace}"'
    baseline_selector = f'pod=~"{req.name}-primary-[0-9]+-.*",namespace="{req.namespace}"'
 
    canary_samples = fetch_quantile_samples(metric, canary_selector, window)
    baseline_samples = fetch_quantile_samples(metric, baseline_selector, window)
 
    if len(canary_samples) < 5 or len(baseline_samples) < 5:
        return 0.5  # 표본 부족 → 중립값 반환
 
    p_value, _ = run_mann_whitney(canary_samples, baseline_samples)
    return p_value
 
 
@app.post("/confirm")
async def confirm_gate(req: GateRequest):
    """confirm-rollout 타입 webhook 엔드포인트."""
    key = f"{req.namespace}/{req.name}"
 
    # 이미 수동 승인된 경우 즉시 통과
    if APPROVAL_STORE.get(key):
        del APPROVAL_STORE[key]
        return {"result": "approved"}
 
    try:
        p_value = await _run_check(req)
    except Exception as e:
        logger.error("confirm 게이트 p-value 계산 실패: %s", e)
        raise HTTPException(status_code=500, detail={"result": "check_failed"})
 
    if p_value < 0.05:
        return {"result": "auto_pass", "p_value": round(p_value, 4)}
    elif p_value > 0.15:
        raise HTTPException(
            status_code=500,
            detail={"result": "auto_fail", "p_value": round(p_value, 4)},
        )
    else:
        # 경계 구간: Slack 알림 후 대기
        await _notify_slack(key, p_value)
        raise HTTPException(
            status_code=500,
            detail={"result": "pending_approval", "p_value": round(p_value, 4)},
        )
 
 
@app.post("/approve/{namespace}/{name}")
async def approve(namespace: str, name: str):
    """Slack 봇 또는 관리자 UI에서 수동 승인 시 호출."""
    APPROVAL_STORE[f"{namespace}/{name}"] = True
    return {"result": "stored"}
 
 
async def _notify_slack(key: str, p_value: float):
    if not SLACK_WEBHOOK_URL:
        return
    message = {
        "text": (
            f":warning: *{key}* 카나리 p-value={p_value:.3f} (경계 구간 0.05~0.15). "
            f"수동 승인 필요: `POST /approve/{key}`"
        )
    }
    async with httpx.AsyncClient() as client:
        await client.post(SLACK_WEBHOOK_URL, json=message)

yaml

# canary.yaml — confirm-rollout webhook 추가
    webhooks:
      - name: statistical-gate
        type: rollout
        url: http://stat-svc.monitoring.svc/check
        timeout: 30s
        metadata:
          window: "5m"
          alpha: "0.05"
      - name: human-in-the-loop
        type: confirm-rollout          # 각 단계 진행 전 승인 확인
        url: http://stat-svc.monitoring.svc/confirm
        timeout: 1h                    # 최대 1시간 수동 승인 대기

Implementation Option D: Kayenta-style double gate (p-value + effect size)

python

# stat_gate/kayenta_style.py
from scipy.stats import mannwhitneyu, bootstrap  # bootstrap: 비모수 신뢰 구간 계산
import numpy as np
 
 
def kayenta_style_check(
    baseline: list[float],
    canary: list[float],
    alpha: float = 0.02,                 # Kayenta 기본값: 98% CI (alpha=0.02)
    allowed_increase_ratio: float = 0.1  # 중앙값이 10% 이상 증가 시 유의
) -> tuple[bool, dict]:
    """
    Kayenta의 핵심 통계 검정 방식과 동일한 이중 게이트:
    1. Mann-Whitney U p-value < alpha
    2. 효과 크기(중앙값 차이 비율) > allowed_increase_ratio
    두 조건 모두 충족 시에만 실패 판정.
    """
    stat, p_value = mannwhitneyu(baseline, canary, alternative="less")
 
    baseline_median = np.median(baseline)
    canary_median = np.median(canary)
    effect_ratio = (canary_median - baseline_median) / (baseline_median + 1e-10)
 
    # 95% 신뢰 구간 계산 (bootstrap)
    # scipy.stats.bootstrap의 statistic 함수는 반드시 (*data, axis) 시그니처를 따라야 합니다.
    # (b, c, axis) 형태로 작성하면 실행 시 TypeError가 발생합니다.
    def diff_medians(*arrays, axis):
        return np.median(arrays[1], axis=axis) - np.median(arrays[0], axis=axis)
 
    data = (np.array(baseline), np.array(canary))
    ci_result = bootstrap(
        data,
        diff_medians,
        confidence_level=0.95,
        n_resamples=1000,
        random_state=42,
    )
    ci_low, ci_high = ci_result.confidence_interval
 
    # 이중 게이트: 통계적으로 유의 AND 효과 크기도 임계값 초과
    is_degradation = (p_value < alpha) and (effect_ratio > allowed_increase_ratio)
 
    return not is_degradation, {
        "p_value": round(p_value, 4),
        "effect_ratio": round(effect_ratio, 4),
        "ci_95": [round(ci_low, 4), round(ci_high, 4)],
        "allowed_increase_ratio": allowed_increase_ratio,
    }

Pros and Cons Analysis

Advantages

Item	Content
Noise Robustness	Prevents unnecessary rollbacks by statistically eliminating false positives caused by natural variation
Scalability	Uses Flagger's existing webhook interface as is. Can be implemented without changing CRDs
Transparency	Automatically configures audit trail by returning p-values and confidence intervals as JSON
Flexibility	Different testing methods can be applied depending on the metric type (Latency, Error Rate, CTR)
Independence	Since the statistics service is a separate microservice, the testing logic can be changed independently of Flagger upgrades
ML Deployment Suitability	Gating of business metrics (conversion rate, CTR) from LLM and recommendation models using the same pattern is possible

Disadvantages and Precautions

Item	Content	Response Plan
Sample Size Dependence	Low traffic results in False Negatives due to insufficient statistical power	Passes insufficient samples; reduces required sample size with CUPED
Multiple Comparison Issues	Increased Type I Error When Testing Multiple Metrics Simultaneously	Application of Bonferroni or Holm-Bonferroni Correction
Sequential Testing Pitfalls	Alpha Inflation Occurs During Repeated Testing	Applying Alpha Spending Function Is Essential (Discussed in the Next Post)
Webhook Timeout	If timeout is short, determine failure before metric collection	Set to `timeout: 30s` or higher; Prometheus query optimization
Operational Complexity	Added availability and monitoring for the statistics service itself	Autoscaling with HPA; `/health` Health Check Endpoint Configuration
Alpha Level Selection	Block valid deployments if too low, pass bad deployments if too high	Starts at 0.05; incrementally adjusts based on rollback history
APPROVAL_STORE Volatility	In-memory dictionaries lose acknowledgment data upon process restart	Must be replaced with Redis in production

The Most Common Mistakes in Practice

Immediate judgment in undersample conditions: When canary traffic is less than 5%, sufficient data is not accumulated for 30 minutes. If the test is performed without a minimum sample size guard, a bad distribution will pass due to insufficient power. You must implement a guard that passes based on the n < threshold condition.
Testing performance degradation with a two-tailed test: When testing whether the canary is worse than the baseline, using a two-tailed test results in half the power at the same alpha. alternative="less" A one-tailed test must be used.
Set Webhook timeout shorter than Prometheus query time: Aggregation queries can take 10 to 15 seconds when Prometheus is under high load. If set to timeout: 5s, a timeout occurs before data collection, and Flagger considers this a failure. Set it to at least timeout: 30s.

In Conclusion

3 Steps to Start Right Now:

Verify Statistics Service Locally: Configure the environment with pip install fastapi scipy prometheus-api-client uvicorn prometheus-client httpx and run with uvicorn main:app --reload. Check the response with the command below:
curl -X POST localhost:8000/check \ -H "Content-Type: application/json" \ -d '{"name":"test","namespace":"default","metadata":{"window":"5m","alpha":"0.05","metric":"http_request_duration_seconds"}}'
Connect rollout webhook to the staging cluster: Add webhooks blocks to the existing Canary CRD and set them to type: rollout and threshold: 3. Start loosely with alpha: "0.10" initially and adjust while observing the rollback frequency.
Visualizing Test Results with Grafana Dashboard: The code for implementation option A already exposes the canary_gate_p_value and canary_gate_effect_size_ms metrics on port 8001. By adding a ServiceMonitor to allow Prometheus to scrape this port, you can immediately visualize p-value trends per canary cycle in Grafana. As this data accumulates, it serves as a basis for alpha level tuning.

If you want to know more

Next Part

A method to implement early stopping by applying Sequential Testing and Alpha Spending functions to the Flagger gating layer — Faster rollback, less user damage

Key Concepts

Before reading this article

Flagger Webhook Gating Mechanism

The Natural Variation Problem and Statistical Solution

Which testing method should be chosen?

Architecture Overview

Practical Application

Implementation Option A: Python + FastAPI-based Mann-Whitney Gate

Implementation Option B: Flagger Canary CRD Configuration

Implementation Option C: Boundary Interval Hybrid Manual Approval Gating

Implementation Option D: Kayenta-style double gate (p-value + effect size)

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Next Part

Reference Materials

Key Concepts

Before reading this article

Flagger Webhook Gating Mechanism

The Natural Variation Problem and Statistical Solution

Which testing method should be chosen?

Architecture Overview

Practical Application

Implementation Option A: Python + FastAPI-based Mann-Whitney Gate

Implementation Option B: Flagger Canary CRD Configuration

Implementation Option C: Boundary Interval Hybrid Manual Approval Gating

Implementation Option D: Kayenta-style double gate (p-value + effect size)

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Next Part

Reference Materials

Recommended Posts

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

Flagger + Istio A/B Routing: Integrating New Relic NRQL with Conversion Rate as Distribution Gating Criteria

Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments

Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate