Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services
Subtitle: Kubernetes Canary Deployment · Python FastAPI · Mann-Whitney U Test · Prometheus · A/B Testing Gating
Any developer who has operated Canary deployments on Kubernetes has likely experienced this at least once. The error rate rose from 0.8% to 1.2%. Should I roll back? Or is it just a traffic spike or temporary noise?
Simple threshold-based gating fails to answer this question. When only 10% of total traffic is routed to the canary, the small sample size is susceptible to natural variation—random noise that always occurs when comparing two identical versions. In fact, in a P99 latency service with a standard deviation of ±30ms, judging based on only 50 canary samples results in a 30% or higher probability of exceeding the threshold even when there are actually no issues. Consequently, situations where perfectly good deployments are rolled back (false positives) or actually bad deployments pass through (missed positives) repeat.
This article covers how to implement an A/B test gating layer robust to natural variation using Python FastAPI and the Mann-Whitney U test by connecting a statistical significance testing service to Flagger's webhooks interface. At the end of this article, you can take the complete code that connects directly to your Flagger CRD.
Key Concepts
Before reading this article
This article is intended for teams that are already using Flagger or are considering adopting it. It assumes a basic understanding of Kubernetes Deployment, Canary CRD, and PromQL. For an introduction to Flagger itself, it is recommended to refer to the official documentation first.
Flagger Webhook Gating Mechanism
Flagger calls an external HTTP endpoint during every canary analysis cycle and determines whether to proceed with deployment based solely on the response code.
HTTP 2xx → 배포 계속 진행 (트래픽 가중치 증가)
HTTP 4xx/5xx → 실패 카운트 증가 → threshold 초과 시 롤백Webhook types are classified according to the timing of the call and the purpose.
| Type | Time of invocation | Main use |
|---|---|---|
pre-rollout |
Before traffic migration | Pre-quality check (k6 load test, etc.) |
rollout |
Every analysis cycle | Real-time metric-based gating ← Key point of this article |
confirm-rollout |
Manual approval before proceeding to the step | Human intervention at the boundary zone |
rollback |
When a rollback occurs | Notification/Post-processing |
Core Design Principle: Flagger only looks at the HTTP status code of the webhook response. Any complex operations can be performed internally within the statistics service, and Flagger simply receives the result as 200 or 500. This simple interface is the key to enabling connections to external statistics services.
The Natural Variation Problem and Statistical Solution
If a service with a P99 latency of 180ms ± 30ms records a canary moment P99 of 210ms, is this a problem? It could be natural variation within a 1-sigma range. Statistical significance testing answers this question mathematically.
- p-value: The probability that a difference of this magnitude or more extreme will be observed by chance, provided the null hypothesis (that the two distributions are identical) is true. A p < 0.05 value means that "the probability of observing such a result when the null hypothesis is true is less than 5%," and the lower this value, the higher the likelihood that an actual difference exists.
- Confidence Interval (CI): The range of uncertainty regarding the effect size. The larger the sample, the narrower the CI becomes.
- Statistical Power: The probability of detecting an actual difference. It decreases with lower traffic.
Which testing method should be chosen?
Infrastructure metrics such as latency and error rate do not follow a normal distribution. They have long tails and are sensitive to outliers.
| Test Method | Distribution Assumptions | Suitable Metrics | Features |
|---|---|---|---|
| Mann-Whitney U | None (Non-parametric) | Latency, Error Rate | Netflix Kayenta adopted, Robust to outliers |
| Welch's t-test | Normal distribution | Response size, etc. | No need for homogeneity of variance assumption, safer than standard t-test |
| Bayesian A/B | None | Conversion Rate, CTR | Intuitive interpretation as "probability that the canary is better" |
Mann-Whitney U Test: This is a non-parametric test that compares the distributions of two independent samples based on rank rather than their intrinsic values. Because it does not assume normality, it is suitable for skewed distributions such as latency, and the results are not significantly affected by a single outlier. This is why Netflix Kayenta adopted this method as its core test.
Brief Introduction to Bayesian A/B Testing: The Beta-Bernoulli model is well-suited for metrics dealing with binary outcomes, such as conversion rates and error rates. It sets the prior distribution as Beta(1,1) (uniform prior, no prior knowledge) and updates the posterior distribution to Beta(α+success, β+failure) as the number of successes and failures is observed. Its strength lies in the ability to directly calculate the "probability that the canary's error rate is lower than the baseline," enabling intuitive decision-making without a p-value.
Architecture Overview
The overall flow of the statistical gating layer is as follows.
flowchart LR
F[Flagger Controller] -->|"POST /check\n{name, namespace, metadata}"| S["Statistical\nSignificance Service\n(FastAPI)"]
S -->|PromQL 쿼리| P[(Prometheus)]
P -->|분위수 시계열 반환| S
S -->|"HTTP 200: pass\n{p_value, effect_size}"| F
S -->|"HTTP 500: fail\n{degradation_detected}"| F
F -->|트래픽 증가| C[카나리 Pod]
F -->|롤백| B[베이스라인 Pod]Role of each component:
| Component | Role | Implementation |
|---|---|---|
| Flagger Controller | Canary Cycle Management, Webhook Invocation | Flagger CRD |
| Statistical Significance Service | Metric Collection + Statistical Testing + Decision | Python FastAPI |
| Prometheus | Save Canary/Baseline Metrics | Existing Monitoring Stack |
Practical Application
Implementation Option A: Python + FastAPI-based Mann-Whitney Gate
Implement the /check endpoint that Flagger will call as a FastAPI.
Practical Constraints on Prometheus Data Collection: Prometheus does not store individual request latency. rate(http_request_duration_seconds_bucket[5m]) only returns the cumulative proportion of histogram buckets, not individual request samples that can be directly fed into Mann-Whitney. If actual request-unit samples are required, a log-based pipeline like Loki must be used in conjunction. As a practical alternative, we use a method here that collects various quantiles of the histogram as approximate distribution samples.
# stat_gate/main.py
import os
import logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from scipy.stats import mannwhitneyu # 비모수 검정 라이브러리
from prometheus_api_client import PrometheusConnect # Prometheus HTTP API 클라이언트
from prometheus_client import Gauge, start_http_server # 커스텀 메트릭 노출용
import numpy as np
app = FastAPI()
logger = logging.getLogger(__name__)
# PROMETHEUS_URL: Kubernetes 클러스터 내부에서는 서비스 FQDN을 사용하세요.
# 형식: http://<서비스명>.<네임스페이스>.svc.cluster.local:<포트>
prom = PrometheusConnect(
url=os.getenv(
"PROMETHEUS_URL",
"http://prometheus-server.monitoring.svc.cluster.local:9090"
)
)
# Grafana 시각화를 위한 커스텀 메트릭 노출 (포트 8001)
# Prometheus가 이 포트를 스크레이프하도록 ServiceMonitor를 추가하세요.
gate_p_value = Gauge("canary_gate_p_value", "Mann-Whitney p-value", ["canary", "namespace"])
gate_effect_ms = Gauge("canary_gate_effect_size_ms", "Effect size (ms)", ["canary", "namespace"])
start_http_server(8001)
class GateRequest(BaseModel):
name: str
namespace: str
metadata: dict = {}
def fetch_quantile_samples(
metric_base: str,
pod_selector: str,
duration: str = "5m",
) -> list[float]:
"""
Prometheus 히스토그램에서 여러 분위수를 수집하여 근사 분포 샘플로 반환.
주의: Prometheus는 개별 요청 레이턴시를 보관하지 않습니다.
p10~p99 분위수 7개를 근사 샘플로 사용합니다.
보다 정확한 분포 비교가 필요하다면 Loki 등 로그 기반 파이프라인에서
실제 요청 샘플을 수집하는 방식을 권장합니다.
"""
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
samples = []
for q in quantiles:
query = (
f"histogram_quantile({q}, "
f"rate({metric_base}_bucket{{{pod_selector}}}[{duration}]))"
)
try:
result = prom.custom_query(query)
for r in result:
val = r["value"][1]
if val not in ("NaN", "+Inf", "-Inf"):
samples.append(float(val))
except Exception as e:
logger.warning("Prometheus 쿼리 실패 (q=%s): %s", q, e)
return samples
def run_mann_whitney(
canary: list[float],
baseline: list[float],
) -> tuple[float, float]:
"""Mann-Whitney U 검정을 수행하고 (p_value, effect_size_seconds)를 반환."""
stat, p_value = mannwhitneyu(
baseline,
canary,
alternative="less", # 단측 검정: "캐너리가 베이스라인보다 나쁜가"만 검정 → 검력 향상
)
effect_size = float(np.median(canary) - np.median(baseline))
return p_value, effect_size
@app.post("/check")
async def statistical_gate(req: GateRequest):
window = req.metadata.get("window", "5m")
alpha = float(req.metadata.get("alpha", "0.05"))
metric = req.metadata.get("metric", "http_request_duration_seconds")
canary_selector = f'pod=~"{req.name}-[0-9]+-.*",namespace="{req.namespace}"'
baseline_selector = f'pod=~"{req.name}-primary-[0-9]+-.*",namespace="{req.namespace}"'
try:
canary_samples = fetch_quantile_samples(metric, canary_selector, window)
baseline_samples = fetch_quantile_samples(metric, baseline_selector, window)
except Exception as e:
logger.error("메트릭 수집 중 오류 발생: %s", e)
# Prometheus 장애 시 기본 동작: 통과 처리 (fail-open).
# 운영 환경에서는 팀 정책에 따라 fail-closed(500 반환)로 변경 가능.
return {"result": "metrics_unavailable", "detail": str(e)}
if len(canary_samples) < 5 or len(baseline_samples) < 5:
# 표본 부족 → 아직 판단 불가, 데이터 수집을 더 기다림 (통과 처리)
return {
"result": "insufficient_data",
"canary_n": len(canary_samples),
"baseline_n": len(baseline_samples),
}
p_value, effect_size = run_mann_whitney(canary_samples, baseline_samples)
# Prometheus 커스텀 메트릭 업데이트 (Grafana 시각화용)
gate_p_value.labels(canary=req.name, namespace=req.namespace).set(p_value)
gate_effect_ms.labels(canary=req.name, namespace=req.namespace).set(effect_size * 1000)
if p_value < alpha:
raise HTTPException(
status_code=500,
detail={
"result": "degradation_detected",
"p_value": round(p_value, 4),
"effect_size_ms": round(effect_size * 1000, 2),
"alpha": alpha,
},
)
return {
"result": "pass",
"p_value": round(p_value, 4),
"effect_size_ms": round(effect_size * 1000, 2),
"canary_n": len(canary_samples),
"baseline_n": len(baseline_samples),
}Key Code Points:
| Code Location | Intent |
|---|---|
alternative="less" |
One-sided test. Doubles the power with the same alpha by testing only "Is the canary worse?" |
n < 5 Condition |
If quantile-based samples are insufficient, judgment cannot be made → Pass |
try/except + fail-open |
Default policy to not block deployment in case of Prometheus failure (adjustable based on team policy) |
start_http_server(8001) |
Expose p_value, effect_size as Prometheus metrics → Grafana dashboard integration |
raise HTTPException(500) |
Failure signals recognized by Flagger |
Implementation Option B: Flagger Canary CRD Configuration
Connect the statistics service with a webhook of type rollout. Traffic increases only when it passes through the statistics gate during every analysis cycle.
# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
service:
port: 80
analysis:
interval: 1m # 60초마다 분석 사이클 실행
threshold: 3 # 3회 연속 실패 시 롤백
maxWeight: 50 # 최대 50% 트래픽까지 증가
stepWeight: 10 # 사이클마다 10%씩 증가
webhooks:
- name: statistical-gate
type: rollout # 매 분석 사이클마다 호출
url: http://stat-svc.monitoring.svc/check
timeout: 30s # Prometheus 쿼리 시간을 고려해 30s 이상 권장
metadata:
window: "5m"
alpha: "0.05"
metric: "http_request_duration_seconds"
- name: pre-load-test
type: pre-rollout # 최초 트래픽 이동 전 한 번만
url: http://flagger-k6-webhook.monitoring.svc/launch
timeout: 120s
metadata:
script: configmap/k6-script/test.jsMeaning of threshold: 3: A rollback is triggered when the statistics gate returns HTTP 500 three consecutive times. It will not be rolled back by one or two transient failures (network latency, Prometheus overload, etc.). Along with the alpha level, this value is also a key parameter for controlling sensitivity.
Implementation Option C: Boundary Interval Hybrid Manual Approval Gating
When the p-value is in the "boundary range" between 0.05 and 0.15, the pattern is to notify the team instead of making an automatic decision and wait for manual approval.
# stat_gate/main.py — confirm-rollout 엔드포인트 추가
import httpx
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")
# ⚠️ 경고: 아래 딕셔너리는 프로세스 재시작 시 초기화됩니다.
# 프로덕션에서는 반드시 Redis 또는 외부 스토어로 교체하세요.
APPROVAL_STORE: dict[str, bool] = {}
async def _run_check(req: GateRequest) -> float:
"""/check 엔드포인트와 동일한 로직으로 p-value를 반환하는 헬퍼."""
metric = req.metadata.get("metric", "http_request_duration_seconds")
window = req.metadata.get("window", "5m")
canary_selector = f'pod=~"{req.name}-[0-9]+-.*",namespace="{req.namespace}"'
baseline_selector = f'pod=~"{req.name}-primary-[0-9]+-.*",namespace="{req.namespace}"'
canary_samples = fetch_quantile_samples(metric, canary_selector, window)
baseline_samples = fetch_quantile_samples(metric, baseline_selector, window)
if len(canary_samples) < 5 or len(baseline_samples) < 5:
return 0.5 # 표본 부족 → 중립값 반환
p_value, _ = run_mann_whitney(canary_samples, baseline_samples)
return p_value
@app.post("/confirm")
async def confirm_gate(req: GateRequest):
"""confirm-rollout 타입 webhook 엔드포인트."""
key = f"{req.namespace}/{req.name}"
# 이미 수동 승인된 경우 즉시 통과
if APPROVAL_STORE.get(key):
del APPROVAL_STORE[key]
return {"result": "approved"}
try:
p_value = await _run_check(req)
except Exception as e:
logger.error("confirm 게이트 p-value 계산 실패: %s", e)
raise HTTPException(status_code=500, detail={"result": "check_failed"})
if p_value < 0.05:
return {"result": "auto_pass", "p_value": round(p_value, 4)}
elif p_value > 0.15:
raise HTTPException(
status_code=500,
detail={"result": "auto_fail", "p_value": round(p_value, 4)},
)
else:
# 경계 구간: Slack 알림 후 대기
await _notify_slack(key, p_value)
raise HTTPException(
status_code=500,
detail={"result": "pending_approval", "p_value": round(p_value, 4)},
)
@app.post("/approve/{namespace}/{name}")
async def approve(namespace: str, name: str):
"""Slack 봇 또는 관리자 UI에서 수동 승인 시 호출."""
APPROVAL_STORE[f"{namespace}/{name}"] = True
return {"result": "stored"}
async def _notify_slack(key: str, p_value: float):
if not SLACK_WEBHOOK_URL:
return
message = {
"text": (
f":warning: *{key}* 카나리 p-value={p_value:.3f} (경계 구간 0.05~0.15). "
f"수동 승인 필요: `POST /approve/{key}`"
)
}
async with httpx.AsyncClient() as client:
await client.post(SLACK_WEBHOOK_URL, json=message)# canary.yaml — confirm-rollout webhook 추가
webhooks:
- name: statistical-gate
type: rollout
url: http://stat-svc.monitoring.svc/check
timeout: 30s
metadata:
window: "5m"
alpha: "0.05"
- name: human-in-the-loop
type: confirm-rollout # 각 단계 진행 전 승인 확인
url: http://stat-svc.monitoring.svc/confirm
timeout: 1h # 최대 1시간 수동 승인 대기Implementation Option D: Kayenta-style double gate (p-value + effect size)
The p-value alone is insufficient. If the sample size is very large, even small differences that are practically insignificant are judged to be statistically significant. Netflix Kayenta uses a double gate that determines a difference to be significant only when the 98% confidence interval falls outside the margin of error band.
# stat_gate/kayenta_style.py
from scipy.stats import mannwhitneyu, bootstrap # bootstrap: 비모수 신뢰 구간 계산
import numpy as np
def kayenta_style_check(
baseline: list[float],
canary: list[float],
alpha: float = 0.02, # Kayenta 기본값: 98% CI (alpha=0.02)
allowed_increase_ratio: float = 0.1 # 중앙값이 10% 이상 증가 시 유의
) -> tuple[bool, dict]:
"""
Kayenta의 핵심 통계 검정 방식과 동일한 이중 게이트:
1. Mann-Whitney U p-value < alpha
2. 효과 크기(중앙값 차이 비율) > allowed_increase_ratio
두 조건 모두 충족 시에만 실패 판정.
"""
stat, p_value = mannwhitneyu(baseline, canary, alternative="less")
baseline_median = np.median(baseline)
canary_median = np.median(canary)
effect_ratio = (canary_median - baseline_median) / (baseline_median + 1e-10)
# 95% 신뢰 구간 계산 (bootstrap)
# scipy.stats.bootstrap의 statistic 함수는 반드시 (*data, axis) 시그니처를 따라야 합니다.
# (b, c, axis) 형태로 작성하면 실행 시 TypeError가 발생합니다.
def diff_medians(*arrays, axis):
return np.median(arrays[1], axis=axis) - np.median(arrays[0], axis=axis)
data = (np.array(baseline), np.array(canary))
ci_result = bootstrap(
data,
diff_medians,
confidence_level=0.95,
n_resamples=1000,
random_state=42,
)
ci_low, ci_high = ci_result.confidence_interval
# 이중 게이트: 통계적으로 유의 AND 효과 크기도 임계값 초과
is_degradation = (p_value < alpha) and (effect_ratio > allowed_increase_ratio)
return not is_degradation, {
"p_value": round(p_value, 4),
"effect_ratio": round(effect_ratio, 4),
"ci_95": [round(ci_low, 4), round(ci_high, 4)],
"allowed_increase_ratio": allowed_increase_ratio,
}Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Noise Robustness | Prevents unnecessary rollbacks by statistically eliminating false positives caused by natural variation |
| Scalability | Uses Flagger's existing webhook interface as is. Can be implemented without changing CRDs |
| Transparency | Automatically configures audit trail by returning p-values and confidence intervals as JSON |
| Flexibility | Different testing methods can be applied depending on the metric type (Latency, Error Rate, CTR) |
| Independence | Since the statistics service is a separate microservice, the testing logic can be changed independently of Flagger upgrades |
| ML Deployment Suitability | Gating of business metrics (conversion rate, CTR) from LLM and recommendation models using the same pattern is possible |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Sample Size Dependence | Low traffic results in False Negatives due to insufficient statistical power | Passes insufficient samples; reduces required sample size with CUPED |
| Multiple Comparison Issues | Increased Type I Error When Testing Multiple Metrics Simultaneously | Application of Bonferroni or Holm-Bonferroni Correction |
| Sequential Testing Pitfalls | Alpha Inflation Occurs During Repeated Testing | Applying Alpha Spending Function Is Essential (Discussed in the Next Post) |
| Webhook Timeout | If timeout is short, determine failure before metric collection | Set to timeout: 30s or higher; Prometheus query optimization |
| Operational Complexity | Added availability and monitoring for the statistics service itself | Autoscaling with HPA; /health Health Check Endpoint Configuration |
| Alpha Level Selection | Block valid deployments if too low, pass bad deployments if too high | Starts at 0.05; incrementally adjusts based on rollback history |
| APPROVAL_STORE Volatility | In-memory dictionaries lose acknowledgment data upon process restart | Must be replaced with Redis in production |
The Most Common Mistakes in Practice
- Immediate judgment in undersample conditions: When canary traffic is less than 5%, sufficient data is not accumulated for 30 minutes. If the test is performed without a minimum sample size guard, a bad distribution will pass due to insufficient power. You must implement a guard that passes based on the
n < thresholdcondition. - Testing performance degradation with a two-tailed test: When testing whether the canary is worse than the baseline, using a two-tailed test results in half the power at the same alpha.
alternative="less"A one-tailed test must be used. - Set Webhook timeout shorter than Prometheus query time: Aggregation queries can take 10 to 15 seconds when Prometheus is under high load. If set to
timeout: 5s, a timeout occurs before data collection, and Flagger considers this a failure. Set it to at leasttimeout: 30s.
In Conclusion
By connecting a statistical significance testing service to Flagger's webhook interface, you can implement the same level of rigor as Kayenta's core statistical testing methods on your own infrastructure with just a single microservice and dozens of lines of YAML.
3 Steps to Start Right Now:
- Verify Statistics Service Locally: Configure the environment with
pip install fastapi scipy prometheus-api-client uvicorn prometheus-client httpxand run withuvicorn main:app --reload. Check the response with the command below: curl -X POST localhost:8000/check \ -H "Content-Type: application/json" \ -d '{"name":"test","namespace":"default","metadata":{"window":"5m","alpha":"0.05","metric":"http_request_duration_seconds"}}'- Connect
rolloutwebhook to the staging cluster: Addwebhooksblocks to the existing Canary CRD and set them totype: rolloutandthreshold: 3. Start loosely withalpha: "0.10"initially and adjust while observing the rollback frequency. - Visualizing Test Results with Grafana Dashboard: The code for implementation option A already exposes the
canary_gate_p_valueandcanary_gate_effect_size_msmetrics on port 8001. By adding a ServiceMonitor to allow Prometheus to scrape this port, you can immediately visualize p-value trends per canary cycle in Grafana. As this data accumulates, it serves as a basis for alpha level tuning.
If you want to know more
CUPED (Controlled-experiment Using Pre-Existing Data): A technique that reduces variance by using metric data from the pre-deployment period as covariates. It is particularly useful for low-traffic services as it allows statistical significance to be reached more quickly with the same traffic. Optimizely and Kameleoon have adopted it as their primary statistical method.
Alpha Spending Function: A method for allocating the total alpha budget to be used over the entire experimental period in Sequential Testing. When applied to the Flagger gating layer, it allows the overall Type I error rate to be controlled below the target alpha, even when performing repeated tests in every analysis cycle. The implementation will be covered in the next post.
Next Part
A method to implement early stopping by applying Sequential Testing and Alpha Spending functions to the Flagger gating layer — Faster rollback, less user damage
Reference Materials
- Flagger Webhooks | Official Documentation
- Flagger Deployment Strategies (A/B Testing) | Official Documentation
- Flags GitHub | fluxcd/flags
- Automated Canary Analysis at Netflix with Kayenta | Netflix TechBlog
- How Canary Judgment Works (Mann-Whitney U) | Spinnaker
- Kayenta Canary Config Documentation | GitHub
- Grafana flagger-k6-webhook | GitHub
- Istio A/B Testing Tutorial | Flagger
- CUPED in A/B Testing | Optimizely
- Best Statistical Model for A/B Testing | AB Tasty
- Argo Rollouts Analysis & Progressive Delivery | Official Documentation
- A/B Testing with Linkerd and Flagger | InfraCloud