Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping
Anyone running an A/B test falls into this temptation at least once: "Hey, the interim results are already significant; shouldn't we stop now?" The answer from classical statistics is harsh: No. If you look at the interim results and stop before reaching a predetermined sample size, the Type I error rate (the allowable probability of incorrectly judging something as significant, α) inflates rapidly. For example, if you look at an experiment designed with α=0.05 five times in the middle and decide whether to stop each time, the actual error rate skyrockets to 0.19. This is the so-called "Peeking Problem."
However, modern statistical theory offers a completely different answer to this question. Using the e-value and SAVI (Safe Anytime-Valid Inference) frameworks, a Type I error guarantee is mathematically maintained regardless of when the analysis is stopped. Spotify has adopted a sequential testing framework based on mSPRT (mixture Sequential Probability Ratio Test, explained later) for its A/B testing platform (Spotify Engineering, 2023), and Amplitude, Netflix, and Uber are also applying similar methodologies to thousands of commercial experiments. It has become mainstream in both academia and industry to the extent that dedicated tutorial sessions were organized at ICML 2025.
This article outlines the entire landscape of SAVI, ranging from the mathematical nature of the e-value to composite null hypotheses, multiple tests, and regression adjustments that go beyond mSPRT, and introduces them along with code that can be immediately applied in R and Python. For readers new to mSPRT, it is sufficient to understand it as a "sequential test using the likelihood ratio, which indicates how well data is explained under two hypotheses."
Key Concepts
What is an E-value
The e-value is a random variable that quantifies the evidence for the null hypothesis H₀. The definition is simple.
E-value: A non-negative statistic whose expected value is 1 or less under the null hypothesis H₀. That is, E[E] ≤ 1 (under H₀).
This single simple constraint creates amazing characteristics.
| Characteristics | Content |
|---|---|
| Rejection Criteria | e ≥ 1/α (e.g., if α=0.05, e ≥ 20) |
| Interpretation | High e-value = Strong evidence against H₀ |
| Combined | The product of independent e-values is still the e-value |
| Merging | Even under random dependency structures, the weighted average is the e-value |
The difference becomes clear when compared with the p-value.
| Comparison Item | p-value | e-value |
|---|---|---|
| Rejection Criteria | p < α | e ≥ 1/α |
| Optional Stopping | Not Available (Type 1 Error Expansion) | Available (Coverage Retention) |
| Post-hoc change of significance level | Not possible | Possible |
| Complex Null Hypothesis | Difficulty | Solved with Universal Inference |
| Multiple Test Combination | Dependency Structure Constraints | Allowed Random Dependency |
One-line explanation for non-statistical roles: When explaining e-values to PMs or data analysts, you can say this: "We are currently betting on the hypothesis that 'the control group is better,' and our bet balance changes as data accumulates. The e-value represents this balance, and if it exceeds 20 times, we determine that we are correct. This judgment criterion remains unwavering, regardless of when you check it or if you stop midway."
Overall Structure of the SAVI Framework
SAVI is a framework that systematizes sequential inference centered on e-values. It has three core components.
| Components | Roles | Corresponding Concepts in the P-Value World |
|---|---|---|
| E-value | Measurement of evidence at a single point in time | p-value |
| E-process | Sequentially accumulating evidence (Test Martingale) | Sequential p-value |
| Confidence Sequence | Confidence Interval Valid at Any Time Point | Fixed Sample Confidence Interval |
Understanding E-process and Martingale: Even if the word "Martingale" is unfamiliar, the intuition is simple. Unlike a standard cumulative sum (+), the e-process is updated by multiplication (×) whenever new data comes in. "The betting balance has grown 100-fold so far, and new data has come in. If this data is evidence contrary to H₀, the balance increases further; otherwise, it decreases." Thanks to this multiplicative update structure, the overall error rate is guaranteed regardless of when it stops.
Game Theoretical Interpretation: SAVI reinterprets statistical tests as gambling games. If H₀ is true, no strategy can make money in the long run. If H₀ is false, the right betting strategy increases profits exponentially. The e-value is the "current betting balance," and the e-process is the "temporal trajectory of the balance."
The basis of the mathematical guarantee is Ville's inequality.
e-process E_t가 H₀ 하에서 Test Martingale이면:
P(어떤 시점 t에서 E_t ≥ 1/α) ≤ αThis is the mathematical framework of the claim that "it is safe to stop at any time."
Beyond mSPRT: The Four Extensions SAVI Handles
mSPRT (Mixture Sequential Probability Ratio Test) is a special case of the e-value. The likelihood ratio of mSPRT (the ratio of data probabilities under two hypotheses) satisfies the definition of the e-value because it is less than or equal to an expected value of 1 under H₀. SAVI solves a much broader problem than this.
1. Composite Null Hypothesis — When testing a set of distributions that is not a single distribution
When H₀ is a set of distributions, such as "all distributions with a mean less than or equal to 0," classical methods often get stuck. Universal Inference by Wasserman, Ramdas, and Balakrishnan (2020) solves this problem using only data splitting. No regularity conditions (mathematical assumptions such as the distribution being differentiable and boundless) are required at all.
# Universal Inference: split LRT 기반 e-value — 개념 예시 (pseudocode)
# 알고리즘 흐름 설명용입니다. 실제 실행을 위해서는 각 함수를 구현해야 합니다.
def universal_e_value_concept(data, fit_mle_fn, compute_lrt_fn, null_set):
"""
Universal Inference 3단계:
1. 데이터를 훈련/검증으로 분할
2. 훈련 세트로 MLE(최대우도추정) 추정
3. 검증 세트에서 split LRT e-value 계산
E = L(theta_hat | test) / sup_{theta in H0} L(theta | test)
"""
n = len(data)
train, test = data[:n // 2], data[n // 2:]
theta_hat = fit_mle_fn(train) # fit_mle_fn은 문제에 맞게 구현
e_value = compute_lrt_fn(theta_hat, test, null_set) # 마찬가지
return e_value # e_value >= 1/alpha 이면 H0 기각
# 정규 분포 평균 검정 — 바로 실행 가능한 구체 예시
import numpy as np
from scipy import stats
def demo_normal_mean_test(n=100, true_mean=0.5, null_mean=0.0):
"""H0: μ = null_mean 에 대한 Universal Inference 데모"""
np.random.seed(42)
data = np.random.normal(true_mean, 1, n)
train, test = data[:n // 2], data[n // 2:]
# MLE: 정규 분포 평균의 MLE는 표본 평균
theta_hat = np.mean(train)
# Split LRT e-value 계산
log_like_alt = np.sum(stats.norm.logpdf(test, loc=theta_hat, scale=1))
log_like_null = np.sum(stats.norm.logpdf(test, loc=null_mean, scale=1))
e_value = np.exp(log_like_alt - log_like_null)
print(f"e-value: {e_value:.2f}")
# 예상 출력: e-value: 수십~수백 (효과 크기 0.5, n=50 검증 세트)
print(f"기각 여부 (α=0.05): {e_value >= 20}")
# 예상 출력: 기각 여부 (α=0.05): True
return e_value
demo_normal_mean_test()2. Regression Adjustment — Variance Reduction and Anytime Validity Simultaneously
CUPED (Controlled-experiment Using Pre-Experiment Data) is a technique that reduces measurement variance by utilizing pre-experiment data (e.g., previous week's purchase history) as covariates. Reduced variance allows the same effect to be detected with a smaller sample size. The key point when applying CUPED-style adjustments within the e-value framework is that the condition E[E] ≤ 1 (under H₀) must be maintained even after adjustment. If this condition is broken, the anytime-valid guarantee disappears. Statsig's CURE (Covariate-adjusted Unadjusted Regression Estimator) is a representative example of implementing covariate adjustment while preserving this condition.
3. Multiple Tests — Controlling FDR (False Discovery Rate) and FWER (Family-Wise Error Rate)
If FDR (the proportion of rejected hypotheses that are actually true) and FWER (the probability of at least one being wrongly rejected) are controlled based on e-value, the guarantee is maintained regardless of the dependency structure.
| Procedure | Control Target | Dependency Structure Assumption |
|---|---|---|
| e-BH | FDR (False Rejection Rate) | Allows Random Dependence |
| e-Bonferroni | FWER (Falsely False Rejection) | Allows Random Dependencies |
| e-GAI | Online FDR (Streaming) | Allows Arbitrary Dependencies |
The classical BH procedure assumes PRDS (Positive Regression Dependency on Subsets, a structure of specific amounts of dependency between hypotheses). e-BH guarantees FDR ≤ α even without this assumption.
4. Non-parametric and High-Dimensional Setups — Works Without Model Assumptions
e-value can be applied to metrics that follow a binomial distribution, such as conversion rates and click-through rates, as well as to high-dimensional setups that test hundreds of ad creatives simultaneously. A major strength is that it can be used universally without the need to worry about whether the assumption of a normal distribution holds true.
Practical Application
The three examples below cover the basic application of e-values (Example 1), e-process temporal monitoring (Example 2), and FDR control in multiple tests (Example 3), respectively. All assume the same context—web service A/B testing.
Example 1: Sequential A/B Test Implemented in R
The safestats package is the official reference implementation of SAVI. You can sequentially monitor conversion rate experiments using a safe t-test.
library(safestats)
# 1단계: 설계 — 목표 효과 크기와 파워 기반 설계
# delta1: 탐지하고 싶은 효과 크기 (Cohen's d)
# alpha: 1종 오류율, beta: 2종 오류율 (1 - power)
design <- designSafeT(delta1 = 0.5, alpha = 0.05, beta = 0.2)
cat("최소 권장 샘플 수:", design$n_plan, "\n")
# 예상 출력: 최소 권장 샘플 수: 62 (처리군·대조군 각각)
# 2단계: 순차 검정 — 데이터가 쌓일 때마다 e-value 갱신
set.seed(42)
control <- rnorm(100, mean = 0, sd = 1)
treatment <- rnorm(100, mean = 0.5, sd = 1)
result <- safeTTest(
x = treatment,
y = control,
design = design
)
cat("e-value:", result$eValue, "\n")
# 예상 출력: e-value: 수십~수백 (효과 크기 0.5, n=100이면 20 초과 가능성 높음)
cat("기각 여부 (α=0.05):", result$eValue >= 1 / 0.05, "\n")
# 기각 기준: e-value >= 20이면 H₀ 기각
# 예상 출력: 기각 여부 (α=0.05): TRUE
# 핵심: 이 결과를 n=100에서 확인해도, n=50에서 확인해도
# 1종 오류율은 α=0.05로 수학적으로 보장됨| Code Step | Role |
|---|---|
designSafeT() |
Power Analysis + Design Parameter Calculation |
safeTTest() |
e-value calculation (sequential application possible) |
result$eValue >= 1/α |
Rejection decision (Reject H₀ if e ≥ 20) |
Example 2: e-process visualization implemented in Python
We will directly implement and verify how e-processes accumulate through multiplication over time.
Caution when using the expectation library: The package is under active development and the API is subject to change. TwoSampleMeanEProcess Before using class names and interfaces, check the latest API in the GitHub repository. The code below can be executed directly as a version that implements the same mathematical principles directly.
# 라이브러리를 사용할 경우 (API 변경 가능성 있음)
pip install expectationimport numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(42)
# 실험 데이터: 처리군이 실제 효과 있는 시나리오
control = np.random.normal(0, 1, 200)
treatment = np.random.normal(0.4, 1, 200)
alt_mean_diff = 0.4 # 사전 기대 효과 크기
# e-process: 각 관측에서 likelihood ratio를 곱셈으로 갱신
# 이 곱셈 구조가 Optional Stopping을 안전하게 만드는 핵심
e_values = []
e_current = 1.0 # e-process 초기값 (베팅 잔고)
for i in range(10, len(control)):
diff = treatment[i] - control[i]
# 단일 관측에서의 likelihood ratio
# null(차이=0) vs alternative(차이=alt_mean_diff)
lr = (stats.norm.pdf(diff, loc=alt_mean_diff, scale=np.sqrt(2)) /
stats.norm.pdf(diff, loc=0, scale=np.sqrt(2)))
e_current *= lr # 곱셈 갱신
e_values.append(e_current)
timesteps = range(10, len(control))
plt.figure(figsize=(10, 5))
plt.plot(timesteps, e_values, label="e-process 궤적", color="steelblue")
plt.axhline(y=1 / 0.05, color="red", linestyle="--", label="기각 임계값 (1/α = 20)")
plt.axhline(y=1.0, color="gray", linestyle=":", label="귀무가설 기준 (e=1)")
plt.yscale("log")
plt.xlabel("누적 관측 수")
plt.ylabel("e-value (log scale)")
plt.title("E-process 시간적 궤적 — Optional Stopping 데모")
plt.legend()
plt.tight_layout()
plt.show()
rejection_point = next(
(i + 10 for i, e in enumerate(e_values) if e >= 20),
None
)
print(f"최초 기각 시점: 관측 {rejection_point}번째")
# 예상 출력: 최초 기각 시점: 관측 40~100번째 (효과 크기·랜덤 시드에 따라 다름)
print(f"최종 e-value: {e_values[-1]:.2f}")
# 예상 출력: 최종 e-value: 수십~수천 (H₀가 거짓인 시나리오)In the logarithmic scale trajectory, if H₀ is false, it slopes upward to the right (exponentially rising) in a straight line, and if H₀ is true, it can be seen that it hovers randomly around 1.
Example 3: Multiple Creative Verification Using e-BH — FDR (False Discovery Rate) Control
This is an example of controlling FDR with e-BH when simultaneously A/B testing hundreds of ad creatives on the same service.
Caution when using the savvi library: The savvi.multiple_testing.eBH path may change depending on the version. Before running, check the latest API on the PyPI page. The version below, which implements the e-BH algorithm directly, can be run immediately.
# 라이브러리를 사용할 경우 (API 변경 가능성 있음)
pip install savviimport numpy as np
def eBH(e_values, alpha):
"""
e-BH 절차: e-value 기반 FDR(False Discovery Rate) 제어
임의의 의존 구조 하에서도 FDR ≤ alpha 보장.
알고리즘: e-value를 내림차순 정렬 후
e_(k) >= n / (k * alpha) 를 만족하는 가장 큰 k를 찾아 상위 k개 기각
"""
n = len(e_values)
sorted_idx = np.argsort(e_values)[::-1] # 내림차순 정렬
sorted_e = e_values[sorted_idx]
k_star = 0
for k in range(1, n + 1):
if sorted_e[k - 1] >= n / (k * alpha):
k_star = k
# 단조 조건이 보장되지 않으므로 break 없이 가장 큰 k를 탐색
return sorted_idx[:k_star]
np.random.seed(0)
# 시뮬레이션: 500개 광고 소재 중 50개는 실제 효과 있음
n_hypotheses = 500
n_true_effects = 50
e_null = np.random.exponential(1.0, n_hypotheses - n_true_effects)
e_alt = np.random.exponential(5.0, n_true_effects) # 실제 효과 있는 소재
e_all = np.concatenate([e_null, e_alt])
labels = np.array([0] * (n_hypotheses - n_true_effects) + [1] * n_true_effects)
alpha = 0.05
rejected = eBH(e_all, alpha=alpha)
fdr = np.sum(labels[rejected] == 0) / max(len(rejected), 1)
power = np.sum(labels[rejected] == 1) / n_true_effects
print(f"기각된 소재 수: {len(rejected)}")
# 예상 출력: 기각된 소재 수: 20~40 (효과 강도에 따라 다름)
print(f"실제 FDR: {fdr:.3f} (목표: ≤ {alpha})")
# 예상 출력: 실제 FDR: 0.000~0.050 (보통 목표치 이하)
print(f"검정력(Power): {power:.3f}")
# 예상 출력: 검정력(Power): 0.5~0.8 (효과 크기에 따라 다름)e-BH vs. Classical BH Procedure: The classical Benjamini-Hochberg (BH) procedure assumes independence of p-values or PRDS (Positive Regression Dependency on Subsets, a structure of specific amounts of dependency among hypotheses). e-BH guarantees FDR ≤ α even under arbitrary dependency structures. However, the power of the test may be slightly lower in exchange for this strong guarantee. One should not conclude that the "method is wrong" simply because the number of rejected factors is smaller than that of BH.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Optional Stopping | Type I Error Coverage Remains Even If Experiment Is Stopped or Resumed Early for Any Reason |
| Multifunctionality | The product of independent e-values is still the e-value → Summing evidence across experiments is natural |
| Allowing Random Dependency | Weighted Mean e-Value → Applicable in Multiple Tests Regardless of Dependency Structure |
| Post-hoc change in significance level | Valid even after adjusting α (p-value is fundamentally impossible) |
| Universality | Applicable to complex, non-parametric, and high-dimensional settings |
| Intuitive Interpretation | Easy to explain to PMs and Data Analysts using the betting balance analogy |
| Confidence Sequence | Confidence intervals can also be calculated validly at any time |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Loss of Power | More conservative than p-value-based tests on the same sample | Combines enhanced prior effect size estimation and CUPED-style regression adjustments |
| Difficulty in selecting the optimal e-value | All non-negative functions are e-values → Which e-value to use is non-trivial | Utilize standard implementations such as mSPRT (normal distribution) and safestats |
| Existing Tool Compatibility | Most statistical software is designed around p-values | Uses R safestats, Python expectation·savvi |
| Learning Curve | Game-theoretic thinking differs from traditional statistics education | Start with Ramdas & Wang CMU textbook |
| Non-normal Distribution Indicators | mSPRT requires a normal distribution assumption (or CLT (Central Limit Theorem) approx.) | Non-normal indicators use separate e-value designs or non-parametric methods |
Caution regarding confusion with Bayes Factors: Some e-values are Bayes factors, but most Bayes factors are not e-values. Bayes factors may not guarantee that their expected value is less than or equal to 1 under H₀. Mixing them immediately breaks the anytime-valid guarantee.
The Most Common Mistakes in Practice
- Using e-values without design: If you omit power analysis simply because optional stopping is possible, the experiment may not converge. "You can stop at any time" is different from "you can run it forever without a goal." You must go through a preliminary design phase, as shown in
designSafeT(). - Unverified e-value condition after regression adjustment: Not checking whether the adjusted estimator still satisfies the condition E[E] ≤ 1 (under H₀) after adjusting covariates using CUPED, etc. If this condition is broken, the anytime-valid guarantee disappears.
- Interpreting e-BH results the same as BH: e-BH reveals its true value in random dependency structures and online (streaming) settings. You should not conclude that the "method is wrong" simply because the number of rejected hypotheses is lower than that of BH.
In Conclusion
e-value and SAVI are not merely generalizations of mSPRT, but a new language of statistical inference that completely reconstructs the p-value-centered paradigm on a game-theoretic basis.
3 Steps You Can Put into Practice Right Now — If you follow them in order, you can build both theory and practice.
- First, if you are an R user: After running
install.packages("safestats"), redesign the current experiment into a safe t-test usingdesignSafeT(delta1=0.5, alpha=0.05, beta=0.2), and check for yourself when the e-value exceeds 1/α (=20). - Next, if you are a Python user: Try applying the e-process visualization code from Example 2 to past A/B test data. By plotting the e-process trajectory through post-hoc analysis, you can visually confirm "when the experiment should have been stopped to reach the correct conclusion," which serves as a direct reference for future experiment design.
- If you want to solidify your theoretical foundation: Follow the entire flow from Ville's inequality to the e-BH procedure using Ramdas & Wang's CMU open textbook. After reading the textbook, watching Alexander Ly's SAVI tutorial will naturally lead to practical application.
Next Article: In-depth Analysis of Confidence Sequence — Directly implement an "always valid confidence interval" paired with the e-process introduced in this article, and experimentally compare its width with that of a classical confidence interval. If you understand e-values, Confidence Sequence is the natural next step.
Reference Materials
5 Key Points — Start Here
- Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science — A key paper containing the mathematical foundation of the entire e-value·SAVI theory
- Hypothesis Testing with E-values | Ramdas & Wang, CMU — An open textbook covering e-values from introductory to advanced levels
- Universal Inference | PNAS, Wasserman et al. 2020 — Original paper on the construction method of the composite null hypothesis e-value
- Choosing a Sequential Testing Framework | Spotify Engineering — Comparison of Sequential Testing Frameworks from a Practical Perspective
- safestats R package | CRAN — The official implementation used in Example 1
To delve deeper
- E-values | Wikipedia — Quick reference for concept review
- Beyond Neyman–Pearson: E-values enable hypothesis testing with a data-driven alpha | PNAS — Theory of Posterior Significance Level Changeability
- False Discovery Rate Control with E-values | JRSS-B — Original paper on e-BH procedure, formula-focused
- Family-wise Error Rate Control with E-values | arXiv 2501.09015 — e-Bonferroni complete proof
- ICML 2025 Tutorial: Game-theoretic Statistics and Sequential Anytime-Valid Inference — Get a Glance at the Latest Research Trends
- Sequential Testing on Statsig | Statsig Blog — Practical implementation-focused case including CURE
- A tiny review on e-values and e-processes | Ruodu Wang, 2023 — A short and clear review of concepts
- Regularized e-processes | arXiv 2410.01427 — Latest Theory on Improving Prior Knowledge Utilization Efficiency
- expectation Python library | GitHub — Example 2: For checking the latest API of the library
- savvi Python package | PyPI — Example 3: For checking the latest API of the library