Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

Anyone running an A/B test falls into this temptation at least once: "Hey, the interim results are already significant; shouldn't we stop now?" The answer from classical statistics is harsh: No. If you look at the interim results and stop before reaching a predetermined sample size, the Type I error rate (the allowable probability of incorrectly judging something as significant, α) inflates rapidly. For example, if you look at an experiment designed with α=0.05 five times in the middle and decide whether to stop each time, the actual error rate skyrockets to 0.19. This is the so-called "Peeking Problem."

However, modern statistical theory offers a completely different answer to this question. Using the e-value and SAVI (Safe Anytime-Valid Inference) frameworks, a Type I error guarantee is mathematically maintained regardless of when the analysis is stopped. Spotify has adopted a sequential testing framework based on mSPRT (mixture Sequential Probability Ratio Test, explained later) for its A/B testing platform (Spotify Engineering, 2023), and Amplitude, Netflix, and Uber are also applying similar methodologies to thousands of commercial experiments. It has become mainstream in both academia and industry to the extent that dedicated tutorial sessions were organized at ICML 2025.

This article outlines the entire landscape of SAVI, ranging from the mathematical nature of the e-value to composite null hypotheses, multiple tests, and regression adjustments that go beyond mSPRT, and introduces them along with code that can be immediately applied in R and Python. For readers new to mSPRT, it is sufficient to understand it as a "sequential test using the likelihood ratio, which indicates how well data is explained under two hypotheses."

Key Concepts

What is an E-value

The e-value is a random variable that quantifies the evidence for the null hypothesis H₀. The definition is simple.

E-value: A non-negative statistic whose expected value is 1 or less under the null hypothesis H₀. That is, E[E] ≤ 1 (under H₀).

This single simple constraint creates amazing characteristics.

Characteristics	Content
Rejection Criteria	`e ≥ 1/α` (e.g., if α=0.05, e ≥ 20)
Interpretation	High e-value = Strong evidence against H₀
Combined	The product of independent e-values is still the e-value
Merging	Even under random dependency structures, the weighted average is the e-value

The difference becomes clear when compared with the p-value.

Comparison Item	p-value	e-value
Rejection Criteria	p < α	e ≥ 1/α
Optional Stopping	Not Available (Type 1 Error Expansion)	Available (Coverage Retention)
Post-hoc change of significance level	Not possible	Possible
Complex Null Hypothesis	Difficulty	Solved with Universal Inference
Multiple Test Combination	Dependency Structure Constraints	Allowed Random Dependency

One-line explanation for non-statistical roles: When explaining e-values to PMs or data analysts, you can say this: "We are currently betting on the hypothesis that 'the control group is better,' and our bet balance changes as data accumulates. The e-value represents this balance, and if it exceeds 20 times, we determine that we are correct. This judgment criterion remains unwavering, regardless of when you check it or if you stop midway."

Overall Structure of the SAVI Framework

SAVI is a framework that systematizes sequential inference centered on e-values. It has three core components.

Components	Roles	Corresponding Concepts in the P-Value World
E-value	Measurement of evidence at a single point in time	p-value
E-process	Sequentially accumulating evidence (Test Martingale)	Sequential p-value
Confidence Sequence	Confidence Interval Valid at Any Time Point	Fixed Sample Confidence Interval

Understanding E-process and Martingale: Even if the word "Martingale" is unfamiliar, the intuition is simple. Unlike a standard cumulative sum (+), the e-process is updated by multiplication (×) whenever new data comes in. "The betting balance has grown 100-fold so far, and new data has come in. If this data is evidence contrary to H₀, the balance increases further; otherwise, it decreases." Thanks to this multiplicative update structure, the overall error rate is guaranteed regardless of when it stops.

Game Theoretical Interpretation: SAVI reinterprets statistical tests as gambling games. If H₀ is true, no strategy can make money in the long run. If H₀ is false, the right betting strategy increases profits exponentially. The e-value is the "current betting balance," and the e-process is the "temporal trajectory of the balance."

The basis of the mathematical guarantee is Ville's inequality.

e-process E_t가 H₀ 하에서 Test Martingale이면:
P(어떤 시점 t에서 E_t ≥ 1/α) ≤ α

This is the mathematical framework of the claim that "it is safe to stop at any time."

Beyond mSPRT: The Four Extensions SAVI Handles

mSPRT (Mixture Sequential Probability Ratio Test) is a special case of the e-value. The likelihood ratio of mSPRT (the ratio of data probabilities under two hypotheses) satisfies the definition of the e-value because it is less than or equal to an expected value of 1 under H₀. SAVI solves a much broader problem than this.

1. Composite Null Hypothesis — When testing a set of distributions that is not a single distribution

When H₀ is a set of distributions, such as "all distributions with a mean less than or equal to 0," classical methods often get stuck. Universal Inference by Wasserman, Ramdas, and Balakrishnan (2020) solves this problem using only data splitting. No regularity conditions (mathematical assumptions such as the distribution being differentiable and boundless) are required at all.

python

# Universal Inference: split LRT 기반 e-value — 개념 예시 (pseudocode)
# 알고리즘 흐름 설명용입니다. 실제 실행을 위해서는 각 함수를 구현해야 합니다.
 
def universal_e_value_concept(data, fit_mle_fn, compute_lrt_fn, null_set):
    """
    Universal Inference 3단계:
    1. 데이터를 훈련/검증으로 분할
    2. 훈련 세트로 MLE(최대우도추정) 추정
    3. 검증 세트에서 split LRT e-value 계산
       E = L(theta_hat | test) / sup_{theta in H0} L(theta | test)
    """
    n = len(data)
    train, test = data[:n // 2], data[n // 2:]
    theta_hat = fit_mle_fn(train)            # fit_mle_fn은 문제에 맞게 구현
    e_value   = compute_lrt_fn(theta_hat, test, null_set)  # 마찬가지
    return e_value  # e_value >= 1/alpha 이면 H0 기각
 
 
# 정규 분포 평균 검정 — 바로 실행 가능한 구체 예시
import numpy as np
from scipy import stats
 
def demo_normal_mean_test(n=100, true_mean=0.5, null_mean=0.0):
    """H0: μ = null_mean 에 대한 Universal Inference 데모"""
    np.random.seed(42)
    data        = np.random.normal(true_mean, 1, n)
    train, test = data[:n // 2], data[n // 2:]
 
    # MLE: 정규 분포 평균의 MLE는 표본 평균
    theta_hat = np.mean(train)
 
    # Split LRT e-value 계산
    log_like_alt  = np.sum(stats.norm.logpdf(test, loc=theta_hat, scale=1))
    log_like_null = np.sum(stats.norm.logpdf(test, loc=null_mean, scale=1))
    e_value = np.exp(log_like_alt - log_like_null)
 
    print(f"e-value: {e_value:.2f}")
    # 예상 출력: e-value: 수십~수백 (효과 크기 0.5, n=50 검증 세트)
    print(f"기각 여부 (α=0.05): {e_value >= 20}")
    # 예상 출력: 기각 여부 (α=0.05): True
    return e_value
 
demo_normal_mean_test()

2. Regression Adjustment — Variance Reduction and Anytime Validity Simultaneously

CUPED (Controlled-experiment Using Pre-Experiment Data) is a technique that reduces measurement variance by utilizing pre-experiment data (e.g., previous week's purchase history) as covariates. Reduced variance allows the same effect to be detected with a smaller sample size. The key point when applying CUPED-style adjustments within the e-value framework is that the condition E[E] ≤ 1 (under H₀) must be maintained even after adjustment. If this condition is broken, the anytime-valid guarantee disappears. Statsig's CURE (Covariate-adjusted Unadjusted Regression Estimator) is a representative example of implementing covariate adjustment while preserving this condition.

3. Multiple Tests — Controlling FDR (False Discovery Rate) and FWER (Family-Wise Error Rate)

If FDR (the proportion of rejected hypotheses that are actually true) and FWER (the probability of at least one being wrongly rejected) are controlled based on e-value, the guarantee is maintained regardless of the dependency structure.

Procedure	Control Target	Dependency Structure Assumption
e-BH	FDR (False Rejection Rate)	Allows Random Dependence
e-Bonferroni	FWER (Falsely False Rejection)	Allows Random Dependencies
e-GAI	Online FDR (Streaming)	Allows Arbitrary Dependencies

The classical BH procedure assumes PRDS (Positive Regression Dependency on Subsets, a structure of specific amounts of dependency between hypotheses). e-BH guarantees FDR ≤ α even without this assumption.

4. Non-parametric and High-Dimensional Setups — Works Without Model Assumptions

e-value can be applied to metrics that follow a binomial distribution, such as conversion rates and click-through rates, as well as to high-dimensional setups that test hundreds of ad creatives simultaneously. A major strength is that it can be used universally without the need to worry about whether the assumption of a normal distribution holds true.

Practical Application

The three examples below cover the basic application of e-values (Example 1), e-process temporal monitoring (Example 2), and FDR control in multiple tests (Example 3), respectively. All assume the same context—web service A/B testing.

Example 1: Sequential A/B Test Implemented in R

The safestats package is the official reference implementation of SAVI. You can sequentially monitor conversion rate experiments using a safe t-test.

bash

library(safestats)
 
# 1단계: 설계 — 목표 효과 크기와 파워 기반 설계
# delta1: 탐지하고 싶은 효과 크기 (Cohen's d)
# alpha: 1종 오류율, beta: 2종 오류율 (1 - power)
design <- designSafeT(delta1 = 0.5, alpha = 0.05, beta = 0.2)
cat("최소 권장 샘플 수:", design$n_plan, "\n")
# 예상 출력: 최소 권장 샘플 수: 62 (처리군·대조군 각각)
 
# 2단계: 순차 검정 — 데이터가 쌓일 때마다 e-value 갱신
set.seed(42)
control   <- rnorm(100, mean = 0,   sd = 1)
treatment <- rnorm(100, mean = 0.5, sd = 1)
 
result <- safeTTest(
  x      = treatment,
  y      = control,
  design = design
)
 
cat("e-value:", result$eValue, "\n")
# 예상 출력: e-value: 수십~수백 (효과 크기 0.5, n=100이면 20 초과 가능성 높음)
 
cat("기각 여부 (α=0.05):", result$eValue >= 1 / 0.05, "\n")
# 기각 기준: e-value >= 20이면 H₀ 기각
# 예상 출력: 기각 여부 (α=0.05): TRUE
 
# 핵심: 이 결과를 n=100에서 확인해도, n=50에서 확인해도
# 1종 오류율은 α=0.05로 수학적으로 보장됨

Code Step	Role
`designSafeT()`	Power Analysis + Design Parameter Calculation
`safeTTest()`	e-value calculation (sequential application possible)
`result$eValue >= 1/α`	Rejection decision (Reject H₀ if e ≥ 20)

Example 2: e-process visualization implemented in Python

We will directly implement and verify how e-processes accumulate through multiplication over time.

Caution when using the expectation library: The package is under active development and the API is subject to change. TwoSampleMeanEProcess Before using class names and interfaces, check the latest API in the GitHub repository. The code below can be executed directly as a version that implements the same mathematical principles directly.

# 라이브러리를 사용할 경우 (API 변경 가능성 있음)
pip install expectation

css

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
np.random.seed(42)
 
# 실험 데이터: 처리군이 실제 효과 있는 시나리오
control   = np.random.normal(0,   1, 200)
treatment = np.random.normal(0.4, 1, 200)
 
alt_mean_diff = 0.4  # 사전 기대 효과 크기
 
# e-process: 각 관측에서 likelihood ratio를 곱셈으로 갱신
# 이 곱셈 구조가 Optional Stopping을 안전하게 만드는 핵심
e_values  = []
e_current = 1.0  # e-process 초기값 (베팅 잔고)
 
for i in range(10, len(control)):
    diff = treatment[i] - control[i]
 
    # 단일 관측에서의 likelihood ratio
    # null(차이=0) vs alternative(차이=alt_mean_diff)
    lr = (stats.norm.pdf(diff, loc=alt_mean_diff, scale=np.sqrt(2)) /
          stats.norm.pdf(diff, loc=0,             scale=np.sqrt(2)))
 
    e_current *= lr  # 곱셈 갱신
    e_values.append(e_current)
 
timesteps = range(10, len(control))
 
plt.figure(figsize=(10, 5))
plt.plot(timesteps, e_values, label="e-process 궤적", color="steelblue")
plt.axhline(y=1 / 0.05, color="red",  linestyle="--", label="기각 임계값 (1/α = 20)")
plt.axhline(y=1.0,      color="gray", linestyle=":",  label="귀무가설 기준 (e=1)")
plt.yscale("log")
plt.xlabel("누적 관측 수")
plt.ylabel("e-value (log scale)")
plt.title("E-process 시간적 궤적 — Optional Stopping 데모")
plt.legend()
plt.tight_layout()
plt.show()
 
rejection_point = next(
    (i + 10 for i, e in enumerate(e_values) if e >= 20),
    None
)
print(f"최초 기각 시점: 관측 {rejection_point}번째")
# 예상 출력: 최초 기각 시점: 관측 40~100번째 (효과 크기·랜덤 시드에 따라 다름)
print(f"최종 e-value: {e_values[-1]:.2f}")
# 예상 출력: 최종 e-value: 수십~수천 (H₀가 거짓인 시나리오)

In the logarithmic scale trajectory, if H₀ is false, it slopes upward to the right (exponentially rising) in a straight line, and if H₀ is true, it can be seen that it hovers randomly around 1.

Example 3: Multiple Creative Verification Using e-BH — FDR (False Discovery Rate) Control

This is an example of controlling FDR with e-BH when simultaneously A/B testing hundreds of ad creatives on the same service.

Caution when using the savvi library: The savvi.multiple_testing.eBH path may change depending on the version. Before running, check the latest API on the PyPI page. The version below, which implements the e-BH algorithm directly, can be run immediately.

# 라이브러리를 사용할 경우 (API 변경 가능성 있음)
pip install savvi

python

import numpy as np
 
def eBH(e_values, alpha):
    """
    e-BH 절차: e-value 기반 FDR(False Discovery Rate) 제어
    임의의 의존 구조 하에서도 FDR ≤ alpha 보장.
 
    알고리즘: e-value를 내림차순 정렬 후
    e_(k) >= n / (k * alpha) 를 만족하는 가장 큰 k를 찾아 상위 k개 기각
    """
    n          = len(e_values)
    sorted_idx = np.argsort(e_values)[::-1]  # 내림차순 정렬
    sorted_e   = e_values[sorted_idx]
 
    k_star = 0
    for k in range(1, n + 1):
        if sorted_e[k - 1] >= n / (k * alpha):
            k_star = k
        # 단조 조건이 보장되지 않으므로 break 없이 가장 큰 k를 탐색
 
    return sorted_idx[:k_star]
 
 
np.random.seed(0)
 
# 시뮬레이션: 500개 광고 소재 중 50개는 실제 효과 있음
n_hypotheses   = 500
n_true_effects = 50
 
e_null = np.random.exponential(1.0, n_hypotheses - n_true_effects)
e_alt  = np.random.exponential(5.0, n_true_effects)  # 실제 효과 있는 소재
e_all  = np.concatenate([e_null, e_alt])
labels = np.array([0] * (n_hypotheses - n_true_effects) + [1] * n_true_effects)
 
alpha    = 0.05
rejected = eBH(e_all, alpha=alpha)
 
fdr   = np.sum(labels[rejected] == 0) / max(len(rejected), 1)
power = np.sum(labels[rejected] == 1) / n_true_effects
 
print(f"기각된 소재 수: {len(rejected)}")
# 예상 출력: 기각된 소재 수: 20~40 (효과 강도에 따라 다름)
print(f"실제 FDR: {fdr:.3f} (목표: ≤ {alpha})")
# 예상 출력: 실제 FDR: 0.000~0.050 (보통 목표치 이하)
print(f"검정력(Power): {power:.3f}")
# 예상 출력: 검정력(Power): 0.5~0.8 (효과 크기에 따라 다름)

e-BH vs. Classical BH Procedure: The classical Benjamini-Hochberg (BH) procedure assumes independence of p-values or PRDS (Positive Regression Dependency on Subsets, a structure of specific amounts of dependency among hypotheses). e-BH guarantees FDR ≤ α even under arbitrary dependency structures. However, the power of the test may be slightly lower in exchange for this strong guarantee. One should not conclude that the "method is wrong" simply because the number of rejected factors is smaller than that of BH.

Pros and Cons Analysis

Advantages

Item	Content
Optional Stopping	Type I Error Coverage Remains Even If Experiment Is Stopped or Resumed Early for Any Reason
Multifunctionality	The product of independent e-values is still the e-value → Summing evidence across experiments is natural
Allowing Random Dependency	Weighted Mean e-Value → Applicable in Multiple Tests Regardless of Dependency Structure
Post-hoc change in significance level	Valid even after adjusting α (p-value is fundamentally impossible)
Universality	Applicable to complex, non-parametric, and high-dimensional settings
Intuitive Interpretation	Easy to explain to PMs and Data Analysts using the betting balance analogy
Confidence Sequence	Confidence intervals can also be calculated validly at any time

Disadvantages and Precautions

Item	Content	Response Plan
Loss of Power	More conservative than p-value-based tests on the same sample	Combines enhanced prior effect size estimation and CUPED-style regression adjustments
Difficulty in selecting the optimal e-value	All non-negative functions are e-values → Which e-value to use is non-trivial	Utilize standard implementations such as mSPRT (normal distribution) and `safestats`
Existing Tool Compatibility	Most statistical software is designed around p-values	Uses R `safestats`, Python `expectation`·`savvi`
Learning Curve	Game-theoretic thinking differs from traditional statistics education	Start with Ramdas & Wang CMU textbook
Non-normal Distribution Indicators	mSPRT requires a normal distribution assumption (or CLT (Central Limit Theorem) approx.)	Non-normal indicators use separate e-value designs or non-parametric methods

Caution regarding confusion with Bayes Factors: Some e-values are Bayes factors, but most Bayes factors are not e-values. Bayes factors may not guarantee that their expected value is less than or equal to 1 under H₀. Mixing them immediately breaks the anytime-valid guarantee.

The Most Common Mistakes in Practice

Using e-values without design: If you omit power analysis simply because optional stopping is possible, the experiment may not converge. "You can stop at any time" is different from "you can run it forever without a goal." You must go through a preliminary design phase, as shown in designSafeT().
Unverified e-value condition after regression adjustment: Not checking whether the adjusted estimator still satisfies the condition E[E] ≤ 1 (under H₀) after adjusting covariates using CUPED, etc. If this condition is broken, the anytime-valid guarantee disappears.
Interpreting e-BH results the same as BH: e-BH reveals its true value in random dependency structures and online (streaming) settings. You should not conclude that the "method is wrong" simply because the number of rejected hypotheses is lower than that of BH.

In Conclusion

e-value and SAVI are not merely generalizations of mSPRT, but a new language of statistical inference that completely reconstructs the p-value-centered paradigm on a game-theoretic basis.

3 Steps You Can Put into Practice Right Now — If you follow them in order, you can build both theory and practice.

First, if you are an R user: After running install.packages("safestats"), redesign the current experiment into a safe t-test using designSafeT(delta1=0.5, alpha=0.05, beta=0.2), and check for yourself when the e-value exceeds 1/α (=20).
Next, if you are a Python user: Try applying the e-process visualization code from Example 2 to past A/B test data. By plotting the e-process trajectory through post-hoc analysis, you can visually confirm "when the experiment should have been stopped to reach the correct conclusion," which serves as a direct reference for future experiment design.
If you want to solidify your theoretical foundation: Follow the entire flow from Ville's inequality to the e-BH procedure using Ramdas & Wang's CMU open textbook. After reading the textbook, watching Alexander Ly's SAVI tutorial will naturally lead to practical application.

Next Article: In-depth Analysis of Confidence Sequence — Directly implement an "always valid confidence interval" paired with the e-process introduced in this article, and experimentally compare its width with that of a classical confidence interval. If you understand e-values, Confidence Sequence is the natural next step.

Reference Materials

5 Key Points — Start Here

Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science — A key paper containing the mathematical foundation of the entire e-value·SAVI theory
Hypothesis Testing with E-values | Ramdas & Wang, CMU — An open textbook covering e-values from introductory to advanced levels
Universal Inference | PNAS, Wasserman et al. 2020 — Original paper on the construction method of the composite null hypothesis e-value
Choosing a Sequential Testing Framework | Spotify Engineering — Comparison of Sequential Testing Frameworks from a Practical Perspective
safestats R package | CRAN — The official implementation used in Example 1

To delve deeper

E-values | Wikipedia — Quick reference for concept review
Beyond Neyman–Pearson: E-values enable hypothesis testing with a data-driven alpha | PNAS — Theory of Posterior Significance Level Changeability
False Discovery Rate Control with E-values | JRSS-B — Original paper on e-BH procedure, formula-focused
Family-wise Error Rate Control with E-values | arXiv 2501.09015 — e-Bonferroni complete proof
ICML 2025 Tutorial: Game-theoretic Statistics and Sequential Anytime-Valid Inference — Get a Glance at the Latest Research Trends
Sequential Testing on Statsig | Statsig Blog — Practical implementation-focused case including CURE
A tiny review on e-values and e-processes | Ruodu Wang, 2023 — A short and clear review of concepts
Regularized e-processes | arXiv 2410.01427 — Latest Theory on Improving Prior Knowledge Utilization Efficiency
expectation Python library | GitHub — Example 2: For checking the latest API of the library
savvi Python package | PyPI — Example 3: For checking the latest API of the library

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping | DEV BAK - 기술블로그

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

Key Concepts

What is an E-value

The e-value is a random variable that quantifies the evidence for the null hypothesis H₀. The definition is simple.

E-value: A non-negative statistic whose expected value is 1 or less under the null hypothesis H₀. That is, E[E] ≤ 1 (under H₀).

This single simple constraint creates amazing characteristics.

Characteristics	Content
Rejection Criteria	`e ≥ 1/α` (e.g., if α=0.05, e ≥ 20)
Interpretation	High e-value = Strong evidence against H₀
Combined	The product of independent e-values is still the e-value
Merging	Even under random dependency structures, the weighted average is the e-value

The difference becomes clear when compared with the p-value.

Comparison Item	p-value	e-value
Rejection Criteria	p < α	e ≥ 1/α
Optional Stopping	Not Available (Type 1 Error Expansion)	Available (Coverage Retention)
Post-hoc change of significance level	Not possible	Possible
Complex Null Hypothesis	Difficulty	Solved with Universal Inference
Multiple Test Combination	Dependency Structure Constraints	Allowed Random Dependency

Overall Structure of the SAVI Framework

SAVI is a framework that systematizes sequential inference centered on e-values. It has three core components.

Components	Roles	Corresponding Concepts in the P-Value World
E-value	Measurement of evidence at a single point in time	p-value
E-process	Sequentially accumulating evidence (Test Martingale)	Sequential p-value
Confidence Sequence	Confidence Interval Valid at Any Time Point	Fixed Sample Confidence Interval

The basis of the mathematical guarantee is Ville's inequality.

e-process E_t가 H₀ 하에서 Test Martingale이면:
P(어떤 시점 t에서 E_t ≥ 1/α) ≤ α

This is the mathematical framework of the claim that "it is safe to stop at any time."

Beyond mSPRT: The Four Extensions SAVI Handles

1. Composite Null Hypothesis — When testing a set of distributions that is not a single distribution

python

# Universal Inference: split LRT 기반 e-value — 개념 예시 (pseudocode)
# 알고리즘 흐름 설명용입니다. 실제 실행을 위해서는 각 함수를 구현해야 합니다.
 
def universal_e_value_concept(data, fit_mle_fn, compute_lrt_fn, null_set):
    """
    Universal Inference 3단계:
    1. 데이터를 훈련/검증으로 분할
    2. 훈련 세트로 MLE(최대우도추정) 추정
    3. 검증 세트에서 split LRT e-value 계산
       E = L(theta_hat | test) / sup_{theta in H0} L(theta | test)
    """
    n = len(data)
    train, test = data[:n // 2], data[n // 2:]
    theta_hat = fit_mle_fn(train)            # fit_mle_fn은 문제에 맞게 구현
    e_value   = compute_lrt_fn(theta_hat, test, null_set)  # 마찬가지
    return e_value  # e_value >= 1/alpha 이면 H0 기각
 
 
# 정규 분포 평균 검정 — 바로 실행 가능한 구체 예시
import numpy as np
from scipy import stats
 
def demo_normal_mean_test(n=100, true_mean=0.5, null_mean=0.0):
    """H0: μ = null_mean 에 대한 Universal Inference 데모"""
    np.random.seed(42)
    data        = np.random.normal(true_mean, 1, n)
    train, test = data[:n // 2], data[n // 2:]
 
    # MLE: 정규 분포 평균의 MLE는 표본 평균
    theta_hat = np.mean(train)
 
    # Split LRT e-value 계산
    log_like_alt  = np.sum(stats.norm.logpdf(test, loc=theta_hat, scale=1))
    log_like_null = np.sum(stats.norm.logpdf(test, loc=null_mean, scale=1))
    e_value = np.exp(log_like_alt - log_like_null)
 
    print(f"e-value: {e_value:.2f}")
    # 예상 출력: e-value: 수십~수백 (효과 크기 0.5, n=50 검증 세트)
    print(f"기각 여부 (α=0.05): {e_value >= 20}")
    # 예상 출력: 기각 여부 (α=0.05): True
    return e_value
 
demo_normal_mean_test()

2. Regression Adjustment — Variance Reduction and Anytime Validity Simultaneously

3. Multiple Tests — Controlling FDR (False Discovery Rate) and FWER (Family-Wise Error Rate)

Procedure	Control Target	Dependency Structure Assumption
e-BH	FDR (False Rejection Rate)	Allows Random Dependence
e-Bonferroni	FWER (Falsely False Rejection)	Allows Random Dependencies
e-GAI	Online FDR (Streaming)	Allows Arbitrary Dependencies

4. Non-parametric and High-Dimensional Setups — Works Without Model Assumptions

Practical Application

Example 1: Sequential A/B Test Implemented in R

The safestats package is the official reference implementation of SAVI. You can sequentially monitor conversion rate experiments using a safe t-test.

bash

library(safestats)
 
# 1단계: 설계 — 목표 효과 크기와 파워 기반 설계
# delta1: 탐지하고 싶은 효과 크기 (Cohen's d)
# alpha: 1종 오류율, beta: 2종 오류율 (1 - power)
design <- designSafeT(delta1 = 0.5, alpha = 0.05, beta = 0.2)
cat("최소 권장 샘플 수:", design$n_plan, "\n")
# 예상 출력: 최소 권장 샘플 수: 62 (처리군·대조군 각각)
 
# 2단계: 순차 검정 — 데이터가 쌓일 때마다 e-value 갱신
set.seed(42)
control   <- rnorm(100, mean = 0,   sd = 1)
treatment <- rnorm(100, mean = 0.5, sd = 1)
 
result <- safeTTest(
  x      = treatment,
  y      = control,
  design = design
)
 
cat("e-value:", result$eValue, "\n")
# 예상 출력: e-value: 수십~수백 (효과 크기 0.5, n=100이면 20 초과 가능성 높음)
 
cat("기각 여부 (α=0.05):", result$eValue >= 1 / 0.05, "\n")
# 기각 기준: e-value >= 20이면 H₀ 기각
# 예상 출력: 기각 여부 (α=0.05): TRUE
 
# 핵심: 이 결과를 n=100에서 확인해도, n=50에서 확인해도
# 1종 오류율은 α=0.05로 수학적으로 보장됨

Code Step	Role
`designSafeT()`	Power Analysis + Design Parameter Calculation
`safeTTest()`	e-value calculation (sequential application possible)
`result$eValue >= 1/α`	Rejection decision (Reject H₀ if e ≥ 20)

Example 2: e-process visualization implemented in Python

We will directly implement and verify how e-processes accumulate through multiplication over time.

# 라이브러리를 사용할 경우 (API 변경 가능성 있음)
pip install expectation

css

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
np.random.seed(42)
 
# 실험 데이터: 처리군이 실제 효과 있는 시나리오
control   = np.random.normal(0,   1, 200)
treatment = np.random.normal(0.4, 1, 200)
 
alt_mean_diff = 0.4  # 사전 기대 효과 크기
 
# e-process: 각 관측에서 likelihood ratio를 곱셈으로 갱신
# 이 곱셈 구조가 Optional Stopping을 안전하게 만드는 핵심
e_values  = []
e_current = 1.0  # e-process 초기값 (베팅 잔고)
 
for i in range(10, len(control)):
    diff = treatment[i] - control[i]
 
    # 단일 관측에서의 likelihood ratio
    # null(차이=0) vs alternative(차이=alt_mean_diff)
    lr = (stats.norm.pdf(diff, loc=alt_mean_diff, scale=np.sqrt(2)) /
          stats.norm.pdf(diff, loc=0,             scale=np.sqrt(2)))
 
    e_current *= lr  # 곱셈 갱신
    e_values.append(e_current)
 
timesteps = range(10, len(control))
 
plt.figure(figsize=(10, 5))
plt.plot(timesteps, e_values, label="e-process 궤적", color="steelblue")
plt.axhline(y=1 / 0.05, color="red",  linestyle="--", label="기각 임계값 (1/α = 20)")
plt.axhline(y=1.0,      color="gray", linestyle=":",  label="귀무가설 기준 (e=1)")
plt.yscale("log")
plt.xlabel("누적 관측 수")
plt.ylabel("e-value (log scale)")
plt.title("E-process 시간적 궤적 — Optional Stopping 데모")
plt.legend()
plt.tight_layout()
plt.show()
 
rejection_point = next(
    (i + 10 for i, e in enumerate(e_values) if e >= 20),
    None
)
print(f"최초 기각 시점: 관측 {rejection_point}번째")
# 예상 출력: 최초 기각 시점: 관측 40~100번째 (효과 크기·랜덤 시드에 따라 다름)
print(f"최종 e-value: {e_values[-1]:.2f}")
# 예상 출력: 최종 e-value: 수십~수천 (H₀가 거짓인 시나리오)

In the logarithmic scale trajectory, if H₀ is false, it slopes upward to the right (exponentially rising) in a straight line, and if H₀ is true, it can be seen that it hovers randomly around 1.

Example 3: Multiple Creative Verification Using e-BH — FDR (False Discovery Rate) Control

This is an example of controlling FDR with e-BH when simultaneously A/B testing hundreds of ad creatives on the same service.

# 라이브러리를 사용할 경우 (API 변경 가능성 있음)
pip install savvi

python

import numpy as np
 
def eBH(e_values, alpha):
    """
    e-BH 절차: e-value 기반 FDR(False Discovery Rate) 제어
    임의의 의존 구조 하에서도 FDR ≤ alpha 보장.
 
    알고리즘: e-value를 내림차순 정렬 후
    e_(k) >= n / (k * alpha) 를 만족하는 가장 큰 k를 찾아 상위 k개 기각
    """
    n          = len(e_values)
    sorted_idx = np.argsort(e_values)[::-1]  # 내림차순 정렬
    sorted_e   = e_values[sorted_idx]
 
    k_star = 0
    for k in range(1, n + 1):
        if sorted_e[k - 1] >= n / (k * alpha):
            k_star = k
        # 단조 조건이 보장되지 않으므로 break 없이 가장 큰 k를 탐색
 
    return sorted_idx[:k_star]
 
 
np.random.seed(0)
 
# 시뮬레이션: 500개 광고 소재 중 50개는 실제 효과 있음
n_hypotheses   = 500
n_true_effects = 50
 
e_null = np.random.exponential(1.0, n_hypotheses - n_true_effects)
e_alt  = np.random.exponential(5.0, n_true_effects)  # 실제 효과 있는 소재
e_all  = np.concatenate([e_null, e_alt])
labels = np.array([0] * (n_hypotheses - n_true_effects) + [1] * n_true_effects)
 
alpha    = 0.05
rejected = eBH(e_all, alpha=alpha)
 
fdr   = np.sum(labels[rejected] == 0) / max(len(rejected), 1)
power = np.sum(labels[rejected] == 1) / n_true_effects
 
print(f"기각된 소재 수: {len(rejected)}")
# 예상 출력: 기각된 소재 수: 20~40 (효과 강도에 따라 다름)
print(f"실제 FDR: {fdr:.3f} (목표: ≤ {alpha})")
# 예상 출력: 실제 FDR: 0.000~0.050 (보통 목표치 이하)
print(f"검정력(Power): {power:.3f}")
# 예상 출력: 검정력(Power): 0.5~0.8 (효과 크기에 따라 다름)

Pros and Cons Analysis

Advantages

Item	Content
Optional Stopping	Type I Error Coverage Remains Even If Experiment Is Stopped or Resumed Early for Any Reason
Multifunctionality	The product of independent e-values is still the e-value → Summing evidence across experiments is natural
Allowing Random Dependency	Weighted Mean e-Value → Applicable in Multiple Tests Regardless of Dependency Structure
Post-hoc change in significance level	Valid even after adjusting α (p-value is fundamentally impossible)
Universality	Applicable to complex, non-parametric, and high-dimensional settings
Intuitive Interpretation	Easy to explain to PMs and Data Analysts using the betting balance analogy
Confidence Sequence	Confidence intervals can also be calculated validly at any time

Disadvantages and Precautions

Item	Content	Response Plan
Loss of Power	More conservative than p-value-based tests on the same sample	Combines enhanced prior effect size estimation and CUPED-style regression adjustments
Difficulty in selecting the optimal e-value	All non-negative functions are e-values → Which e-value to use is non-trivial	Utilize standard implementations such as mSPRT (normal distribution) and `safestats`
Existing Tool Compatibility	Most statistical software is designed around p-values	Uses R `safestats`, Python `expectation`·`savvi`
Learning Curve	Game-theoretic thinking differs from traditional statistics education	Start with Ramdas & Wang CMU textbook
Non-normal Distribution Indicators	mSPRT requires a normal distribution assumption (or CLT (Central Limit Theorem) approx.)	Non-normal indicators use separate e-value designs or non-parametric methods

The Most Common Mistakes in Practice

Using e-values without design: If you omit power analysis simply because optional stopping is possible, the experiment may not converge. "You can stop at any time" is different from "you can run it forever without a goal." You must go through a preliminary design phase, as shown in designSafeT().
Unverified e-value condition after regression adjustment: Not checking whether the adjusted estimator still satisfies the condition E[E] ≤ 1 (under H₀) after adjusting covariates using CUPED, etc. If this condition is broken, the anytime-valid guarantee disappears.
Interpreting e-BH results the same as BH: e-BH reveals its true value in random dependency structures and online (streaming) settings. You should not conclude that the "method is wrong" simply because the number of rejected hypotheses is lower than that of BH.

In Conclusion

e-value and SAVI are not merely generalizations of mSPRT, but a new language of statistical inference that completely reconstructs the p-value-centered paradigm on a game-theoretic basis.

3 Steps You Can Put into Practice Right Now — If you follow them in order, you can build both theory and practice.

First, if you are an R user: After running install.packages("safestats"), redesign the current experiment into a safe t-test using designSafeT(delta1=0.5, alpha=0.05, beta=0.2), and check for yourself when the e-value exceeds 1/α (=20).
Next, if you are a Python user: Try applying the e-process visualization code from Example 2 to past A/B test data. By plotting the e-process trajectory through post-hoc analysis, you can visually confirm "when the experiment should have been stopped to reach the correct conclusion," which serves as a direct reference for future experiment design.
If you want to solidify your theoretical foundation: Follow the entire flow from Ville's inequality to the e-BH procedure using Ramdas & Wang's CMU open textbook. After reading the textbook, watching Alexander Ly's SAVI tutorial will naturally lead to practical application.

Reference Materials

5 Key Points — Start Here

Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science — A key paper containing the mathematical foundation of the entire e-value·SAVI theory
Hypothesis Testing with E-values | Ramdas & Wang, CMU — An open textbook covering e-values from introductory to advanced levels
Universal Inference | PNAS, Wasserman et al. 2020 — Original paper on the construction method of the composite null hypothesis e-value
Choosing a Sequential Testing Framework | Spotify Engineering — Comparison of Sequential Testing Frameworks from a Practical Perspective
safestats R package | CRAN — The official implementation used in Example 1

To delve deeper

E-values | Wikipedia — Quick reference for concept review
Beyond Neyman–Pearson: E-values enable hypothesis testing with a data-driven alpha | PNAS — Theory of Posterior Significance Level Changeability
False Discovery Rate Control with E-values | JRSS-B — Original paper on e-BH procedure, formula-focused
Family-wise Error Rate Control with E-values | arXiv 2501.09015 — e-Bonferroni complete proof
ICML 2025 Tutorial: Game-theoretic Statistics and Sequential Anytime-Valid Inference — Get a Glance at the Latest Research Trends
Sequential Testing on Statsig | Statsig Blog — Practical implementation-focused case including CURE
A tiny review on e-values and e-processes | Ruodu Wang, 2023 — A short and clear review of concepts
Regularized e-processes | arXiv 2410.01427 — Latest Theory on Improving Prior Knowledge Utilization Efficiency
expectation Python library | GitHub — Example 2: For checking the latest API of the library
savvi Python package | PyPI — Example 3: For checking the latest API of the library

Key Concepts

What is an E-value

Overall Structure of the SAVI Framework

Beyond mSPRT: The Four Extensions SAVI Handles

Practical Application

Example 1: Sequential A/B Test Implemented in R

Example 2: e-process visualization implemented in Python

Example 3: Multiple Creative Verification Using e-BH — FDR (False Discovery Rate) Control

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

What is an E-value

Overall Structure of the SAVI Framework

Beyond mSPRT: The Four Extensions SAVI Handles

Practical Application

Example 1: Sequential A/B Test Implemented in R

Example 2: e-process visualization implemented in Python

Example 3: Multiple Creative Verification Using e-BH — FDR (False Discovery Rate) Control

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT

Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services