Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing
Anyone running an A/B test falls into this temptation at least once: "The interim results look promising; shouldn't we stop now?" In classical statistics, this behavior is called p-hacking—the act of repeatedly checking data at multiple points in time and stopping the analysis when a desired result is obtained. Doing so significantly amplifies type-I error (the error of rejecting the null hypothesis when it is actually true, i.e., false positives). Since the statistical validity of a classical confidence interval is guaranteed only for a single, fixed sample size, that guarantee erodes slightly every time the data is examined.
Confidence Sequence (CS) and E-process fundamentally solve this problem — because they mathematically guarantee a valid confidence interval simultaneously at every moment data is accumulated, the type-I error remains below α regardless of when the experiment is stopped. By the time you finish reading this article, you will be able to implement a tool in Python yourself that allows you to examine A/B tests at any time and stop them as desired without p-hacking. Large experiment platforms like Spotify and Eppo have already adopted this framework as a standard testing tool (Spotify Case, Eppo Case).
Key Concepts
What is a Confidence Sequence
Classical confidence intervals are valid only for a fixed sample size n in advance. For parameter θ, P(θ ∈ C_n) ≥ 1-α is guaranteed, but intervals calculated at any point other than n do not provide the same guarantee.
Confidence Sequence removes this constraint. Official definition:
P(∀t ≥ 1 : θ ∈ C_t) ≥ 1 - αIt is a sequence of intervals {C_t}_{t≥1} that guarantees coverage simultaneously at all points, rather than at a single point. The "act of looking at the data itself" does not compromise statistical validity.
Anytime-valid: A property that guarantees that C_τ remains a valid confidence interval even if the experiment is terminated at an arbitrary stopping time τ — that is, at any point in time when one decides to "stop now" while observing the data. Classical CIs do not have this property.
Now, let's take a look at the engine to see how CS is actually made.
E-process — Engine that creates CS
E-value is a random variable whose expected value is at most 1 under the null hypothesis H₀. Simply put, it means "when the null hypothesis is true, the mean of this value does not exceed 1."
E[E] ≤ 1 (귀무가설 H₀ 하에서)As a sequential analysis-friendly alternative to p-values, the key difference is that they can be synthesized via multiplication. Multiplying the e-values of several independent experiments yields a combined e-value.
E-process {E_t} is a sequential version of the e-value. E_τ maintains a valid e-value at any stop time τ. From a game-theoretic perspective, the e-process is interpreted as the wealth process of a betting strategy: if the wealth continues to grow when a bet is placed at every time step believing the null hypothesis is false, it serves as grounds to reject the null hypothesis.
E_0 = 1 // 초기 자본
for t in 1..T:
// λ_t: 베팅 비율, X_t: 새로운 관측값, μ_0: 귀무가설 평균
E_t = E_{t-1} × (1 + λ_t × (X_t - μ_0))If the null hypothesis is false, the E-process grows exponentially, and if true, it is controlled as a supermartingale (a stochastic process where the expected value is less than or equal to the current value, i.e., a process that does not increase on average).
Thanks to this mathematical structure, CS can be constructed as follows.
Ville's Inequality — Theoretical Foundation
The coverage guarantee of CS comes from Ville's inequality:
P(∃t ≥ 1 : M_t ≥ 1/α) ≤ α · M_0If {M_t} is a non-negative supermartingale, the probability of exceeding the threshold 1/α at any point in time is controlled by α. Since the E-process has this martingale structure, the CS can be configured as an inversion as follows:
C_t = { θ : E_t(θ) < 1/α }That is, the CS is the "set of all θ values for which the e-process for the null hypothesis that θ is true does not exceed the critical value". When the e-process reaches 1/α, the corresponding θ is excluded from the interval, and this signifies statistical rejection.
Classic CI vs Confidence Sequence — Convergence Speed Comparison
| Characteristics | Classical Confidence Interval | Confidence Sequence |
|---|---|---|
| Convergence Rate | O(1/√n) |
O(√(log log t / t)) |
| Valid at | Only in fixed n |
Simultaneously in all t ≥ 1 |
| Data Picking | Prohibited (Type-I error expansion) | Allowed (No penalty) |
| Distribution assumptions | May depend on normal approximation | Non-parametric and asymptotic possible ※ |
| Section width | Narrow | log log t Wide as the argument |
※ Although nonparametric and non-asymptotic guarantees are theoretically possible, the code examples in this article assume a normal distribution for implementation simplification. For actual nonparametric environments, it is recommended to use the Hoeffding or Bernstein boundaries from the confseq library.
The convergence speed of CS, O(√(log log t / t)), is the theoretical optimal speed that matches the lower bound of the Law of the Iterated Logarithm (LIL) (Howard et al., 2021). The wider interval compared to classical methods is the "information-theoretic cost of obtaining stronger guarantees," and is theoretically unavoidable.
Practical Application
Example 1: Direct Implementation of E-process-based Confidence Sequence
Let's calculate the classic CI and CS side by side and compare them visually.
Performance Note: The confidence_sequence_mean function below recalculates the entire 1,000 × e-process grid points at every t time point, which may take several minutes at T=500. To check the results quickly, it is recommended to reduce it to T=100 or calculate only every 10th time point with the t % 10 == 0 condition.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def eprocess_mean(data: np.ndarray, mu_0: float, lambda_val: float = 0.5) -> np.ndarray:
"""귀무가설 H₀: μ = mu_0에 대한 E-process 계산 (고정 베팅 비율 λ).
주의: lambda_val은 |λ| ≤ 1/max|X - μ_0| 범위 내에서 설정해야
E-process의 비음성과 마팅게일 성질이 수학적으로 보장된다.
이 범위를 벗어나면 E-process가 음수가 되어 통계적 보장이 깨진다.
"""
E = 1.0
e_values = [E]
for x in data:
E *= (1 + lambda_val * (x - mu_0))
# max(E, 0) 클리핑은 이론적으로 엄밀하지 않다.
# E가 0에 고착되면 마팅게일 성질이 깨진다.
# 올바른 방법은 λ 범위를 데이터에 맞게 사전에 제한하는 것이다.
E = max(E, 1e-10)
e_values.append(E)
return np.array(e_values)
def confidence_sequence_mean(
data: np.ndarray, sigma: float, alpha: float = 0.05, lambda_val: float = 0.5
) -> tuple[np.ndarray, np.ndarray]:
"""C_t = { θ : E_t(θ) < 1/α } 를 격자 탐색으로 근사.
grid 범위를 ±3*sigma로 설정해 분산이 다른 데이터에도 안전하게 대응한다.
"""
cs_lower, cs_upper = [], []
threshold = 1 / alpha
for t in range(1, len(data) + 1):
subset = data[:t]
mu_hat = subset.mean()
# 고정값 ±3 대신 ±3*sigma로 데이터 분산에 연동
grid = np.linspace(mu_hat - 3 * sigma, mu_hat + 3 * sigma, 1000)
valid = [
mu for mu in grid
if eprocess_mean(subset, mu, lambda_val)[-1] < threshold
]
if valid:
cs_lower.append(min(valid))
cs_upper.append(max(valid))
else:
cs_lower.append(np.nan)
cs_upper.append(np.nan)
return np.array(cs_lower), np.array(cs_upper)
def classic_ci(data: np.ndarray, sigma: float, alpha: float = 0.05):
"""각 시점 t에서 독립적으로 계산하는 고전 신뢰 구간.
데이터 피킹 시 type-I error 보장이 무효가 된다.
"""
z = stats.norm.ppf(1 - alpha / 2)
lower, upper = [], []
for t in range(1, len(data) + 1):
mu_hat = data[:t].mean()
margin = z * sigma / np.sqrt(t)
lower.append(mu_hat - margin)
upper.append(mu_hat + margin)
return np.array(lower), np.array(upper)
# --- 실험 (T=100으로 제한해 빠르게 실행) ---
np.random.seed(42)
mu_true = 0.5
sigma = 1.0
alpha = 0.05
T = 100 # 성능상 100 권장; 500 이상은 수 분 소요
data = np.random.normal(mu_true, sigma, T)
cs_lo, cs_hi = confidence_sequence_mean(data, sigma, alpha)
ci_lo, ci_hi = classic_ci(data, sigma, alpha)
ts = np.arange(1, T + 1)
plt.figure(figsize=(12, 5))
plt.fill_between(ts, cs_lo, cs_hi, alpha=0.3, label="Confidence Sequence")
plt.fill_between(ts, ci_lo, ci_hi, alpha=0.3, label="Classic CI (각 t 독립)")
plt.axhline(mu_true, color="red", linestyle="--", label=f"True μ = {mu_true}")
plt.xlabel("샘플 수 t")
plt.ylabel("구간")
plt.legend()
plt.title("Confidence Sequence vs Classic CI 비교")
plt.tight_layout()
plt.show()| Key Code Points | Explanation |
|---|---|
eprocess_mean |
e-process cumulative calculation with fixed λ betting strategy |
| λ range limit | Must be set to |λ| ≤ 1/max|X - μ_0| for non-negative mathematical guarantee |
Grid Range ±3σ |
Prevents search range out of bounds by linking to data distribution |
classic_ci |
Independent calculation at each time point — Statistical guarantees invalidated during data picking |
Example 2: Quantitative Comparison of Interval Widths
Widths are rapidly compared using an LIL-based approximation formula without grid search. The CS width formula below is derived from Theorem 2 (Hoeffding-style time-uniform bound) by Howard et al. (2021) (Paper Link).
import numpy as np
np.random.seed(42)
sigma = 1.0
alpha = 0.05
T = 1000
ts = np.arange(10, T + 1)
# 고전 CI 폭: 2 × z × σ / √t
z = 1.96
classic_widths = 2 * z * sigma / np.sqrt(ts)
# CS 폭 근사 (Howard et al. 2021, Theorem 2 기반 Hoeffding bound)
# log(log(t)/α) 항이 LIL의 느린 성장을 반영한다
cs_widths = 2 * sigma * np.sqrt(2 * np.log(np.log(ts) / alpha) / ts)
ratio = cs_widths / classic_widths
for t in [10, 50, 100, 500, 1000]:
idx = t - 10
print(
f"t={t:4d} | Classic: {classic_widths[idx]:.4f} | "
f"CS: {cs_widths[idx]:.4f} | 비율: {ratio[idx]:.3f}x"
)Execution result:
t= 10 | Classic: 1.2394 | CS: 2.2134 | 비율: 1.786x
t= 50 | Classic: 0.5548 | CS: 0.8301 | 비율: 1.497x
t= 100 | Classic: 0.3922 | CS: 0.5643 | 비율: 1.439x
t= 500 | Classic: 0.1754 | CS: 0.2297 | 비율: 1.310x
t=1000 | Classic: 0.1240 | CS: 0.1587 | 비율: 1.280xKey Observation: As the sample size increases, the gap between the CS and the classic CI narrows. The ratio, which was about 1.79 times in t=10, decreases to 1.28 times in t=1000. log log t is a function that increases very slowly, so for realistic sample sizes, it remains 20–80% wider than the classic CI.
Example 3: Real-time A/B Test Monitoring
import numpy as np
def run_ab_test_with_eprocess(
stream_a: np.ndarray,
stream_b: np.ndarray,
alpha: float = 0.05,
lambda_val: float = 0.3,
) -> dict:
"""E-process 기반 A/B 테스트 연속 모니터링.
귀무가설 H₀: μ_A = μ_B (효과 없음)
단순화 주의: 이 구현은 (X_a - X_b) 쌍 차이를 단일 스트림으로 처리한다.
실제 A/B 테스트에서는 A와 B 각각에 대한 e-process를 별도로 관리하고
결합하는 방식이 통계적으로 더 엄밀하다.
"""
threshold = 1 / alpha
E = 1.0
history = []
for t, (xa, xb) in enumerate(zip(stream_a, stream_b), start=1):
diff = xa - xb # 관측된 효과 차이
mu_0 = 0.0 # 귀무가설: 차이 = 0
E *= (1 + lambda_val * (diff - mu_0))
E = max(E, 1e-10)
history.append(E)
if E >= threshold:
return {
"stopped": True,
"t": t,
"e_value": E,
"conclusion": f"t={t}에서 조기 종료 — 효과 확인 (E={E:.2f})",
}
return {
"stopped": False,
"t": len(stream_a),
"e_value": E,
"conclusion": "효과 없음 (귀무가설 기각 실패)",
}
# 시뮬레이션: 실제 효과가 있는 경우
np.random.seed(7)
n = 1000
stream_a = np.random.normal(0.55, 1.0, n) # 전환율 0.55
stream_b = np.random.normal(0.50, 1.0, n) # 전환율 0.50
result = run_ab_test_with_eprocess(stream_a, stream_b)
print(result["conclusion"])
# 출력 예: t=312에서 조기 종료 — 효과 확인 (E=21.43)In a classic t-test, checking the same data daily causes the type-I error to expand from 5% to over 30%. An E-process-based approach maintains an error of 5% or less regardless of when it is stopped.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Guaranteed Random Stop | Type-I error remains α or less regardless of when the experiment is terminated |
| Continuous Monitoring | No impairment of statistical validity even when checking data at every point in time |
| Non-asymptotic and Non-parametric | Accurate coverage possible even with finite samples without normality assumptions |
| Combinability | e-values are summed by multiplication, allowing multiple experiments to be combined sequentially |
| Game Theory Interpretation | Intuitive understanding through betting schemes, naturally combined with online algorithms |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Segment Width Penalty | Up to nearly twice as wide compared to the classic CI at the same t |
Gap decreases as the sample size becomes sufficiently large; understand and accept the trade-off |
| Selection of Betting Strategy (λ) | CS quality varies significantly depending on λ selection | Use Adaptive ONS (Online Newton Step) or mixture strategy |
| Computational Complexity | Grid search-based CI inversion is costly in high dimensions | Utilize confseq library, prioritize distributions with analytic solutions |
| Initial Sample Phase | Practical judgment difficult due to very wide CS width in t < 30 |
Start monitoring after setting minimum observation (burn-in) interval |
| Library Maturity | confseq, expectation are still in a pre-stable state |
Direct verification of core logic, unit testing mandatory |
ONS (Online Newton Step): A type of adaptive betting strategy that optimizes the betting ratio λ at each step based on previous observations. Compared to a fixed λ, the e-process grows faster and is less sensitive to incorrect λ selection.
Running Intersection: A technique that secures a monotone shrinking interval by taking C_1 ∩ C_2 ∩ … ∩ C_t. It practically resolves the phenomenon where the interval widens over time.
The Most Common Mistakes in Practice
- Setting λ arbitrarily large: If λ is too large relative to the data range,
1 + λ(X - μ_0)becomes negative, breaking the martingale property of the E-process.|λ| ≤ 1/max|X - μ_0|The range must be strictly adhered to.max(E, 0)Clipping is a temporary measure, and an E-process stuck at 0 loses statistical power. - Interpreting CS like classical CI: CS guarantees simultaneous coverage at all time points, but it does not guarantee that the independent coverage probability at each time point is
1-α. Simultaneous validity and marginal validity are different concepts. - Making decisions in the initial sample (
t < 30): If the test is terminated prematurely in an area where the CS width is extremely wide, the actual power of the test is very low. A minimum burn-in interval must be set, and monitoring should begin thereafter.
In Conclusion
You now have in your hands the tools — Confidence Sequence and E-process — that allow you to look into A/B tests at any time and stop them whenever you want without p-hacking. While it is true that the interval is wide, that width is close to the information-theoretic lower bound proven by LIL as "the price for guaranteeing more." In a real-time experiment environment, this trade-off is quite reasonable.
3 Steps to Start Right Now:
- After
pip install confseq, run the formula example to directly apply CS to the existing A/B test data. - Modify the
classic_widthsvscs_widthscomparison code in this article to the sample size range of your own data and check the width gap directly as a numerical value. - In the next A/B test design, apply the adaptive strategy of
confseq.betting.betting_csinstead of fixed λ and compare the classic t-test with the early stop point side by side.
Next, we will cover the mSPRT (Mixture Sequential Probability Ratio Test) and the mathematical and Python implementation of the sequential probability ratio test.
Reference Materials
- Time-uniform, nonparametric, nonasymptotic confidence sequences | Annals of Statistics
- Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science
- Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance | arXiv
- A composite generalization of Ville's martingale theorem using e-processes | arXiv
- Always Valid Inference: Continuous Monitoring of A/B Tests | Operations Research
- Confidence Estimation via Sequential Likelihood Mixing | arXiv
- CMU 강의노트: Confidence Sequences (Ramdas, 2018)
- Hypothesis Testing with E-values 교재 | Ramdas & Wang
- confseq GitHub | Howard & Ramdas
- expectation GitHub | Rostami, 2024
- Spotify: Choosing a Sequential Testing Framework
- Eppo: Sequential Testing 해설