Why You Can Look into A/B Tests at Anytime with mSPRT | Anytime-Valid p-value Principles and Implementation Differences Between Netflix, Optimizely, and Spotify
Anyone running an A/B test feels the urge at least once to ask, "Isn't the current data enough?" This act of looking at the results during an experiment is called peeking. In traditional fixed-sample t-tests, this is a serious problem. In fact, even if you look at the results only 20 times at an α=0.05 standard, the probability of concluding "significant" at least once when the null hypothesis is true (FWER) exceeds 64%. This is because the Type I error accumulates the more you look at the results.
mSPRT (Mixture Sequential Probability Ratio Test) is a sequential hypothesis testing methodology that mathematically solves this peeking problem entirely using Martingale theory, guaranteeing an 'Anytime-Valid' p-value that is 'safe to stop at any time.' Optimizely, Netflix, and Spotify each solved this problem in different ways. Understanding these differences is not merely a matter of curiosity. When designing or selecting a production A/B testing platform, a single choice of methodology can result in a 20% difference in experimental power.
To read this article, you must be familiar with the basic concepts of p-values and hypothesis testing. Although it includes statistical formulas, intuitive explanations are provided alongside each formula, so even non-mathematics majors can follow along. It summarizes the mathematical foundations of mSPRT, the implementation differences among the three companies, and precautions for practical application all in one go.
Key Concepts
Key Terms Used in This Article
Definition of Terms
- Type I Error: The error of rejecting the null hypothesis when it is true. It is controlled at a significance level α.
- FWER (Family-Wise Error Rate): The probability of one or more Type I errors occurring when testing multiple hypotheses simultaneously.
- MDE (Minimum Detectable Effect): The minimum effect size deemed practically meaningful in an experiment.
- Anytime-Valid p-value: A p-value for which the Type I error rate does not exceed α when calculated at any point in time.
- e-value: A non-negative random variable with an expected value of 1 or less under the null hypothesis. The modern theoretical basis of anytime-valid inference.
The Limitations of Classic SPRT and the Birth of mSPRT
The Sequential Probability Ratio Test (SPRT), proposed by Wald (1945), is the originator of sequential testing. It calculates the likelihood ratio for each incoming data point to determine whether to reject the null hypothesis. However, the classic SPRT has a fatal limitation: the treatment effect size θ must be known accurately in advance. Since it is impossible to know before the experiment whether "this button color change will increase the click-through rate by exactly 2.3%," it is difficult to use in practice.
mSPRT solves this problem by using the mixture prior distribution H(θ) instead of a single θ. It integrates the likelihood ratio over the entire set of possible effect sizes.
Λ_n = ∫ [ ∏_{i=1}^{n} f_θ(X_i) / f_0(X_i) ] dH(θ)Intuitive Understanding f_θ(X_i) is the probability density of observation X_i when the effect size is θ, and f_0(X_i) is the probability density under the null hypothesis (no effect). Λ_n is the average of the ratio of "the probability of this data appearing when there is a treatment effect" versus "the probability of it appearing when there is no effect" across all possible effect sizes.
In normal data, the most common case is to set H to N(0, τ²), where a closed-form solution exists, allowing for real-time computation.
Anytime-Valid p-value Guarantee: Martingales are Key
The reason mSPRT is always safe is that Λ_n becomes a nonnegative martingale under the null hypothesis H₀.
What is a Martingale? In casinos, the strategy of doubling bets to recover losses is called the 'Martingale strategy,' but the term 'Martingale' here has a completely different meaning. It refers to a process in probability where the expected value after the current moment is exactly the same as the current value. The cumulative sum of coin tosses is a typical example.
Under the null hypothesis, Λ\n satisfies the following martingale condition.
E_H₀[Λ_n | Λ_1, ..., Λ_{n-1}] = Λ_{n-1}Applying Ville's Inequality to this property, the following holds.
P(∃n ≥ 1 : Λ_n ≥ 1/α) ≤ αIn other words, the probability that Λ_n exceeds 1/α when the null hypothesis is true is at most α over the entire experimental period. Therefore, if we define p_n = 1/Λ_n as the anytime-valid p-value, we gain a mathematical guarantee that the Type I error rate will not exceed α no matter how many times we look.
Key Insight Traditional p-values require a conditional interpretation based on the premise that the analysis was performed "exactly at this specific time with this sample size." Anytime-valid p-values do not have this condition. The interpretation is the same regardless of the time point.
This concept is also a sequential version of e-value (e-process), which has recently been attracting attention in the statistical community. Λ\n in mSPRT is a special case of e-value, and the theory is rapidly being generalized through the research of Grünwald, Ramdas, and others.
τ² (Mixed Variance) Selection: Small Differences Determine Power
In mSPRT design, the most practically important parameter is the variance τ² of the mixed prior distribution. Intuitively, τ² is the "width of our prior belief about how spread out the effect size will be." If this value is too narrow, the actual effect is missed, and if it is too wide, the power of the test is low.
τ² is optimized based on α (significance level), σ (data standard deviation), and M (expected sample size), and it is recommended to inversely calculate it from MDE.
| τ² degree of misassignment | loss of power | increase in mean run length |
|---|---|---|
| Difference of 1–2 orders compared to actual effect | Within 5–10% | Minimal |
| Exceeds 2 orders relative to actual effect | Max 20% | Max 40% increase |
mSPRT is relatively robust as long as τ² is not significantly different from the MDE. However, using a completely different scale results in a significant loss of power.
Comparison of Implementations by Three Companies
Example 1: Optimizely Stats Engine — The Originator of Commercialization
Since January 2015, I have been applying single-stream mSPRT to the difference in observations between two groups (control/treatment) Zn = Yn - Xn ~ N(θ, 2σ²). Real-time computation is possible because using a normal mixed dictionary H = N(0, τ²) yields a closed-form solution.
# requires: numpy>=1.21, scipy>=1.7
import numpy as np
def msprt_p_value(n: int, mean_diff: float, tau_sq: float, sigma_sq: float) -> float:
"""
정규 근사 기반 mSPRT anytime-valid p-value 계산
Args:
n: 현재까지 수집된 표본 수 (처리군 또는 대조군 각각)
mean_diff: 처리군과 대조군의 평균 차이 (원래 단위, 표준화 전)
tau_sq: 혼합 사전 분포 분산 τ² — MDE를 기반으로 역산 권장
sigma_sq: 데이터 분산 추정치 σ²
Returns:
anytime-valid p-value (언제 계산해도 1종 오류율 α 이하 보장)
"""
# 혼합 우도비 Λ_n (정규 혼합 분포 폐쇄형 해)
variance_ratio = sigma_sq / (sigma_sq + n * tau_sq)
exponent = (tau_sq * n**2 * mean_diff**2) / (2 * sigma_sq * (sigma_sq + n * tau_sq))
lambda_n = np.sqrt(variance_ratio) * np.exp(exponent)
# p_n = 1 / Λ_n, 1.0을 초과하지 않도록 클리핑
return min(1.0, 1.0 / lambda_n)
# 사용 예시: 10,000명 표본, 관측된 평균 차이 0.05, τ²=0.01, σ²=1.0
p = msprt_p_value(n=10000, mean_diff=0.05, tau_sq=0.01, sigma_sq=1.0)
print(f"Anytime-valid p-value: {p:.4f}")| Code Element | Description |
|---|---|
mean_diff |
Difference in mean between treatment and control (original units). Note that this is not a standardized Z-statistic. |
variance_ratio |
Scaling term of Λ_n — becomes smaller as samples accumulate |
exponent |
As the observed effect increases, the exponent increases, and Λ_n increases |
min(1.0, 1.0 / lambda_n) |
p-value cannot exceed 1 |
Optimizely goes a step further by implementing dynamic traffic distribution utilizing the anytime-valid characteristic through Stats Accelerator. This directs more traffic to the most promising variants in real time while maintaining Type I fault guarantees. It also includes built-in time-varying signal processing logic to prevent Simpson's Paradox.
Example 2: Netflix — Scaling to Complex Metrics
Netflix's core problem is play-delay. When distributing new software, they must quickly detect whether play-delay has regressed; however, play-delay does not follow a normal distribution, and changes in quantiles are often more significant.
Netflix expanded mSPRT with two approaches.
Part 1 — Continuous Data: Using the anytime-valid confidence sequence of Howard & Ramdas (2021), changes in quantiles as well as the mean are continuously monitored.
# requires: numpy>=1.21, scipy>=1.7
# 주의: 이 코드는 개념 설명을 위한 단순화된 구현입니다.
# 직접 실행은 가능하지만, 수치는 Howard & Ramdas(2021) 논문의
# empirical Bernstein bound 정확한 구현과 다릅니다.
# 실제 구현은 논문 Algorithm 1을 참조하세요.
import numpy as np
def anytime_valid_confidence_sequence(
data: np.ndarray,
alpha: float = 0.05,
) -> tuple[float, float]:
"""
Howard & Ramdas (2021) 기반 anytime-valid 신뢰 구간 (단순화 버전)
t-test 신뢰 구간과 달리 언제 계산해도 커버리지 보장
"""
n = len(data)
mean = np.mean(data)
var = np.var(data, ddof=1)
# 단순화된 시간 균등 신뢰 시퀀스 반경
# (실제 구현에는 추가 보정항이 필요)
radius = np.sqrt(
2 * var * (np.log(2 / alpha) + np.log(np.log(n) + 1)) / n
)
return (mean - radius, mean + radius)
# 사용 예시
data = np.random.normal(0.1, 1.0, 500)
lower, upper = anytime_valid_confidence_sequence(data)
print(f"Anytime-valid 95% 신뢰 구간: ({lower:.4f}, {upper:.4f})")Part 2 — Count Data: We implemented the anytime-valid test using only two integers: the number of events in the treatment group and the control group. Netflix released this implementation on GitHub Gist.
Key Insight The difference between Netflix's approach and Optimizely is that Netflix applies a theoretically optimized, anytime-valid method for each metric type (continuous, counts, quantiles, regression-adjusted) rather than a single mSPRT formula. This is more complex but yields higher power for each metric.
Example 3: Spotify — Choosing a Different Path with GST
Spotify officially adopted Group Sequential Test (GST) instead of mSPRT in March 2023. The reason is clear. Spotify's experiment data is aggregated in daily and weekly batches rather than as real-time streams. In this environment, the timing of intermediate analysis can be planned in advance, making the alpha spending function of GST more efficient.
The O'Brien-Fleming function used in GST is a "function that sets strict boundaries early on and lenient boundaries later on." In the early stages of an experiment, rejection is only allowed if there is overwhelming evidence, and as data accumulates, rejection is allowed even at increasingly lower standards. Lan-DeMets (1983) is an alpha spending framework that generalizes various boundary types, including this O'Brien-Fleming.
# requires: numpy>=1.21, scipy>=1.7
# 주의: 이 코드는 개념 설명을 위한 단순화된 구현입니다.
# 직접 실행은 가능하지만, 수치는 Lan-DeMets(1983)의
# 수치 적분 기반 정확한 구현과 다를 수 있습니다.
import numpy as np
from scipy import stats
def obrien_fleming_alpha_spending(
t: float,
alpha: float = 0.05
) -> float:
"""
O'Brien-Fleming 타입 alpha spending 함수 (Lan-DeMets 프레임워크)
O'Brien-Fleming은 특정 경계 형태(초기 엄격, 후반 관대)이고,
Lan-DeMets는 이를 포함한 일반적 alpha spending 프레임워크다.
Args:
t: 현재 정보 비율 (0~1, 현재 n / 계획된 최대 N)
alpha: 전체 유의수준
Returns:
t 시점까지의 누적 alpha spending
"""
return 2 * (1 - stats.norm.cdf(
stats.norm.ppf(1 - alpha / 2) / np.sqrt(t)
))
# 사용 예시: 5번의 중간 분석 계획
analysis_points = [0.2, 0.4, 0.6, 0.8, 1.0]
prev_spent = 0.0
print(f"{'정보비율':>8} {'허용 α(이번)':>14} {'누적 spent':>12}")
for t in analysis_points:
cumulative = obrien_fleming_alpha_spending(t, alpha=0.05)
incremental = cumulative - prev_spent
print(f"{t:>8.1f} {incremental:>14.5f} {cumulative:>12.5f}")
prev_spent = cumulative| Comparison Items | mSPRT (Optimizely·Netflix) | GST (Spotify) |
|---|---|---|
| Analysis Time | Anytime, no restrictions | Only at pre-planned times |
| Pre-specify sample size | Unnecessary | Required |
| Batch environment power | Lower compared to GST | Higher |
| Tolerance for Sample Size Misestimation | Robust | Vulnerable |
| Real-time Streaming Suitability | High | Low |
Direct Comparison of Three Methods: Running with Identical Data
Both mSPRT and GST are applied to the same simulation data to compare power and mean sample size.
# requires: numpy>=1.21, scipy>=1.7
# 주의: GST 구현은 단순화된 개념 코드입니다.
# 수치는 경향성 파악용이며 논문의 정확한 구현과 다릅니다.
import numpy as np
from scipy import stats
np.random.seed(42)
TRUE_EFFECT = 0.05 # 진짜 효과 크기
SIGMA = 1.0
N_SIM = 1_000 # 시뮬레이션 횟수
MAX_N = 5_000 # 최대 표본 크기 (군별)
ALPHA = 0.05
TAU_SQ = 0.01 # mSPRT 혼합 분산 (MDE 0.05 기반 설정)
def msprt_test(ctrl: np.ndarray, trt: np.ndarray) -> tuple[bool, int]:
"""mSPRT: 50개 단위로 데이터 추가하며 연속 모니터링"""
sigma_sq = SIGMA ** 2
for n in range(50, len(ctrl) + 1, 50):
mean_diff = np.mean(trt[:n]) - np.mean(ctrl[:n])
v_ratio = sigma_sq / (sigma_sq + n * TAU_SQ)
exp_term = (TAU_SQ * n**2 * mean_diff**2) / (2 * sigma_sq * (sigma_sq + n * TAU_SQ))
if np.sqrt(v_ratio) * np.exp(exp_term) >= 1 / ALPHA:
return True, n
return False, MAX_N
def gst_test(ctrl: np.ndarray, trt: np.ndarray) -> tuple[bool, int]:
"""GST (O'Brien-Fleming): 사전 계획된 5번 중간 분석"""
analysis_fractions = [0.2, 0.4, 0.6, 0.8, 1.0]
prev_spent = 0.0
for frac in analysis_fractions:
n = int(frac * MAX_N)
cum_spent = 2 * (1 - stats.norm.cdf(
stats.norm.ppf(1 - ALPHA / 2) / np.sqrt(frac)
))
local_alpha = cum_spent - prev_spent
prev_spent = cum_spent
z = (np.mean(trt[:n]) - np.mean(ctrl[:n])) / (SIGMA * np.sqrt(2 / n))
if 2 * (1 - stats.norm.cdf(abs(z))) < local_alpha:
return True, n
return False, MAX_N
results = {"mSPRT": [0, 0], "GST ": [0, 0]}
for _ in range(N_SIM):
ctrl = np.random.normal(0, SIGMA, MAX_N)
trt = np.random.normal(TRUE_EFFECT, SIGMA, MAX_N)
for name, fn in [("mSPRT", msprt_test), ("GST ", gst_test)]:
rejected, n = fn(ctrl, trt)
results[name][0] += rejected
results[name][1] += n
print(f"진짜 효과={TRUE_EFFECT}, σ={SIGMA}, 시뮬레이션={N_SIM}회\n")
print(f"{'방법':<8} {'검정력':>8} {'평균 표본':>12}")
for name, (power, total_n) in results.items():
print(f"{name:<8} {power/N_SIM:>8.3f} {total_n/N_SIM:>12.0f}")Running this code directly allows you to verify the actual trade-off between the power and average sample size of the two methods. Generally, GST demonstrates higher power at the pre-planned analysis time, while mSPRT is more flexible but has slightly lower power at the same sample size.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Peeking Safety | Guarantees Type I error rate α even during continuous monitoring, resolves the core weakness of conventional t-tests |
| Sample size flexibility | No need to fix the experiment period in advance |
| Early Termination | Conclusions can be reached immediately upon securing sufficient evidence, preventing resource waste |
| τ² Robustness | Power loss within 5–10% even if mixed variance settings are off by 1–2 orders of magnitude |
| Theoretical Completeness | Connected to the e-value framework, extendable to complex null hypothesis and regression adjustment |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Loss of power | Low power relative to GST when sample size is known | Review GST in batch environment |
| τ² Selection Difficulty | Settings affect performance, domain knowledge required | Inversion after MDE pre-definition |
| Non-normalized metric complexity | Numerical integration required for ratios, counts, etc. due to the absence of closed-forms | Refer to Netflix's public implementation |
| Multiple tests unprotected | FWER uncontrolled when multiple metrics are applied simultaneously | Bonferroni or e-value combination applied |
| Caution regarding two-sided testing | Separate handling required when setting directional alternative hypotheses | Design clearly to distinguish between one-sided and two-sided testing |
The Most Common Mistakes in Practice
- Start by setting τ² by feel — You must define the MDE first and then backcalculate it based on α, σ, and the expected sample size M. Setting τ² by feel can reduce the power of the test by up to 20%.
- Do not correct α when applying mSPRT to multiple metrics simultaneously — If there are 10 metrics, you must lower the α of each test to 0.05/10 = 0.005 using Bonferroni correction, or use e-value multiplication combination.
- Always choose mSPRT in batch analysis environments — If data is aggregated daily and the experiment period is known in advance, GST provides higher power at the same α. The choice of Spotify is well-founded.
In Conclusion
While mSPRT mathematically guarantees "safe at all times" A/B testing through the Martingale theory, the data environment determines the choice of methodology, as demonstrated by the examples of Optimizely, Netflix, and Spotify. The best implementation varies completely depending on whether the data is a real-time stream or a daily batch, and whether the metrics are based on a simple normal distribution or quantiles.
3 Steps to Start Right Now (It is effective to proceed in the order of 1→2→3):
- Change the comparison code to your own metric values and run it — By replacing
TRUE_EFFECTandTAU_SQin the simulation code above with your own domain MDE, you can directly verify the power trade-off between mSPRT and GST. If you are more comfortable with statistical packages than Python, the RMSPRTpackage (install.packages("MSPRT")) also supports the same simulation. - Run the Netflix public GitHub Gist — Apply the anytime-valid test code for count data to the actual experimental data and compare it with the existing χ² test results.
- Diagnose your analysis environment — If the data is a real-time stream, prioritize mSPRT; if it is a daily batch and the experiment period can be fixed, prioritize GST.
Next Post: e-value and SAVI (Safe Anytime-Valid Inference) Framework — e-value, briefly mentioned in this post, actually solves a much broader range of problems than mSPRT. A summary of the latest theories extending anytime-valid inference to composite null hypotheses, regression adjustments, and multivariate testing.
Reference Materials
- Always Valid Inference: Continuous Monitoring of A/B Tests | Johari et al., arXiv 1512.04922 — The theoretical foundation of the Optimizely Stats Engine, the core paper for the mSPRT anytime-valid guarantee
- Sequential A/B Testing Keeps the World Streaming Part 1: Continuous Data | Netflix TechBlog — Details of Netflix’s implementation of anytime-valid confidence sequences for continuous data
- Sequential A/B Testing Keeps the World Streaming Part 2: Counting Processes | Netflix TechBlog — Anytime-valid test for Netflix's count data, includes GitHub Gist link
- Choosing a Sequential Testing Framework — Comparisons and Discussions | Spotify Engineering — Why Spotify Chose GST Instead of mSPRT and the Simulation Basis
- Bringing Sequential Testing to Experiments with Longitudinal Data (Part 2) | Spotify Engineering — Spotify’s Sequential Testing Extension Framework for Longitudinal Data
- Stats Accelerator Overview | Optimizely Support — Optimizely Stats Accelerator (Dynamic Traffic Allocation) Official Documentation
- Optimizely Stats Engine Whitepaper | Optimizely — Original documentation for the mathematical implementation of the Stats Engine
- Sequential Test for Practical Significance: Truncated mSPRT | arXiv 2509.07892 — A study on truncated mSPRT with explicitly integrated MDE (2025)
- Anytime-Valid Inference For Multinomial Count Data | Netflix Research — Netflix's anytime-valid inference for multinomial count data (NeurIPS 발표)
- Mastering the mSPRT for A/B Testing | HackerNoon — mSPRT Practical Application Tutorial
- Comparison of Statistical Power: SPRT, AGILE, Always Valid Inference | Analytics-Toolkit.com — Quantitative Comparative Analysis of Statistical Power Among Sequential Test Methodologies
- Sequential Testing for Statistical Inference | Amplitude Experiment Docs — Reference document for Amplitude's mSPRT-based implementation
- Sequential Tests | Spotify Confidence Documentation — Official documentation for GST-based sequential tests on the Spotify Confidence platform
- E-values | Wikipedia — Explanation of the e-value concept and its theoretical connection to mSPRT