Why You Can Look into A/B Tests at Anytime with mSPRT | Anytime-Valid p-value Principles and Implementation Differences Between Netflix, Optimizely, and Spotify

Anyone running an A/B test feels the urge at least once to ask, "Isn't the current data enough?" This act of looking at the results during an experiment is called peeking. In traditional fixed-sample t-tests, this is a serious problem. In fact, even if you look at the results only 20 times at an α=0.05 standard, the probability of concluding "significant" at least once when the null hypothesis is true (FWER) exceeds 64%. This is because the Type I error accumulates the more you look at the results.

mSPRT (Mixture Sequential Probability Ratio Test) is a sequential hypothesis testing methodology that mathematically solves this peeking problem entirely using Martingale theory, guaranteeing an 'Anytime-Valid' p-value that is 'safe to stop at any time.' Optimizely, Netflix, and Spotify each solved this problem in different ways. Understanding these differences is not merely a matter of curiosity. When designing or selecting a production A/B testing platform, a single choice of methodology can result in a 20% difference in experimental power.

To read this article, you must be familiar with the basic concepts of p-values and hypothesis testing. Although it includes statistical formulas, intuitive explanations are provided alongside each formula, so even non-mathematics majors can follow along. It summarizes the mathematical foundations of mSPRT, the implementation differences among the three companies, and precautions for practical application all in one go.

Key Concepts

Key Terms Used in This Article

Definition of Terms

Type I Error: The error of rejecting the null hypothesis when it is true. It is controlled at a significance level α.
FWER (Family-Wise Error Rate): The probability of one or more Type I errors occurring when testing multiple hypotheses simultaneously.
MDE (Minimum Detectable Effect): The minimum effect size deemed practically meaningful in an experiment.
Anytime-Valid p-value: A p-value for which the Type I error rate does not exceed α when calculated at any point in time.
e-value: A non-negative random variable with an expected value of 1 or less under the null hypothesis. The modern theoretical basis of anytime-valid inference.

The Limitations of Classic SPRT and the Birth of mSPRT

The Sequential Probability Ratio Test (SPRT), proposed by Wald (1945), is the originator of sequential testing. It calculates the likelihood ratio for each incoming data point to determine whether to reject the null hypothesis. However, the classic SPRT has a fatal limitation: the treatment effect size θ must be known accurately in advance. Since it is impossible to know before the experiment whether "this button color change will increase the click-through rate by exactly 2.3%," it is difficult to use in practice.

mSPRT solves this problem by using the mixture prior distribution H(θ) instead of a single θ. It integrates the likelihood ratio over the entire set of possible effect sizes.

Λ_n = ∫ [ ∏_{i=1}^{n} f_θ(X_i) / f_0(X_i) ] dH(θ)

Intuitive Understanding f_θ(X_i) is the probability density of observation X_i when the effect size is θ, and f_0(X_i) is the probability density under the null hypothesis (no effect). Λ_n is the average of the ratio of "the probability of this data appearing when there is a treatment effect" versus "the probability of it appearing when there is no effect" across all possible effect sizes.

In normal data, the most common case is to set H to N(0, τ²), where a closed-form solution exists, allowing for real-time computation.

Anytime-Valid p-value Guarantee: Martingales are Key

The reason mSPRT is always safe is that Λ_n becomes a nonnegative martingale under the null hypothesis H₀.

What is a Martingale? In casinos, the strategy of doubling bets to recover losses is called the 'Martingale strategy,' but the term 'Martingale' here has a completely different meaning. It refers to a process in probability where the expected value after the current moment is exactly the same as the current value. The cumulative sum of coin tosses is a typical example.

Under the null hypothesis, Λ\n satisfies the following martingale condition.

E_H₀[Λ_n | Λ_1, ..., Λ_{n-1}] = Λ_{n-1}

Applying Ville's Inequality to this property, the following holds.

P(∃n ≥ 1 : Λ_n ≥ 1/α) ≤ α

In other words, the probability that Λ_n exceeds 1/α when the null hypothesis is true is at most α over the entire experimental period. Therefore, if we define p_n = 1/Λ_n as the anytime-valid p-value, we gain a mathematical guarantee that the Type I error rate will not exceed α no matter how many times we look.

Key Insight Traditional p-values require a conditional interpretation based on the premise that the analysis was performed "exactly at this specific time with this sample size." Anytime-valid p-values do not have this condition. The interpretation is the same regardless of the time point.

This concept is also a sequential version of e-value (e-process), which has recently been attracting attention in the statistical community. Λ\n in mSPRT is a special case of e-value, and the theory is rapidly being generalized through the research of Grünwald, Ramdas, and others.

τ² (Mixed Variance) Selection: Small Differences Determine Power

In mSPRT design, the most practically important parameter is the variance τ² of the mixed prior distribution. Intuitively, τ² is the "width of our prior belief about how spread out the effect size will be." If this value is too narrow, the actual effect is missed, and if it is too wide, the power of the test is low.

τ² is optimized based on α (significance level), σ (data standard deviation), and M (expected sample size), and it is recommended to inversely calculate it from MDE.

τ² degree of misassignment	loss of power	increase in mean run length
Difference of 1–2 orders compared to actual effect	Within 5–10%	Minimal
Exceeds 2 orders relative to actual effect	Max 20%	Max 40% increase

mSPRT is relatively robust as long as τ² is not significantly different from the MDE. However, using a completely different scale results in a significant loss of power.

Comparison of Implementations by Three Companies

Example 1: Optimizely Stats Engine — The Originator of Commercialization

Since January 2015, I have been applying single-stream mSPRT to the difference in observations between two groups (control/treatment) Zn = Yn - Xn ~ N(θ, 2σ²). Real-time computation is possible because using a normal mixed dictionary H = N(0, τ²) yields a closed-form solution.

python

# requires: numpy>=1.21, scipy>=1.7
import numpy as np
 
def msprt_p_value(n: int, mean_diff: float, tau_sq: float, sigma_sq: float) -> float:
    """
    정규 근사 기반 mSPRT anytime-valid p-value 계산
 
    Args:
        n: 현재까지 수집된 표본 수 (처리군 또는 대조군 각각)
        mean_diff: 처리군과 대조군의 평균 차이 (원래 단위, 표준화 전)
        tau_sq: 혼합 사전 분포 분산 τ² — MDE를 기반으로 역산 권장
        sigma_sq: 데이터 분산 추정치 σ²
 
    Returns:
        anytime-valid p-value (언제 계산해도 1종 오류율 α 이하 보장)
    """
    # 혼합 우도비 Λ_n (정규 혼합 분포 폐쇄형 해)
    variance_ratio = sigma_sq / (sigma_sq + n * tau_sq)
    exponent = (tau_sq * n**2 * mean_diff**2) / (2 * sigma_sq * (sigma_sq + n * tau_sq))
 
    lambda_n = np.sqrt(variance_ratio) * np.exp(exponent)
 
    # p_n = 1 / Λ_n, 1.0을 초과하지 않도록 클리핑
    return min(1.0, 1.0 / lambda_n)
 
 
# 사용 예시: 10,000명 표본, 관측된 평균 차이 0.05, τ²=0.01, σ²=1.0
p = msprt_p_value(n=10000, mean_diff=0.05, tau_sq=0.01, sigma_sq=1.0)
print(f"Anytime-valid p-value: {p:.4f}")

Code Element	Description
`mean_diff`	Difference in mean between treatment and control (original units). Note that this is not a standardized Z-statistic.
`variance_ratio`	Scaling term of Λ_n — becomes smaller as samples accumulate
`exponent`	As the observed effect increases, the exponent increases, and Λ_n increases
`min(1.0, 1.0 / lambda_n)`	p-value cannot exceed 1

Optimizely goes a step further by implementing dynamic traffic distribution utilizing the anytime-valid characteristic through Stats Accelerator. This directs more traffic to the most promising variants in real time while maintaining Type I fault guarantees. It also includes built-in time-varying signal processing logic to prevent Simpson's Paradox.

Example 2: Netflix — Scaling to Complex Metrics

Netflix's core problem is play-delay. When distributing new software, they must quickly detect whether play-delay has regressed; however, play-delay does not follow a normal distribution, and changes in quantiles are often more significant.

Netflix expanded mSPRT with two approaches.

Part 1 — Continuous Data: Using the anytime-valid confidence sequence of Howard & Ramdas (2021), changes in quantiles as well as the mean are continuously monitored.

python

# requires: numpy>=1.21, scipy>=1.7
# 주의: 이 코드는 개념 설명을 위한 단순화된 구현입니다.
# 직접 실행은 가능하지만, 수치는 Howard & Ramdas(2021) 논문의
# empirical Bernstein bound 정확한 구현과 다릅니다.
# 실제 구현은 논문 Algorithm 1을 참조하세요.
import numpy as np
 
def anytime_valid_confidence_sequence(
    data: np.ndarray,
    alpha: float = 0.05,
) -> tuple[float, float]:
    """
    Howard & Ramdas (2021) 기반 anytime-valid 신뢰 구간 (단순화 버전)
    t-test 신뢰 구간과 달리 언제 계산해도 커버리지 보장
    """
    n = len(data)
    mean = np.mean(data)
    var = np.var(data, ddof=1)
 
    # 단순화된 시간 균등 신뢰 시퀀스 반경
    # (실제 구현에는 추가 보정항이 필요)
    radius = np.sqrt(
        2 * var * (np.log(2 / alpha) + np.log(np.log(n) + 1)) / n
    )
 
    return (mean - radius, mean + radius)
 
 
# 사용 예시
data = np.random.normal(0.1, 1.0, 500)
lower, upper = anytime_valid_confidence_sequence(data)
print(f"Anytime-valid 95% 신뢰 구간: ({lower:.4f}, {upper:.4f})")

Part 2 — Count Data: We implemented the anytime-valid test using only two integers: the number of events in the treatment group and the control group. Netflix released this implementation on GitHub Gist.

Key Insight The difference between Netflix's approach and Optimizely is that Netflix applies a theoretically optimized, anytime-valid method for each metric type (continuous, counts, quantiles, regression-adjusted) rather than a single mSPRT formula. This is more complex but yields higher power for each metric.

Example 3: Spotify — Choosing a Different Path with GST

Spotify officially adopted Group Sequential Test (GST) instead of mSPRT in March 2023. The reason is clear. Spotify's experiment data is aggregated in daily and weekly batches rather than as real-time streams. In this environment, the timing of intermediate analysis can be planned in advance, making the alpha spending function of GST more efficient.

The O'Brien-Fleming function used in GST is a "function that sets strict boundaries early on and lenient boundaries later on." In the early stages of an experiment, rejection is only allowed if there is overwhelming evidence, and as data accumulates, rejection is allowed even at increasingly lower standards. Lan-DeMets (1983) is an alpha spending framework that generalizes various boundary types, including this O'Brien-Fleming.

python

# requires: numpy>=1.21, scipy>=1.7
# 주의: 이 코드는 개념 설명을 위한 단순화된 구현입니다.
# 직접 실행은 가능하지만, 수치는 Lan-DeMets(1983)의
# 수치 적분 기반 정확한 구현과 다를 수 있습니다.
import numpy as np
from scipy import stats
 
def obrien_fleming_alpha_spending(
    t: float,
    alpha: float = 0.05
) -> float:
    """
    O'Brien-Fleming 타입 alpha spending 함수 (Lan-DeMets 프레임워크)
 
    O'Brien-Fleming은 특정 경계 형태(초기 엄격, 후반 관대)이고,
    Lan-DeMets는 이를 포함한 일반적 alpha spending 프레임워크다.
 
    Args:
        t: 현재 정보 비율 (0~1, 현재 n / 계획된 최대 N)
        alpha: 전체 유의수준
 
    Returns:
        t 시점까지의 누적 alpha spending
    """
    return 2 * (1 - stats.norm.cdf(
        stats.norm.ppf(1 - alpha / 2) / np.sqrt(t)
    ))
 
 
# 사용 예시: 5번의 중간 분석 계획
analysis_points = [0.2, 0.4, 0.6, 0.8, 1.0]
prev_spent = 0.0
 
print(f"{'정보비율':>8} {'허용 α(이번)':>14} {'누적 spent':>12}")
for t in analysis_points:
    cumulative = obrien_fleming_alpha_spending(t, alpha=0.05)
    incremental = cumulative - prev_spent
    print(f"{t:>8.1f} {incremental:>14.5f} {cumulative:>12.5f}")
    prev_spent = cumulative

Comparison Items	mSPRT (Optimizely·Netflix)	GST (Spotify)
Analysis Time	Anytime, no restrictions	Only at pre-planned times
Pre-specify sample size	Unnecessary	Required
Batch environment power	Lower compared to GST	Higher
Tolerance for Sample Size Misestimation	Robust	Vulnerable
Real-time Streaming Suitability	High	Low

Direct Comparison of Three Methods: Running with Identical Data

Both mSPRT and GST are applied to the same simulation data to compare power and mean sample size.

python

# requires: numpy>=1.21, scipy>=1.7
# 주의: GST 구현은 단순화된 개념 코드입니다.
# 수치는 경향성 파악용이며 논문의 정확한 구현과 다릅니다.
import numpy as np
from scipy import stats
 
np.random.seed(42)
 
TRUE_EFFECT = 0.05  # 진짜 효과 크기
SIGMA = 1.0
N_SIM = 1_000       # 시뮬레이션 횟수
MAX_N = 5_000       # 최대 표본 크기 (군별)
ALPHA = 0.05
TAU_SQ = 0.01       # mSPRT 혼합 분산 (MDE 0.05 기반 설정)
 
def msprt_test(ctrl: np.ndarray, trt: np.ndarray) -> tuple[bool, int]:
    """mSPRT: 50개 단위로 데이터 추가하며 연속 모니터링"""
    sigma_sq = SIGMA ** 2
    for n in range(50, len(ctrl) + 1, 50):
        mean_diff = np.mean(trt[:n]) - np.mean(ctrl[:n])
        v_ratio = sigma_sq / (sigma_sq + n * TAU_SQ)
        exp_term = (TAU_SQ * n**2 * mean_diff**2) / (2 * sigma_sq * (sigma_sq + n * TAU_SQ))
        if np.sqrt(v_ratio) * np.exp(exp_term) >= 1 / ALPHA:
            return True, n
    return False, MAX_N
 
def gst_test(ctrl: np.ndarray, trt: np.ndarray) -> tuple[bool, int]:
    """GST (O'Brien-Fleming): 사전 계획된 5번 중간 분석"""
    analysis_fractions = [0.2, 0.4, 0.6, 0.8, 1.0]
    prev_spent = 0.0
    for frac in analysis_fractions:
        n = int(frac * MAX_N)
        cum_spent = 2 * (1 - stats.norm.cdf(
            stats.norm.ppf(1 - ALPHA / 2) / np.sqrt(frac)
        ))
        local_alpha = cum_spent - prev_spent
        prev_spent = cum_spent
        z = (np.mean(trt[:n]) - np.mean(ctrl[:n])) / (SIGMA * np.sqrt(2 / n))
        if 2 * (1 - stats.norm.cdf(abs(z))) < local_alpha:
            return True, n
    return False, MAX_N
 
results = {"mSPRT": [0, 0], "GST  ": [0, 0]}
 
for _ in range(N_SIM):
    ctrl = np.random.normal(0, SIGMA, MAX_N)
    trt  = np.random.normal(TRUE_EFFECT, SIGMA, MAX_N)
    for name, fn in [("mSPRT", msprt_test), ("GST  ", gst_test)]:
        rejected, n = fn(ctrl, trt)
        results[name][0] += rejected
        results[name][1] += n
 
print(f"진짜 효과={TRUE_EFFECT}, σ={SIGMA}, 시뮬레이션={N_SIM}회\n")
print(f"{'방법':<8} {'검정력':>8} {'평균 표본':>12}")
for name, (power, total_n) in results.items():
    print(f"{name:<8} {power/N_SIM:>8.3f} {total_n/N_SIM:>12.0f}")

Running this code directly allows you to verify the actual trade-off between the power and average sample size of the two methods. Generally, GST demonstrates higher power at the pre-planned analysis time, while mSPRT is more flexible but has slightly lower power at the same sample size.

Pros and Cons Analysis

Advantages

Item	Content
Peeking Safety	Guarantees Type I error rate α even during continuous monitoring, resolves the core weakness of conventional t-tests
Sample size flexibility	No need to fix the experiment period in advance
Early Termination	Conclusions can be reached immediately upon securing sufficient evidence, preventing resource waste
τ² Robustness	Power loss within 5–10% even if mixed variance settings are off by 1–2 orders of magnitude
Theoretical Completeness	Connected to the e-value framework, extendable to complex null hypothesis and regression adjustment

Disadvantages and Precautions

Item	Content	Response Plan
Loss of power	Low power relative to GST when sample size is known	Review GST in batch environment
τ² Selection Difficulty	Settings affect performance, domain knowledge required	Inversion after MDE pre-definition
Non-normalized metric complexity	Numerical integration required for ratios, counts, etc. due to the absence of closed-forms	Refer to Netflix's public implementation
Multiple tests unprotected	FWER uncontrolled when multiple metrics are applied simultaneously	Bonferroni or e-value combination applied
Caution regarding two-sided testing	Separate handling required when setting directional alternative hypotheses	Design clearly to distinguish between one-sided and two-sided testing

The Most Common Mistakes in Practice

Start by setting τ² by feel — You must define the MDE first and then backcalculate it based on α, σ, and the expected sample size M. Setting τ² by feel can reduce the power of the test by up to 20%.
Do not correct α when applying mSPRT to multiple metrics simultaneously — If there are 10 metrics, you must lower the α of each test to 0.05/10 = 0.005 using Bonferroni correction, or use e-value multiplication combination.
Always choose mSPRT in batch analysis environments — If data is aggregated daily and the experiment period is known in advance, GST provides higher power at the same α. The choice of Spotify is well-founded.

In Conclusion

While mSPRT mathematically guarantees "safe at all times" A/B testing through the Martingale theory, the data environment determines the choice of methodology, as demonstrated by the examples of Optimizely, Netflix, and Spotify. The best implementation varies completely depending on whether the data is a real-time stream or a daily batch, and whether the metrics are based on a simple normal distribution or quantiles.

3 Steps to Start Right Now (It is effective to proceed in the order of 1→2→3):

Change the comparison code to your own metric values and run it — By replacing TRUE_EFFECT and TAU_SQ in the simulation code above with your own domain MDE, you can directly verify the power trade-off between mSPRT and GST. If you are more comfortable with statistical packages than Python, the R MSPRT package (install.packages("MSPRT")) also supports the same simulation.
Run the Netflix public GitHub Gist — Apply the anytime-valid test code for count data to the actual experimental data and compare it with the existing χ² test results.
Diagnose your analysis environment — If the data is a real-time stream, prioritize mSPRT; if it is a daily batch and the experiment period can be fixed, prioritize GST.

Next Post: e-value and SAVI (Safe Anytime-Valid Inference) Framework — e-value, briefly mentioned in this post, actually solves a much broader range of problems than mSPRT. A summary of the latest theories extending anytime-valid inference to composite null hypotheses, regression adjustments, and multivariate testing.

Reference Materials

Always Valid Inference: Continuous Monitoring of A/B Tests | Johari et al., arXiv 1512.04922 — The theoretical foundation of the Optimizely Stats Engine, the core paper for the mSPRT anytime-valid guarantee
Sequential A/B Testing Keeps the World Streaming Part 1: Continuous Data | Netflix TechBlog — Details of Netflix’s implementation of anytime-valid confidence sequences for continuous data
Sequential A/B Testing Keeps the World Streaming Part 2: Counting Processes | Netflix TechBlog — Anytime-valid test for Netflix's count data, includes GitHub Gist link
Choosing a Sequential Testing Framework — Comparisons and Discussions | Spotify Engineering — Why Spotify Chose GST Instead of mSPRT and the Simulation Basis
Bringing Sequential Testing to Experiments with Longitudinal Data (Part 2) | Spotify Engineering — Spotify’s Sequential Testing Extension Framework for Longitudinal Data
Stats Accelerator Overview | Optimizely Support — Optimizely Stats Accelerator (Dynamic Traffic Allocation) Official Documentation
Optimizely Stats Engine Whitepaper | Optimizely — Original documentation for the mathematical implementation of the Stats Engine
Sequential Test for Practical Significance: Truncated mSPRT | arXiv 2509.07892 — A study on truncated mSPRT with explicitly integrated MDE (2025)
Anytime-Valid Inference For Multinomial Count Data | Netflix Research — Netflix's anytime-valid inference for multinomial count data (NeurIPS 발표)
Mastering the mSPRT for A/B Testing | HackerNoon — mSPRT Practical Application Tutorial
Comparison of Statistical Power: SPRT, AGILE, Always Valid Inference | Analytics-Toolkit.com — Quantitative Comparative Analysis of Statistical Power Among Sequential Test Methodologies
Sequential Testing for Statistical Inference | Amplitude Experiment Docs — Reference document for Amplitude's mSPRT-based implementation
Sequential Tests | Spotify Confidence Documentation — Official documentation for GST-based sequential tests on the Spotify Confidence platform
E-values | Wikipedia — Explanation of the e-value concept and its theoretical connection to mSPRT

Why You Can Look into A/B Tests at Anytime with mSPRT | Anytime-Valid p-value Principles and Implementation Differences Between Netflix, Optimizely, and Spotify | DEV BAK - 기술블로그

Why You Can Look into A/B Tests at Anytime with mSPRT | Anytime-Valid p-value Principles and Implementation Differences Between Netflix, Optimizely, and Spotify

Key Concepts

Key Terms Used in This Article

Definition of Terms

Type I Error: The error of rejecting the null hypothesis when it is true. It is controlled at a significance level α.
FWER (Family-Wise Error Rate): The probability of one or more Type I errors occurring when testing multiple hypotheses simultaneously.
MDE (Minimum Detectable Effect): The minimum effect size deemed practically meaningful in an experiment.
Anytime-Valid p-value: A p-value for which the Type I error rate does not exceed α when calculated at any point in time.
e-value: A non-negative random variable with an expected value of 1 or less under the null hypothesis. The modern theoretical basis of anytime-valid inference.

The Limitations of Classic SPRT and the Birth of mSPRT

mSPRT solves this problem by using the mixture prior distribution H(θ) instead of a single θ. It integrates the likelihood ratio over the entire set of possible effect sizes.

Λ_n = ∫ [ ∏_{i=1}^{n} f_θ(X_i) / f_0(X_i) ] dH(θ)

In normal data, the most common case is to set H to N(0, τ²), where a closed-form solution exists, allowing for real-time computation.

Anytime-Valid p-value Guarantee: Martingales are Key

The reason mSPRT is always safe is that Λ_n becomes a nonnegative martingale under the null hypothesis H₀.

Under the null hypothesis, Λ\n satisfies the following martingale condition.

E_H₀[Λ_n | Λ_1, ..., Λ_{n-1}] = Λ_{n-1}

Applying Ville's Inequality to this property, the following holds.

P(∃n ≥ 1 : Λ_n ≥ 1/α) ≤ α

τ² (Mixed Variance) Selection: Small Differences Determine Power

τ² is optimized based on α (significance level), σ (data standard deviation), and M (expected sample size), and it is recommended to inversely calculate it from MDE.

τ² degree of misassignment	loss of power	increase in mean run length
Difference of 1–2 orders compared to actual effect	Within 5–10%	Minimal
Exceeds 2 orders relative to actual effect	Max 20%	Max 40% increase

mSPRT is relatively robust as long as τ² is not significantly different from the MDE. However, using a completely different scale results in a significant loss of power.

Comparison of Implementations by Three Companies

Example 1: Optimizely Stats Engine — The Originator of Commercialization

python

# requires: numpy>=1.21, scipy>=1.7
import numpy as np
 
def msprt_p_value(n: int, mean_diff: float, tau_sq: float, sigma_sq: float) -> float:
    """
    정규 근사 기반 mSPRT anytime-valid p-value 계산
 
    Args:
        n: 현재까지 수집된 표본 수 (처리군 또는 대조군 각각)
        mean_diff: 처리군과 대조군의 평균 차이 (원래 단위, 표준화 전)
        tau_sq: 혼합 사전 분포 분산 τ² — MDE를 기반으로 역산 권장
        sigma_sq: 데이터 분산 추정치 σ²
 
    Returns:
        anytime-valid p-value (언제 계산해도 1종 오류율 α 이하 보장)
    """
    # 혼합 우도비 Λ_n (정규 혼합 분포 폐쇄형 해)
    variance_ratio = sigma_sq / (sigma_sq + n * tau_sq)
    exponent = (tau_sq * n**2 * mean_diff**2) / (2 * sigma_sq * (sigma_sq + n * tau_sq))
 
    lambda_n = np.sqrt(variance_ratio) * np.exp(exponent)
 
    # p_n = 1 / Λ_n, 1.0을 초과하지 않도록 클리핑
    return min(1.0, 1.0 / lambda_n)
 
 
# 사용 예시: 10,000명 표본, 관측된 평균 차이 0.05, τ²=0.01, σ²=1.0
p = msprt_p_value(n=10000, mean_diff=0.05, tau_sq=0.01, sigma_sq=1.0)
print(f"Anytime-valid p-value: {p:.4f}")

Code Element	Description
`mean_diff`	Difference in mean between treatment and control (original units). Note that this is not a standardized Z-statistic.
`variance_ratio`	Scaling term of Λ_n — becomes smaller as samples accumulate
`exponent`	As the observed effect increases, the exponent increases, and Λ_n increases
`min(1.0, 1.0 / lambda_n)`	p-value cannot exceed 1

Example 2: Netflix — Scaling to Complex Metrics

Netflix expanded mSPRT with two approaches.

Part 1 — Continuous Data: Using the anytime-valid confidence sequence of Howard & Ramdas (2021), changes in quantiles as well as the mean are continuously monitored.

python

# requires: numpy>=1.21, scipy>=1.7
# 주의: 이 코드는 개념 설명을 위한 단순화된 구현입니다.
# 직접 실행은 가능하지만, 수치는 Howard & Ramdas(2021) 논문의
# empirical Bernstein bound 정확한 구현과 다릅니다.
# 실제 구현은 논문 Algorithm 1을 참조하세요.
import numpy as np
 
def anytime_valid_confidence_sequence(
    data: np.ndarray,
    alpha: float = 0.05,
) -> tuple[float, float]:
    """
    Howard & Ramdas (2021) 기반 anytime-valid 신뢰 구간 (단순화 버전)
    t-test 신뢰 구간과 달리 언제 계산해도 커버리지 보장
    """
    n = len(data)
    mean = np.mean(data)
    var = np.var(data, ddof=1)
 
    # 단순화된 시간 균등 신뢰 시퀀스 반경
    # (실제 구현에는 추가 보정항이 필요)
    radius = np.sqrt(
        2 * var * (np.log(2 / alpha) + np.log(np.log(n) + 1)) / n
    )
 
    return (mean - radius, mean + radius)
 
 
# 사용 예시
data = np.random.normal(0.1, 1.0, 500)
lower, upper = anytime_valid_confidence_sequence(data)
print(f"Anytime-valid 95% 신뢰 구간: ({lower:.4f}, {upper:.4f})")

Example 3: Spotify — Choosing a Different Path with GST

python

# requires: numpy>=1.21, scipy>=1.7
# 주의: 이 코드는 개념 설명을 위한 단순화된 구현입니다.
# 직접 실행은 가능하지만, 수치는 Lan-DeMets(1983)의
# 수치 적분 기반 정확한 구현과 다를 수 있습니다.
import numpy as np
from scipy import stats
 
def obrien_fleming_alpha_spending(
    t: float,
    alpha: float = 0.05
) -> float:
    """
    O'Brien-Fleming 타입 alpha spending 함수 (Lan-DeMets 프레임워크)
 
    O'Brien-Fleming은 특정 경계 형태(초기 엄격, 후반 관대)이고,
    Lan-DeMets는 이를 포함한 일반적 alpha spending 프레임워크다.
 
    Args:
        t: 현재 정보 비율 (0~1, 현재 n / 계획된 최대 N)
        alpha: 전체 유의수준
 
    Returns:
        t 시점까지의 누적 alpha spending
    """
    return 2 * (1 - stats.norm.cdf(
        stats.norm.ppf(1 - alpha / 2) / np.sqrt(t)
    ))
 
 
# 사용 예시: 5번의 중간 분석 계획
analysis_points = [0.2, 0.4, 0.6, 0.8, 1.0]
prev_spent = 0.0
 
print(f"{'정보비율':>8} {'허용 α(이번)':>14} {'누적 spent':>12}")
for t in analysis_points:
    cumulative = obrien_fleming_alpha_spending(t, alpha=0.05)
    incremental = cumulative - prev_spent
    print(f"{t:>8.1f} {incremental:>14.5f} {cumulative:>12.5f}")
    prev_spent = cumulative

Comparison Items	mSPRT (Optimizely·Netflix)	GST (Spotify)
Analysis Time	Anytime, no restrictions	Only at pre-planned times
Pre-specify sample size	Unnecessary	Required
Batch environment power	Lower compared to GST	Higher
Tolerance for Sample Size Misestimation	Robust	Vulnerable
Real-time Streaming Suitability	High	Low

Direct Comparison of Three Methods: Running with Identical Data

Both mSPRT and GST are applied to the same simulation data to compare power and mean sample size.

python

# requires: numpy>=1.21, scipy>=1.7
# 주의: GST 구현은 단순화된 개념 코드입니다.
# 수치는 경향성 파악용이며 논문의 정확한 구현과 다릅니다.
import numpy as np
from scipy import stats
 
np.random.seed(42)
 
TRUE_EFFECT = 0.05  # 진짜 효과 크기
SIGMA = 1.0
N_SIM = 1_000       # 시뮬레이션 횟수
MAX_N = 5_000       # 최대 표본 크기 (군별)
ALPHA = 0.05
TAU_SQ = 0.01       # mSPRT 혼합 분산 (MDE 0.05 기반 설정)
 
def msprt_test(ctrl: np.ndarray, trt: np.ndarray) -> tuple[bool, int]:
    """mSPRT: 50개 단위로 데이터 추가하며 연속 모니터링"""
    sigma_sq = SIGMA ** 2
    for n in range(50, len(ctrl) + 1, 50):
        mean_diff = np.mean(trt[:n]) - np.mean(ctrl[:n])
        v_ratio = sigma_sq / (sigma_sq + n * TAU_SQ)
        exp_term = (TAU_SQ * n**2 * mean_diff**2) / (2 * sigma_sq * (sigma_sq + n * TAU_SQ))
        if np.sqrt(v_ratio) * np.exp(exp_term) >= 1 / ALPHA:
            return True, n
    return False, MAX_N
 
def gst_test(ctrl: np.ndarray, trt: np.ndarray) -> tuple[bool, int]:
    """GST (O'Brien-Fleming): 사전 계획된 5번 중간 분석"""
    analysis_fractions = [0.2, 0.4, 0.6, 0.8, 1.0]
    prev_spent = 0.0
    for frac in analysis_fractions:
        n = int(frac * MAX_N)
        cum_spent = 2 * (1 - stats.norm.cdf(
            stats.norm.ppf(1 - ALPHA / 2) / np.sqrt(frac)
        ))
        local_alpha = cum_spent - prev_spent
        prev_spent = cum_spent
        z = (np.mean(trt[:n]) - np.mean(ctrl[:n])) / (SIGMA * np.sqrt(2 / n))
        if 2 * (1 - stats.norm.cdf(abs(z))) < local_alpha:
            return True, n
    return False, MAX_N
 
results = {"mSPRT": [0, 0], "GST  ": [0, 0]}
 
for _ in range(N_SIM):
    ctrl = np.random.normal(0, SIGMA, MAX_N)
    trt  = np.random.normal(TRUE_EFFECT, SIGMA, MAX_N)
    for name, fn in [("mSPRT", msprt_test), ("GST  ", gst_test)]:
        rejected, n = fn(ctrl, trt)
        results[name][0] += rejected
        results[name][1] += n
 
print(f"진짜 효과={TRUE_EFFECT}, σ={SIGMA}, 시뮬레이션={N_SIM}회\n")
print(f"{'방법':<8} {'검정력':>8} {'평균 표본':>12}")
for name, (power, total_n) in results.items():
    print(f"{name:<8} {power/N_SIM:>8.3f} {total_n/N_SIM:>12.0f}")

Pros and Cons Analysis

Advantages

Item	Content
Peeking Safety	Guarantees Type I error rate α even during continuous monitoring, resolves the core weakness of conventional t-tests
Sample size flexibility	No need to fix the experiment period in advance
Early Termination	Conclusions can be reached immediately upon securing sufficient evidence, preventing resource waste
τ² Robustness	Power loss within 5–10% even if mixed variance settings are off by 1–2 orders of magnitude
Theoretical Completeness	Connected to the e-value framework, extendable to complex null hypothesis and regression adjustment

Disadvantages and Precautions

Item	Content	Response Plan
Loss of power	Low power relative to GST when sample size is known	Review GST in batch environment
τ² Selection Difficulty	Settings affect performance, domain knowledge required	Inversion after MDE pre-definition
Non-normalized metric complexity	Numerical integration required for ratios, counts, etc. due to the absence of closed-forms	Refer to Netflix's public implementation
Multiple tests unprotected	FWER uncontrolled when multiple metrics are applied simultaneously	Bonferroni or e-value combination applied
Caution regarding two-sided testing	Separate handling required when setting directional alternative hypotheses	Design clearly to distinguish between one-sided and two-sided testing

The Most Common Mistakes in Practice

Start by setting τ² by feel — You must define the MDE first and then backcalculate it based on α, σ, and the expected sample size M. Setting τ² by feel can reduce the power of the test by up to 20%.
Do not correct α when applying mSPRT to multiple metrics simultaneously — If there are 10 metrics, you must lower the α of each test to 0.05/10 = 0.005 using Bonferroni correction, or use e-value multiplication combination.
Always choose mSPRT in batch analysis environments — If data is aggregated daily and the experiment period is known in advance, GST provides higher power at the same α. The choice of Spotify is well-founded.

In Conclusion

3 Steps to Start Right Now (It is effective to proceed in the order of 1→2→3):

Change the comparison code to your own metric values and run it — By replacing TRUE_EFFECT and TAU_SQ in the simulation code above with your own domain MDE, you can directly verify the power trade-off between mSPRT and GST. If you are more comfortable with statistical packages than Python, the R MSPRT package (install.packages("MSPRT")) also supports the same simulation.
Run the Netflix public GitHub Gist — Apply the anytime-valid test code for count data to the actual experimental data and compare it with the existing χ² test results.
Diagnose your analysis environment — If the data is a real-time stream, prioritize mSPRT; if it is a daily batch and the experiment period can be fixed, prioritize GST.

Reference Materials

Always Valid Inference: Continuous Monitoring of A/B Tests | Johari et al., arXiv 1512.04922 — The theoretical foundation of the Optimizely Stats Engine, the core paper for the mSPRT anytime-valid guarantee
Sequential A/B Testing Keeps the World Streaming Part 1: Continuous Data | Netflix TechBlog — Details of Netflix’s implementation of anytime-valid confidence sequences for continuous data
Sequential A/B Testing Keeps the World Streaming Part 2: Counting Processes | Netflix TechBlog — Anytime-valid test for Netflix's count data, includes GitHub Gist link
Choosing a Sequential Testing Framework — Comparisons and Discussions | Spotify Engineering — Why Spotify Chose GST Instead of mSPRT and the Simulation Basis
Bringing Sequential Testing to Experiments with Longitudinal Data (Part 2) | Spotify Engineering — Spotify’s Sequential Testing Extension Framework for Longitudinal Data
Stats Accelerator Overview | Optimizely Support — Optimizely Stats Accelerator (Dynamic Traffic Allocation) Official Documentation
Optimizely Stats Engine Whitepaper | Optimizely — Original documentation for the mathematical implementation of the Stats Engine
Sequential Test for Practical Significance: Truncated mSPRT | arXiv 2509.07892 — A study on truncated mSPRT with explicitly integrated MDE (2025)
Anytime-Valid Inference For Multinomial Count Data | Netflix Research — Netflix's anytime-valid inference for multinomial count data (NeurIPS 발표)
Mastering the mSPRT for A/B Testing | HackerNoon — mSPRT Practical Application Tutorial
Comparison of Statistical Power: SPRT, AGILE, Always Valid Inference | Analytics-Toolkit.com — Quantitative Comparative Analysis of Statistical Power Among Sequential Test Methodologies
Sequential Testing for Statistical Inference | Amplitude Experiment Docs — Reference document for Amplitude's mSPRT-based implementation
Sequential Tests | Spotify Confidence Documentation — Official documentation for GST-based sequential tests on the Spotify Confidence platform
E-values | Wikipedia — Explanation of the e-value concept and its theoretical connection to mSPRT

Key Concepts

Key Terms Used in This Article

The Limitations of Classic SPRT and the Birth of mSPRT

Anytime-Valid p-value Guarantee: Martingales are Key

τ² (Mixed Variance) Selection: Small Differences Determine Power

Comparison of Implementations by Three Companies

Example 1: Optimizely Stats Engine — The Originator of Commercialization

Example 2: Netflix — Scaling to Complex Metrics

Example 3: Spotify — Choosing a Different Path with GST

Direct Comparison of Three Methods: Running with Identical Data

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

Key Terms Used in This Article

The Limitations of Classic SPRT and the Birth of mSPRT

Anytime-Valid p-value Guarantee: Martingales are Key

τ² (Mixed Variance) Selection: Small Differences Determine Power

Comparison of Implementations by Three Companies

Example 1: Optimizely Stats Engine — The Originator of Commercialization

Example 2: Netflix — Scaling to Complex Metrics

Example 3: Spotify — Choosing a Different Path with GST

Direct Comparison of Three Methods: Running with Identical Data

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Local LLM TCO Analysis: How to Calculate the On-Premises Break-Even Point and GPU Utilization Optimization Strategies

How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU

How AI Coding Agents Are Reshaping Dev Team Structure: How to Transition into an Orchestrator

AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline

Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies

Type-Safe LLM Response Validation with Pydantic AI