Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

Anyone running an A/B test falls into this temptation at least once: "The interim results look promising; shouldn't we stop now?" In classical statistics, this behavior is called p-hacking—the act of repeatedly checking data at multiple points in time and stopping the analysis when a desired result is obtained. Doing so significantly amplifies type-I error (the error of rejecting the null hypothesis when it is actually true, i.e., false positives). Since the statistical validity of a classical confidence interval is guaranteed only for a single, fixed sample size, that guarantee erodes slightly every time the data is examined.

Confidence Sequence (CS) and E-process fundamentally solve this problem — because they mathematically guarantee a valid confidence interval simultaneously at every moment data is accumulated, the type-I error remains below α regardless of when the experiment is stopped. By the time you finish reading this article, you will be able to implement a tool in Python yourself that allows you to examine A/B tests at any time and stop them as desired without p-hacking. Large experiment platforms like Spotify and Eppo have already adopted this framework as a standard testing tool (Spotify Case, Eppo Case).

Key Concepts

What is a Confidence Sequence

Classical confidence intervals are valid only for a fixed sample size n in advance. For parameter θ, P(θ ∈ C_n) ≥ 1-α is guaranteed, but intervals calculated at any point other than n do not provide the same guarantee.

Confidence Sequence removes this constraint. Official definition:

P(∀t ≥ 1 : θ ∈ C_t) ≥ 1 - α

It is a sequence of intervals {C_t}_{t≥1} that guarantees coverage simultaneously at all points, rather than at a single point. The "act of looking at the data itself" does not compromise statistical validity.

Anytime-valid: A property that guarantees that C_τ remains a valid confidence interval even if the experiment is terminated at an arbitrary stopping time τ — that is, at any point in time when one decides to "stop now" while observing the data. Classical CIs do not have this property.

Now, let's take a look at the engine to see how CS is actually made.

E-process — Engine that creates CS

E-value is a random variable whose expected value is at most 1 under the null hypothesis H₀. Simply put, it means "when the null hypothesis is true, the mean of this value does not exceed 1."

E[E] ≤ 1  (귀무가설 H₀ 하에서)

As a sequential analysis-friendly alternative to p-values, the key difference is that they can be synthesized via multiplication. Multiplying the e-values of several independent experiments yields a combined e-value.

E-process {E_t} is a sequential version of the e-value. E_τ maintains a valid e-value at any stop time τ. From a game-theoretic perspective, the e-process is interpreted as the wealth process of a betting strategy: if the wealth continues to grow when a bet is placed at every time step believing the null hypothesis is false, it serves as grounds to reject the null hypothesis.

E_0 = 1  // 초기 자본
for t in 1..T:
    // λ_t: 베팅 비율, X_t: 새로운 관측값, μ_0: 귀무가설 평균
    E_t = E_{t-1} × (1 + λ_t × (X_t - μ_0))

If the null hypothesis is false, the E-process grows exponentially, and if true, it is controlled as a supermartingale (a stochastic process where the expected value is less than or equal to the current value, i.e., a process that does not increase on average).

Thanks to this mathematical structure, CS can be constructed as follows.

Ville's Inequality — Theoretical Foundation

The coverage guarantee of CS comes from Ville's inequality:

P(∃t ≥ 1 : M_t ≥ 1/α) ≤ α · M_0

If {M_t} is a non-negative supermartingale, the probability of exceeding the threshold 1/α at any point in time is controlled by α. Since the E-process has this martingale structure, the CS can be configured as an inversion as follows:

C_t = { θ : E_t(θ) < 1/α }

That is, the CS is the "set of all θ values for which the e-process for the null hypothesis that θ is true does not exceed the critical value". When the e-process reaches 1/α, the corresponding θ is excluded from the interval, and this signifies statistical rejection.

Classic CI vs Confidence Sequence — Convergence Speed Comparison

Characteristics	Classical Confidence Interval	Confidence Sequence
Convergence Rate	`O(1/√n)`	`O(√(log log t / t))`
Valid at	Only in fixed `n`	Simultaneously in all `t ≥ 1`
Data Picking	Prohibited (Type-I error expansion)	Allowed (No penalty)
Distribution assumptions	May depend on normal approximation	Non-parametric and asymptotic possible ※
Section width	Narrow	`log log t` Wide as the argument

※ Although nonparametric and non-asymptotic guarantees are theoretically possible, the code examples in this article assume a normal distribution for implementation simplification. For actual nonparametric environments, it is recommended to use the Hoeffding or Bernstein boundaries from the confseq library.

The convergence speed of CS, O(√(log log t / t)), is the theoretical optimal speed that matches the lower bound of the Law of the Iterated Logarithm (LIL) (Howard et al., 2021). The wider interval compared to classical methods is the "information-theoretic cost of obtaining stronger guarantees," and is theoretically unavoidable.

Practical Application

Example 1: Direct Implementation of E-process-based Confidence Sequence

Let's calculate the classic CI and CS side by side and compare them visually.

Performance Note: The confidence_sequence_mean function below recalculates the entire 1,000 × e-process grid points at every t time point, which may take several minutes at T=500. To check the results quickly, it is recommended to reduce it to T=100 or calculate only every 10th time point with the t % 10 == 0 condition.

python

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
 
 
def eprocess_mean(data: np.ndarray, mu_0: float, lambda_val: float = 0.5) -> np.ndarray:
    """귀무가설 H₀: μ = mu_0에 대한 E-process 계산 (고정 베팅 비율 λ).
 
    주의: lambda_val은 |λ| ≤ 1/max|X - μ_0| 범위 내에서 설정해야
    E-process의 비음성과 마팅게일 성질이 수학적으로 보장된다.
    이 범위를 벗어나면 E-process가 음수가 되어 통계적 보장이 깨진다.
    """
    E = 1.0
    e_values = [E]
    for x in data:
        E *= (1 + lambda_val * (x - mu_0))
        # max(E, 0) 클리핑은 이론적으로 엄밀하지 않다.
        # E가 0에 고착되면 마팅게일 성질이 깨진다.
        # 올바른 방법은 λ 범위를 데이터에 맞게 사전에 제한하는 것이다.
        E = max(E, 1e-10)
        e_values.append(E)
    return np.array(e_values)
 
 
def confidence_sequence_mean(
    data: np.ndarray, sigma: float, alpha: float = 0.05, lambda_val: float = 0.5
) -> tuple[np.ndarray, np.ndarray]:
    """C_t = { θ : E_t(θ) < 1/α } 를 격자 탐색으로 근사.
 
    grid 범위를 ±3*sigma로 설정해 분산이 다른 데이터에도 안전하게 대응한다.
    """
    cs_lower, cs_upper = [], []
    threshold = 1 / alpha
 
    for t in range(1, len(data) + 1):
        subset = data[:t]
        mu_hat = subset.mean()
        # 고정값 ±3 대신 ±3*sigma로 데이터 분산에 연동
        grid = np.linspace(mu_hat - 3 * sigma, mu_hat + 3 * sigma, 1000)
        valid = [
            mu for mu in grid
            if eprocess_mean(subset, mu, lambda_val)[-1] < threshold
        ]
        if valid:
            cs_lower.append(min(valid))
            cs_upper.append(max(valid))
        else:
            cs_lower.append(np.nan)
            cs_upper.append(np.nan)
 
    return np.array(cs_lower), np.array(cs_upper)
 
 
def classic_ci(data: np.ndarray, sigma: float, alpha: float = 0.05):
    """각 시점 t에서 독립적으로 계산하는 고전 신뢰 구간.
 
    데이터 피킹 시 type-I error 보장이 무효가 된다.
    """
    z = stats.norm.ppf(1 - alpha / 2)
    lower, upper = [], []
    for t in range(1, len(data) + 1):
        mu_hat = data[:t].mean()
        margin = z * sigma / np.sqrt(t)
        lower.append(mu_hat - margin)
        upper.append(mu_hat + margin)
    return np.array(lower), np.array(upper)
 
 
# --- 실험 (T=100으로 제한해 빠르게 실행) ---
np.random.seed(42)
mu_true = 0.5
sigma = 1.0
alpha = 0.05
T = 100  # 성능상 100 권장; 500 이상은 수 분 소요
data = np.random.normal(mu_true, sigma, T)
 
cs_lo, cs_hi = confidence_sequence_mean(data, sigma, alpha)
ci_lo, ci_hi = classic_ci(data, sigma, alpha)
 
ts = np.arange(1, T + 1)
plt.figure(figsize=(12, 5))
plt.fill_between(ts, cs_lo, cs_hi, alpha=0.3, label="Confidence Sequence")
plt.fill_between(ts, ci_lo, ci_hi, alpha=0.3, label="Classic CI (각 t 독립)")
plt.axhline(mu_true, color="red", linestyle="--", label=f"True μ = {mu_true}")
plt.xlabel("샘플 수 t")
plt.ylabel("구간")
plt.legend()
plt.title("Confidence Sequence vs Classic CI 비교")
plt.tight_layout()
plt.show()

Key Code Points	Explanation
`eprocess_mean`	e-process cumulative calculation with fixed λ betting strategy
λ range limit	Must be set to `\|λ\| ≤ 1/max\|X - μ_0\|` for non-negative mathematical guarantee
Grid Range `±3σ`	Prevents search range out of bounds by linking to data distribution
`classic_ci`	Independent calculation at each time point — Statistical guarantees invalidated during data picking

Example 2: Quantitative Comparison of Interval Widths

Widths are rapidly compared using an LIL-based approximation formula without grid search. The CS width formula below is derived from Theorem 2 (Hoeffding-style time-uniform bound) by Howard et al. (2021) (Paper Link).

python

import numpy as np
 
np.random.seed(42)
sigma = 1.0
alpha = 0.05
T = 1000
ts = np.arange(10, T + 1)
 
# 고전 CI 폭: 2 × z × σ / √t
z = 1.96
classic_widths = 2 * z * sigma / np.sqrt(ts)
 
# CS 폭 근사 (Howard et al. 2021, Theorem 2 기반 Hoeffding bound)
# log(log(t)/α) 항이 LIL의 느린 성장을 반영한다
cs_widths = 2 * sigma * np.sqrt(2 * np.log(np.log(ts) / alpha) / ts)
 
ratio = cs_widths / classic_widths
 
for t in [10, 50, 100, 500, 1000]:
    idx = t - 10
    print(
        f"t={t:4d} | Classic: {classic_widths[idx]:.4f} | "
        f"CS: {cs_widths[idx]:.4f} | 비율: {ratio[idx]:.3f}x"
    )

Execution result:

yaml

t=  10 | Classic: 1.2394 | CS: 2.2134 | 비율: 1.786x
t=  50 | Classic: 0.5548 | CS: 0.8301 | 비율: 1.497x
t= 100 | Classic: 0.3922 | CS: 0.5643 | 비율: 1.439x
t= 500 | Classic: 0.1754 | CS: 0.2297 | 비율: 1.310x
t=1000 | Classic: 0.1240 | CS: 0.1587 | 비율: 1.280x

Key Observation: As the sample size increases, the gap between the CS and the classic CI narrows. The ratio, which was about 1.79 times in t=10, decreases to 1.28 times in t=1000. log log t is a function that increases very slowly, so for realistic sample sizes, it remains 20–80% wider than the classic CI.

Example 3: Real-time A/B Test Monitoring

python

import numpy as np
 
 
def run_ab_test_with_eprocess(
    stream_a: np.ndarray,
    stream_b: np.ndarray,
    alpha: float = 0.05,
    lambda_val: float = 0.3,
) -> dict:
    """E-process 기반 A/B 테스트 연속 모니터링.
 
    귀무가설 H₀: μ_A = μ_B (효과 없음)
 
    단순화 주의: 이 구현은 (X_a - X_b) 쌍 차이를 단일 스트림으로 처리한다.
    실제 A/B 테스트에서는 A와 B 각각에 대한 e-process를 별도로 관리하고
    결합하는 방식이 통계적으로 더 엄밀하다.
    """
    threshold = 1 / alpha
    E = 1.0
    history = []
 
    for t, (xa, xb) in enumerate(zip(stream_a, stream_b), start=1):
        diff = xa - xb   # 관측된 효과 차이
        mu_0 = 0.0       # 귀무가설: 차이 = 0
        E *= (1 + lambda_val * (diff - mu_0))
        E = max(E, 1e-10)
        history.append(E)
 
        if E >= threshold:
            return {
                "stopped": True,
                "t": t,
                "e_value": E,
                "conclusion": f"t={t}에서 조기 종료 — 효과 확인 (E={E:.2f})",
            }
 
    return {
        "stopped": False,
        "t": len(stream_a),
        "e_value": E,
        "conclusion": "효과 없음 (귀무가설 기각 실패)",
    }
 
 
# 시뮬레이션: 실제 효과가 있는 경우
np.random.seed(7)
n = 1000
stream_a = np.random.normal(0.55, 1.0, n)  # 전환율 0.55
stream_b = np.random.normal(0.50, 1.0, n)  # 전환율 0.50
 
result = run_ab_test_with_eprocess(stream_a, stream_b)
print(result["conclusion"])
# 출력 예: t=312에서 조기 종료 — 효과 확인 (E=21.43)

In a classic t-test, checking the same data daily causes the type-I error to expand from 5% to over 30%. An E-process-based approach maintains an error of 5% or less regardless of when it is stopped.

Pros and Cons Analysis

Advantages

Item	Content
Guaranteed Random Stop	Type-I error remains α or less regardless of when the experiment is terminated
Continuous Monitoring	No impairment of statistical validity even when checking data at every point in time
Non-asymptotic and Non-parametric	Accurate coverage possible even with finite samples without normality assumptions
Combinability	e-values are summed by multiplication, allowing multiple experiments to be combined sequentially
Game Theory Interpretation	Intuitive understanding through betting schemes, naturally combined with online algorithms

Disadvantages and Precautions

Item	Content	Response Plan
Segment Width Penalty	Up to nearly twice as wide compared to the classic CI at the same `t`	Gap decreases as the sample size becomes sufficiently large; understand and accept the trade-off
Selection of Betting Strategy (λ)	CS quality varies significantly depending on λ selection	Use Adaptive ONS (Online Newton Step) or mixture strategy
Computational Complexity	Grid search-based CI inversion is costly in high dimensions	Utilize `confseq` library, prioritize distributions with analytic solutions
Initial Sample Phase	Practical judgment difficult due to very wide CS width in `t < 30`	Start monitoring after setting minimum observation (burn-in) interval
Library Maturity	`confseq`, `expectation` are still in a pre-stable state	Direct verification of core logic, unit testing mandatory

ONS (Online Newton Step): A type of adaptive betting strategy that optimizes the betting ratio λ at each step based on previous observations. Compared to a fixed λ, the e-process grows faster and is less sensitive to incorrect λ selection.

Running Intersection: A technique that secures a monotone shrinking interval by taking C_1 ∩ C_2 ∩ … ∩ C_t. It practically resolves the phenomenon where the interval widens over time.

The Most Common Mistakes in Practice

Setting λ arbitrarily large: If λ is too large relative to the data range, 1 + λ(X - μ_0) becomes negative, breaking the martingale property of the E-process. |λ| ≤ 1/max|X - μ_0| The range must be strictly adhered to. max(E, 0) Clipping is a temporary measure, and an E-process stuck at 0 loses statistical power.
Interpreting CS like classical CI: CS guarantees simultaneous coverage at all time points, but it does not guarantee that the independent coverage probability at each time point is 1-α. Simultaneous validity and marginal validity are different concepts.
Making decisions in the initial sample (t < 30): If the test is terminated prematurely in an area where the CS width is extremely wide, the actual power of the test is very low. A minimum burn-in interval must be set, and monitoring should begin thereafter.

In Conclusion

You now have in your hands the tools — Confidence Sequence and E-process — that allow you to look into A/B tests at any time and stop them whenever you want without p-hacking. While it is true that the interval is wide, that width is close to the information-theoretic lower bound proven by LIL as "the price for guaranteeing more." In a real-time experiment environment, this trade-off is quite reasonable.

3 Steps to Start Right Now:

After pip install confseq, run the formula example to directly apply CS to the existing A/B test data.
Modify the classic_widths vs cs_widths comparison code in this article to the sample size range of your own data and check the width gap directly as a numerical value.
In the next A/B test design, apply the adaptive strategy of confseq.betting.betting_cs instead of fixed λ and compare the classic t-test with the early stop point side by side.

Next, we will cover the mSPRT (Mixture Sequential Probability Ratio Test) and the mathematical and Python implementation of the sequential probability ratio test.

Reference Materials

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing | DEV BAK - 기술블로그

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

Key Concepts

What is a Confidence Sequence

Confidence Sequence removes this constraint. Official definition:

P(∀t ≥ 1 : θ ∈ C_t) ≥ 1 - α

Now, let's take a look at the engine to see how CS is actually made.

E-process — Engine that creates CS

E-value is a random variable whose expected value is at most 1 under the null hypothesis H₀. Simply put, it means "when the null hypothesis is true, the mean of this value does not exceed 1."

E[E] ≤ 1  (귀무가설 H₀ 하에서)

E_0 = 1  // 초기 자본
for t in 1..T:
    // λ_t: 베팅 비율, X_t: 새로운 관측값, μ_0: 귀무가설 평균
    E_t = E_{t-1} × (1 + λ_t × (X_t - μ_0))

Thanks to this mathematical structure, CS can be constructed as follows.

Ville's Inequality — Theoretical Foundation

The coverage guarantee of CS comes from Ville's inequality:

P(∃t ≥ 1 : M_t ≥ 1/α) ≤ α · M_0

C_t = { θ : E_t(θ) < 1/α }

Classic CI vs Confidence Sequence — Convergence Speed Comparison

Characteristics	Classical Confidence Interval	Confidence Sequence
Convergence Rate	`O(1/√n)`	`O(√(log log t / t))`
Valid at	Only in fixed `n`	Simultaneously in all `t ≥ 1`
Data Picking	Prohibited (Type-I error expansion)	Allowed (No penalty)
Distribution assumptions	May depend on normal approximation	Non-parametric and asymptotic possible ※
Section width	Narrow	`log log t` Wide as the argument

Practical Application

Example 1: Direct Implementation of E-process-based Confidence Sequence

Let's calculate the classic CI and CS side by side and compare them visually.

python

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
 
 
def eprocess_mean(data: np.ndarray, mu_0: float, lambda_val: float = 0.5) -> np.ndarray:
    """귀무가설 H₀: μ = mu_0에 대한 E-process 계산 (고정 베팅 비율 λ).
 
    주의: lambda_val은 |λ| ≤ 1/max|X - μ_0| 범위 내에서 설정해야
    E-process의 비음성과 마팅게일 성질이 수학적으로 보장된다.
    이 범위를 벗어나면 E-process가 음수가 되어 통계적 보장이 깨진다.
    """
    E = 1.0
    e_values = [E]
    for x in data:
        E *= (1 + lambda_val * (x - mu_0))
        # max(E, 0) 클리핑은 이론적으로 엄밀하지 않다.
        # E가 0에 고착되면 마팅게일 성질이 깨진다.
        # 올바른 방법은 λ 범위를 데이터에 맞게 사전에 제한하는 것이다.
        E = max(E, 1e-10)
        e_values.append(E)
    return np.array(e_values)
 
 
def confidence_sequence_mean(
    data: np.ndarray, sigma: float, alpha: float = 0.05, lambda_val: float = 0.5
) -> tuple[np.ndarray, np.ndarray]:
    """C_t = { θ : E_t(θ) < 1/α } 를 격자 탐색으로 근사.
 
    grid 범위를 ±3*sigma로 설정해 분산이 다른 데이터에도 안전하게 대응한다.
    """
    cs_lower, cs_upper = [], []
    threshold = 1 / alpha
 
    for t in range(1, len(data) + 1):
        subset = data[:t]
        mu_hat = subset.mean()
        # 고정값 ±3 대신 ±3*sigma로 데이터 분산에 연동
        grid = np.linspace(mu_hat - 3 * sigma, mu_hat + 3 * sigma, 1000)
        valid = [
            mu for mu in grid
            if eprocess_mean(subset, mu, lambda_val)[-1] < threshold
        ]
        if valid:
            cs_lower.append(min(valid))
            cs_upper.append(max(valid))
        else:
            cs_lower.append(np.nan)
            cs_upper.append(np.nan)
 
    return np.array(cs_lower), np.array(cs_upper)
 
 
def classic_ci(data: np.ndarray, sigma: float, alpha: float = 0.05):
    """각 시점 t에서 독립적으로 계산하는 고전 신뢰 구간.
 
    데이터 피킹 시 type-I error 보장이 무효가 된다.
    """
    z = stats.norm.ppf(1 - alpha / 2)
    lower, upper = [], []
    for t in range(1, len(data) + 1):
        mu_hat = data[:t].mean()
        margin = z * sigma / np.sqrt(t)
        lower.append(mu_hat - margin)
        upper.append(mu_hat + margin)
    return np.array(lower), np.array(upper)
 
 
# --- 실험 (T=100으로 제한해 빠르게 실행) ---
np.random.seed(42)
mu_true = 0.5
sigma = 1.0
alpha = 0.05
T = 100  # 성능상 100 권장; 500 이상은 수 분 소요
data = np.random.normal(mu_true, sigma, T)
 
cs_lo, cs_hi = confidence_sequence_mean(data, sigma, alpha)
ci_lo, ci_hi = classic_ci(data, sigma, alpha)
 
ts = np.arange(1, T + 1)
plt.figure(figsize=(12, 5))
plt.fill_between(ts, cs_lo, cs_hi, alpha=0.3, label="Confidence Sequence")
plt.fill_between(ts, ci_lo, ci_hi, alpha=0.3, label="Classic CI (각 t 독립)")
plt.axhline(mu_true, color="red", linestyle="--", label=f"True μ = {mu_true}")
plt.xlabel("샘플 수 t")
plt.ylabel("구간")
plt.legend()
plt.title("Confidence Sequence vs Classic CI 비교")
plt.tight_layout()
plt.show()

Key Code Points	Explanation
`eprocess_mean`	e-process cumulative calculation with fixed λ betting strategy
λ range limit	Must be set to `\|λ\| ≤ 1/max\|X - μ_0\|` for non-negative mathematical guarantee
Grid Range `±3σ`	Prevents search range out of bounds by linking to data distribution
`classic_ci`	Independent calculation at each time point — Statistical guarantees invalidated during data picking

Example 2: Quantitative Comparison of Interval Widths

python

import numpy as np
 
np.random.seed(42)
sigma = 1.0
alpha = 0.05
T = 1000
ts = np.arange(10, T + 1)
 
# 고전 CI 폭: 2 × z × σ / √t
z = 1.96
classic_widths = 2 * z * sigma / np.sqrt(ts)
 
# CS 폭 근사 (Howard et al. 2021, Theorem 2 기반 Hoeffding bound)
# log(log(t)/α) 항이 LIL의 느린 성장을 반영한다
cs_widths = 2 * sigma * np.sqrt(2 * np.log(np.log(ts) / alpha) / ts)
 
ratio = cs_widths / classic_widths
 
for t in [10, 50, 100, 500, 1000]:
    idx = t - 10
    print(
        f"t={t:4d} | Classic: {classic_widths[idx]:.4f} | "
        f"CS: {cs_widths[idx]:.4f} | 비율: {ratio[idx]:.3f}x"
    )

Execution result:

yaml

t=  10 | Classic: 1.2394 | CS: 2.2134 | 비율: 1.786x
t=  50 | Classic: 0.5548 | CS: 0.8301 | 비율: 1.497x
t= 100 | Classic: 0.3922 | CS: 0.5643 | 비율: 1.439x
t= 500 | Classic: 0.1754 | CS: 0.2297 | 비율: 1.310x
t=1000 | Classic: 0.1240 | CS: 0.1587 | 비율: 1.280x

Example 3: Real-time A/B Test Monitoring

python

import numpy as np
 
 
def run_ab_test_with_eprocess(
    stream_a: np.ndarray,
    stream_b: np.ndarray,
    alpha: float = 0.05,
    lambda_val: float = 0.3,
) -> dict:
    """E-process 기반 A/B 테스트 연속 모니터링.
 
    귀무가설 H₀: μ_A = μ_B (효과 없음)
 
    단순화 주의: 이 구현은 (X_a - X_b) 쌍 차이를 단일 스트림으로 처리한다.
    실제 A/B 테스트에서는 A와 B 각각에 대한 e-process를 별도로 관리하고
    결합하는 방식이 통계적으로 더 엄밀하다.
    """
    threshold = 1 / alpha
    E = 1.0
    history = []
 
    for t, (xa, xb) in enumerate(zip(stream_a, stream_b), start=1):
        diff = xa - xb   # 관측된 효과 차이
        mu_0 = 0.0       # 귀무가설: 차이 = 0
        E *= (1 + lambda_val * (diff - mu_0))
        E = max(E, 1e-10)
        history.append(E)
 
        if E >= threshold:
            return {
                "stopped": True,
                "t": t,
                "e_value": E,
                "conclusion": f"t={t}에서 조기 종료 — 효과 확인 (E={E:.2f})",
            }
 
    return {
        "stopped": False,
        "t": len(stream_a),
        "e_value": E,
        "conclusion": "효과 없음 (귀무가설 기각 실패)",
    }
 
 
# 시뮬레이션: 실제 효과가 있는 경우
np.random.seed(7)
n = 1000
stream_a = np.random.normal(0.55, 1.0, n)  # 전환율 0.55
stream_b = np.random.normal(0.50, 1.0, n)  # 전환율 0.50
 
result = run_ab_test_with_eprocess(stream_a, stream_b)
print(result["conclusion"])
# 출력 예: t=312에서 조기 종료 — 효과 확인 (E=21.43)

In a classic t-test, checking the same data daily causes the type-I error to expand from 5% to over 30%. An E-process-based approach maintains an error of 5% or less regardless of when it is stopped.

Pros and Cons Analysis

Advantages

Item	Content
Guaranteed Random Stop	Type-I error remains α or less regardless of when the experiment is terminated
Continuous Monitoring	No impairment of statistical validity even when checking data at every point in time
Non-asymptotic and Non-parametric	Accurate coverage possible even with finite samples without normality assumptions
Combinability	e-values are summed by multiplication, allowing multiple experiments to be combined sequentially
Game Theory Interpretation	Intuitive understanding through betting schemes, naturally combined with online algorithms

Disadvantages and Precautions

Item	Content	Response Plan
Segment Width Penalty	Up to nearly twice as wide compared to the classic CI at the same `t`	Gap decreases as the sample size becomes sufficiently large; understand and accept the trade-off
Selection of Betting Strategy (λ)	CS quality varies significantly depending on λ selection	Use Adaptive ONS (Online Newton Step) or mixture strategy
Computational Complexity	Grid search-based CI inversion is costly in high dimensions	Utilize `confseq` library, prioritize distributions with analytic solutions
Initial Sample Phase	Practical judgment difficult due to very wide CS width in `t < 30`	Start monitoring after setting minimum observation (burn-in) interval
Library Maturity	`confseq`, `expectation` are still in a pre-stable state	Direct verification of core logic, unit testing mandatory

Running Intersection: A technique that secures a monotone shrinking interval by taking C_1 ∩ C_2 ∩ … ∩ C_t. It practically resolves the phenomenon where the interval widens over time.

The Most Common Mistakes in Practice

Setting λ arbitrarily large: If λ is too large relative to the data range, 1 + λ(X - μ_0) becomes negative, breaking the martingale property of the E-process. |λ| ≤ 1/max|X - μ_0| The range must be strictly adhered to. max(E, 0) Clipping is a temporary measure, and an E-process stuck at 0 loses statistical power.
Interpreting CS like classical CI: CS guarantees simultaneous coverage at all time points, but it does not guarantee that the independent coverage probability at each time point is 1-α. Simultaneous validity and marginal validity are different concepts.
Making decisions in the initial sample (t < 30): If the test is terminated prematurely in an area where the CS width is extremely wide, the actual power of the test is very low. A minimum burn-in interval must be set, and monitoring should begin thereafter.

In Conclusion

3 Steps to Start Right Now:

After pip install confseq, run the formula example to directly apply CS to the existing A/B test data.
Modify the classic_widths vs cs_widths comparison code in this article to the sample size range of your own data and check the width gap directly as a numerical value.
In the next A/B test design, apply the adaptive strategy of confseq.betting.betting_cs instead of fixed λ and compare the classic t-test with the early stop point side by side.

Next, we will cover the mSPRT (Mixture Sequential Probability Ratio Test) and the mathematical and Python implementation of the sequential probability ratio test.

Key Concepts

What is a Confidence Sequence

E-process — Engine that creates CS

Ville's Inequality — Theoretical Foundation

Classic CI vs Confidence Sequence — Convergence Speed Comparison

Practical Application

Example 1: Direct Implementation of E-process-based Confidence Sequence

Example 2: Quantitative Comparison of Interval Widths

Example 3: Real-time A/B Test Monitoring

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

What is a Confidence Sequence

E-process — Engine that creates CS

Ville's Inequality — Theoretical Foundation

Classic CI vs Confidence Sequence — Convergence Speed Comparison

Practical Application

Example 1: Direct Implementation of E-process-based Confidence Sequence

Example 2: Quantitative Comparison of Interval Widths

Example 3: Real-time A/B Test Monitoring

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT

Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process

Guide to Building an Enterprise Model Context Protocol Server Securely Shared by the Entire Team: Practical Implementation of Streamable HTTP and OAuth 2.1

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit