Bayes Factor vs. E-value (Safety Test): Complete Analysis of Convergence Conditions and Practical Selection Guide for Safe Testing

If you run an A/B test for the first time with safestats and find that the e-value and Bayes factor print nearly identical numbers, that's not a coincidence. The two frameworks are mathematically equivalent under certain conditions, and understanding those conditions makes it clear which one to use in a given situation.

Statistical testing has long relied on the p-value (the probability of observing a result at least as extreme as the current one, given that the null hypothesis is true) as the standard, but its limitations become apparent in environments that require checking results mid-experiment — such as running A/B testing platforms, comparing ML models, or conducting interim analyses in clinical trials. Two alternatives that fill this gap are the Bayes Factor and the E-value (e-statistic, safe test). With the publication of the Safe Testing paper by Grünwald et al. in JRSS-B (2024–2025) and the release of the E-values textbook by Ramdas and Wang, the relationship between the two frameworks has been laid out with greater precision.

Under a simple null hypothesis, the Bayes factor itself becomes an e-variable; but under a composite null hypothesis and sequential experiments, only the e-value fully guarantees Type-I error control (the error of rejecting a true null hypothesis). This article pinpoints exactly where the two frameworks converge and where they diverge, and provides a practical decision guide that data scientists and ML engineers can apply immediately.

Core Concepts

Bayes Factor: The Evidence Ratio Between Two Hypotheses

The Bayes factor (BF) is the ratio of the marginal likelihoods (average likelihood integrated over the prior) of the observed data under two competing hypotheses H₀ and H₁.

$$BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}$$

BF > 1 supports H₁; BF < 1 supports H₀. Unlike the p-value, its strength is that it can numerically express evidence in favor of the null hypothesis. The following interpretation scale introduced by Jeffreys (1935) remains widely used today.

Jeffreys Scale	BF Range	Interpretation
No evidence	1 ~ 3	Not worth mentioning
Weak evidence	3 ~ 10	Substantial evidence for H₁
Strong evidence	10 ~ 100	Strong evidence for H₁
Decisive evidence	> 100	Decisive evidence for H₁

Note: Computing a Bayes factor requires specifying a prior distribution (the assumed distribution of the parameter before observing data). Improper priors (whose probabilities sum to infinity) cannot be used, and results vary depending on the choice of prior.

E-value (Safe Test): A Statistic as Betting Capital

An e-value is a non-negative random variable whose expected value under the null hypothesis H₀ is at most 1.

$$E \geq 0, \quad E_0[E] \leq 1$$

Intuitively, it can be understood as "betting capital starting at 1 unit in a world where the null hypothesis is true." If the null hypothesis is true, the expected wealth does not grow, so by Ville's inequality, Type-I error is controlled at or below α at any stopping rule.

$$P_0!\left(E \geq \frac{1}{\alpha}\right) \leq \alpha$$

Mathematical background: Ville's inequality is an extension of Markov's inequality. Substituting t = 1/α into Markov's inequality P(X ≥ t) ≤ E[X]/t immediately yields P₀(E ≥ 1/α) ≤ E₀[E] · α ≤ α. The key upgrade is that while Markov holds only at a fixed time point, Ville guarantees P₀(E_T ≥ 1/α) ≤ α at any arbitrary stopping time T. This is what mathematically makes optional stopping safe.

This property is called optional stopping safety. Vovk & Wang (2021) and Grünwald, de Heide & Koolen (2024, JRSS-B) systematized this framework.

GRO — The Connecting Concept Between the Two Frameworks

GRO (Growth-Rate Optimality) is a practical criterion for designing e-values and the mathematical point at which Bayes factors and e-values intersect. It finds the e-variable that maximizes the log expected value under the alternative distribution Q.

$$\max_{E} ; E_Q[\log E] \quad \text{subject to} \quad E_0[E] \leq 1$$

For the t-test, the GRO-optimal e-variable exactly coincides with the Bayes factor using a Cauchy prior. Why the Cauchy distribution?

GRO optimization searches for a "betting strategy that grows capital quickly when the null hypothesis is wrong." To this end, the prior must not concentrate on a specific effect size but must assign sufficient probability mass to both large and small effect sizes. The Cauchy distribution, with much heavier tails than the normal distribution, fulfills precisely this role — the strategic judgment of not abandoning bets regardless of the effect size converges to the Cauchy prior.

Important limitation: The GRO e-variable and the Bayes factor coincide only when the prior satisfies the GRO-optimal condition. Treating a Bayes factor computed with an arbitrary prior as an e-value is an error.

Conditions That Determine Convergence and Divergence

Condition	Bayes Factor	E-value
Simple null hypothesis (H₀: θ = θ₀)	E₀[BF] = 1 → e-variable condition satisfied	Same result can be derived
Composite null hypothesis (H₀: θ ≤ 0)	E₀[BF] ≠ 1 → Type-I error not guaranteed	Can be handled with REGROW, etc.
When GRO-optimal prior is used	BF = GRO e-variable → exact match	Same
When a general prior is used	BF ≠ e-variable	Separate
Sequential experiments / optional stopping	Type-I error generally not guaranteed	Always guaranteed

Practical Application

Example 1: Online A/B Testing — E-value in Sequential Experiments

A scenario where you need to check results mid-experiment and decide whether to stop early. The classical t-test requires the sample size to be fixed in advance, but an e-value-based safe test guarantees the error rate no matter when you stop.

bash

# R - safestats 패키지 활용
library(safestats)
 
# 재현 가능한 예시 데이터 생성 (실무에서는 실제 관측값으로 교체)
set.seed(42)
obs_data <- rnorm(50, mean = 0.3, sd = 1.0)
 
# 단측 z-검정용 safe test 설계
# delta: 탐지하고자 하는 최소 효과 크기 (Cohen's d)
design <- designSafeZ(
  alternative = "greater",
  alpha       = 0.05,
  beta        = 0.20,
  delta       = 0.3
)
 
# 데이터가 축적될 때마다 e-process 업데이트
e_process <- safeZTest(x = obs_data, design = design)
 
# 임계값 도달 시 언제든 기각 가능 (alpha 보장)
threshold <- 1 / 0.05  # = 20
 
if (e_process$eValue >= threshold) {
  cat(sprintf("e-value = %.2f >= %.0f → 귀무가설 기각 가능\n",
              e_process$eValue, threshold))
} else {
  cat(sprintf("e-value = %.2f < %.0f → 아직 충분한 증거 없음\n",
              e_process$eValue, threshold))
}

Code Element	Role
`designSafeZ()`	Designs the optimal e-variable based on GRO criterion
`delta = 0.3`	Specifies the minimum effect size to detect
`safeZTest()`	Computes the e-process on accumulated data
`threshold = 1/alpha`	Rejection threshold per Ville's inequality

e-process: The product of e-values accumulated over time. If the likelihood ratios at each time point are independent, the product is also an e-variable, making sequential updates naturally valid.

For cases where a direct Python implementation is needed, the normal distribution e-process can be written as follows.

python

import numpy as np
 
 
def compute_e_process(observations: list[float], delta_alt: float) -> float:
    """
    단순 단측 e-process (정규 분포, 분산 = 1로 알려진 경우 가정)
 
    E_t = ∏ exp(δ·xᵢ - δ²/2)  [우도비: N(δ, 1) vs N(0, 1)]
 
    한계 (실무 적용 시 주의):
    - 분산이 알려진 경우에만 정확 — 분산 미지의 경우 anytime-valid t-test 구현 필요
    - 배치 업데이트나 적응형 설계에는 확장 구현 필요
    - 단측 검정(δ > 0)만 다룸 — 양측 검정에는 λ-mixture 방식 사용 권장
 
    Args:
        observations: 순차 수집된 관측값 리스트
        delta_alt: 대안 가설의 효과 크기 (H₁: μ = delta_alt)
    """
    if len(observations) == 0:
        raise ValueError("관측값이 비어 있습니다.")
 
    e_process = 1.0
    for x in observations:
        likelihood_ratio = np.exp(delta_alt * x - 0.5 * delta_alt ** 2)
        e_process *= likelihood_ratio
    return e_process
 
 
# 사용 예시
data_stream = [0.4, 0.3, 0.5, 0.2, 0.6, 0.7, 0.1, 0.8]
alpha = 0.05
threshold = 1 / alpha  # = 20
 
e_val = compute_e_process(data_stream, delta_alt=0.3)
print(f"e-value: {e_val:.4f}")
 
if e_val >= threshold:
    print(f"e-value {e_val:.2f} >= {threshold:.0f} → 기각 가능")
else:
    print(f"e-value {e_val:.2f} < {threshold:.0f} → 데이터 더 필요")

Example 2: Bayesian Model Comparison — Scenarios Where the Bayes Factor Is Appropriate

This is the case where the sample size is determined in advance and domain knowledge can be encoded as a prior. When comparing the relative plausibility of two models, the Bayes factor provides an intuitive interpretation.

python

import numpy as np
from scipy import integrate, stats
 
 
def bayes_factor_ttest_jzs(
    data: np.ndarray,
    mu0: float = 0.0,
    r: float = np.sqrt(2) / 2,
) -> float:
    """
    단일 표본 t-검정 베이즈 팩터 (JZS Cauchy 사전분포, Rouder et al. 2009)
 
    BF > 10: 강한 증거, BF > 100: 결정적 증거 (Jeffreys 척도)
 
    한계:
    - 정규 모집단 가정 (비정규 데이터에 부적합)
    - 단변량 단일 표본 검정에 한정
    - 고차원·반복측정 설계에는 R의 BayesFactor 패키지 사용 권장
 
    Args:
        data: 관측 데이터 (1차원 배열)
        mu0:  귀무가설의 기준 평균
        r:    Cauchy 사전분포 스케일 파라미터 (기본값 sqrt(2)/2 ≈ "medium")
    """
    if len(data) == 0:
        raise ValueError("데이터가 비어 있습니다.")
 
    n = len(data)
    t_stat = (np.mean(data) - mu0) / (np.std(data, ddof=1) / np.sqrt(n))
 
    # H₁ 주변 우도: δ ~ Cauchy(0, r) 사전분포에 대해 수치 적분
    def integrand(delta: float) -> float:
        ncp = delta * np.sqrt(n)  # 비중심 t 모수
        log_lik = stats.nct.logpdf(t_stat, df=n - 1, nc=ncp)
        log_prior = stats.cauchy.logpdf(delta, loc=0, scale=r)
        return np.exp(log_lik + log_prior)
 
    marginal_H1, _ = integrate.quad(integrand, -np.inf, np.inf, limit=200)
 
    # H₀ 우도: 중심 t 분포 (비중심 모수 = 0)
    marginal_H0 = stats.t.pdf(t_stat, df=n - 1)
 
    if marginal_H0 == 0:
        raise ValueError("H₀ 우도가 0입니다. 데이터와 귀무가설을 확인하세요.")
 
    return marginal_H1 / marginal_H0
 
 
# 사용 예시
np.random.seed(42)
sample_data = np.random.normal(loc=0.5, scale=1.0, size=30)
bf = bayes_factor_ttest_jzs(sample_data)
print(f"베이즈 팩터 BF₁₀ = {bf:.2f}")
 
if bf > 100:
    print("결정적 증거 — H₁ 강력 지지")
elif bf > 10:
    print("강한 증거 — H₁ 지지")
elif bf > 3:
    print("실질적 증거 — H₁ 지지")
elif bf > 1 / 3:
    print("애매한 증거")
else:
    print("H₀ 지지")

Note for R users: Dedicated BF libraries in Python are not yet mature. In an R environment, BayesFactor::ttestBF(x = data, mu = 0, rscale = "medium") is a more stable choice.

Pros and Cons

Advantages

Item	Bayes Factor	E-value
Quantifying support for null	Evidence favoring H₀ can be expressed directly as a number	Difficult to quantify directly
Leveraging prior knowledge	Domain knowledge can be encoded as a prior	No prior needed; frequentist approach possible
Model comparison	Intuitive comparison of evidence ratios across multiple models	Primarily suited for binary testing
Type-I error control	Automatically satisfied with simple null + GRO prior	Always satisfied (Ville's inequality)
Combinability	Multiplicative combination valid only with the same prior	Product of independent e-values is an e-value → free combination
Optional stopping	Not guaranteed (sample size must be fixed in advance)	Fully guaranteed

Disadvantages and Caveats

Item	Bayes Factor	E-value	Mitigation
Optional stopping	Type-I error not guaranteed	Fully guaranteed	Fix sample size in advance when using BF
Composite null hypothesis	E₀[BF] ≠ 1, error rate uncontrolled	Can be handled with REGROW	Use Rescaled BF (`BF/μ*`)
Computational complexity	Requires high-dimensional marginal likelihood integration	Relatively simple	Use PyMC Bridge Sampling, JASP
Evidence for null hypothesis	Can be quantified	Difficult to quantify directly	Consider using BF alongside when needed
Statistical power	Maintains original BF level	Rescaled BF has lower power than original BF	Use GRO-optimal design (`designSafeZ`)
Prior distribution	Required; improper priors cannot be used	Not required	—

Rescaled Bayes Factor: Converting to the form BF / μ* satisfies the e-variable condition even under a composite null hypothesis. However, if μ* > 1, the result is lower than the original BF, reducing statistical power.

FBST e-value warning: The ev from the Full Bayesian Significance Test (FBST) has a similar name but is an entirely different concept from the e-value discussed in this article. It has no Type-I error control guarantee, and conflating the two can lead to serious errors.

The Most Common Mistakes in Practice

Using the Bayes factor directly under a composite null hypothesis: Under a composite null such as H₀: θ ≤ 0, the ordinary Bayes factor does not control Type-I error. It is recommended to switch to a Rescaled BF (BF/μ*) or an e-value-based test.
Repeatedly checking the Bayes factor without fixing the sample size: The Bayes factor is generally not safe under optional stopping either. If sequential monitoring is required, using the safe test from the safestats package is recommended.
Over-generalizing the GRO equivalence condition: The GRO e-variable and the Bayes factor coincide only when the prior satisfies the GRO-optimal condition. Treating a Bayes factor computed with an arbitrary prior as an e-value is an error.

Closing Thoughts

The practical decision criteria can be summarized on one page as follows.

실험 도중 결과를 확인해야 하는가? (A/B 테스트, 중간 분석)
├─ 예  → E-value  [safestats::designSafeZ → safeZTest]
└─ 아니오
    └─ 도메인 사전분포가 있고, 귀무가설이 단순(θ = θ₀)인가?
        ├─ 예  → 베이즈 팩터  [BayesFactor::ttestBF]
        └─ 아니오 (복합 귀무, 사전분포 없음)
            └─ E-value 또는 Rescaled Bayes Factor

Before reading this article, it may have been unclear "why the Bayes factor and e-value sometimes produce the same number." The answer lies in GRO optimization, and the reason the GRO-optimal prior resolves to the Cauchy distribution is that its heavy tails represent a strategy of never abandoning bets regardless of the effect size. The two frameworks are not competing alternatives but tools with different guarantees, and the difference in those guarantees is the entire basis for choosing between them.

Three steps you can take right now:

You can run the safestats package yourself: In R, after install.packages("safestats"), run the R code block above as-is to watch the e-process reach the threshold of 20. vignette("safestats-vignette") contains complete sequential examples for z-tests and t-tests that you can reference right away.
You can run the Python code in Google Colab: Paste the compute_e_process() and bayes_factor_ttest_jzs() code above into Colab cells, generate simulation data with np.random.normal(), and compare the two values side by side — the GRO convergence condition will become intuitively clear.
★ You can start with one essential read: The two papers marked with a star (★) in the references below provide the most systematic treatment of the mathematical foundations covered in this article. Reading one of the two first is recommended.

Next article: How integrating e-values into Conformal Prediction changes uncertainty estimation in ML models — covering application methods in batch and sequential settings.

References

★ Essential

Further Reading

Bayes Factor vs. E-value (Safety Test): Complete Analysis of Convergence Conditions and Practical Selection Guide for Safe Testing | DEV BAK - 기술블로그

Bayes Factor vs. E-value (Safety Test): Complete Analysis of Convergence Conditions and Practical Selection Guide for Safe Testing

Core Concepts

Bayes Factor: The Evidence Ratio Between Two Hypotheses

The Bayes factor (BF) is the ratio of the marginal likelihoods (average likelihood integrated over the prior) of the observed data under two competing hypotheses H₀ and H₁.

$$BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}$$

Jeffreys Scale	BF Range	Interpretation
No evidence	1 ~ 3	Not worth mentioning
Weak evidence	3 ~ 10	Substantial evidence for H₁
Strong evidence	10 ~ 100	Strong evidence for H₁
Decisive evidence	> 100	Decisive evidence for H₁

Note: Computing a Bayes factor requires specifying a prior distribution (the assumed distribution of the parameter before observing data). Improper priors (whose probabilities sum to infinity) cannot be used, and results vary depending on the choice of prior.

E-value (Safe Test): A Statistic as Betting Capital

An e-value is a non-negative random variable whose expected value under the null hypothesis H₀ is at most 1.

$$E \geq 0, \quad E_0[E] \leq 1$$

$$P_0!\left(E \geq \frac{1}{\alpha}\right) \leq \alpha$$

Mathematical background: Ville's inequality is an extension of Markov's inequality. Substituting t = 1/α into Markov's inequality P(X ≥ t) ≤ E[X]/t immediately yields P₀(E ≥ 1/α) ≤ E₀[E] · α ≤ α. The key upgrade is that while Markov holds only at a fixed time point, Ville guarantees P₀(E_T ≥ 1/α) ≤ α at any arbitrary stopping time T. This is what mathematically makes optional stopping safe.

This property is called optional stopping safety. Vovk & Wang (2021) and Grünwald, de Heide & Koolen (2024, JRSS-B) systematized this framework.

GRO — The Connecting Concept Between the Two Frameworks

$$\max_{E} ; E_Q[\log E] \quad \text{subject to} \quad E_0[E] \leq 1$$

For the t-test, the GRO-optimal e-variable exactly coincides with the Bayes factor using a Cauchy prior. Why the Cauchy distribution?

Important limitation: The GRO e-variable and the Bayes factor coincide only when the prior satisfies the GRO-optimal condition. Treating a Bayes factor computed with an arbitrary prior as an e-value is an error.

Conditions That Determine Convergence and Divergence

Condition	Bayes Factor	E-value
Simple null hypothesis (H₀: θ = θ₀)	E₀[BF] = 1 → e-variable condition satisfied	Same result can be derived
Composite null hypothesis (H₀: θ ≤ 0)	E₀[BF] ≠ 1 → Type-I error not guaranteed	Can be handled with REGROW, etc.
When GRO-optimal prior is used	BF = GRO e-variable → exact match	Same
When a general prior is used	BF ≠ e-variable	Separate
Sequential experiments / optional stopping	Type-I error generally not guaranteed	Always guaranteed

Practical Application

Example 1: Online A/B Testing — E-value in Sequential Experiments

bash

# R - safestats 패키지 활용
library(safestats)
 
# 재현 가능한 예시 데이터 생성 (실무에서는 실제 관측값으로 교체)
set.seed(42)
obs_data <- rnorm(50, mean = 0.3, sd = 1.0)
 
# 단측 z-검정용 safe test 설계
# delta: 탐지하고자 하는 최소 효과 크기 (Cohen's d)
design <- designSafeZ(
  alternative = "greater",
  alpha       = 0.05,
  beta        = 0.20,
  delta       = 0.3
)
 
# 데이터가 축적될 때마다 e-process 업데이트
e_process <- safeZTest(x = obs_data, design = design)
 
# 임계값 도달 시 언제든 기각 가능 (alpha 보장)
threshold <- 1 / 0.05  # = 20
 
if (e_process$eValue >= threshold) {
  cat(sprintf("e-value = %.2f >= %.0f → 귀무가설 기각 가능\n",
              e_process$eValue, threshold))
} else {
  cat(sprintf("e-value = %.2f < %.0f → 아직 충분한 증거 없음\n",
              e_process$eValue, threshold))
}

Code Element	Role
`designSafeZ()`	Designs the optimal e-variable based on GRO criterion
`delta = 0.3`	Specifies the minimum effect size to detect
`safeZTest()`	Computes the e-process on accumulated data
`threshold = 1/alpha`	Rejection threshold per Ville's inequality

e-process: The product of e-values accumulated over time. If the likelihood ratios at each time point are independent, the product is also an e-variable, making sequential updates naturally valid.

For cases where a direct Python implementation is needed, the normal distribution e-process can be written as follows.

python

import numpy as np
 
 
def compute_e_process(observations: list[float], delta_alt: float) -> float:
    """
    단순 단측 e-process (정규 분포, 분산 = 1로 알려진 경우 가정)
 
    E_t = ∏ exp(δ·xᵢ - δ²/2)  [우도비: N(δ, 1) vs N(0, 1)]
 
    한계 (실무 적용 시 주의):
    - 분산이 알려진 경우에만 정확 — 분산 미지의 경우 anytime-valid t-test 구현 필요
    - 배치 업데이트나 적응형 설계에는 확장 구현 필요
    - 단측 검정(δ > 0)만 다룸 — 양측 검정에는 λ-mixture 방식 사용 권장
 
    Args:
        observations: 순차 수집된 관측값 리스트
        delta_alt: 대안 가설의 효과 크기 (H₁: μ = delta_alt)
    """
    if len(observations) == 0:
        raise ValueError("관측값이 비어 있습니다.")
 
    e_process = 1.0
    for x in observations:
        likelihood_ratio = np.exp(delta_alt * x - 0.5 * delta_alt ** 2)
        e_process *= likelihood_ratio
    return e_process
 
 
# 사용 예시
data_stream = [0.4, 0.3, 0.5, 0.2, 0.6, 0.7, 0.1, 0.8]
alpha = 0.05
threshold = 1 / alpha  # = 20
 
e_val = compute_e_process(data_stream, delta_alt=0.3)
print(f"e-value: {e_val:.4f}")
 
if e_val >= threshold:
    print(f"e-value {e_val:.2f} >= {threshold:.0f} → 기각 가능")
else:
    print(f"e-value {e_val:.2f} < {threshold:.0f} → 데이터 더 필요")

Example 2: Bayesian Model Comparison — Scenarios Where the Bayes Factor Is Appropriate

python

import numpy as np
from scipy import integrate, stats
 
 
def bayes_factor_ttest_jzs(
    data: np.ndarray,
    mu0: float = 0.0,
    r: float = np.sqrt(2) / 2,
) -> float:
    """
    단일 표본 t-검정 베이즈 팩터 (JZS Cauchy 사전분포, Rouder et al. 2009)
 
    BF > 10: 강한 증거, BF > 100: 결정적 증거 (Jeffreys 척도)
 
    한계:
    - 정규 모집단 가정 (비정규 데이터에 부적합)
    - 단변량 단일 표본 검정에 한정
    - 고차원·반복측정 설계에는 R의 BayesFactor 패키지 사용 권장
 
    Args:
        data: 관측 데이터 (1차원 배열)
        mu0:  귀무가설의 기준 평균
        r:    Cauchy 사전분포 스케일 파라미터 (기본값 sqrt(2)/2 ≈ "medium")
    """
    if len(data) == 0:
        raise ValueError("데이터가 비어 있습니다.")
 
    n = len(data)
    t_stat = (np.mean(data) - mu0) / (np.std(data, ddof=1) / np.sqrt(n))
 
    # H₁ 주변 우도: δ ~ Cauchy(0, r) 사전분포에 대해 수치 적분
    def integrand(delta: float) -> float:
        ncp = delta * np.sqrt(n)  # 비중심 t 모수
        log_lik = stats.nct.logpdf(t_stat, df=n - 1, nc=ncp)
        log_prior = stats.cauchy.logpdf(delta, loc=0, scale=r)
        return np.exp(log_lik + log_prior)
 
    marginal_H1, _ = integrate.quad(integrand, -np.inf, np.inf, limit=200)
 
    # H₀ 우도: 중심 t 분포 (비중심 모수 = 0)
    marginal_H0 = stats.t.pdf(t_stat, df=n - 1)
 
    if marginal_H0 == 0:
        raise ValueError("H₀ 우도가 0입니다. 데이터와 귀무가설을 확인하세요.")
 
    return marginal_H1 / marginal_H0
 
 
# 사용 예시
np.random.seed(42)
sample_data = np.random.normal(loc=0.5, scale=1.0, size=30)
bf = bayes_factor_ttest_jzs(sample_data)
print(f"베이즈 팩터 BF₁₀ = {bf:.2f}")
 
if bf > 100:
    print("결정적 증거 — H₁ 강력 지지")
elif bf > 10:
    print("강한 증거 — H₁ 지지")
elif bf > 3:
    print("실질적 증거 — H₁ 지지")
elif bf > 1 / 3:
    print("애매한 증거")
else:
    print("H₀ 지지")

Note for R users: Dedicated BF libraries in Python are not yet mature. In an R environment, BayesFactor::ttestBF(x = data, mu = 0, rscale = "medium") is a more stable choice.

Pros and Cons

Advantages

Item	Bayes Factor	E-value
Quantifying support for null	Evidence favoring H₀ can be expressed directly as a number	Difficult to quantify directly
Leveraging prior knowledge	Domain knowledge can be encoded as a prior	No prior needed; frequentist approach possible
Model comparison	Intuitive comparison of evidence ratios across multiple models	Primarily suited for binary testing
Type-I error control	Automatically satisfied with simple null + GRO prior	Always satisfied (Ville's inequality)
Combinability	Multiplicative combination valid only with the same prior	Product of independent e-values is an e-value → free combination
Optional stopping	Not guaranteed (sample size must be fixed in advance)	Fully guaranteed

Disadvantages and Caveats

Item	Bayes Factor	E-value	Mitigation
Optional stopping	Type-I error not guaranteed	Fully guaranteed	Fix sample size in advance when using BF
Composite null hypothesis	E₀[BF] ≠ 1, error rate uncontrolled	Can be handled with REGROW	Use Rescaled BF (`BF/μ*`)
Computational complexity	Requires high-dimensional marginal likelihood integration	Relatively simple	Use PyMC Bridge Sampling, JASP
Evidence for null hypothesis	Can be quantified	Difficult to quantify directly	Consider using BF alongside when needed
Statistical power	Maintains original BF level	Rescaled BF has lower power than original BF	Use GRO-optimal design (`designSafeZ`)
Prior distribution	Required; improper priors cannot be used	Not required	—

Rescaled Bayes Factor: Converting to the form BF / μ* satisfies the e-variable condition even under a composite null hypothesis. However, if μ* > 1, the result is lower than the original BF, reducing statistical power.

FBST e-value warning: The ev from the Full Bayesian Significance Test (FBST) has a similar name but is an entirely different concept from the e-value discussed in this article. It has no Type-I error control guarantee, and conflating the two can lead to serious errors.

The Most Common Mistakes in Practice

Using the Bayes factor directly under a composite null hypothesis: Under a composite null such as H₀: θ ≤ 0, the ordinary Bayes factor does not control Type-I error. It is recommended to switch to a Rescaled BF (BF/μ*) or an e-value-based test.
Repeatedly checking the Bayes factor without fixing the sample size: The Bayes factor is generally not safe under optional stopping either. If sequential monitoring is required, using the safe test from the safestats package is recommended.
Over-generalizing the GRO equivalence condition: The GRO e-variable and the Bayes factor coincide only when the prior satisfies the GRO-optimal condition. Treating a Bayes factor computed with an arbitrary prior as an e-value is an error.

Closing Thoughts

The practical decision criteria can be summarized on one page as follows.

실험 도중 결과를 확인해야 하는가? (A/B 테스트, 중간 분석)
├─ 예  → E-value  [safestats::designSafeZ → safeZTest]
└─ 아니오
    └─ 도메인 사전분포가 있고, 귀무가설이 단순(θ = θ₀)인가?
        ├─ 예  → 베이즈 팩터  [BayesFactor::ttestBF]
        └─ 아니오 (복합 귀무, 사전분포 없음)
            └─ E-value 또는 Rescaled Bayes Factor

Three steps you can take right now:

You can run the safestats package yourself: In R, after install.packages("safestats"), run the R code block above as-is to watch the e-process reach the threshold of 20. vignette("safestats-vignette") contains complete sequential examples for z-tests and t-tests that you can reference right away.
You can run the Python code in Google Colab: Paste the compute_e_process() and bayes_factor_ttest_jzs() code above into Colab cells, generate simulation data with np.random.normal(), and compare the two values side by side — the GRO convergence condition will become intuitively clear.
★ You can start with one essential read: The two papers marked with a star (★) in the references below provide the most systematic treatment of the mathematical foundations covered in this article. Reading one of the two first is recommended.

Next article: How integrating e-values into Conformal Prediction changes uncertainty estimation in ML models — covering application methods in batch and sequential settings.

References

★ Essential

Further Reading

Core Concepts

Bayes Factor: The Evidence Ratio Between Two Hypotheses

E-value (Safe Test): A Statistic as Betting Capital

GRO — The Connecting Concept Between the Two Frameworks

Conditions That Determine Convergence and Divergence

Practical Application

Example 1: Online A/B Testing — E-value in Sequential Experiments

Example 2: Bayesian Model Comparison — Scenarios Where the Bayes Factor Is Appropriate

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Bayes Factor: The Evidence Ratio Between Two Hypotheses

E-value (Safe Test): A Statistic as Betting Capital

GRO — The Connecting Concept Between the Two Frameworks

Conditions That Determine Convergence and Divergence

Practical Application

Example 1: Online A/B Testing — E-value in Sequential Experiments

Example 2: Bayesian Model Comparison — Scenarios Where the Bayes Factor Is Appropriate

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

ML Uncertainty Quantification in Batch and Sequential Settings: What Changes When Integrating e-values into Conformal Prediction

Enterprise MCP Governance Practical Guide: Centralizing RBAC, Audit Trails, and Token Vaults with ScopeBlind & Webrix

AI Multi-Agent Permission Delegation with Cedar: The delegation_chain Pattern, a Production Policy Library, and Security Pitfalls

n8n MCP Server Trigger Complete Guide — Creating a Custom MCP Server Without Coding and Connecting to Claude Desktop

Practical Guide to Implementing Kubernetes Policy-as-Code with OPA Bundle Server + GitOps

How to Mathematically Allow A/B Test Peeking — Estimating Real-Time Effect Size with e-Process and Anytime-Valid Confidence Sequences