Bayes Factor vs. E-value (Safety Test): Complete Analysis of Convergence Conditions and Practical Selection Guide for Safe Testing
If you run an A/B test for the first time with safestats and find that the e-value and Bayes factor print nearly identical numbers, that's not a coincidence. The two frameworks are mathematically equivalent under certain conditions, and understanding those conditions makes it clear which one to use in a given situation.
Statistical testing has long relied on the p-value (the probability of observing a result at least as extreme as the current one, given that the null hypothesis is true) as the standard, but its limitations become apparent in environments that require checking results mid-experiment — such as running A/B testing platforms, comparing ML models, or conducting interim analyses in clinical trials. Two alternatives that fill this gap are the Bayes Factor and the E-value (e-statistic, safe test). With the publication of the Safe Testing paper by Grünwald et al. in JRSS-B (2024–2025) and the release of the E-values textbook by Ramdas and Wang, the relationship between the two frameworks has been laid out with greater precision.
Under a simple null hypothesis, the Bayes factor itself becomes an e-variable; but under a composite null hypothesis and sequential experiments, only the e-value fully guarantees Type-I error control (the error of rejecting a true null hypothesis). This article pinpoints exactly where the two frameworks converge and where they diverge, and provides a practical decision guide that data scientists and ML engineers can apply immediately.
Core Concepts
Bayes Factor: The Evidence Ratio Between Two Hypotheses
The Bayes factor (BF) is the ratio of the marginal likelihoods (average likelihood integrated over the prior) of the observed data under two competing hypotheses H₀ and H₁.
$$BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}$$
BF > 1 supports H₁; BF < 1 supports H₀. Unlike the p-value, its strength is that it can numerically express evidence in favor of the null hypothesis. The following interpretation scale introduced by Jeffreys (1935) remains widely used today.
| Jeffreys Scale | BF Range | Interpretation |
|---|---|---|
| No evidence | 1 ~ 3 | Not worth mentioning |
| Weak evidence | 3 ~ 10 | Substantial evidence for H₁ |
| Strong evidence | 10 ~ 100 | Strong evidence for H₁ |
| Decisive evidence | > 100 | Decisive evidence for H₁ |
Note: Computing a Bayes factor requires specifying a prior distribution (the assumed distribution of the parameter before observing data). Improper priors (whose probabilities sum to infinity) cannot be used, and results vary depending on the choice of prior.
E-value (Safe Test): A Statistic as Betting Capital
An e-value is a non-negative random variable whose expected value under the null hypothesis H₀ is at most 1.
$$E \geq 0, \quad E_0[E] \leq 1$$
Intuitively, it can be understood as "betting capital starting at 1 unit in a world where the null hypothesis is true." If the null hypothesis is true, the expected wealth does not grow, so by Ville's inequality, Type-I error is controlled at or below α at any stopping rule.
$$P_0!\left(E \geq \frac{1}{\alpha}\right) \leq \alpha$$
Mathematical background: Ville's inequality is an extension of Markov's inequality. Substituting t = 1/α into Markov's inequality P(X ≥ t) ≤ E[X]/t immediately yields P₀(E ≥ 1/α) ≤ E₀[E] · α ≤ α. The key upgrade is that while Markov holds only at a fixed time point, Ville guarantees P₀(E_T ≥ 1/α) ≤ α at any arbitrary stopping time T. This is what mathematically makes optional stopping safe.
This property is called optional stopping safety. Vovk & Wang (2021) and Grünwald, de Heide & Koolen (2024, JRSS-B) systematized this framework.
GRO — The Connecting Concept Between the Two Frameworks
GRO (Growth-Rate Optimality) is a practical criterion for designing e-values and the mathematical point at which Bayes factors and e-values intersect. It finds the e-variable that maximizes the log expected value under the alternative distribution Q.
$$\max_{E} ; E_Q[\log E] \quad \text{subject to} \quad E_0[E] \leq 1$$
For the t-test, the GRO-optimal e-variable exactly coincides with the Bayes factor using a Cauchy prior. Why the Cauchy distribution?
GRO optimization searches for a "betting strategy that grows capital quickly when the null hypothesis is wrong." To this end, the prior must not concentrate on a specific effect size but must assign sufficient probability mass to both large and small effect sizes. The Cauchy distribution, with much heavier tails than the normal distribution, fulfills precisely this role — the strategic judgment of not abandoning bets regardless of the effect size converges to the Cauchy prior.
Important limitation: The GRO e-variable and the Bayes factor coincide only when the prior satisfies the GRO-optimal condition. Treating a Bayes factor computed with an arbitrary prior as an e-value is an error.
Conditions That Determine Convergence and Divergence
| Condition | Bayes Factor | E-value |
|---|---|---|
| Simple null hypothesis (H₀: θ = θ₀) | E₀[BF] = 1 → e-variable condition satisfied | Same result can be derived |
| Composite null hypothesis (H₀: θ ≤ 0) | E₀[BF] ≠ 1 → Type-I error not guaranteed | Can be handled with REGROW, etc. |
| When GRO-optimal prior is used | BF = GRO e-variable → exact match | Same |
| When a general prior is used | BF ≠ e-variable | Separate |
| Sequential experiments / optional stopping | Type-I error generally not guaranteed | Always guaranteed |
Practical Application
Example 1: Online A/B Testing — E-value in Sequential Experiments
A scenario where you need to check results mid-experiment and decide whether to stop early. The classical t-test requires the sample size to be fixed in advance, but an e-value-based safe test guarantees the error rate no matter when you stop.
# R - safestats 패키지 활용
library(safestats)
# 재현 가능한 예시 데이터 생성 (실무에서는 실제 관측값으로 교체)
set.seed(42)
obs_data <- rnorm(50, mean = 0.3, sd = 1.0)
# 단측 z-검정용 safe test 설계
# delta: 탐지하고자 하는 최소 효과 크기 (Cohen's d)
design <- designSafeZ(
alternative = "greater",
alpha = 0.05,
beta = 0.20,
delta = 0.3
)
# 데이터가 축적될 때마다 e-process 업데이트
e_process <- safeZTest(x = obs_data, design = design)
# 임계값 도달 시 언제든 기각 가능 (alpha 보장)
threshold <- 1 / 0.05 # = 20
if (e_process$eValue >= threshold) {
cat(sprintf("e-value = %.2f >= %.0f → 귀무가설 기각 가능\n",
e_process$eValue, threshold))
} else {
cat(sprintf("e-value = %.2f < %.0f → 아직 충분한 증거 없음\n",
e_process$eValue, threshold))
}| Code Element | Role |
|---|---|
designSafeZ() |
Designs the optimal e-variable based on GRO criterion |
delta = 0.3 |
Specifies the minimum effect size to detect |
safeZTest() |
Computes the e-process on accumulated data |
threshold = 1/alpha |
Rejection threshold per Ville's inequality |
e-process: The product of e-values accumulated over time. If the likelihood ratios at each time point are independent, the product is also an e-variable, making sequential updates naturally valid.
For cases where a direct Python implementation is needed, the normal distribution e-process can be written as follows.
import numpy as np
def compute_e_process(observations: list[float], delta_alt: float) -> float:
"""
단순 단측 e-process (정규 분포, 분산 = 1로 알려진 경우 가정)
E_t = ∏ exp(δ·xᵢ - δ²/2) [우도비: N(δ, 1) vs N(0, 1)]
한계 (실무 적용 시 주의):
- 분산이 알려진 경우에만 정확 — 분산 미지의 경우 anytime-valid t-test 구현 필요
- 배치 업데이트나 적응형 설계에는 확장 구현 필요
- 단측 검정(δ > 0)만 다룸 — 양측 검정에는 λ-mixture 방식 사용 권장
Args:
observations: 순차 수집된 관측값 리스트
delta_alt: 대안 가설의 효과 크기 (H₁: μ = delta_alt)
"""
if len(observations) == 0:
raise ValueError("관측값이 비어 있습니다.")
e_process = 1.0
for x in observations:
likelihood_ratio = np.exp(delta_alt * x - 0.5 * delta_alt ** 2)
e_process *= likelihood_ratio
return e_process
# 사용 예시
data_stream = [0.4, 0.3, 0.5, 0.2, 0.6, 0.7, 0.1, 0.8]
alpha = 0.05
threshold = 1 / alpha # = 20
e_val = compute_e_process(data_stream, delta_alt=0.3)
print(f"e-value: {e_val:.4f}")
if e_val >= threshold:
print(f"e-value {e_val:.2f} >= {threshold:.0f} → 기각 가능")
else:
print(f"e-value {e_val:.2f} < {threshold:.0f} → 데이터 더 필요")Example 2: Bayesian Model Comparison — Scenarios Where the Bayes Factor Is Appropriate
This is the case where the sample size is determined in advance and domain knowledge can be encoded as a prior. When comparing the relative plausibility of two models, the Bayes factor provides an intuitive interpretation.
import numpy as np
from scipy import integrate, stats
def bayes_factor_ttest_jzs(
data: np.ndarray,
mu0: float = 0.0,
r: float = np.sqrt(2) / 2,
) -> float:
"""
단일 표본 t-검정 베이즈 팩터 (JZS Cauchy 사전분포, Rouder et al. 2009)
BF > 10: 강한 증거, BF > 100: 결정적 증거 (Jeffreys 척도)
한계:
- 정규 모집단 가정 (비정규 데이터에 부적합)
- 단변량 단일 표본 검정에 한정
- 고차원·반복측정 설계에는 R의 BayesFactor 패키지 사용 권장
Args:
data: 관측 데이터 (1차원 배열)
mu0: 귀무가설의 기준 평균
r: Cauchy 사전분포 스케일 파라미터 (기본값 sqrt(2)/2 ≈ "medium")
"""
if len(data) == 0:
raise ValueError("데이터가 비어 있습니다.")
n = len(data)
t_stat = (np.mean(data) - mu0) / (np.std(data, ddof=1) / np.sqrt(n))
# H₁ 주변 우도: δ ~ Cauchy(0, r) 사전분포에 대해 수치 적분
def integrand(delta: float) -> float:
ncp = delta * np.sqrt(n) # 비중심 t 모수
log_lik = stats.nct.logpdf(t_stat, df=n - 1, nc=ncp)
log_prior = stats.cauchy.logpdf(delta, loc=0, scale=r)
return np.exp(log_lik + log_prior)
marginal_H1, _ = integrate.quad(integrand, -np.inf, np.inf, limit=200)
# H₀ 우도: 중심 t 분포 (비중심 모수 = 0)
marginal_H0 = stats.t.pdf(t_stat, df=n - 1)
if marginal_H0 == 0:
raise ValueError("H₀ 우도가 0입니다. 데이터와 귀무가설을 확인하세요.")
return marginal_H1 / marginal_H0
# 사용 예시
np.random.seed(42)
sample_data = np.random.normal(loc=0.5, scale=1.0, size=30)
bf = bayes_factor_ttest_jzs(sample_data)
print(f"베이즈 팩터 BF₁₀ = {bf:.2f}")
if bf > 100:
print("결정적 증거 — H₁ 강력 지지")
elif bf > 10:
print("강한 증거 — H₁ 지지")
elif bf > 3:
print("실질적 증거 — H₁ 지지")
elif bf > 1 / 3:
print("애매한 증거")
else:
print("H₀ 지지")Note for R users: Dedicated BF libraries in Python are not yet mature. In an R environment,
BayesFactor::ttestBF(x = data, mu = 0, rscale = "medium")is a more stable choice.
Pros and Cons
Advantages
| Item | Bayes Factor | E-value |
|---|---|---|
| Quantifying support for null | Evidence favoring H₀ can be expressed directly as a number | Difficult to quantify directly |
| Leveraging prior knowledge | Domain knowledge can be encoded as a prior | No prior needed; frequentist approach possible |
| Model comparison | Intuitive comparison of evidence ratios across multiple models | Primarily suited for binary testing |
| Type-I error control | Automatically satisfied with simple null + GRO prior | Always satisfied (Ville's inequality) |
| Combinability | Multiplicative combination valid only with the same prior | Product of independent e-values is an e-value → free combination |
| Optional stopping | Not guaranteed (sample size must be fixed in advance) | Fully guaranteed |
Disadvantages and Caveats
| Item | Bayes Factor | E-value | Mitigation |
|---|---|---|---|
| Optional stopping | Type-I error not guaranteed | Fully guaranteed | Fix sample size in advance when using BF |
| Composite null hypothesis | E₀[BF] ≠ 1, error rate uncontrolled | Can be handled with REGROW | Use Rescaled BF (BF/μ*) |
| Computational complexity | Requires high-dimensional marginal likelihood integration | Relatively simple | Use PyMC Bridge Sampling, JASP |
| Evidence for null hypothesis | Can be quantified | Difficult to quantify directly | Consider using BF alongside when needed |
| Statistical power | Maintains original BF level | Rescaled BF has lower power than original BF | Use GRO-optimal design (designSafeZ) |
| Prior distribution | Required; improper priors cannot be used | Not required | — |
Rescaled Bayes Factor: Converting to the form
BF / μ*satisfies the e-variable condition even under a composite null hypothesis. However, if μ* > 1, the result is lower than the original BF, reducing statistical power.
FBST e-value warning: The ev from the Full Bayesian Significance Test (FBST) has a similar name but is an entirely different concept from the e-value discussed in this article. It has no Type-I error control guarantee, and conflating the two can lead to serious errors.
The Most Common Mistakes in Practice
- Using the Bayes factor directly under a composite null hypothesis: Under a composite null such as H₀: θ ≤ 0, the ordinary Bayes factor does not control Type-I error. It is recommended to switch to a Rescaled BF (
BF/μ*) or an e-value-based test. - Repeatedly checking the Bayes factor without fixing the sample size: The Bayes factor is generally not safe under optional stopping either. If sequential monitoring is required, using the safe test from the
safestatspackage is recommended. - Over-generalizing the GRO equivalence condition: The GRO e-variable and the Bayes factor coincide only when the prior satisfies the GRO-optimal condition. Treating a Bayes factor computed with an arbitrary prior as an e-value is an error.
Closing Thoughts
The practical decision criteria can be summarized on one page as follows.
실험 도중 결과를 확인해야 하는가? (A/B 테스트, 중간 분석)
├─ 예 → E-value [safestats::designSafeZ → safeZTest]
└─ 아니오
└─ 도메인 사전분포가 있고, 귀무가설이 단순(θ = θ₀)인가?
├─ 예 → 베이즈 팩터 [BayesFactor::ttestBF]
└─ 아니오 (복합 귀무, 사전분포 없음)
└─ E-value 또는 Rescaled Bayes FactorBefore reading this article, it may have been unclear "why the Bayes factor and e-value sometimes produce the same number." The answer lies in GRO optimization, and the reason the GRO-optimal prior resolves to the Cauchy distribution is that its heavy tails represent a strategy of never abandoning bets regardless of the effect size. The two frameworks are not competing alternatives but tools with different guarantees, and the difference in those guarantees is the entire basis for choosing between them.
Three steps you can take right now:
- You can run the safestats package yourself: In R, after
install.packages("safestats"), run the R code block above as-is to watch the e-process reach the threshold of 20.vignette("safestats-vignette")contains complete sequential examples for z-tests and t-tests that you can reference right away. - You can run the Python code in Google Colab: Paste the
compute_e_process()andbayes_factor_ttest_jzs()code above into Colab cells, generate simulation data withnp.random.normal(), and compare the two values side by side — the GRO convergence condition will become intuitively clear. - ★ You can start with one essential read: The two papers marked with a star (★) in the references below provide the most systematic treatment of the mathematical foundations covered in this article. Reading one of the two first is recommended.
Next article: How integrating e-values into Conformal Prediction changes uncertainty estimation in ML models — covering application methods in batch and sequential settings.
References
★ Essential
- Grünwald, de Heide, Koolen — Safe Testing | JRSS-B 2024
- Ramdas & Wang — Hypothesis Testing with E-values (2025 교과서)
Further Reading
- Rescaled Bayes Factors: A Class of E-variables | arXiv 2024
- E-values as Statistical Evidence: A Comparison to Bayes Factors, Likelihoods, and p-values | arXiv 2025
- Bayes, E-values, and Testing | arXiv 2026
- Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science 2023
- Test Martingales, Bayes Factors and p-Values — Shafer, Vovk | Statistical Science
- safestats R 패키지 | CRAN
- Safe Flexible Hypothesis Tests 비네트
- E-Values Expand the Scope of Conformal Prediction | arXiv 2025
- Anytime-Valid FDR Control with Stopped e-BH Procedure | ScienceDirect 2025
- A Critical Evaluation of the FBST ev for Bayesian Hypothesis Testing | Springer