Objective Bayesian Hypothesis Testing: A Comprehensive Guide
Written on
Introduction
Hypothesis testing is a crucial component of statistical education and scientific inquiry. It often begins with a hypothesis or generates data leading to one. A notable example is the dataset from Cushny and Peebles, who sought an effective soporific by administering various drugs to patients at the Michigan Asylum for the Insane in Kalamazoo and recording their sleep patterns. This dataset gained fame when Student first applied the t-distribution to it, and it was later analyzed by Ronald Fisher in his seminal work, Statistical Methods for Research Workers.
From the dataset, several hypotheses arise: Is L-hyoscine HBr more effective than L-hyoscyamine HBr in inducing sleep? Does L-hyoscyamine HBr outperform having no drug at all? Ideally, statistics would aid in testing these questions. However, the commonly taught method, significance testing using P-values, has its flaws and is often misinterpreted.
For instance, Peter Bernstein illustrates the misunderstanding of hypothesis testing in his book Against the Gods: The Remarkable Story of Risk, stating:
> Epidemiologists define a result as statistically significant if there is no more than a 5% chance that the outcome is due to random chance.
This interpretation reflects a common misconception: P-values do not measure the probability of the null hypothesis. Steven Goodman has termed this the "P-value fallacy," noting that many, including academic physicians, often mistakenly believe that a P-value of 0.05 indicates a 95% chance that the null hypothesis is false. Bayesian statistics offers a pathway to assess the probability of the null hypothesis, a concept Fisher himself rejected.
Fisher argued:
> The “error” of accepting the null hypothesis when it is false is always ill-defined in both magnitude and frequency.
This recurrent misinterpretation of P-values raises the question: How should we interpret them?
In 1987, James Berger and Thomas Sellke made significant strides toward answering this question. Their investigation into testing a normal mean and point null hypothesis revealed alarming insights into the understanding of P-values.
They examined symmetric priors that assign equal weight to both the null and alternative hypotheses. Their findings indicated that for any objective prior, the probability of the null hypothesis is bounded below by a function when P < 1/e. The implications are striking: for a P-value of 0.05, the null hypothesis's probability is at least 28.9%, suggesting that interpreting P < 0.05 as a mere 5% chance for the null hypothesis is fundamentally flawed.
Furthermore, this lower bound is calculated under conditions unfavorable to the null hypothesis. To illustrate this, Sellke, Bayarri, and Berger conducted a simulation analyzing a series of experimental drugs, concluding that a significant proportion of drugs yielding low P-values would have negligible effects.
In this article, we will discuss how to derive objective Bayesian solutions for hypothesis testing under more reasonable assumptions for the null hypothesis. Although there are effective approaches in Bayesian literature, they are often overlooked in traditional statistical education. We will focus on expected encompassing intrinsic Bayes factors (EEIBFs) and apply them to the Cushny and Peebles dataset.
Example: The Hyoscine Trial at Kalamazoo
Let us explore whether L-hyoscine HBr is a more effective soporific compared to L-hyoscyamine HBr. We will test three hypotheses based on the assumption that the differences in average sleep durations are normally distributed with unknown variance:
- H_0: L-hyoscyamine HBr and L-hyoscine HBr have equal effectiveness.
- H_l: L-hyoscyamine HBr is less effective than L-hyoscine HBr.
- H_m: L-hyoscyamine HBr is more effective than L-hyoscine HBr.
Using the Python library bbai, we can compute the EEIBF posterior probabilities for these hypotheses.
# Create an array of differences in average sleep durations import numpy as np deltas = np.array([-1.2, -2.4, -1.3, -1.3, 0., -1., -1.8, -0.8, -4.6, -1.4])
# Test the three hypotheses using expected encompassing intrinsic Bayes factors from bbai.stat import NormalMeanHypothesis result = NormalMeanHypothesis().test(deltas)
# Print the posterior probabilities for the hypotheses print('Probability mean is less than zero: ', result.left) print('Probability mean is equal to zero: ', result.equal) print('Probability mean is greater than zero: ', result.right)
The table below illustrates how the posterior probabilities of the hypotheses evolve as differences are observed.
For further details, you can refer to the complete example in the provided Jupyter notebook.
Comparison to Student’s Analysis
> Note: Student made some errors when entering the data from the Cushny and Peebles study, including mislabeling columns and omitting Patient 11's observation.
Student analyzed the data from a Bayesian perspective, focusing only on the two hypotheses H_l and H_m (a one-sided test) and computed probabilities with a prior that is improper. While improper priors can be justified for one-sided testing, they often lead to results that are least favorable to the null hypothesis.
Here's some Python code that replicates the findings from Student’s paper:
import numpy as np # Note: Student considered only the first 10 entries from the dataset deltas = np.array([-1.2, -2.4, -1.3, -1.3, 0., -1., -1.8, -0.8, -4.6, -1.4]) n = len(deltas)
t = np.mean(deltas) / (np.std(deltas, ddof=1) / np.sqrt(n))
# Compute the posterior probability for the hypothesis that L-hyoscyamine HBr is less effective than L-hyoscine HBr import scipy P_H_alt = scipy.stats.t(n-1).cdf(-t) print('Probability L-hyoscyamine HBr is more effective:', P_H_alt)
The results will align with Student's findings. Adjusting for the n=10 entry provides a comparable probability, highlighting that EEIBFs tend to yield lower probabilities than those derived from Student's analysis due to their more reasonable prior probabilities for the null hypothesis.
Bayes Factors
The foundation of objective Bayesian hypothesis testing lies in Bayes factors and objective priors. Let M_1 and M_2 represent two competing models with parameters ?_1 and ?_2, each with proper priors ?_1 and ?_2. Given data x, the posterior probability for one model is given by:
The Bayes factor B_ji is defined as the ratio of the posterior probabilities for the two models. If both models have equal prior probability, then B_ji directly compares their posterior probabilities.
Example 1: The Binomial Distribution with Jeffreys Prior
Suppose we observe 5 successes and 4 failures from a binomial distribution with an unknown success probability p. By considering two models corresponding to the hypotheses p < 1/2 and p > 1/2, we can apply Jeffreys prior to derive the priors after normalization.
Example 2: Naive Version of the Training Sample Method
Suppose we observe values from a normal distribution with known variance ?²=1, testing hypotheses µ < 0 and µ > 0. Using the first observation to establish proper distributions, we compute the Bayes factor.
Example 3: EEIBF Hypothesis Testing of Normally Distributed Data
Continuing from the previous examples, assume the data is sampled from a normal distribution with unknown variance. We will test the hypotheses µ < 0 and µ > 0 while ensuring we have enough observations to form a proper posterior distribution.
Using the bbai library, the computations can be efficiently executed.
import numpy as np y = np.array([1.89, 0.52, 1.10, 2.36, 1.99, -0.85, 1.07])
from bbai.stat import NormalMeanHypothesis result = NormalMeanHypothesis(with_equal=False).test(y) print('B_12^EEI =', result.factor_left / result.factor_right)
The findings support the assertion that EEIBF is an effective procedure for even small sample sizes.
Discussion
Q1: What other options are available for objective Bayesian hypothesis testing?
Sellke and Berger developed equations for deriving lower bounds on the posterior probability of the null hypothesis, which have been further refined by Clare for tighter bounds. Additionally, intrinsic priors can be employed directly as default priors in computing Bayes factors.
Q2: Do multiple reasonable methods make the EEIBF approach subjective?
If the objective is to derive an objective answer to hypothesis testing, then P-values can be misleading. P-values depend on the model, observed data, and experimental intent, which can lead to different interpretations of the same data.
Q3: Does rejecting P-value hypothesis testing indicate one is Bayesian?
Frequentism shouldn't be equated with P-value hypothesis testing, as Bayesian inference can be interpreted within frequentist terminology.
Conclusion
Fisher's advocacy for P-value testing sparked significant debate due to its potential pitfalls. Critics like Harold Jeffreys highlighted that P-values might lead to rejecting true hypotheses based on negligible contributions from actual values. The problems Fisher aimed to address with P-values, such as the arbitrary results produced by early Bayesian methods, can be rectified by employing default Bayes factors like EEIBF, which provide more reliable outcomes under general conditions.
If you found this article insightful, you may also enjoy my piece on Introduction to Objective Bayesian Inference.
References
[1]: Cushny, A. R. and Peebles, A. R. The action of optical isomers. 11: Hyoscines, Journal of Physiology, 32, 501–510 (1905). [2]: Senn S, Richardson W. The first t-test. Stat Med. 1994 Apr 30;13(8):785–803. doi: 10.1002/sim.4780130802. PMID: 8047737. [3]: Goodman, S. (1999, June). Toward evidence-based medical statistics. 1: The p value fallacy. Annals of Internal Medicine 130 (12), 995–1004. [4]: Bernstein, P. (1998). Against the Gods: The Remarkable Story of Risk. Wiley. [5]: Fisher, R. A. (1935). The Design of Experiments (9th ed.). Oliver & Boyd. [6]: Berger, J. and T. Sellke (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association 82 (397), 112–22. [7]: Selke, T., M. J. Bayarri, and J. Berger (2001). Calibration of p values for testing precise null hypotheses. The American Statistician 855 (1), 62–71. [8]: Berger, J. and J. Mortera (1999). Default bayes factors for nonnested hypothesis testing. Journal of the American Statistical Association 94 (446), 542–554. [9]: Student. The probable error of a mean. Biometrika VI (1908). [10]: Berger, J. O., & Pericchi, L. R. (1996). The Intrinsic Bayes Factor for Model Selection and Prediction. Journal of the American Statistical Association, 91 (433), 109–122. [11]: Casella, George, and Roger L. Berger. Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem. Journal of the American Statistical Association 82, no. 397 (1987): 106–11. [12]: Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis. Springer. [13]: Berger, J. O. and D. A. Berry (1988). Statistical analysis and the illusion of objectivity. American Scientist 76 (2), 159–165. [14]: “View of Michigan Asylum for the Insane, Kalamazoo” Map of Kalamazoo Co., Michigan. Philadelphia: Geil & Harley, 1861. Library of Congress. [15]: Fisher, R. A (1925). Statistical Methods for Research Workers. [16]: Clare R. (2024). A universal robust bound for the intrinsic Bayes factor. arXiv 2402.06112 [17]: Berger, J., J. Bernardo, and D. Sun (2024). Objective Bayesian Inference. World Scientific. [18]: James O. Berger. Could Fisher, Jeffreys, and Neyman Have Agreed on Testing? Statist. Sci. 18 (1) 1–32, February 2003. [19]: Jeffreys, H. (1961). Theory of Probability (3 ed.). Oxford Classic Texts in the Physical Sciences. [20]: Fisher, R. (1930). Inverse probability. Mathematical Proceedings of the Cambridge Philosophical Society 26 (4), 528–535. [21]: Fisher, R. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368.