Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2021

Open Access 01-12-2021 | Research article

Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research

Author: Riko Kelter

Published in: BMC Medical Research Methodology | Issue 1/2021

Login to get access

Abstract

Background

Null hypothesis significance testing (NHST) is among the most frequently employed methods in the biomedical sciences. However, the problems of NHST and p-values have been discussed widely and various Bayesian alternatives have been proposed. Some proposals focus on equivalence testing, which aims at testing an interval hypothesis instead of a precise hypothesis. An interval hypothesis includes a small range of parameter values instead of a single null value and the idea goes back to Hodges and Lehmann. As researchers can always expect to observe some (although often negligibly small) effect size, interval hypotheses are more realistic for biomedical research. However, the selection of an equivalence region (the interval boundaries) often seems arbitrary and several Bayesian approaches to equivalence testing coexist.

Methods

A new proposal is made how to determine the equivalence region for Bayesian equivalence tests based on objective criteria like type I error rate and power. Existing approaches to Bayesian equivalence testing in the two-sample setting are discussed with a focus on the Bayes factor and the region of practical equivalence (ROPE). A simulation study derives the necessary results to make use of the new method in the two-sample setting, which is among the most frequently carried out procedures in biomedical research.

Results

Bayesian Hodges-Lehmann tests for statistical equivalence differ in their sensitivity to the prior modeling, power, and the associated type I error rates. The relationship between type I error rates, power and sample sizes for existing Bayesian equivalence tests is identified in the two-sample setting. Results allow to determine the equivalence region based on the new method by incorporating such objective criteria. Importantly, results show that not only can prior selection influence the type I error rate and power, but the relationship is even reverse for the Bayes factor and ROPE based equivalence tests.

Conclusion

Based on the results, researchers can select between the existing Bayesian Hodges-Lehmann tests for statistical equivalence and determine the equivalence region based on objective criteria, thus improving the reproducibility of biomedical research.
Footnotes
1
See Lehmann [15] for a balanced perspective which focusses on the appropriate frame of reference of the statistical inference.
 
2
In this paper Bayesian equivalence tests are investigated. Frequentist equivalence tests, superiority tests or non-inferiority tests are not studied, although the Bayesian versions of the latter two can be identified as slight modifications of Bayesian equivalence tests, which is clarified in the main text later.
 
3
However, the Cauchy distribution has fat tails so it could also be reasonable to use distributions with lighter tails as an alternative (for example, a normal distribution).
 
4
Van Ravenzwaaij et al. [44] also present examples of how to apply the Bayes factor for non-inferiority and superiority testing, for details see the original paper.
 
5
Results show that this is not the case.
 
6
Note that in this case, the support interval should not be given a Bayes factor interpretation: The interval includes parameter values which have been corroborated by the data, that is p(θ|x)/p(θ)>k. While the Savage-Dickey Bayes factor representation allows to interpret values inside the support interval as yielding BF01>k, to reject the null based on Bayes factors one would logically require parameter values which yield a BF10>k. Thus, in general, the support interval draws its legitimation from including values which have been corroborated by the data, and not by the fact that sometimes a Bayes factor interpretation can be given to them.
 
7
An exception is given by the OH model of Morey et al. [41], where only the widths r0 and r1 need to be specified. However, r0 can be interpreted as the width of the equivalence region in the OH model.
 
8
In the frequentist paradigm a widespread equivalence testing procedure is the two one-sided tests (TOST) procedure described in Lakens et al. [38], see Appendix A.
 
9
Similar proposals for default ROPEs as β=0.05 for regression coefficients have been made for logistic and linear regression models. For a mathematical derivation see Kruschke ([36], p. 277).
 
10
Lakens et al. [38] underline that this is the weakest possible justification.
 
11
Other options next to the JZS model of Rouder et al. [28] which is used to compute the NOH model of Morey and Rouder [41] would be the Bayesian t-test models of Kruschke [58] or Kelter [32, 71].
 
12
The Bayes factor of Van Ravenzwaaij et al. [44] is not reported here because it is identical to the NOH model of Morey et al. [41], but it can be computed using the baymedr R package [45], see the provided replication script.
 
13
An exception is the OH model of Morey et al. [41], in which the associated Bayes factor is not consistent as discussed above.
 
14
The NOH Bayes factor \(BF_{01}^{NOH}\) for R=[−0.05,0.05] yields 9.27 under the wide C(0,1) prior in this case, indicating also moderate evidence for the interval null hypothesis, compare Jeffreys [60].
 
Literature
1.
8.
go back to reference Berger JO, Wolpert RL. The Likelihood Principle. Hayward: Institute of Mathematical Statistics; 1988, p. 208. Berger JO, Wolpert RL. The Likelihood Principle. Hayward: Institute of Mathematical Statistics; 1988, p. 208.
15.
go back to reference Lehmann EL. The Fisher, Neyman-Pearson Theories of Testign Hypotheses: One Theory or Two?J Am Stat Assoc. 1993; 88(424):1242–9.CrossRef Lehmann EL. The Fisher, Neyman-Pearson Theories of Testign Hypotheses: One Theory or Two?J Am Stat Assoc. 1993; 88(424):1242–9.CrossRef
17.
20.
go back to reference Pratt JW. On the Foundations of Statistical Inference: Discussion. J Am Stat Assoc. 1962; 57(298):307–26. Pratt JW. On the Foundations of Statistical Inference: Discussion. J Am Stat Assoc. 1962; 57(298):307–26.
21.
go back to reference Dawid AP. Recent Developments in Statistics. In: Proceedings of the European Meeting of Statisticians. Grenoble: North-Holland Pub. Co.: 1977. Dawid AP. Recent Developments in Statistics. In: Proceedings of the European Meeting of Statisticians. Grenoble: North-Holland Pub. Co.: 1977.
27.
go back to reference Jeffreys H. Scientific Inference. Cambridge: Cambridge University Press; 1931. Jeffreys H. Scientific Inference. Cambridge: Cambridge University Press; 1931.
34.
go back to reference Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale: Routledge; 1988. Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale: Routledge; 1988.
39.
go back to reference Berger JO, Boukai B, Wang Y. Unified Frequentist and Bayesian Testing of a Precise Hypothesis. Stat Sci. 1997; 12(3):133–60.CrossRef Berger JO, Boukai B, Wang Y. Unified Frequentist and Bayesian Testing of a Precise Hypothesis. Stat Sci. 1997; 12(3):133–60.CrossRef
43.
go back to reference Lindley DV. Decision Analysis and Bioequivalence Trials. Stat Sci. 1998; 13(2):136–41.CrossRef Lindley DV. Decision Analysis and Bioequivalence Trials. Stat Sci. 1998; 13(2):136–41.CrossRef
45.
go back to reference Linde M, van Ravenzwaaij D. baymedr: An R Package for the Calculation of Bayes Factors for Equivalence, Non-Inferiority, and Superiority Designs. arXiv preprint: arXiv:1910.11616v1. 2020. Linde M, van Ravenzwaaij D. baymedr: An R Package for the Calculation of Bayes Factors for Equivalence, Non-Inferiority, and Superiority Designs. arXiv preprint: arXiv:1910.11616v1. 2020.
51.
go back to reference Stern JM. Significance tests, Belief Calculi, and Burden of Proof in legal and Scientific Discourse. Front Artif Intell Appl. 2003; 101:139–47. Stern JM. Significance tests, Belief Calculi, and Burden of Proof in legal and Scientific Discourse. Front Artif Intell Appl. 2003; 101:139–47.
60.
go back to reference Jeffreys H. Theory of Probability, 3rd ed. Oxford: Oxford University Press; 1961. Jeffreys H. Theory of Probability, 3rd ed. Oxford: Oxford University Press; 1961.
61.
go back to reference Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995; 90(430):773–95.CrossRef Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995; 90(430):773–95.CrossRef
63.
go back to reference Lee MD, Wagenmakers E-J. Bayesian Cognitive Modeling : a Practical Course. Amsterdam: Cambridge University Press; 2013, p. 264. Lee MD, Wagenmakers E-J. Bayesian Cognitive Modeling : a Practical Course. Amsterdam: Cambridge University Press; 2013, p. 264.
66.
go back to reference Westlake WJ. Symmetrical confidence intervals for bioequivalence trials. Biometrics. 1976; 32(4):741–4.PubMedCrossRef Westlake WJ. Symmetrical confidence intervals for bioequivalence trials. Biometrics. 1976; 32(4):741–4.PubMedCrossRef
68.
go back to reference Carlin BP, Louis TA. Bayesian Methods for Data Analysis. Boca Raton: Chapman & Hall, CRC Press; 2009. Carlin BP, Louis TA. Bayesian Methods for Data Analysis. Boca Raton: Chapman & Hall, CRC Press; 2009.
69.
go back to reference Hobbs BP, Carlin BP. Practical Bayesian design and analysis for drug and device clinical trials. J Biopharm Stat. 2007; 18(1):54–80.CrossRef Hobbs BP, Carlin BP. Practical Bayesian design and analysis for drug and device clinical trials. J Biopharm Stat. 2007; 18(1):54–80.CrossRef
70.
go back to reference Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987; 15(6):657–80.PubMedCrossRef Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987; 15(6):657–80.PubMedCrossRef
77.
go back to reference Cook JA, Hislop JA, Adewuyi TE, Harrild KA, Altman DG, Ramsay DG, Fraser C, Buckley B, Fayers P, Harvey I, Briggs AH, Norrie JD, Fergusson D, Ford I, Vale LD. Assessing methods to specify the target difference for a randomised controlled trial: DELTA (Difference ELicitation in TriAls) review. Health Technol Assess. 2014; 18(28):1–172. https://doi.org/10.3310/hta18280.PubMedPubMedCentralCrossRef Cook JA, Hislop JA, Adewuyi TE, Harrild KA, Altman DG, Ramsay DG, Fraser C, Buckley B, Fayers P, Harvey I, Briggs AH, Norrie JD, Fergusson D, Ford I, Vale LD. Assessing methods to specify the target difference for a randomised controlled trial: DELTA (Difference ELicitation in TriAls) review. Health Technol Assess. 2014; 18(28):1–172. https://​doi.​org/​10.​3310/​hta18280.PubMedPubMedCentralCrossRef
78.
go back to reference Cook JA, Julious SA, Sones W, Hampson LV, Hewitt C, Berlin JA, Ashby D, Emsley R, Fergusson DA, Walters SJ, Wilson ECF, MacLennan G, Stallard N, Rothwell JC, Bland M, Brown L, Ramsay CR, Cook A, Armstrong D, Altman D, Vale LD. DELTA 2 guidance on choosing the target difference and undertaking and reporting the sample size calculation for a randomised controlled trial. Trials. 2018; 19(1):1–6. https://doi.org/10.1136/bmj.k3750.CrossRef Cook JA, Julious SA, Sones W, Hampson LV, Hewitt C, Berlin JA, Ashby D, Emsley R, Fergusson DA, Walters SJ, Wilson ECF, MacLennan G, Stallard N, Rothwell JC, Bland M, Brown L, Ramsay CR, Cook A, Armstrong D, Altman D, Vale LD. DELTA 2 guidance on choosing the target difference and undertaking and reporting the sample size calculation for a randomised controlled trial. Trials. 2018; 19(1):1–6. https://​doi.​org/​10.​1136/​bmj.​k3750.CrossRef
86.
go back to reference Kordsmeyer T, Penke L. The association of three indicators of developmental instability with mating success in humans. Evol Hum Behav. 2017; 38:704–13.CrossRef Kordsmeyer T, Penke L. The association of three indicators of developmental instability with mating success in humans. Evol Hum Behav. 2017; 38:704–13.CrossRef
90.
go back to reference Morey RD, Rouder JN. BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-4.2. 2018. Morey RD, Rouder JN. BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-4.2. 2018.
92.
93.
go back to reference Schuirmann DJ. On hypothesis testing to determine if the mean of a normal distribution is contained in a known interval. Biometrics. 1981; 37(617). Schuirmann DJ. On hypothesis testing to determine if the mean of a normal distribution is contained in a known interval. Biometrics. 1981; 37(617).
97.
go back to reference Berger RL, Hsu JC, Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4):283–302.CrossRef Berger RL, Hsu JC, Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996; 11(4):283–302.CrossRef
99.
go back to reference Chow S-C, Liu J-P. Design and Analysis of Bioavailability and Bioequivalence Studies, 3rd ed. Boca Raton: Chapman & Hall/CRC Press; 2008.CrossRef Chow S-C, Liu J-P. Design and Analysis of Bioavailability and Bioequivalence Studies, 3rd ed. Boca Raton: Chapman & Hall/CRC Press; 2008.CrossRef
Metadata
Title
Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research
Author
Riko Kelter
Publication date
01-12-2021
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2021
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-021-01341-7

Other articles of this Issue 1/2021

BMC Medical Research Methodology 1/2021 Go to the issue