Null statistical hypothesis testing aims to prevent confirmation bias [1]. The researcher creates the null hypothesis by converting the research question to a research hypothesis and then converting the research hypothesis to the null hypothesis. This should happen before starting data collection starts. The researcher should use statistical significance (p-values) testing to accept or reject the null hypothesis. Using significance testing to accept or reject a research question or hypothesis may lead to chasing statistically significant findings and confirmation bias. Researchers have called for the death of null statistical hypothesis testing. They have suggested replacing these methods with effect sizes, confidence intervals, and Bayesian methods secondary to researchers not following the intent of this methodology [2]. These different approaches have their strengths and weaknesses. The power of the null statistical hypothesis testing is that it can be a powerful tool used to rule things out [3].

Sean P. Riley, PT, DPT, ScD

Twitter @seanrileypt

Assistant Professor, Doctor of Physical Therapy Program, University of Hartford, West Hartford, CT. 06117

Competing interests: Center of Excellence in Manual and Manipulative Therapy at Duke University.


Author bio(s): Sean Riley is an assistant professor at the University of Hartford, a clinical researcher, a physical therapist, and an advocate for the creation of ‘trustworthy’ research for the practicing clinician. He is board-certified in Orthopedics through the American Board of Physical Therapy Specialties and is fellowship-trained through the American Academy of Orthopaedic Manual Physical Therapists. He has 21 published manuscripts since 2015 and has presented his research at the national level each of the last eight years. Sean is also an Associate Editor for the Journal of Manual and Manipulative Therapy.


Null Statistical Hypothesis Testing

Null statistical hypothesis testing aims to prevent confirmation bias [1]. The researcher creates the null hypothesis by converting the research question to a research hypothesis and then converting the research hypothesis to the null hypothesis. This should happen before starting data collection starts. The researcher should use statistical significance (p-values) testing to accept or reject the null hypothesis. Using significance testing to accept or reject a research question or hypothesis may lead to chasing statistically significant findings and confirmation bias. Researchers have called for the death of null statistical hypothesis testing. They have suggested replacing these methods with effect sizes, confidence intervals, and Bayesian methods secondary to researchers not following the intent of this methodology [2]. These different approaches have their strengths and weaknesses. The power of the null statistical hypothesis testing is that it can be a powerful tool used to rule things out [3].

We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.” -Richard P. Feynman.


Statistically Significant vs. Clinically Meaningful

Maximizing the strengths of null statistical hypothesis testing may start with recognizing that it may not decrease confirmation bias and may encourage looking for  statistically significant differences compared to identifying if differences are clinically meaningful [4]. Statistical significance is the lowest bar that should be used to determine if any within and between-group differences exist [5]. If there are no statistically significant differences, we should rule out and move on. If statistically significant differences are present, it should be decided if the difference is clinically meaningful. This may start by setting up the game’s rules, following the rules, and making the conscious choice to respect the results of the rules before playing the game. Ensuring the rules to the game have not changed after playing the game should be managed by journal editors. Unfortunately, there are several ways that statistical significance may be manipulated after an initial analysis is performed to alter the results [6].


Following the Rules

A clinical trial registry was meant to verify that researchers played by their established rules. The most frequent modifications are related to the study’s primary aim, primary outcome, and power analysis [7]. A research study should have one primary research question and one primary outcome used to answer that question named before data collection starts. Using multiple primary outcomes or changing outcomes from secondary to primary provides an environment where statistical significance may be manipulated [8]. Physical therapy research findings must be clinically ‘trustworthy’  for the practicing clinician. Manipulating these variables after a preliminary analysis is performed is likely more prevalent than thought, secondary to the inability to verify prospective intent in 70% of randomized clinical trials regarding physical therapy interventions used to treat musculoskeletal disorders [7].


Power Analysis and Type I & II Errors

The power analysis should be based on the primary outcome and used to find the number of subjects needed to see a statistically significant difference if a statistically significant difference exists. This involves controlling for type I and II errors [9]. A type I error identifies a statistically significant difference when the difference does not exist [9]. Statistical significance can be a chance finding. It is known that p-values can vary in identical experiments, and the larger the sample is, the more likely it is that very small within and between-group differences will be recognized as statistically significant [9]. A type II error is related to statistical power, which is related to sample size [9].  This error occurs when a statistically significant difference is not found when a statistically significant difference does exist [9]. Having a larger sample size increases statistical power and decreases the risk of a type II error as it simultaneously increases the risk of a type I error [9]. The collection of excess data to create statistically significant differences that are not clinically meaningful is an ethical issue contrary to the principle of beneficence in human subjects’ research. It wastes the patient’s time without helping the individual or society.


The Solution

The simple solution to this problem should be a respect for the rules, the results, and the identification of what clinically meaningful difference is in a clinical trial registry before collecting data. Promoting statistically significant differences that are not clinically meaningful abounds in our literature. I would suggest not throwing the baby out with the bathwater but recognizing that type I & II errors exist. Just increasing the sample size is not the answer to figuring out if the differences found are clinically meaningful. Finally, statistical significance can be a powerful tool for ruling out hypotheses. Ruling out is essential to decrease variability in clinical practice and progress our profession.


Bibliography

  1. Pernet C. Null hypothesis significance testing: a short tutorial. F1000Res. 2015;4:621.
  2. Masson ME. A tutorial on a practical Bayesian alternative to null-hypothesis significance testing. Behav Res Methods. 2011 Sep;43(3):679-90.
  3. Cohen HW. P values: use and misuse in medical literature. Am J Hypertens. 2011 Jan;24(1):18-23.
  4. Fleischmann M, Vaughan B. Commentary: Statistical significance and clinical significance – A call to consider patient reported outcome measures, effect size, confidence interval and minimal clinically important difference (MCID). J Bodyw Mov Ther. 2019 Oct;23(4):690-694.
  5. Yaddanapudi LN. The American Statistical Association statement on P-values explained. J Anaesthesiol Clin Pharmacol. 2016 Oct-Dec;32(4):421-423.
  6. Cook C, Garcia AN. Post-randomization bias. J Man Manip Ther. 2020 May;28(2):69-71.
  7. Riley SP, Swanson BT, Shaffer SM, et al. The Unknown Prevalence of Postrandomization Bias in 15 Physical Therapy Journals: A Methods Review. J Orthop Sports Phys Ther. 2021 Nov;51(11):542-550.
  8. Vetter TR, Mascha EJ. Defining the Primary Outcomes and Justifying Secondary Outcomes of a Study: Usually, the Fewer, the Better. Anesth Analg. 2017 Aug;125(2):678-681.
  9. Shreffler J, Huecker MR. Type I and Type II Errors and Statistical Power. StatPearls. Treasure Island (FL)2022.