Statology

Statistics Made Easy

How to Write a Null Hypothesis (5 Examples)

A hypothesis test uses sample data to determine whether or not some claim about a population parameter is true.

Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms:

H 0 (Null Hypothesis): Population parameter =,  ≤, ≥ some value

H A  (Alternative Hypothesis): Population parameter <, >, ≠ some value

Note that the null hypothesis always contains the equal sign .

We interpret the hypotheses as follows:

Null hypothesis: The sample data provides no evidence to support some claim being made by an individual.

Alternative hypothesis: The sample data  does provide sufficient evidence to support the claim being made by an individual.

For example, suppose it’s assumed that the average height of a certain species of plant is 20 inches tall. However, one botanist claims the true average height is greater than 20 inches.

To test this claim, she may go out and collect a random sample of plants. She can then use this sample data to perform a hypothesis test using the following two hypotheses:

H 0 : μ ≤ 20 (the true mean height of plants is equal to or even less than 20 inches)

H A : μ > 20 (the true mean height of plants is greater than 20 inches)

If the sample data gathered by the botanist shows that the mean height of this species of plants is significantly greater than 20 inches, she can reject the null hypothesis and conclude that the mean height is greater than 20 inches.

Read through the following examples to gain a better understanding of how to write a null hypothesis in different situations.

Example 1: Weight of Turtles

A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles.

Here is how to write the null and alternative hypotheses for this scenario:

H 0 : μ = 300 (the true mean weight is equal to 300 pounds)

H A : μ ≠ 300 (the true mean weight is not equal to 300 pounds)

Example 2: Height of Males

It’s assumed that the mean height of males in a certain city is 68 inches. However, an independent researcher believes the true mean height is greater than 68 inches. To test this, he goes out and collects the height of 50 males in the city.

H 0 : μ ≤ 68 (the true mean height is equal to or even less than 68 inches)

H A : μ > 68 (the true mean height is greater than 68 inches)

Example 3: Graduation Rates

A university states that 80% of all students graduate on time. However, an independent researcher believes that less than 80% of all students graduate on time. To test this, she collects data on the proportion of students who graduated on time last year at the university.

H 0 : p ≥ 0.80 (the true proportion of students who graduate on time is 80% or higher)

H A : μ < 0.80 (the true proportion of students who graduate on time is less than 80%)

Example 4: Burger Weights

A food researcher wants to test whether or not the true mean weight of a burger at a certain restaurant is 7 ounces. To test this, he goes out and measures the weight of a random sample of 20 burgers from this restaurant.

H 0 : μ = 7 (the true mean weight is equal to 7 ounces)

H A : μ ≠ 7 (the true mean weight is not equal to 7 ounces)

Example 5: Citizen Support

A politician claims that less than 30% of citizens in a certain town support a certain law. To test this, he goes out and surveys 200 citizens on whether or not they support the law.

H 0 : p ≥ .30 (the true proportion of citizens who support the law is greater than or equal to 30%)

H A : μ < 0.30 (the true proportion of citizens who support the law is less than 30%)

Additional Resources

Introduction to Hypothesis Testing Introduction to Confidence Intervals An Explanation of P-Values and Statistical Significance

Featured Posts

5 Regularization Techniques You Should Know

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

2 Replies to “How to Write a Null Hypothesis (5 Examples)”

you are amazing, thank you so much

Say I am a botanist hypothesizing the average height of daisies is 20 inches, or not? Does T = (ave – 20 inches) / √ variance / (80 / 4)? … This assumes 40 real measures + 40 fake = 80 n, but that seems questionable. Please advise.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

Null Hypothesis Examples

ThoughtCo / Hilary Allison

  • Scientific Method
  • Chemical Laws
  • Periodic Table
  • Projects & Experiments
  • Biochemistry
  • Physical Chemistry
  • Medical Chemistry
  • Chemistry In Everyday Life
  • Famous Chemists
  • Activities for Kids
  • Abbreviations & Acronyms
  • Weather & Climate
  • Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
  • B.A., Physics and Mathematics, Hastings College

In statistical analysis, the null hypothesis assumes there is no meaningful relationship between two variables. Testing the null hypothesis can tell you whether your results are due to the effect of manipulating ​a dependent variable or due to chance. It's often used in conjunction with an alternative hypothesis, which assumes there is, in fact, a relationship between two variables.

The null hypothesis is among the easiest hypothesis to test using statistical analysis, making it perhaps the most valuable hypothesis for the scientific method. By evaluating a null hypothesis in addition to another hypothesis, researchers can support their conclusions with a higher level of confidence. Below are examples of how you might formulate a null hypothesis to fit certain questions.

What Is the Null Hypothesis?

The null hypothesis states there is no relationship between the measured phenomenon (the dependent variable ) and the independent variable , which is the variable an experimenter typically controls or changes. You do not​ need to believe that the null hypothesis is true to test it. On the contrary, you will likely suspect there is a relationship between a set of variables. One way to prove that this is the case is to reject the null hypothesis. Rejecting a hypothesis does not mean an experiment was "bad" or that it didn't produce results. In fact, it is often one of the first steps toward further inquiry.

To distinguish it from other hypotheses , the null hypothesis is written as ​ H 0  (which is read as “H-nought,” "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the null hypothesis is not true, perhaps because the experimenter did not account for a critical factor or because of chance. This is one reason why it's important to repeat experiments.

Examples of the Null Hypothesis

To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect. Write your hypothesis in a way that reflects this.

Other Types of Hypotheses

In addition to the null hypothesis, the alternative hypothesis is also a staple in traditional significance tests . It's essentially the opposite of the null hypothesis because it assumes the claim in question is true. For the first item in the table above, for example, an alternative hypothesis might be "Age does have an effect on mathematical ability."

Key Takeaways

  • In hypothesis testing, the null hypothesis assumes no relationship between two variables, providing a baseline for statistical analysis.
  • Rejecting the null hypothesis suggests there is evidence of a relationship between variables.
  • By formulating a null hypothesis, researchers can systematically test assumptions and draw more reliable conclusions from their experiments.
  • What 'Fail to Reject' Means in a Hypothesis Test
  • What Is a Hypothesis? (Science)
  • Null Hypothesis Definition and Examples
  • What Are the Elements of a Good Hypothesis?
  • Scientific Method Vocabulary Terms
  • Definition of a Hypothesis
  • Six Steps of the Scientific Method
  • What Is the Difference Between Alpha and P-Values?
  • Hypothesis Test for the Difference of Two Population Proportions
  • Understanding Simple vs Controlled Experiments
  • Null Hypothesis and Alternative Hypothesis
  • What Are Examples of a Hypothesis?
  • What It Means When a Variable Is Spurious
  • Hypothesis Test Example
  • How to Conduct a Hypothesis Test
  • What Is a P-Value?

9.1 Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 , the — null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

H a —, the alternative hypothesis: a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are reject H 0 if the sample information favors the alternative hypothesis or do not reject H 0 or decline to reject H 0 if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example 9.1

H 0 : No more than 30 percent of the registered voters in Santa Clara County voted in the primary election. p ≤ 30 H a : More than 30 percent of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25 percent. State the null and alternative hypotheses.

Example 9.2

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are the following: H 0 : μ = 2.0 H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 66
  • H a : μ __ 66

Example 9.3

We want to test if college students take fewer than five years to graduate from college, on the average. The null and alternative hypotheses are the following: H 0 : μ ≥ 5 H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 45
  • H a : μ __ 45

Example 9.4

An article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third of the students pass. The same article stated that 6.6 percent of U.S. students take advanced placement exams and 4.4 percent pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6 percent. State the null and alternative hypotheses. H 0 : p ≤ 0.066 H a : p > 0.066

On a state driver’s test, about 40 percent pass the test on the first try. We want to test if more than 40 percent pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p __ 0.40
  • H a : p __ 0.40

Collaborative Exercise

Bring to class a newspaper, some news magazines, and some internet articles. In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute Texas Education Agency (TEA). The original material is available at: https://www.texasgateway.org/book/tea-statistics . Changes were made to the original material, including updates to art, structure, and other content updates.

Access for free at https://openstax.org/books/statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Statistics
  • Publication date: Mar 27, 2020
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/statistics/pages/9-1-null-and-alternative-hypotheses

© Jan 23, 2024 Texas Education Agency (TEA). The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

  • Research article
  • Open access
  • Published: 19 May 2010

The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation

  • Luis Carlos Silva-Ayçaguer 1 ,
  • Patricio Suárez-Gil 2 &
  • Ana Fernández-Somoano 3  

BMC Medical Research Methodology volume  10 , Article number:  44 ( 2010 ) Cite this article

38k Accesses

23 Citations

18 Altmetric

Metrics details

The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested supplementary use of confidence intervals (CI). Our objective was to evaluate the extent and quality in the use of NHST and CI, both in English and Spanish language biomedical publications between 1995 and 2006, taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on the accuracy of the interpretation of statistical significance and the validity of conclusions.

Original articles published in three English and three Spanish biomedical journals in three fields (General Medicine, Clinical Specialties and Epidemiology - Public Health) were considered for this study. Papers published in 1995-1996, 2000-2001, and 2005-2006 were selected through a systematic sampling method. After excluding the purely descriptive and theoretical articles, analytic studies were evaluated for their use of NHST with P-values and/or CI for interpretation of statistical "significance" and "relevance" in study conclusions.

Among 1,043 original papers, 874 were selected for detailed review. The exclusive use of P-values was less frequent in English language publications as well as in Public Health journals; overall such use decreased from 41% in 1995-1996 to 21% in 2005-2006. While the use of CI increased over time, the "significance fallacy" (to equate statistical and substantive significance) appeared very often, mainly in journals devoted to clinical specialties (81%). In papers originally written in English and Spanish, 15% and 10%, respectively, mentioned statistical significance in their conclusions.

Conclusions

Overall, results of our review show some improvements in statistical management of statistical results, but further efforts by scholars and journal editors are clearly required to move the communication toward ICMJE advices, especially in the clinical setting, which seems to be imperative among publications in Spanish.

Peer Review reports

The null hypothesis statistical testing (NHST) has been the most widely used statistical approach in health research over the past 80 years. Its origins dates back to 1279 [ 1 ] although it was in the second decade of the twentieth century when the statistician Ronald Fisher formally introduced the concept of "null hypothesis" H 0 - which, generally speaking, establishes that certain parameters do not differ from each other. He was the inventor of the "P-value" through which it could be assessed [ 2 ]. Fisher's P-value is defined as a conditional probability calculated using the results of a study. Specifically, the P-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The Fisherian significance testing theory considered the p-value as an index to measure the strength of evidence against the null hypothesis in a single experiment. The father of NHST never endorsed, however, the inflexible application of the ultimately subjective threshold levels almost universally adopted later on (although the introduction of the 0.05 has his paternity also).

A few years later, Jerzy Neyman and Egon Pearson considered the Fisherian approach inefficient, and in 1928 they published an article [ 3 ] that would provide the theoretical basis of what they called hypothesis statistical testing . The Neyman-Pearson approach is based on the notion that one out of two choices has to be taken: accept the null hypothesis taking the information as a reference based on the information provided, or reject it in favor of an alternative one. Thus, one can incur one of two types of errors: a Type I error, if the null hypothesis is rejected when it is actually true, and a Type II error, if the null hypothesis is accepted when it is actually false. They established a rule to optimize the decision process, using the p-value introduced by Fisher, by setting the maximum frequency of errors that would be admissible.

The null hypothesis statistical testing, as applied today, is a hybrid coming from the amalgamation of the two methods [ 4 ]. As a matter of fact, some 15 years later, both procedures were combined to give rise to the nowadays widespread use of an inferential tool that would satisfy none of the statisticians involved in the original controversy. The present method essentially goes as follows: given a null hypothesis, an estimate of the parameter (or parameters) is obtained and used to create statistics whose distribution, under H 0 , is known. With these data the P-value is computed. Finally, the null hypothesis is rejected when the obtained P-value is smaller than a certain comparative threshold (usually 0.05) and it is not rejected if P is larger than the threshold.

The first reservations about the validity of the method began to appear around 1940, when some statisticians censured the logical roots and practical convenience of Fisher's P-value [ 5 ]. Significance tests and P-values have repeatedly drawn the attention and criticism of many authors over the past 70 years, who have kept questioning its epistemological legitimacy as well as its practical value. What remains in spite of these criticisms is the lasting legacy of researchers' unwillingness to eradicate or reform these methods.

Although there are very comprehensive works on the topic [ 6 ], we list below some of the criticisms most universally accepted by specialists.

The P-values are used as a tool to make decisions in favor of or against a hypothesis. What really may be relevant, however, is to get an effect size estimate (often the difference between two values) rather than rendering dichotomous true/false verdicts [ 7 – 11 ].

The P-value is a conditional probability of the data, provided that some assumptions are met, but what really interests the investigator is the inverse probability: what degree of validity can be attributed to each of several competing hypotheses, once that certain data have been observed [ 12 ].

The two elements that affect the results, namely the sample size and the magnitude of the effect, are inextricably linked in the value of p and we can always get a lower P-value by increasing the sample size. Thus, the conclusions depend on a factor completely unrelated to the reality studied (i.e. the available resources, which in turn determine the sample size) [ 13 , 14 ].

Those who defend the NHST often assert the objective nature of that test, but the process is actually far from being so. NHST does not ensure objectivity. This is reflected in the fact that we generally operate with thresholds that are ultimately no more than conventions, such as 0.01 or 0.05. What is more, for many years their use has unequivocally demonstrated the inherent subjectivity that goes with the concept of P, regardless of how it will be used later [ 15 – 17 ].

In practice, the NHST is limited to a binary response sorting hypotheses into "true" and "false" or declaring "rejection" or "no rejection", without demanding a reasonable interpretation of the results, as has been noted time and again for decades. This binary orthodoxy validates categorical thinking, which results in a very simplistic view of scientific activity that induces researchers not to test theories about the magnitude of effect sizes [ 18 – 20 ].

Despite the weakness and shortcomings of the NHST, they are frequently taught as if they were the key inferential statistical method or the most appropriate, or even the sole unquestioned one. The statistical textbooks, with only some exceptions, do not even mention the NHST controversy. Instead, the myth is spread that NHST is the "natural" final action of scientific inference and the only procedure for testing hypotheses. However, relevant specialists and important regulators of the scientific world advocate avoiding them.

Taking especially into account that NHST does not offer the most important information (i.e. the magnitude of an effect of interest, and the precision of the estimate of the magnitude of that effect), many experts recommend the reporting of point estimates of effect sizes with confidence intervals as the appropriate representation of the inherent uncertainty linked to empirical studies [ 21 – 25 ]. Since 1988, the International Committee of Medical Journal Editors (ICMJE, known as the Vancouver Group ) incorporates the following recommendation to authors of manuscripts submitted to medical journals: "When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as P-values, which fail to convey important information about effect size" [ 26 ].

As will be shown, the use of confidence intervals (CI), occasionally accompanied by P-values, is recommended as a more appropriate method for reporting results. Some authors have noted several shortcomings of CI long ago [ 27 ]. In spite of the fact that calculating CI could be complicated indeed, and that their interpretation is far from simple [ 28 , 29 ], authors are urged to use them because they provide much more information than the NHST and do not merit most of its criticisms of NHST [ 30 ]. While some have proposed different options (for instance, likelihood-based information theoretic methods [ 31 ], and the Bayesian inferential paradigm [ 32 ]), confidence interval estimation of effect sizes is clearly the most widespread alternative approach.

Although twenty years have passed since the ICMJE began to disseminate such recommendations, systematically ignored by the vast majority of textbooks and hardly incorporated in medical publications [ 33 ], it is interesting to examine the extent to which the NHST is used in articles published in medical journals during recent years, in order to identify what is still lacking in the process of eradicating the widespread ceremonial use that is made of statistics in health research [ 34 ]. Furthermore, it is enlightening in this context to examine whether these patterns differ between English- and Spanish-speaking worlds and, if so, to see if the changes in paradigms are occurring more slowly in Spanish-language publications. In such a case we would offer various suggestions.

In addition to assessing the adherence to the above cited statistical recommendation proposed by ICMJE relative to the use of P-values, we consider it of particular interest to estimate the extent to which the significance fallacy is present, an inertial deficiency that consists of attributing -- explicitly or not -- qualitative importance or practical relevance to the found differences simply because statistical significance was obtained.

Many authors produce misleading statements such as "a significant effect was (or was not) found" when it should be said that "a statistically significant difference was (or was not) found". A detrimental consequence of this equivalence is that some authors believe that finding out whether there is "statistical significance" or not is the aim, so that this term is then mentioned in the conclusions [ 35 ]. This means virtually nothing, except that it indicates that the author is letting a computer do the thinking. Since the real research questions are never statistical ones, the answers cannot be statistical either. Accordingly, the conversion of the dichotomous outcome produced by a NHST into a conclusion is another manifestation of the mentioned fallacy.

The general objective of the present study is to evaluate the extent and quality of use of NHST and CI, both in English- and in Spanish-language biomedical publications, between 1995 and 2006 taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on accuracy regarding interpretation of statistical significance and the validity of conclusions.

We reviewed the original articles from six journals, three in English and three in Spanish, over three disjoint periods sufficiently separated from each other (1995-1996, 2000-2001, 2005-2006) as to properly describe the evolution in prevalence of the target features along the selected periods.

The selection of journals was intended to get representation for each of the following three thematic areas: clinical specialties ( Obstetrics & Gynecology and Revista Española de Cardiología) ; Public Health and Epidemiology ( International Journal of Epidemiology and Atención Primaria) and the area of general and internal medicine ( British Medical Journal and Medicina Clínica ). Five of the selected journals formally endorsed ICMJE guidelines; the remaining one ( Revista Española de Cardiología ) suggests observing ICMJE demands in relation with specific issues. We attempted to capture journal diversity in the sample by selecting general and specialty journals with different degrees of influence, resulting from their impact factors in 2007, which oscillated between 1.337 (MC) and 9.723 (BMJ). No special reasons guided us to choose these specific journals, but we opted for journals with rather large paid circulations. For instance, the Spanish Cardiology Journal is the one with the largest impact factor among the fourteen Spanish Journals devoted to clinical specialties that have impact factor and Obstetrics & Gynecology has an outstanding impact factor among the huge number of journals available for selection.

It was decided to take around 60 papers for each biennium and journal, which means a total of around 1,000 papers. As recently suggested [ 36 , 37 ], this number was not established using a conventional method, but by means of a purposive and pragmatic approach in choosing the maximum sample size that was feasible.

Systematic sampling in phases [ 38 ] was used in applying a sampling fraction equal to 60/N, where N is the number of articles, in each of the 18 subgroups defined by crossing the six journals and the three time periods. Table 1 lists the population size and the sample size for each subgroup. While the sample within each subgroup was selected with equal probability, estimates based on other subsets of articles (defined across time periods, areas, or languages) are based on samples with various selection probabilities. Proper weights were used to take into account the stratified nature of the sampling in these cases.

Forty-nine of the 1,092 selected papers were eliminated because, although the section of the article in which they were assigned could suggest they were originals, detailed scrutiny revealed that in some cases they were not. The sample, therefore, consisted of 1,043 papers. Each of them was classified into one of three categories: (1) purely descriptive papers, those designed to review or characterize the state of affairs as it exists at present, (2) analytical papers, or (3) articles that address theoretical, methodological or conceptual issues. An article was regarded as analytical if it seeks to explain the reasons behind a particular occurrence by discovering causal relationships or, even if self-classified as descriptive, it was carried out to assess cause-effect associations among variables. We classify as theoretical or methodological those articles that do not handle empirical data as such, and focus instead on proposing or assessing research methods. We identified 169 papers as purely descriptive or theoretical, which were therefore excluded from the sample. Figure 1 presents a flow chart showing the process for determining eligibility for inclusion in the sample.

figure 1

Flow chart of the selection process for eligible papers .

To estimate the adherence to ICMJE recommendations, we considered whether the papers used P-values, confidence intervals, and both simultaneously. By "the use of P-values" we mean that the article contains at least one P-value, explicitly mentioned in the text or at the bottom of a table, or that it reports that an effect was considered as statistically significant . It was deemed that an article uses CI if it explicitly contained at least one confidence interval, but not when it only provides information that could allow its computation (usually by presenting both the estimate and the standard error). Probability intervals provided in Bayesian analysis were classified as confidence intervals (although conceptually they are not the same) since what is really of interest here is whether or not the authors quantify the findings and present them with appropriate indicators of the margin of error or uncertainty.

In addition we determined whether the "Results" section of each article attributed the status of "significant" to an effect on the sole basis of the outcome of a NHST (i.e., without clarifying that it is strictly statistical significance). Similarly, we examined whether the term "significant" (applied to a test) was mistakenly used as synonymous with substantive , relevant or important . The use of the term "significant effect" when it is only appropriate as a reference to a "statistically significant difference," can be considered a direct expression of the significance fallacy [ 39 ] and, as such, constitutes one way to detect the problem in a specific paper.

We also assessed whether the "Conclusions," which sometimes appear as a separate section in the paper or otherwise in the last paragraphs of the "Discussion" section mentioned statistical significance and, if so, whether any of such mentions were no more than an allusion to results.

To perform these analyses we considered both the abstract and the body of the article. To assess the handling of the significance issue, however, only the body of the manuscript was taken into account.

The information was collected by four trained observers. Every paper was assigned to two reviewers. Disagreements were discussed and, if no agreement was reached, a third reviewer was consulted to break the tie and so moderate the effect of subjectivity in the assessment.

In order to assess the reliability of the criteria used for the evaluation of articles and to effect a convergence of criteria among the reviewers, a pilot study of 20 papers from each of three journals ( Clinical Medicine , Primary Care , and International Journal of Epidemiology) was performed. The results of this pilot study were satisfactory. Our results are reported using percentages together with their corresponding confidence intervals. For sampling errors estimations, used to obtain confidence intervals, we weighted the data using the inverse of the probability of selection of each paper, and we took into account the complex nature of the sample design. These analyses were carried out with EPIDAT [ 40 ], a specialized computer program that is readily available.

A total of 1,043 articles were reviewed, of which 874 (84%) were found to be analytic, while the remainders were purely descriptive or of a theoretical and methodological nature. Five of them did not employ either P-values or CI. Consequently, the analysis was made using the remaining 869 articles.

Use of NHST and confidence intervals

The percentage of articles that use only P-values, without even mentioning confidence intervals, to report their results has declined steadily throughout the period analyzed (Table 2 ). The percentage decreased from approximately 41% in 1995-1996 to 21% in 2005-2006. However, it does not differ notably among journals of different languages, as shown by the estimates and confidence intervals of the respective percentages. Concerning thematic areas, it is highly surprising that most of the clinical articles ignore the recommendations of ICMJE, while for general and internal medicine papers such a problem is only present in one in five papers, and in the area of Public Health and Epidemiology it occurs only in one out of six. The use of CI alone (without P-values) has increased slightly across the studied periods (from 9% to 13%), but it is five times more prevalent in Public Health and Epidemiology journals than in Clinical ones, where it reached a scanty 3%.

Ambivalent handling of the significance

While the percentage of articles referring implicitly or explicitly to significance in an ambiguous or incorrect way - that is, incurring the significance fallacy -- seems to decline steadily, the prevalence of this problem exceeds 69%, even in the most recent period. This percentage was almost the same for articles written in Spanish and in English, but it was notably higher in the Clinical journals (81%) compared to the other journals, where the problem occurs in approximately 7 out of 10 papers (Table 3 ). The kappa coefficient for measuring agreement between observers concerning the presence of the "significance fallacy" was 0.78 (CI95%: 0.62 to 0.93), which is considered acceptable in the scale of Landis and Koch [ 41 ].

Reference to numerical results or statistical significance in Conclusions

The percentage of papers mentioning a numerical finding as a conclusion is similar in the three periods analyzed (Table 4 ). Concerning languages, this percentage is nearly twice as large for Spanish journals as for those published in English (approximately 21% versus 12%). And, again, the highest percentage (16%) corresponded to clinical journals.

A similar pattern is observed, although with less pronounced differences, in references to the outcome of the NHST (significant or not) in the conclusions (Table 5 ). The percentage of articles that introduce the term in the "Conclusions" does not appreciably differ between articles written in Spanish and in English. Again, the area where this insufficiency is more often present (more than 15% of articles) is the Clinical area.

There are some previous studies addressing the degree to which researchers have moved beyond the ritualistic use of NHST to assess their hypotheses. This has been examined for areas such as biology [ 42 ], organizational research [ 43 ], or psychology [ 44 – 47 ]. However, to our knowledge, no recent research has explored the pattern of use P-values and CI in medical literature and, in any case, no efforts have been made to study this problem in a way that takes into account different languages and specialties.

At first glance it is puzzling that, after decades of questioning and technical warnings, and after twenty years since the inception of ICMJE recommendation to avoid NHST, they continue being applied ritualistically and mindlessly as the dominant doctrine. Not long ago, when researchers did not observe statistically significant effects, they were unlikely to write them up and to report "negative" findings, since they knew there was a high probability that the paper would be rejected. This has changed a bit: editors are more prone to judge all findings as potentially eloquent. This is probably the frequent denunciations of the tendency for those papers presenting a significant positive result to receive more favorable publication decisions than equally well-conducted ones that report a negative or null result, the so-called publication bias [ 48 – 50 ]. This new openness is consistent with the fact that if the substantive question addressed is really relevant, the answer (whether positive or negative) will also be relevant.

Consequently, even though it was not an aim of our study, we found many examples in which statistical significance was not obtained. However, many of those negative results were reported with a comment of this type: " The results did not show a significant difference between groups; however, with a larger sample size, this difference would have probably proved to be significant ". The problem with this statement is that it is true; more specifically, it will always be true and it is, therefore, sterile. It is not fortuitous that one never encounters the opposite, and equally tautological, statement: " A significant difference between groups has been detected; however, perhaps with a smaller sample size, this difference would have proved to be not significant" . Such a double standard is itself an unequivocal sign of the ritual application of NHST.

Although the declining rates of NHST usage show that, gradually, ICMJE and similar recommendations are having a positive impact, most of the articles in the clinical setting still considered NHST as the final arbiter of the research process. Moreover, it appears that the improvement in the situation is mostly formal, and the percentage of articles that fall into the significance fallacy is huge.

The contradiction between what has been conceptually recommended and the common practice is sensibly less acute in the area of Epidemiology and Public Health, but the same pattern was evident everywhere in the mechanical way of applying significance tests. Nevertheless, the clinical journals remain the most unmoved by the recommendations.

The ICMJE recommendations are not cosmetic statements but substantial ones, and the vigorous exhortations made by outstanding authorities [ 51 ] are not mere intellectual exercises due to ingenious and inopportune methodologists, but rather they are very serious epistemological warnings.

In some cases, the role of CI is not as clearly suitable (e.g. when estimating multiple regression coefficients or because effect sizes are not available for some research designs [ 43 , 52 ]), but when it comes to estimating, for example, an odds ratio or a rates difference, the advantage of using CI instead of P values is very clear, since in such cases it is obvious that the goal is to assess what has been called the "effect size."

The inherent resistance to change old paradigms and practices that have been entrenched for decades is always high. Old habits die hard. The estimates and trends outlined are entirely consistent with Alvan Feinstein's warning 25 years ago: "Because the history of medical research also shows a long tradition of maintaining loyalty to established doctrines long after the doctrines had been discredited, or shown to be valueless, we cannot expect a sudden change in this medical policy merely because it has been denounced by leading connoisseurs of statistics [ 53 ]".

It is possible, however, that the nature of the problem has an external explanation: it is likely that some editors prefer to "avoid troubles" with the authors and vice versa, thus resorting to the most conventional procedures. Many junior researchers believe that it is wise to avoid long back-and-forth discussions with reviewers and editors. In general, researchers who want to appear in print and survive in a publish-or-perish environment are motivated by force, fear, and expedience in their use of NHST [ 54 ]. Furthermore, it is relatively natural that simple researchers use NHST when they take into account that some theoretical objectors have used this statistical analysis in empirical studies, published after the appearance of their own critiques [ 55 ].

For example, Journal of the American Medical Association published a bibliometric study [ 56 ] discussing the impact of statisticians' co-authorship of medical papers on publication decisions by two major high-impact journals: British Medical Journal and Annals of Internal Medicine . The data analysis is characterized by methodological orthodoxy. The authors just use chi-square tests without any reference to CI, although the NHST had been repeatedly criticized over the years by two of the authors:

Douglas Altman, an early promoter of confidence intervals as an alternative [ 57 ], and Steve Goodman, a critic of NHST from a Bayesian perspective [ 58 ]. Individual authors, however, cannot be blamed for broader institutional problems and systemic forces opposed to change.

The present effort is certainly partial in at least two ways: it is limited to only six specific journals and to three biennia. It would be therefore highly desirable to improve it by studying the problem in a more detailed way (especially by reviewing more journals with different profiles), and continuing the review of prevailing patterns and trends.

Curran-Everett D: Explorations in statistics: hypothesis tests and P values. Adv Physiol Educ. 2009, 33: 81-86. 10.1152/advan.90218.2008.

Article   PubMed   Google Scholar  

Fisher RA: Statistical Methods for Research Workers. 1925, Edinburgh: Oliver & Boyd

Google Scholar  

Neyman J, Pearson E: On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika. 1928, 20: 175-240.

Silva LC: Los laberintos de la investigación biomédica. En defensa de la racionalidad para la ciencia del siglo XXI. 2009, Madrid: Díaz de Santos

Berkson J: Test of significance considered as evidence. J Am Stat Assoc. 1942, 37: 325-335. 10.2307/2279000.

Article   Google Scholar  

Nickerson RS: Null hypothesis significance testing: A review of an old and continuing controversy. Psychol Methods. 2000, 5: 241-301. 10.1037/1082-989X.5.2.241.

Article   CAS   PubMed   Google Scholar  

Rozeboom WW: The fallacy of the null hypothesissignificance test. Psychol Bull. 1960, 57: 418-428. 10.1037/h0042040.

Callahan JL, Reio TG: Making subjective judgments in quantitative studies: The importance of using effect sizes and confidenceintervals. HRD Quarterly. 2006, 17: 159-173.

Nakagawa S, Cuthill IC: Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev. 2007, 82: 591-605. 10.1111/j.1469-185X.2007.00027.x.

Breaugh JA: Effect size estimation: factors to consider and mistakes to avoid. J Manage. 2003, 29: 79-97. 10.1177/014920630302900106.

Thompson B: What future quantitative social science research could look like: confidence intervals for effect sizes. Educ Res. 2002, 31: 25-32.

Matthews RA: Significance levels for the assessment of anomalous phenomena. Journal of Scientific Exploration. 1999, 13: 1-7.

Savage IR: Nonparametric statistics. J Am Stat Assoc. 1957, 52: 332-333.

Silva LC, Benavides A, Almenara J: El péndulo bayesiano: Crónica de una polémica estadística. Llull. 2002, 25: 109-128.

Goodman SN, Royall R: Evidence and scientific research. Am J Public Health. 1988, 78: 1568-1574. 10.2105/AJPH.78.12.1568.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Berger JO, Berry DA: Statistical analysis and the illusion of objectivity. Am Sci. 1988, 76: 159-165.

Hurlbert SH, Lombardi CM: Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Ann Zool Fenn. 2009, 46: 311-349.

Fidler F, Thomason N, Cumming G, Finch S, Leeman J: Editors can lead researchers to confidence intervals but they can't make them think: Statistical reform lessons from Medicine. Psychol Sci. 2004, 15: 119-126. 10.1111/j.0963-7214.2004.01502008.x.

Balluerka N, Vergara AI, Arnau J: Calculating the main alternatives to null-hypothesis-significance testing in between-subject experimental designs. Psicothema. 2009, 21: 141-151.

Cumming G, Fidler F: Confidence intervals: Better answers to better questions. J Psychol. 2009, 217: 15-26.

Jones LV, Tukey JW: A sensible formulation of the significance test. Psychol Methods. 2000, 5: 411-414. 10.1037/1082-989X.5.4.411.

Dixon P: The p-value fallacy and how to avoid it. Can J Exp Psychol. 2003, 57: 189-202.

Nakagawa S, Cuthill IC: Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc. 2007, 82: 591-605. 10.1111/j.1469-185X.2007.00027.x.

Brandstaetter E: Confidence intervals as an alternative to significance testing. MPR-Online. 2001, 4: 33-46.

Masson ME, Loftus GR: Using confidence intervals for graphically based data interpretation. Can J Exp Psychol. 2003, 57: 203-220.

International Committee of Medical Journal Editors: Uniform requirements for manuscripts submitted to biomedical journals. Update October 2008. Accessed July 11, 2009, [ http://www.icmje.org ]

Feinstein AR: P-Values and Confidence Intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998, 51: 355-360. 10.1016/S0895-4356(97)00295-3.

Haller H, Kraus S: Misinterpretations of significance: A problem students share with their teachers?. MRP-Online. 2002, 7: 1-20.

Gigerenzer G, Krauss S, Vitouch O: The null ritual: What you always wanted to know about significance testing but were afraid to ask. The Handbook of Methodology for the Social Sciences. Edited by: Kaplan D. 2004, Thousand Oaks, CA: Sage Publications, Chapter 21: 391-408.

Curran-Everett D, Taylor S, Kafadar K: Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol. 1998, 85: 775-786.

CAS   PubMed   Google Scholar  

Royall RM: Statistical evidence: a likelihood paradigm. 1997, Boca Raton: Chapman & Hall/CRC

Goodman SN: Of P values and Bayes: A modest proposal. Epidemiology. 2001, 12: 295-297. 10.1097/00001648-200105000-00006.

Sarria M, Silva LC: Tests of statistical significance in three biomedical journals: a critical review. Rev Panam Salud Publica. 2004, 15: 300-306.

Silva LC: Una ceremonia estadística para identificar factores de riesgo. Salud Colectiva. 2005, 1: 322-329.

Goodman SN: Toward Evidence-Based Medical Statistics 1: The p Value Fallacy. Ann Intern Med. 1999, 130: 995-1004.

Schulz KF, Grimes DA: Sample size calculations in randomised clinical trials: mandatory and mystical. Lancet. 2005, 365: 1348-1353. 10.1016/S0140-6736(05)61034-3.

Bacchetti P: Current sample size conventions: Flaws, harms, and alternatives. BMC Med. 2010, 8: 17-10.1186/1741-7015-8-17.

Article   PubMed   PubMed Central   Google Scholar  

Silva LC: Diseño razonado de muestras para la investigación sanitaria. 2000, Madrid: Díaz de Santos

Barnett ML, Mathisen A: Tyranny of the p-value: The conflict between statistical significance and common sense. J Dent Res. 1997, 76: 534-536. 10.1177/00220345970760010201.

Santiago MI, Hervada X, Naveira G, Silva LC, Fariñas H, Vázquez E, Bacallao J, Mújica OJ: [The Epidat program: uses and perspectives] [letter]. Pan Am J Public Health. 2010, 27: 80-82. Spanish.

Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33: 159-74. 10.2307/2529310.

Fidler F, Burgman MA, Cumming G, Buttrose R, Thomason N: Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology. Conserv Biol. 2005, 20: 1539-1544. 10.1111/j.1523-1739.2006.00525.x.

Kline RB: Beyond significance testing: Reforming data analysis methods in behavioral research. 2004, Washington, DC: American Psychological Association

Book   Google Scholar  

Curran-Everett D, Benos DJ: Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel. Adv Physiol Educ. 2007, 31: 295-298. 10.1152/advan.00022.2007.

Hubbard R, Parsa AR, Luthy MR: The spread of statistical significance testing: The case of the Journal of Applied Psychology. Theor Psychol. 1997, 7: 545-554. 10.1177/0959354397074006.

Vacha-Haase T, Nilsson JE, Reetz DR, Lance TS, Thompson B: Reporting practices and APA editorial policies regarding statistical significance and effect size. Theor Psychol. 2000, 10: 413-425. 10.1177/0959354300103006.

Krueger J: Null hypothesis significance testing: On the survival of a flawed method. Am Psychol. 2001, 56: 16-26. 10.1037/0003-066X.56.1.16.

Rising K, Bacchetti P, Bero L: Reporting Bias in Drug Trials Submitted to the Food and Drug Administration: Review of Publication and Presentation. PLoS Med. 2008, 5: e217-10.1371/journal.pmed.0050217. doi:10.1371/journal.pmed.0050217

Sridharan L, Greenland L: Editorial policies and publication bias the importance of negative studies. Arch Intern Med. 2009, 169: 1022-1023. 10.1001/archinternmed.2009.100.

Falagas ME, Alexiou VG: The top-ten in journal impact factor manipulation. Arch Immunol Ther Exp (Warsz). 2008, 56: 223-226. 10.1007/s00005-008-0024-5.

Rothman K: Writing for Epidemiology. Epidemiology. 1998, 9: 98-104. 10.1097/00001648-199805000-00019.

Fidler F: The fifth edition of the APA publication manual: Why its statistics recommendations are so controversial. Educ Psychol Meas. 2002, 62: 749-770. 10.1177/001316402236876.

Feinstein AR: Clinical epidemiology: The architecture of clinical research. 1985, Philadelphia: W.B. Saunders Company

Orlitzky M: Institutionalized dualism: statistical significance testing as myth and ceremony. Accessed Feb 8, 2010, [ http://ssrn.com/abstract=1415926 ]

Greenwald AG, González R, Harris RJ, Guthrie D: Effect sizes and p-value. What should be reported and what should be replicated?. Psychophysiology. 1996, 33: 175-183. 10.1111/j.1469-8986.1996.tb02121.x.

Altman DG, Goodman SN, Schroter S: How statistical expertise is used in medical research. J Am Med Assoc. 2002, 287: 2817-2820. 10.1001/jama.287.21.2817.

Gardner MJ, Altman DJ: Statistics with confidence. Confidence intervals and statistical guidelines. 1992, London: BMJ

Goodman SN: P Values, Hypothesis Tests and Likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993, 137: 485-496.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/10/44/prepub

Download references

Acknowledgements

The authors would like to thank Tania Iglesias-Cabo and Vanesa Alvarez-González for their help with the collection of empirical data and their participation in an earlier version of the paper. The manuscript has benefited greatly from thoughtful, constructive feedback by Carlos Campillo-Artero, Tom Piazza and Ann Séror.

Author information

Authors and affiliations.

Centro Nacional de Investigación de Ciencias Médicas, La Habana, Cuba

Luis Carlos Silva-Ayçaguer

Unidad de Investigación. Hospital de Cabueñes, Servicio de Salud del Principado de Asturias (SESPA), Gijón, Spain

Patricio Suárez-Gil

CIBER Epidemiología y Salud Pública (CIBERESP), Spain and Departamento de Medicina, Unidad de Epidemiología Molecular del Instituto Universitario de Oncología, Universidad de Oviedo, Spain

Ana Fernández-Somoano

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Patricio Suárez-Gil .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors' contributions

LCSA designed the study, wrote the paper and supervised the whole process; PSG coordinated the data extraction and carried out statistical analysis, as well as participated in the editing process; AFS extracted the data and participated in the first stage of statistical analysis; all authors contributed to and revised the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions.

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Silva-Ayçaguer, L.C., Suárez-Gil, P. & Fernández-Somoano, A. The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation. BMC Med Res Methodol 10 , 44 (2010). https://doi.org/10.1186/1471-2288-10-44

Download citation

Received : 29 December 2009

Accepted : 19 May 2010

Published : 19 May 2010

DOI : https://doi.org/10.1186/1471-2288-10-44

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical Specialty
  • Significance Fallacy
  • Null Hypothesis Statistical Testing
  • Medical Journal Editor
  • Clinical Journal

BMC Medical Research Methodology

ISSN: 1471-2288

null hypothesis medical example

  • Course Home
  • Correlates, Conditions, Care, and Costs
  • Knowledge Check
  • Dependent and Independent Variables
  • Correlation
  • Age-Adjustment
  • Distribution
  • Standard Deviation
  • Significance Level
  • Confidence Intervals
  • Incorporation into Health Subjects
  • Medical Records
  • Claims Data
  • Vital Records
  • Surveillance
  • Grey Literature
  • Peer-Reviewed Literature
  • National Center for Health Statistics (NCHS)
  • World Health Organization (WHO)
  • Agency for Healthcare Research and Quality (AHRQ)
  • Centers for Disease Control and Prevention (CDC)
  • Robert Wood Johnson Foundation: County Health Rankings & Roadmaps
  • Centers for Medicare and Medicaid Services (CMS)
  • Kaiser Family Foundation (KFF)
  • United States Census Bureau
  • HealthData.gov
  • Dartmouth Atlas of Health Care (DAHC)
  • Academic Journal Databases
  • Search Engines

How to Navigate This Course

There are a variety of ways you can navigate this course. You can:

  • Click the Prev and Next buttons at the bottom of each page to move through the material.
  • Use the main navigation with dropdown subsections featured on all pages.
  • Use a combination of the above methods to explore the course contents.

2. Common Terms and Equations

In statistical analysis, two hypotheses are used. The null hypothesis , or H 0 , states that there is no statistical significance between two variables. The null is often the commonly accepted position and is what scientists seek to not support through the study. The alternative hypothesis , or H a , states that there is a statistical significance between two variables and is what scientists are seeking to support through experimentation.

For example, if someone wants to see how they score on a math test relative to their class average, they can write hypotheses comparing the student’s score, to the class average score (µ). Let’s say for this example, the student’s score on a math exam was 75. The null (H 0 ) and alternative (H a ) hypotheses could be written as:

  • H 0 : µ = 75
  • H 0 : µ = µ 0
  • H a : µ ≠ 75
  • H a : µ ≠ µ 0

In the null hypothesis, there is no difference between the observed mean (µ) and the claimed value (75). However, in the alternative hypothesis, class mean is significantly different (either less than or greater than 75) from the student’s score (75). Statistical tests will be used to support to either support or reject the null hypothesis. When the null hypothesis is supported by the test, then the test indicates that there is not a statistically significant difference between the class mean score and the student’s mean score. If the null hypothesis is rejected, then the alternative hypothesis is supported, which leads to the conclusion that the student’s score is statistically significant difference from the class mean score.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.1: Null and Alternative Hypotheses

  • Last updated
  • Save as PDF
  • Page ID 23459

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

\(H_0\): The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.

\(H_a\): The alternative hypothesis: It is a claim about the population that is contradictory to \(H_0\) and what we conclude when we reject \(H_0\). This is usually what the researcher is trying to prove.

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are "reject \(H_0\)" if the sample information favors the alternative hypothesis or "do not reject \(H_0\)" or "decline to reject \(H_0\)" if the sample information is insufficient to reject the null hypothesis.

\(H_{0}\) always has a symbol with an equal in it. \(H_{a}\) never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example \(\PageIndex{1}\)

  • \(H_{0}\): No more than 30% of the registered voters in Santa Clara County voted in the primary election. \(p \leq 30\)
  • \(H_{a}\): More than 30% of the registered voters in Santa Clara County voted in the primary election. \(p > 30\)

Exercise \(\PageIndex{1}\)

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

  • \(H_{0}\): The drug reduces cholesterol by 25%. \(p = 0.25\)
  • \(H_{a}\): The drug does not reduce cholesterol by 25%. \(p \neq 0.25\)

Example \(\PageIndex{2}\)

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

  • \(H_{0}: \mu = 2.0\)
  • \(H_{a}: \mu \neq 2.0\)

Exercise \(\PageIndex{2}\)

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol \((=, \neq, \geq, <, \leq, >)\) for the null and alternative hypotheses.

  • \(H_{0}: \mu \_ 66\)
  • \(H_{a}: \mu \_ 66\)
  • \(H_{0}: \mu = 66\)
  • \(H_{a}: \mu \neq 66\)

Example \(\PageIndex{3}\)

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

  • \(H_{0}: \mu \geq 5\)
  • \(H_{a}: \mu < 5\)

Exercise \(\PageIndex{3}\)

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • \(H_{0}: \mu \_ 45\)
  • \(H_{a}: \mu \_ 45\)
  • \(H_{0}: \mu \geq 45\)
  • \(H_{a}: \mu < 45\)

Example \(\PageIndex{4}\)

In an issue of U. S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

  • \(H_{0}: p \leq 0.066\)
  • \(H_{a}: p > 0.066\)

Exercise \(\PageIndex{4}\)

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (\(=, \neq, \geq, <, \leq, >\)) for the null and alternative hypotheses.

  • \(H_{0}: p \_ 0.40\)
  • \(H_{a}: p \_ 0.40\)
  • \(H_{0}: p = 0.40\)
  • \(H_{a}: p > 0.40\)

COLLABORATIVE EXERCISE

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

In a hypothesis test , sample data is evaluated in order to arrive at a decision about some type of claim. If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we:

  • Evaluate the null hypothesis , typically denoted with \(H_{0}\). The null is not rejected unless the hypothesis test shows otherwise. The null statement must always contain some form of equality \((=, \leq \text{or} \geq)\)
  • Always write the alternative hypothesis , typically denoted with \(H_{a}\) or \(H_{1}\), using less than, greater than, or not equals symbols, i.e., \((\neq, >, \text{or} <)\).
  • If we reject the null hypothesis, then we can assume there is enough evidence to support the alternative hypothesis.
  • Never state that a claim is proven true or false. Keep in mind the underlying fact that hypothesis testing is based on probability laws; therefore, we can talk only in terms of non-absolute certainties.

Formula Review

\(H_{0}\) and \(H_{a}\) are contradictory.

  • If \(\alpha \leq p\)-value, then do not reject \(H_{0}\).
  • If\(\alpha > p\)-value, then reject \(H_{0}\).

\(\alpha\) is preconceived. Its value is set before the hypothesis test starts. The \(p\)-value is calculated from the data.References

Data from the National Institute of Mental Health. Available online at http://www.nimh.nih.gov/publicat/depression.cfm .

Logo for UH Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Hypothesis Testing with One Sample

Null and Alternative Hypotheses

OpenStaxCollege

[latexpage]

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 : The null hypothesis: It is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

H a : The alternative hypothesis: It is a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are “reject H 0 ” if the sample information favors the alternative hypothesis or “do not reject H 0 ” or “decline to reject H 0 ” if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

H 0 : No more than 30% of the registered voters in Santa Clara County voted in the primary election. p ≤ 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

H 0 : The drug reduces cholesterol by 25%. p = 0.25

H a : The drug does not reduce cholesterol by 25%. p ≠ 0.25

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

H 0 : μ = 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ = 66
  • H a : μ ≠ 66

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

H 0 : μ ≥ 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ ≥ 45
  • H a : μ < 45

In an issue of U. S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

H 0 : p ≤ 0.066

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p = 0.40
  • H a : p > 0.40

<!– ??? –>

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

Chapter Review

In a hypothesis test , sample data is evaluated in order to arrive at a decision about some type of claim. If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we:

Formula Review

H 0 and H a are contradictory.

If α ≤ p -value, then do not reject H 0 .

If α > p -value, then reject H 0 .

α is preconceived. Its value is set before the hypothesis test starts. The p -value is calculated from the data.

You are testing that the mean speed of your cable Internet connection is more than three Megabits per second. What is the random variable? Describe in words.

The random variable is the mean Internet speed in Megabits per second.

You are testing that the mean speed of your cable Internet connection is more than three Megabits per second. State the null and alternative hypotheses.

The American family has an average of two children. What is the random variable? Describe in words.

The random variable is the mean number of children an American family has.

The mean entry level salary of an employee at a company is 💲58,000. You believe it is higher for IT professionals in the company. State the null and alternative hypotheses.

A sociologist claims the probability that a person picked at random in Times Square in New York City is visiting the area is 0.83. You want to test to see if the proportion is actually less. What is the random variable? Describe in words.

The random variable is the proportion of people picked at random in Times Square visiting the city.

A sociologist claims the probability that a person picked at random in Times Square in New York City is visiting the area is 0.83. You want to test to see if the claim is correct. State the null and alternative hypotheses.

In a population of fish, approximately 42% are female. A test is conducted to see if, in fact, the proportion is less. State the null and alternative hypotheses.

Suppose that a recent article stated that the mean time spent in jail by a first–time convicted burglar is 2.5 years. A study was then done to see if the mean time has increased in the new century. A random sample of 26 first-time convicted burglars in a recent year was picked. The mean length of time in jail from the survey was 3 years with a standard deviation of 1.8 years. Suppose that it is somehow known that the population standard deviation is 1.5. If you were conducting a hypothesis test to determine if the mean length of jail time has increased, what would the null and alternative hypotheses be? The distribution of the population is normal.

A random survey of 75 death row inmates revealed that the mean length of time on death row is 17.4 years with a standard deviation of 6.3 years. If you were conducting a hypothesis test to determine if the population mean time on death row could likely be 15 years, what would the null and alternative hypotheses be?

  • H 0 : __________
  • H a : __________
  • H 0 : μ = 15
  • H a : μ ≠ 15

The National Institute of Mental Health published an article stating that in any one-year period, approximately 9.5 percent of American adults suffer from depression or a depressive illness. Suppose that in a survey of 100 people in a certain town, seven of them suffered from depression or a depressive illness. If you were conducting a hypothesis test to determine if the true proportion of people in that town suffering from depression or a depressive illness is lower than the percent in the general adult American population, what would the null and alternative hypotheses be?

Some of the following statements refer to the null hypothesis, some to the alternate hypothesis.

State the null hypothesis, H 0 , and the alternative hypothesis. H a , in terms of the appropriate parameter ( μ or p ).

  • The mean number of years Americans work before retiring is 34.
  • At most 60% of Americans vote in presidential elections.
  • The mean starting salary for San Jose State University graduates is at least 💲100,000 per year.
  • Twenty-nine percent of high school seniors get drunk each month.
  • Fewer than 5% of adults ride the bus to work in Los Angeles.
  • The mean number of cars a person owns in her lifetime is not more than ten.
  • About half of Americans prefer to live away from cities, given the choice.
  • Europeans have a mean paid vacation each year of six weeks.
  • The chance of developing breast cancer is under 11% for women.
  • Private universities’ mean tuition cost is more than 💲20,000 per year.
  • H 0 : μ = 34; H a : μ ≠ 34
  • H 0 : p ≤ 0.60; H a : p > 0.60
  • H 0 : μ ≥ 100,000; H a : μ < 100,000
  • H 0 : p = 0.29; H a : p ≠ 0.29
  • H 0 : p = 0.05; H a : p < 0.05
  • H 0 : μ ≤ 10; H a : μ > 10
  • H 0 : p = 0.50; H a : p ≠ 0.50
  • H 0 : μ = 6; H a : μ ≠ 6
  • H 0 : p ≥ 0.11; H a : p < 0.11
  • H 0 : μ ≤ 20,000; H a : μ > 20,000

Over the past few decades, public health officials have examined the link between weight concerns and teen girls’ smoking. Researchers surveyed a group of 273 randomly selected teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls were surveyed again. Sixty-three said they smoked to stay thin. Is there good evidence that more than thirty percent of the teen girls smoke to stay thin? The alternative hypothesis is:

  • p < 0.30
  • p > 0.30

A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students attended the opening night midnight showing of the latest Harry Potter movie. She surveys 84 of her students and finds that 11 attended the midnight showing. An appropriate alternative hypothesis is:

  • p > 0.20
  • p < 0.20

Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test. The null and alternative hypotheses are:

  • H o : \(\overline{x}\) = 4.5, H a : \(\overline{x}\) > 4.5
  • H o : μ ≥ 4.5, H a : μ < 4.5
  • H o : μ = 4.75, H a : μ > 4.75
  • H o : μ = 4.5, H a : μ > 4.5

Data from the National Institute of Mental Health. Available online at http://www.nimh.nih.gov/publicat/depression.cfm.

Null and Alternative Hypotheses Copyright © 2013 by OpenStaxCollege is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Back Home

  • Science Notes Posts
  • Contact Science Notes
  • Todd Helmenstine Biography
  • Anne Helmenstine Biography
  • Free Printable Periodic Tables (PDF and PNG)
  • Periodic Table Wallpapers
  • Interactive Periodic Table
  • Periodic Table Posters
  • How to Grow Crystals
  • Chemistry Projects
  • Fire and Flames Projects
  • Holiday Science
  • Chemistry Problems With Answers
  • Physics Problems
  • Unit Conversion Example Problems
  • Chemistry Worksheets
  • Biology Worksheets
  • Periodic Table Worksheets
  • Physical Science Worksheets
  • Science Lab Worksheets
  • My Amazon Books

Null Hypothesis Examples

Null Hypothesis Example

The null hypothesis (H 0 ) is the hypothesis that states there is no statistical difference between two sample sets. In other words, it assumes the independent variable does not have an effect on the dependent variable in a scientific experiment .

The null hypothesis is the most powerful type of hypothesis in the scientific method because it’s the easiest one to test with a high confidence level using statistics. If the null hypothesis is accepted, then it’s evidence any observed differences between two experiment groups are due to random chance. If the null hypothesis is rejected, then it’s strong evidence there is a true difference between test sets or that the independent variable affects the dependent variable.

  • The null hypothesis is a nullifiable hypothesis. A researcher seeks to reject it because this result strongly indicates observed differences are real and not just due to chance.
  • The null hypothesis may be accepted or rejected, but not proven. There is always a level of confidence in the outcome.

What Is the Null Hypothesis?

The null hypothesis is written as H 0 , which is read as H-zero, H-nought, or H-null. It is associated with another hypothesis, called the alternate or alternative hypothesis H A or H 1 . When the null hypothesis and alternate hypothesis are written mathematically, they cover all possible outcomes of an experiment.

An experimenter tests the null hypothesis with a statistical analysis called a significance test. The significance test determines the likelihood that the results of the test are not due to chance. Usually, a researcher uses a confidence level of 95% or 99% (p-value of 0.05 or 0.01). But, even if the confidence in the test is high, there is always a small chance the outcome is incorrect. This means you can’t prove a null hypothesis. It’s also a good reason why it’s important to repeat experiments.

Exact and Inexact Null Hypothesis

The most common type of null hypothesis assumes no difference between two samples or groups or no measurable effect of a treatment. This is the exact hypothesis . If you’re asked to state a null hypothesis for a science class, this is the one to write. It is the easiest type of hypothesis to test and is the only one accepted for certain types of analysis. Examples include:

There is no difference between two groups H 0 : μ 1  = μ 2 (where H 0  = the null hypothesis, μ 1  = the mean of population 1, and μ 2  = the mean of population 2)

Both groups have value of 100 (or any number or quality) H 0 : μ = 100

However, sometimes a researcher may test an inexact hypothesis . This type of hypothesis specifies ranges or intervals. Examples include:

Recovery time from a treatment is the same or worse than a placebo: H 0 : μ ≥ placebo time

There is a 5% or less difference between two groups: H 0 : 95 ≤ μ ≤ 105

An inexact hypothesis offers “directionality” about a phenomenon. For example, an exact hypothesis can indicate whether or not a treatment has an effect, while an inexact hypothesis can tell whether an effect is positive of negative. However, an inexact hypothesis may be harder to test and some scientists and statisticians disagree about whether it’s a true null hypothesis .

How to State the Null Hypothesis

To state the null hypothesis, first state what you expect the experiment to show. Then, rephrase the statement in a form that assumes there is no relationship between the variables or that a treatment has no effect.

Example: A researcher tests whether a new drug speeds recovery time from a certain disease. The average recovery time without treatment is 3 weeks.

  • State the goal of the experiment: “I hope the average recovery time with the new drug will be less than 3 weeks.”
  • Rephrase the hypothesis to assume the treatment has no effect: “If the drug doesn’t shorten recovery time, then the average time will be 3 weeks or longer.” Mathematically: H 0 : μ ≥ 3

This null hypothesis (inexact hypothesis) covers both the scenario in which the drug has no effect and the one in which the drugs makes the recovery time longer. The alternate hypothesis is that average recovery time will be less than three weeks:

H A : μ < 3

Of course, the researcher could test the no-effect hypothesis (exact null hypothesis): H 0 : μ = 3

The danger of testing this hypothesis is that rejecting it only implies the drug affected recovery time (not whether it made it better or worse). This is because the alternate hypothesis is:

H A : μ ≠ 3 (which includes μ <3 and μ >3)

Even though the no-effect null hypothesis yields less information, it’s used because it’s easier to test using statistics. Basically, testing whether something is unchanged/changed is easier than trying to quantify the nature of the change.

Remember, a researcher hopes to reject the null hypothesis because this supports the alternate hypothesis. Also, be sure the null and alternate hypothesis cover all outcomes. Finally, remember a simple true/false, equal/unequal, yes/no exact hypothesis is easier to test than a more complex inexact hypothesis.

  • Adèr, H. J.; Mellenbergh, G. J. & Hand, D. J. (2007).  Advising on Research Methods: A Consultant’s Companion . Huizen, The Netherlands: Johannes van Kessel Publishing. ISBN  978-90-79418-01-5 .
  • Cox, D. R. (2006).  Principles of Statistical Inference . Cambridge University Press. ISBN  978-0-521-68567-2 .
  • Everitt, Brian (1998).  The Cambridge Dictionary of Statistics . Cambridge, UK New York: Cambridge University Press. ISBN 978-0521593465.
  • Weiss, Neil A. (1999).  Introductory Statistics  (5th ed.). ISBN 9780201598773.

Related Posts

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of plosone

Why we habitually engage in null-hypothesis significance testing: A qualitative study

Jonah stunt.

1 Department of Health Sciences, Section of Methodology and Applied Statistics, Vrije Universiteit, Amsterdam, The Netherlands

2 Department of Radiation Oncology, Erasmus Medical Center, Rotterdam, The Netherlands

Leonie van Grootel

3 Rathenau Institute, The Hague, The Netherlands

4 Department of Philosophy, Vrije Universiteit, Amsterdam, The Netherlands

5 Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands

David Trafimow

6 Psychology Department, New Mexico State University, Las Cruces, New Mexico, United States of America

Trynke Hoekstra

Michiel de boer.

7 Department of General Practice and Elderly Care, University Medical Center Groningen, Groningen, The Netherlands

Associated Data

A full study protocol, including a detailed data analysis plan, was preregistered ( https://osf.io/4qg38/ ). At the start of this study, preregistration forms for qualitative studies were not developed yet. Therefore, preregistration for this study is based on an outdated form. Presently, there is a preregistration form available for qualitative studies. Information about data collection, data management, data sharing and data storage is described in a Data Management Plan. Sensitive data is stored in Darkstor, an offline archive for storing sensitive information or data (information that involves i.e., privacy or copyright). As the recordings and transcripts of the interviews and focus groups contain privacy-sensitive data, these files are archived in Darkstor and can be accessed only on request by authorized individuals (i.e., the original researcher or a research coordinator)1. Non-sensitive data is stored in DANS ( https://doi.org/10.17026/dans-2at-nzfs ) (Data Archiving and Networked Services; the Netherlands institute for permanent access to digital research resources). 1. Data requests can be send to ln.uv@mdr .

Null Hypothesis Significance Testing (NHST) is the most familiar statistical procedure for making inferences about population effects. Important problems associated with this method have been addressed and various alternatives that overcome these problems have been developed. Despite its many well-documented drawbacks, NHST remains the prevailing method for drawing conclusions from data. Reasons for this have been insufficiently investigated. Therefore, the aim of our study was to explore the perceived barriers and facilitators related to the use of NHST and alternative statistical procedures among relevant stakeholders in the scientific system.

Individual semi-structured interviews and focus groups were conducted with junior and senior researchers, lecturers in statistics, editors of scientific journals and program leaders of funding agencies. During the focus groups, important themes that emerged from the interviews were discussed. Data analysis was performed using the constant comparison method, allowing emerging (sub)themes to be fully explored. A theory substantiating the prevailing use of NHST was developed based on the main themes and subthemes we identified.

Twenty-nine interviews and six focus groups were conducted. Several interrelated facilitators and barriers associated with the use of NHST and alternative statistical procedures were identified. These factors were subsumed under three main themes: the scientific climate, scientific duty, and reactivity. As a result of the factors, most participants feel dependent in their actions upon others, have become reactive, and await action and initiatives from others. This may explain why NHST is still the standard and ubiquitously used by almost everyone involved.

Our findings demonstrate how perceived barriers to shift away from NHST set a high threshold for actual behavioral change and create a circle of interdependency between stakeholders. By taking small steps it should be possible to decrease the scientific community’s strong dependence on NHST and p-values.

Introduction

Empirical studies often start from the idea that there might be an association between a specific factor and a certain outcome within a population. This idea is referred to as the alternative hypothesis (H1). Its complement, the null hypothesis (H0), typically assumes no association or effect (although it is possible to test other effect sizes than no effect with the null hypothesis). At the stage of data-analysis, the probability of obtaining the observed, or a more extreme, association is calculated under the assumption of no effect in the population (H0) and a number of inferential assumptions [ 1 ]. The probability of obtaining the observed, or more extreme, data is known as ‘the p-value’. The p-value demonstrates the compatibility between the observed data and the expected data under the null hypothesis, where 0 is complete incompatibility and 1 is perfect compatibility [ 2 ]. When the p-value is smaller than a prespecified value (labelled as alpha, usually set at 5% (0.05)), results are generally declared to be statistically significant. At this point, researchers commonly reject the null hypothesis and accept the alternative hypothesis [ 2 ]. Assessing statistical significance by means of contrasting the data with the null hypothesis is called Null Hypothesis Significance Testing (NHST). NHST is the best known and most widely used statistical procedure for making inferences about population effects. The procedure has become the prevailing paradigm in empirical science [ 3 ], and reaching and being able to report statistically significant results has become the ultimate goal for many researchers.

Despite its widespread use, NHST and the p-value have been criticized since its inception. Numerous publications have addressed problems associated with NHST and p-values. Arguably the most important drawback is the fact that NHST is a form of indirect or inverse inference: researchers usually want to know if the null or alternative hypothesis can be accepted and use NHST to conclude either way. But with NHST, the probability of a finding, or more extreme findings, given the null hypothesis is calculated [ 4 ]. Ergo, NHST doesn’t tell us what we want to know. In fact, p-values were never meant to serve as a basis to draw conclusions, but as a continuous measure of incompatibility between empirical findings and a statistical model [ 2 ]. Moreover, the procedure promotes a dichotomous way of thinking, by using the outcome of a significance test as a dichotomous indicator for an effect (p<0.05: effect, p>0.05: no effect). Reducing empirical findings to two categories also results in a great loss of information. Further, a significant outcome is often unjustly interpreted as relevant, but a p-value does not convey any information about the strength or importance of the association. Worse yet, the p-values on which NHST is based confound effect size and sample size. A trivial effect size may nevertheless result in statistical significance provided a sufficiently large sample size. Or an important effect size may fail to result in statistical significance if the sample size is too small. P-values do not validly index the size, relevance, or precision of an effect [ 5 ]. Furthermore, statistical models include not only null hypotheses, but additional assumptions, some of which are wrong, such as the ubiquitous assumption of random and independent sampling from a defined population [ 1 ]. Therefore, although p-values validly index the incompatibility of data with models, p-values do not validly index incompatibility of data with hypotheses that are embedded in wrong models. These are important drawbacks rendering NHST unsuitable as the default procedure for drawing conclusions from empirical data [ 2 , 3 , 5 – 13 ].

A number of alternatives have been developed that overcome these pitfalls, such as Bayesian inference methods [ 7 , 11 , 14 , 15 ], informative hypothesis testing [ 9 , 16 ] and a priori inferential statistics [ 4 , 17 ]. These alternatives build on the idea that research usually starts with a more informed research-question than one merely assuming the null hypothesis of no effect. These methods overcome the problem of inverse inference, although the first two might still lead to dichotomous thinking with the use of thresholds. Despite the availability of alternatives, statistical behavior in the research community has hardly changed. Researchers have been slow to adopt alternative methods and NHST is still the prevailing paradigm for making inferences about population effects [ 3 ].

Until now, reasons for the continuous and ubiquitous use of NHST and the p-value have scarcely been investigated. One explanation is that NHST provides a very simple means for drawing conclusions from empirical data, usually based on the 5% cut-off. Secondly, most researchers are unaware of the pitfalls of NHST; it has been shown that NHST and the p-value are often misunderstood and misinterpreted [ 2 , 3 , 8 , 11 , 18 , 19 ]. Thirdly, NHST has a central role in most methods and statistics courses in higher education. Courses on alternative methods are increasingly being offered but are usually not mandatory. To our knowledge, there is a lack of in depth, empirical research, aimed at elucidating why NHST nevertheless remains the dominant approach, or what actions can be taken to shift the sciences away from NHST. Therefore, the aim of our study was to explore the perceived barriers and facilitators, as well as behavioral intentions related to the use of NHST and alternatives statistical procedures, among all relevant stakeholders in the scientific system.

Theoretical framework

In designing our study, we used two theories. Firstly, we used the ‘diffusion of innovation theory’ of Rogers [ 20 ]. This theory describes the dissemination of an innovation as a process consisting of four elements: 1) an innovation is 2) communicated through certain channels 3) over time 4) among the members of a social system [ 20 ]. In the current study, the innovation consists of the idea that we should stop with the default use of NHST and instead consider using alternative methods for drawing conclusions from empirical data. The science system forms the social structure in which the innovation should take place. The most important members, and potential adopters of the innovation, we identified are researchers, lecturers, editors of scientific journals and representatives of funding agencies. Rogers describes phases in the adoption process, which coincide with characteristics of the (potential) adopters of the idea: 1) innovators, 2) early adopters, 3) early majority adopters, 4) late majority adopters and 5) laggards. Innovators are the first to adopt an innovation. There are few innovators but these few are very important for bringing in new ideas. Early adopters form the second group to adopt an innovation. This group includes opinion leaders and role models for other stakeholders. The largest group consists of the early and late majority who follow the early adopters, and then there is a smaller group of laggards who resist the innovation until they are certain the innovation will not fail. The process of innovation adoption by individuals is described as a normal distribution ( Fig 1 ). For these five groups, the adoption of a new idea is influenced by the following five characteristics of the innovative idea and 1) its relative advantage, 2) its compatibility with current experiences, 3) its complexity, 4) its flexibility, and 5) its visibility [ 20 ]. Members of all four stakeholder groups could play an important role in the diffusion of the innovation of replacing NHST by its alternatives.

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g001.jpg

The innovativeness dimension, measured by the time at which an individual from an adopter category adopts an innovation. Each category is one of more standard deviations removed from the average time of adoption [ 20 ].

Another important theory for our study is the ‘theory of planned behavior’, that was developed in the 1960s [ 21 ]. This theory describes how human behavior in a certain context can be predicted and explained. The theory was updated in 2010, under the name ‘the reasoned action approach’ [ 22 ]. A central factor in this theory is the intention to perform a certain behavior, in this case, to change the default use of NHST. According to the theory, people’s intentions determine their behaviors. An intention indexes to what extent someone is motivated to perform the behavior. Intentions are determined by three independent determinants: the person’s attitudes toward the behavior—the degree to which a person sees the behavior as favorable or unfavorable, perceived subjective norms regarding the behavior—the perceived social pressure to perform the behavior or not, and perceptions of control regarding the behavior—the perceived ease or difficulty of performing the behavior. Underlying (i.e. responsible for) these three constructs are corresponding behavioral, normative, and control beliefs [ 21 , 22 ] (see Fig 2 ).

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g002.jpg

Both theories have served as a lens for both data collection and analysis. We used sensitizing concepts [ 23 ] within the framework of the grounded theory approach [ 24 ] from both theories as a starting point for this qualitative study, and more specifically, for the topic list for the interviews and focus groups, providing direction and guidance for the data collection and data analysis.

Many of the concepts of Rogers’ and Fishbein and Ajzen’s theory can be seen as facilitators and barriers for embracing and implementing innovation in the scientific system.

A qualitative study among stakeholders using semi-structured interviews and focus groups was performed. Data collection and analysis were guided by the principle of constant comparison traditional to the grounded theory approach we followed [ 24 ]. The grounded theory is a methodology that uses inductive reasoning, and aims to construct a theory through the collection and analysis of data. Constant comparison is the iterative process whereby each part of the data that emerges from the data analysis is compared with other parts of the data to thoroughly explore and validate the data. Concepts that have been extracted from the data are tagged with codes that are grouped into categories. These categories constitute themes, which (may) become the basis for a new theory. Data collection and analysis were continued until no new information was gained and data saturation had likely occurred within the identified themes.

The target population consisted of stakeholders relevant to our topic: junior and senior researchers, lecturers in statistics, editors of scientific journals and program leaders of funding agencies (see Tables ​ Tables1 1 and ​ and2). 2 ). We approached participants in the field of medical sciences, health- and life sciences and psychology. In line with the grounded theory approach, theoretical sampling was used to identify and recruit eligible participants. Theoretical sampling is a form of purposive sampling. This means that we aimed to purposefully select participants, based on their characteristics that fit the parameters of the research questions [ 25 ]. Recruitment took place by approaching persons in our professional networks and or the networks of the approached persons.

*The numbers between brackets represents the number of participants that were also interviewed.

Data collection

We conducted individual semi-structured interviews followed by focus groups. The aim of the interviews was to gain insight into the views of participants on the use of NHST and alternative methods and to examine potential barriers and facilitators related to these methods. The aim of the focus groups was to validate and further explore interview findings and to develop a comprehensive understanding of participants’ views and beliefs.

For the semi-structured interviews, we used a topic list (see Appendix 1 in S1 Appendix ). Questions addressed participants’ knowledge and beliefs about the concept of NHST, their familiarity with NHST, perceived attractiveness and drawbacks of the use of NHST, knowledge of the current NHST debate, knowledge of and views on alternative procedures and their views on the future of NHST. The topic list was slightly adjusted based on the interviews with editors and representatives from funding agencies (compared to the topic list for interviews with researchers and lecturers). Questions particularly focused on research and education were replaced by questions focused on policy (see Appendix 1 in S1 Appendix ).

The interviews were conducted between October 2017 and June 2018 by two researchers (L.v.G. and J.S.), both trained in qualitative research methods. Interviews lasted about one hour (range 31–86 minutes) and were voice-recorded. One interview was conducted by telephone; all others were face to face and took place at a location convenient for the participants, in most cases the participants’ work location.

Focus groups

During the focus groups, important themes that emerged from the interviews were discussed and explored. These include perceptions on NHST and alternatives and essential conditions to shift away from the default use of NHST.

Five focus groups included representatives from the different stakeholder groups. One focus group was homogenous, including solely lecturers. The focus groups consisted of ‘old’ as well as ‘new’ participants, that is, some of the participants of the focus groups were also in the interview sample. We also selected persons that were open for further contribution to the NHST debate and were willing to help think about (implementing) alternatives for NHST.

The focus groups were conducted between September and December 2018 by three researchers (L.v.G., J.S. and A.d.K.), all trained in qualitative research methods. The focus groups lasted about one-and-a-half hours (range 86–100 minutes).

Data analysis

All interviews and focus groups were transcribed verbatim. Atlas.ti 8.0 software was used for data management and analysis. All transcripts were read thoroughly several times to identify meaningful and relevant text fragments and analyzed by two researchers (J.S. and L.v.G.). Deductive predefined themes and theoretical concepts were used to guide the development of the topic list for the semi-structured interviews and focus groups, and were used as sensitizing concepts [ 23 ] in data collection and data analysis. Inductive themes were identified during the interview process and analysis of the data [ 26 ].

Transcripts were open-, axial- and selectively coded by two researchers (J.S. and L.v.G.). Open coding is the first step in the data-analysis, whereby phenomena found in the text are identified and named (coded). With axial coding, connections between codes are drawn. Selective coding is the process of selecting one central category and relating all other categories to that category, capturing the essence of the research. The constant comparison method [ 27 ] was applied allowing emerging (sub)themes to be fully explored. First, the two researchers independently developed a set of initial codes. Subsequently, findings were discussed until consensus was reached. Codes were then grouped into categories that were covered under subthemes, belonging to main themes. Finally, a theory substantiating the prevailing use of NHST was developed based on the main themes and subthemes.

Ethical issues

This research was conducted in accordance with the Dutch "General Data Protection Regulation" and the “Netherland’s code of conduct for research integrity”. The research protocol had been submitted for review and approved by the ethical review committee of the VU Faculty of Behavioral and Movement Sciences. In addition, the project had been submitted to the Medical Ethics Committee (METC) of the Amsterdam University Medical Centre who decided that the project is not subject to the Medical Research (Human Subjects) Act ( WMO). At the start of data collection, all participants signed an informed consent form.

A full study protocol, including a detailed data analysis plan, was preregistered ( https://osf.io/4qg38/ ). At the start of this study, preregistration forms for qualitative studies were not developed yet. Therefore, preregistration for this study is based on an outdated form. Presently, there is a preregistration form available for qualitative studies [ 28 ]. Information about data collection, data management, data sharing and data storage is described in a Data Management Plan. Sensitive data is stored in Darkstor, an offline archive for storing sensitive information or data (information that involves i.e., privacy or copyright). As the recordings and transcripts of the interviews and focus groups contain privacy-sensitive data, these files are archived in Darkstor and can be accessed only on request by authorized individuals (i.e., the original researcher or a research coordinator) (Data requests can be send to ln.uv@mdr ). Non-sensitive data is stored in DANS ( https://doi.org/10.17026/dans-2at-nzfs ) (Data Archiving and Networked Services; the Netherlands institute for permanent access to digital research resources).

Participant characteristics

Twenty-nine individual interviews and six focus groups were conducted. The focus groups included four to six participants per session. A total of 47 participants were included in the study (13 researchers, 15 lecturers, 11 editors of scientific journals and 8 representatives of funding agencies). Twenty-nine participants were interviewed. Twenty-seven participants took part in the focus group. Nine of the twenty-seven participants were both interviewed and took part in the focus groups. Some participants had multiple roles (i.e., editor and researcher, editor and lecturer or lecturer and researcher) but were classified based on their primary role (assistant professors were classified as lecturers). The lecturers in statistics in our sample were not statisticians themselves. Although they all received training in statistics, they were primarily trained as psychologists, medical doctors, or health scientists. Some lecturers in our sample taught an applied subject, with statistics as part of it. Other lectures taught Methodology and Statistics courses. Statistical skills and knowledge among lecturers varied from modest to quite advanced. Statistical skills and knowledge among participants from the other stakeholder groups varied from poor to quite advanced. All participants were working in the Netherlands. A general overview of the participants is presented in Table 1 . Participant characteristics split up by interviews and focus groups are presented in Table 2 .

Three main themes with sub-themes and categories emerged ( Fig 3 ): the green-colored compartments hold the three main themes: The scientific climate , The scientific duty and Reactivity . Each of these three main themes consists of subthemes, depicted by the yellow-colored compartments. In turn, some (but not all) of the 9 subthemes also have categories. These ‘lower level’ findings are not included in the figure but will be mentioned in the elaboration on the findings and are depicted in Appendix 2 in S1 Appendix . Fig 3 shows how the themes are related to each other. The blue arrows indicate that the themes are interrelated; factors influence each other. The scientific climate affects the way stakeholders perceive and fulfil their scientific duty, the way stakeholders give substance to their scientific duty shapes and maintain the scientific climate. The scientific duty and the scientific climate cause a state of reactivity. Many participants have adopted a ’wait and see’ attitude regarding behavioral changes with respect to statistical methods. They feel dependent on someone else’s action. This leads to a reactive (instead of a proactive) attitude and a low sense of responsibility. ‘Reactivity’ is the core theme, explaining the most critical problem with respect to the continuous and ubiquitous use of NHST.

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g003.jpg

Main themes and subthemes are numbered. Categories are mentioned in the body of the text in bold. ‘P’ stands for participant; ‘I’ stands for interviewer.

1. The scientific climate

The theme, ‘the scientific climate’, represents researchers’ (Dutch) perceptions of the many written and unwritten rules they face in the research environment. This theme concerns the opportunities and challenges participants encounter when working in the science system. Dutch academics feel pressured to publish fast and regularly, and to follow conventions and directions of those on whom they depend. They feel this comes at the expense of the quality of their work. Thus, the scientific climate in the Netherlands has a strong influence on the behavior of participants regarding how they set their priorities and control the quality of their work.

1 . 1 Quality control . Monitoring the quality of research is considered very important. Researchers, funding agencies and editors indicate they rely on their own knowledge, expertise, and insight, and those of their colleagues, to guarantee this quality. However, editors or funding agencies are often left with little choice when it comes to compiling an evaluation committee or a review panel. The choice is often like-knows-like-based. Given the limited choice, they are forced to trust the opinion of their consultants, but the question is whether this is justified.

I: “The ones who evaluate the statistics, do they have sufficient statistical knowledge?” P: “Ehhr, no, I don’t think so.” I: “Okay, interesting. So, there are manuscripts published of which you afterwards might think….” P: “Yes yes.” (Interview 18; Professor/editor, Medical Sciences)

1 . 2 Convention . The scientific system is built on mores and conventions, as this participant describes:

P: “There is science, and there is the sociology of science, that is, how we talk to each other, what we believe, how we connect. And at some point, it was agreed upon that we would talk to each other in this way.” (Interview 28, researcher, Medical Sciences)

And to these conventions, one (naturally) conforms. Stakeholders copy behavior and actions of others within their discipline, thereby causing particular behaviors and values to become conventional or normative. One of those conventions is the use of NHST and p-values. Everyone is trained with NHST and is used to applying this method. Another convention is the fact that significant results mean ‘success’, in the sense of successful research and being a successful researcher. Everyone is aware that ‘p is smaller than 0.05’ means the desired results are achieved and that publication and citation chances are increased.

P: “You want to find a significant result so badly. (…) Because people constantly think: I must find a significant result, otherwise my study is worthless.” (Focus group 4, lecturer, Medical Sciences)

Stakeholders rigidly hold on to the above-mentioned conventions and are not inclined to deviate from existing norms; they are, in other words, quite conservative . ‘We don’t know any better’ has been brought up as a valid argument by participants from various stakeholder groups to stick to current rules and conventions. Consequently, the status quo in the scientific system is being maintained.

P: “People hold on to….” I: ‘Everyone maintains the system?’ P: ‘Yes, we kind of hang to the conservative manner. This is what we know, what someone, everyone, accepts.” (Interview 17, researcher, Health Sciences)

Everyone is trained with NHST and considers it an accessible and easy to interpret method. The familiarity and perceived simplicity of NHST, user-friendly software such as SPSS and the clear cut-off value for significance are important facilitators for the use of NHST and at the same time barriers to start using alternative methods. Applied researchers stressed the importance of the accessibility of NHST as a method to test hypotheses and draw conclusions. This accessibility also justifies the use of NHST when researchers want to communicate their study results and messages in understandable ways to their readership.

P: “It is harder, also to explain, to use an alternative. So, I think, but maybe I’m overstepping, but if you want to go in that direction [alternative methods] it needs to be better facilitated for researchers. Because at the moment… I did some research, but, you know, there are those uncommon statistical packages.” (Interview 16, researcher/editor, Medical Sciences)

1 . 3 Publication pressure . Most researchers mentioned that they perceive publication pressure. This motivates them to use NHST and hope for significant results, as ‘significant p-values’ increase publication chances. They perceive a high workload and the way the scientific reward system is constructed as barriers for behavioral change pertaining to the use of statistical methods; potential negative consequences for publication and career chances prevent researchers from deviating from (un)written rules.

P: “I would like to learn it [alternative methods], but it might very well be that I will not be able to apply it, because I will not get my paper published. I find that quite tricky.” (Interview 1, Assistant Professor, Health Sciences)

2. The scientific duty

Throughout the interviews, participants reported a sense of duty in several variations. “What does it mean to be a scientific researcher?” seemed to be a question that was reflected upon during rather than prior to the interview, suggesting that many scientists had not really thought about the moral and professional obligations of being a scientist in general—let alone what that would mean for their use of NHST. Once they had given it some thought, the opinions concerning what constitutes the scientific duty varied to a large extent. Some participants attached great importance to issues such as reproducibility and transparency in scientific research and continuing education and training for researchers. For others, these topics seemed to play a less important role. A distinction was made between moral and professional obligations that participants described concerning their scientific duty.

2 . 1 Moral obligation . The moral obligation concerns issues such as doing research in a thorough and honest way, refraining from questionable research practices (QRPs) and investing in better research. It concerns tasks and activities that are not often rewarded or acknowledged.

Throughout the interviews and the focus groups, participants very frequently touched upon the responsibility they felt for doing ‘the right thing’ and making the right choice in doing research and using NHST, in particular. The extent to which they felt responsible varied among participants. When it comes to choices during doing research—for example, drawing conclusions from data—participants felt a strong sense of responsibility to do this correctly. However, when it comes to innovation and new practices, and feeling responsible for your own research, let alone improving scientific practice in general, opinions differed. This quotation from one of the focus groups illustrates that:

P1: “If you people [statisticians, methodologists] want me to improve the statistics I use in my research, then you have to hand it to me. I am not going to make any effort to improve that myself. “P3: “No. It is your responsibility as an academic to keep growing and learning and so, also to start familiarizing yourself when you notice that your statistics might need improvement.” (Focus group 2, participant 1 (PhD researcher, Medical Sciences) and 3 (Associate Professor, Health Sciences)

The sense of responsibility for improving research practices regarding the use of NHST was strongly felt and emphasized by a small group of participants. They emphasized the responsibility of the researcher to think, interpret and be critical when interpreting the p -value in NHST. It was felt that you cannot leave that up to the reader. Moreover, scrutinizing and reflecting upon research results was considered a primary responsibility of a scientist, and failing to do so, as not living up to what your job demands you to do:

P: “Yes, and if I want to be very provocative—and I often want that, because then people tend to wake up and react: then I say that hiding behind alpha.05 is just scientific laziness. Actually, it is worse: it is scientific cowardice. I would even say it is ‘relieving yourself from your duty’, but that may sound a bit harsh…” (Interview 2, Professor, Health Sciences)

These participants were convinced that scientists have a duty to keep scientific practice in general at the highest level possible.

The avoidance of questionable research practices (QRPs) was considered a means or a way to keep scientific practices high level and was often touched upon during the interviews and focus groups as being part of the scientific duty. Statisticians saw NHST as directly facilitating QRPs and providing ample examples of how the use of NHST leads to QRPs, whereas most applied researchers perceived NHST as the common way of doing research and were not aware of the risks related to QRPs. Participants did mention the violation of assumptions underlying NHST as being a QRP. Then, too, participants considered overinterpreting results as a QRP, including exaggerating the degree of significance. Although participants stated they were careful about interpreting and reporting p-values, they ‘admitted’ that statistical significance was a starting point for them. Most researchers indicated they search for information that could get their study published, which usually includes a low p-value (this also relates to the theme ‘Scientific climate’).

P: “We all know that a lot of weight is given to the p-value. So, if it is not significant, then that’s the end of it. If it ís significant, it just begins.” (Interview 5, lecturer, Psychology)

The term ‘sloppy science’ was mentioned in relation to efforts by researchers to reduce the p -value (a.k.a. p-hacking, data-dredging, and HARKing. HARKing is an acronym that refers to the questionable research question of Hypothesizing After the Results are Known. It occurs when researchers formulate a hypothesis after the data have been collected and analyzed, but make it look like it is an a priori hypothesis [ 29 ]). Preregistration and replication were mentioned as being promising solutions for some of the problems caused by NHST.

2 . 2 . Professional obligation . The theme professional obligation reflects participants’ expressions about what methodological knowledge scientists should have about NHST. In contrast moral obligations, there appeared to be some consensus about scientists’ professional obligations. Participants considered critical evaluation of research results a core professional obligation. Also, within all the stakeholder groups, participants agreed that sufficient statistical knowledge is required for using NHST, but they varied in their insights in the principles, potential and limitations of NHST. This also applied to the extent to which participants were aware of the current debate about NHST.

Participants considered critical thinking as a requirement for fulfilling their professional obligation. It specifically refers to the process of interpreting outcomes and taking all relevant contextual information into consideration. Critical thinking was not only literally referred to by participants, but also emerged by interpreting text fragments on the emphasis within their research. Researchers differed quite strongly in where the emphasis of their research outcomes should be put and what kind of information is required when reporting study results. Participants mentioned the proven effectiveness of a particular treatment, giving a summary of the research results, effect sizes, clinical relevance, p-values, or whether you have made a considerable contribution to science or society.

P: “I come back to the point where I said that people find it arbitrary to state that two points difference on a particular scale is relevant. They prefer to hide behind an alpha of 0.05, as if it is a God given truth, that it counts for one and for all. But it is just as well an invented concept and an invented guideline, an invented cut-off value, that isn’t more objective than other methods?” (Interview 2, Professor, Health Sciences)

For some participants, especially those representing funding agencies, critical thinking was primarily seen as a prerequisite for the utility of the research. The focus, when formulating the research question and interpreting the results, should be on practical relevance and the contribution the research makes to society.

The term ‘ignorance’ arose in the context of the participants’ concern regarding the level of statistical knowledge scientists and other stakeholders have versus what knowledge they should have to adequately apply statistical analysis in their research. The more statistically competent respondents in the sample felt quite strongly about how problematic the lack of knowledge about NHST is among those who regularly use it in their research, let alone the lack of knowledge about alternative methods. They felt that regularly retraining yourself in research methods is an essential part of the professional obligation one has. Applied researchers in the sample agreed that a certain level of background knowledge on NHST was required to apply it properly to research and acknowledged their own ignorance. However, they had different opinions about what level of knowledge is required. Moreover, not all of them regarded it as part of their scientific duty to be informed about all ins and outs of NHST. Some saw it as the responsibility of statisticians to actively inform them (see also the subtheme periphery). Some participants were not aware of their ignorance or stated that some of their colleagues are not aware of their ignorance, i.e., that they are unconsciously incompetent and without realizing it, poorly understood what the p-value and associated outcome measures actually mean.

P: “The worst, and I honestly think that this is the most common, is unconsciously incompetent, people don’t even understand that…” I: “Ignorance.” P: “Yes, but worse, ignorant and not even knowing you are ignorant.” (Interview 2, Professor, Health Sciences)

The lack of proper knowledge about statistical procedures was especially prevalent in the medical sciences. Participants working in or with the medical sciences all confirmed that there is little room for proper statistical training for medical students and that the level of knowledge is fairly low. NHST is often used because of its simplicity. It is especially attractive for medical PhD students because they need their PhD to get ahead in their medical career instead of pursuing a scientific career.

P: “I am not familiar with other ways of doing research. I would really like to learn, but I do not know where I could go. And I do not know whether there are better ways. So sometimes I do read studies of which I think: ‘this is something I could investigate with a completely different test. Apparently, this is also possible, but I don’t know how.’ Yes, there are courses, but I do not know what they are. And here in the medical center, a lot of research is done by medical doctors and these people have hardly been taught any statistics. Maybe they will get one or two statistics courses, they know how to do a t-test and that is about it. (…) And the courses have a very low level of statistics, so to say.” (Interview 1, Assistant Professor, Health Sciences)

Also, the term ‘ awareness ’ arose. Firstly, it refers to being conscious about the limitations of NHST. Secondly, it refers to the awareness of the ongoing discussions about NHST and more broadly, about the replication crisis. The statisticians in the sample emphasized the importance of knowing that NHST has limitations and that it cannot be considered the holy grail of data analysis. They also emphasized the importance of being aware of the debate. A certain level of awareness was considered a necessary requirement for critical thinking. There was variation in that awareness. Some participants were quite informed and were also fairly engaged in the discussion whereas others were very new to the discussion and larger contextual factors, such as the replication crisis.

I: “Are you aware of the debate going on in academia on this topic [NHST]? P: “No, I occasionally see some article sent by a colleague passing by. I have the idea that something is going on, but I do not know how the debate is conducted and how advanced it is. (Interview 6, lecturer, Psychology)

With respect to the theme, ‘the scientific duty’, participants differed to what extent they felt responsible for better and open science, for pioneering, for reviewing, and for growing and learning as a scientist. Participants had one commonality: although they strived for adherence to the norms of good research, the rampant feeling is that this is very difficult, due to the scientific climate. Consequently, participants perceive an internal conflict : a discrepancy between what they want or believe , and what they do . Participants often found themselves struggling with the responsibility they felt they had. Making the scientifically most solid choice was often difficult due to feasibility, time constraints, or certain expectations from supervisors (this is also directly related to the themes ‘Scientific climate’ and ‘Reactivity’). Thus, the scientific climate strongly influences the behavior of scientists regarding how they set their priorities and fulfill their scientific duties. The strong sense of scientific duty was perceived by some participants as a facilitator and by others as a barrier for the use of alternative methods.

3. Reactivity

A consequence of the foregoing factors is that most stakeholders have adopted a reactive attitude and behave accordingly. People are disinclined to take responsibility and await external signals and initiatives of others. This might explain why NHST is being continuously used and remains the default procedure to make inferences about population effects.

The core theme ‘reactivity’ can be explained by the following subthemes and categories:

3 . 1 Periphery . The NHST-problem resides in the periphery in several ways. First, it is a subject that is not given much priority. Secondly, some applied researchers and editors believe that methodological knowledge, as it is not their field of expertise, should not be part of their job requirement. This also applies to the NHST debate. Thirdly, and partly related to the second point, there is a lack of cooperation within and between disciplines.

The term ‘ priority’ was mentioned often when participants were asked to what extent the topic of NHST was subject of discussion in their working environment. Participants indicated that (too) little priority is given to statistics and the problems related to the subject. There is simply a lot going on in their research field and daily work, so there are always more important or urgent issues on the agenda.

P: “Discussions take place in the periphery; many people find it complicated. Or are just a little too busy.” (Interview 5, lecturer, Psychology)

As the NHST debate is not prioritized, initiatives with respect to this issue are not forthcoming. Moreover, researchers and lecturers claim there is neither time nor money available for training in statistics in general or acquiring more insight and skills with respect to (the use of) alternative methods. Busy working schedules were mentioned as an important barrier for improving statistical knowledge and skills.

P: “Well you can use your time once, so it is an issue low on the priority list.” (Focus group 5, researcher, Medical Sciences)

The NHST debate is perceived as the domain of statisticians and methodologists. Also, cooperation between different domains and domain-specific experts is perceived as complicated, as different perceptions and ways of thinking can clash. Therefore, some participants feel that separate worlds should be kept separate; put another way: stick to what you know!

P: “This part is not our job. The editorial staff, we have the assignment to ensure that it is properly written down. But the discussion about that [alternatives], that is outside our territory.” (Interview 26, editor, Medical Sciences)

Within disciplines, individuals tend to act on their own, not being aware that others are working on the same subject and that it would be worthwhile to join forces. The interviews and focus groups exposed that a modest number of participants actively try to change the current situation, but in doing that, feel like lone voices in the wilderness.

P1: “I mean, you become a lone voice in the wilderness.” P2: “Indeed, you don’t want that.” P1: “I get it, but no one listens. There is no audience.” (Focus Group 3, P1: MD, lecturer, medical Sciences, P2: editor, Medical Sciences)

To succeed at positive change, participants emphasized that it is essential that people (interdisciplinary) cooperate and join forces, rather than operate on individual levels, focusing solely on their own working environment.

The caution people show with respect to taking initiative is reenforced by the fear of encountering resistance from their working environment when one voices that change regarding the use of NHST is needed. A condition that was mentioned as essential to bring about change was tactical implementation , that is, taking very small steps. As everyone is still using NHST, taking big steps brings the risk of losing especially the more conservative people along the way. Also, the adjustment of policy, guidelines and educational programs are processes for which we need to provide time and scope.

P: “Everyone still uses it, so I think we have to be more critical, and I think we have to look at some kind of culture change, that means that we are going to let go of it (NHST) more and we will also use other tests, that in the long term will overthrow NHST. I: and what about alternatives? P: I think you should never be too fanatic in those discussion, because then you will provoke resistance. (…) That is not how it works in communication. You will touch them on a sore spot, and they will think: ‘and who are you?’ I: “and what works?” P: “well, gradualness. Tell them to use NHST, do not burn it to the ground, you do not want to touch peoples work, because it is close to their hearts. Instead, you say: ‘try to do another test next to NHST’. Be a pioneer yourself.” (Interview 5, lecturer, Psychology)

3 . 2 . Efficacy . Most participants stated they feel they are not in the position to initiate change. On the one hand, this feeling is related to their hierarchical positions within their working environments. On the other hand, the feeling is caused by the fact that statistics is perceived as a very complex field of expertise and people feel they lack sufficient knowledge and skills, especially about alternative methods.

Many participants stated they felt little sense of empowerment, or self-efficacy. The academic system is perceived as hierarchical, having an unequal balance of power. Most participants believe that it is not in their power to take a lead in innovative actions or to stand up against establishment, and think that this responsibility lies with other stakeholders, that have more status .

P: “Ideally, there would be a kind of an emergency letter from several people whose names open up doors, in which they indicate that in the medical sciences we are throwing away money because research is not being interpreted properly. Well, if these people that we listen to send such an emergency letter to the board of The Netherlands Organization for Health Research and Development [the largest Dutch funding agency for innovation and research in healthcare], I can imagine that this will initiate a discussion.” (…) I: “and with a big name you mean someone from within the science system? P: well, you know, ideally a chairman, or chairmen of the academic medical center. At that level. If they would put a letter together. Yes, that of course would have way more impact. Or some prominent medical doctors, yes, that would have more impact, than if some other person would send a letter yes.” (Interview 19, representative from funding agency, Physical Sciences)

Some participants indicated that they did try to make a difference but encountered too much resistance and therefore gave up their efforts. PhD students feel they have insufficient power to choose their own directions and make their own choices.

P: I am dependent on funding agencies and professors. In the end, I will write a grant application in that direction that gives me the greatest chance of eventually receiving that grant. Not primarily research that I think is the most optimal (…) If I know that reviewers believe the p-value is very important, well, of course I write down a method in which the p-value is central.” (Focus group 2, PhD-student, Medical Sciences)

With a sense of imperturbability, most participants accept that they cannot really change anything.

Lastly, the complexity of the subject is an obstacle for behavioral change. Statistics is perceived as a difficult subject. Participants indicate that they have a lack of knowledge and skills and that they are unsure about their own abilities. This applies to the ‘standard’ statistical methods (NHST), but to a greater extent to alternative methods. Many participants feel that they do not have the capacity to pursue a true understanding of (alternative) statistical methods.

P: “Statistics is just very hard. Time and again, research demonstrates that scientists, even the smartest, have a hard time with statistics.” (Focus group 3, PhD researcher, Psychology)

3 . 3 . Interdependency . As mentioned, participants feel they are not in a sufficiently strong position to take initiative or to behave in an anti-establishment manner. Therefore, they await external signals from people within the scientific system with more status, power, or knowledge. This can be people within their own stakeholder group, or from other stakeholder groups. As a consequence of this attitude, a situation arises in which peoples’ actions largely depend on others. That is, a complex state of interdependency evolves: scientists argue that if the reward system does not change, they are not able to alter their statistical behavior. According to researchers, editors and funding agencies are still very much focused on NHST and especially (significant) p-values, and thus, scientists wait for editors and funders to adjust their policy regarding statistics:

P: “I wrote an article and submitted it to an internal medicine journal. I only mentioned confidence intervals. Then I was asked to also write down the p-values. So, I had to do that. This is how they [editors] can use their power. They decide.” (Interview 1, Assistant Professor, Health Sciences)

Editors and funders in their turn claim they do not maintain a strict policy. Their main position is that scientists should reach consensus about the best statistical procedure, and they will then adjust their policy and guidelines.

P: “We actually believe that the research field itself should direct the quality of its research, and thus, also the discussions.” (Interview 22, representative from funding agency, Neurosciences)

Lecturers, for their part, argue that they cannot revise their educational programs due to the academic system, and university policies are adapted to NHST and p-values.

As most participants seem not to be aware of this process, a circle of interdependency arises that is difficult to break.

P: “Yes, the stupid thing about this perpetual circle is that you are educating people, let’s say in the department of cardiology. They must of course grow, and so they need to publish. If you want to publish you must meet the norms and values of the cardiology journals, so they will write down all those p-values. These people are trained and in twenty years they are on the editorial board of those journals, and then you never get rid of it [the p-value].” (Interview 18, Professor, editor, Medical Sciences)

3 . 4 . Degree of eagerness . Exerting certain behavior or behavioral change is (partly) determined by the extent to which people want to employ particular behavior, their behavioral intention [ 22 ]. Some participants indicated they are willing to change their behavior regarding the use of statistical methods, but only if it is absolutely necessary, imposed or if they think that the current conventions have too many negative consequences. Thus, true, intrinsic will-power to change behavior is lacking among these participants. Instead, they have a rather opportunistic attitude, meaning that their behavior is mostly driven by circumstances, not by principles.

P: “If tomorrow an alternative is offered by people that make that call, than I will move along. But I am not the one calling the shots on this issue.” (Interview 26, editor, Medical Sciences)

In addition, pragmatism often outweighs the perceived urgency to change. Participants argue they ‘just want to do their jobs’ and consider the practical consequences mainly in their actions. This attitude creates a certain degree of inertia. Although participants claim they are willing to change their behavior, this would contain much more than ‘doing their jobs, and thus, in the end, the NHST-debate is subject to ‘coffee talk’. People are open to discussion, but when it comes to taking action (and motivating others to do so), no one takes action.

P: “The endless analysis of your data to get something with a p-value less than 0.05… There are people that are more critical about that, and there are people that are less critical. But that is a subject for during the coffee break.” (Interview 18, professor, editor, Medical Sciences)

The goal of our study was to acquire in-depth insight into reasons why so many stakeholders from the scientific system keep using NHST as the default method to draw conclusions, despite its many well-documented drawbacks. Furthermore, we wanted to gain insight into the reasons for their reluctance to apply alternative methods. Using a theoretical framework [ 20 , 21 ], several interrelated facilitators and barriers associated with the use of NHST and alternative methods were identified. The identified factors are subsumed under three main themes: the scientific climate, the scientific duty and reactivity. The scientific climate is dominated by conventions, behavioral rules, and beliefs, of which the use of NHST and p-values is part. At the same time, stakeholders feel they have a (moral or professional) duty. For many participants, these two sides of the same coin are incompatible, leading to internal conflicts. There is a discrepancy between what participants want and what they do . As a result of these factors, the majority feels dependent on others and have thereby become reactive. Most participants are not inclined to take responsibility themselves but await action and initiatives from others. This may explain why NHST is still the standard and used by almost everyone involved.

The current study is closely related to the longstanding debate regarding NHST which recently increased to a level not seen before. In 2015, the editors of the journal ‘Basic and Applied Social Psychology’ (BASP) prohibited the use of NHST (and p-values and confidence intervals) [ 30 ]. Subsequently, in 2016, the American Statistical Association published the so-called ‘Statement on p-values’ in the American Statistician. This statement consists of critical standpoints regarding the use of NHST and p-values and warns against the abuse of the procedure. In 2019, the American Statistician devoted an entire edition to the implementation of reforms regarding the use of NHST; in more than forty articles, scientists debated statistical significance, advocated to embrace uncertainty, and suggested alternatives such as the use of s-values, False Positive Risks, reporting results as effect sizes and confidence intervals and more holistic approaches to p-values and outcome measures [ 31 ]. In addition, in the same year, several articles appeared in which an appeal was made to stop using statistical significance testing [ 32 , 33 ]. A number of counter-reactions were published [ 34 – 36 ], stating (i.e.) that banning statistical significance and, with that, abandoning clear rules for statistical analyses may create new problems with regard to statistical interpretation, study interpretations and objectivity. Also, some methodologists expressed the view that under certain circumstances the use of NHST and p-values is not problematic and can in fact provide useful answers [ 37 ]. Until recently, the NHST-debate was limited to mainly methodologists and statisticians. However, a growing number of scientists are getting involved in this lively debate and believe that a paradigm shift is desirable or even necessary.

The aforementioned publications have constructively contributed to this debate. In fact, since the publication of the special edition of the American Statistician, numerous scientific journals published editorials or revised, to a greater or lesser extent, their author guidelines [ 38 – 45 ]. Furthermore, following the American Statistical Association (ASA), the National Institute of Statistical Sciences (NISS) in the United States has also taken up the reform issue. However, real changes are still barely visible. It takes a long time before these kinds of initiatives translate into behavioral changes, and the widespread adoption by most of the scientific community is still far from accomplished. Debate alone will not lead to real changes, and therefore, our efforts to elucidate behavioral barriers and facilitators could provide a framework for potential effective initiatives that could be taken to reduce the default use of NHST. In fact, the debate could counteract behavioral change. If there is no consensus among statisticians and methodologists (the innovators), changing behavior cannot be expected from stakeholders with less statistical and methodological expertise. In other words, without agreement among innovators, early adopters might be reluctant to adopt the innovation.

Research has recently been conducted to explore the potential of behavioral change to improve Open Science behaviors. The adoption of open science behavior has increased in the last years, but uptake has been slow, due to firm barriers such as a lack of awareness about the subject, concerns about constrainment of the creative process, worries about being “scooped” and holding on to existing working practices [ 46 ]. The development regarding open science practices and the parallels these lines of research shows with the current study, might be of benefit to subserve behavioral change regarding the use of statistical methods.

The described obstacles to change behavior are related to features of both the ‘innovative idea’ and the potential adopters of the idea. First, there are characteristics of ‘the innovation’ that form barriers. The first barrier is the complexity of the innovation: most participants perceive alternative methods as difficult to understand and to use. A second barrier concerns the feasibility of trying the innovation; most people do not feel flexible about trying out or experimenting with the new idea. There is a lack of time and monetary resources to get acquainted with alternative methods (for example, by following a course). Also, the possible negative consequences of the use of alternatives (lower publications chances, the chance that the statistical method and message is too complicated for one’s readership) is holding people back from experimenting with these alternatives. And lastly, it is unclear for most participants what the visibility of the results of the new idea are. Up until now, the debate has mainly taken place among a small group of statisticians and methodologists. Many researchers are still not aware of the NHST debate and the idea to shift away from NHST and use alternative methods instead. Therefore, the question is how easily the benefits of the innovation can be made visible for a larger part of the scientific community. Thus, our study shows that, although the compatibility of the innovation is largely consistent with existing values (participants are critical about (the use of) NHST and the p-value and believe that there are better alternatives to NHST), important attributes of the innovative idea negatively affect the rate of adoption and consequently the diffusion of the innovation.

Due to the barriers mentioned above, most stakeholders do not have the intention to change their behavior and adopt the innovative idea. From the theory of planned behavior [ 21 ], it is known that behavioral intentions directly relate to performances of behaviors. The strength of the intention is shaped by attitudes, subjective norms, and perceived power. If people evaluate the suggested behavior as positive (attitude), and if they think others want them to perform the behavior (subjective norm), this leads to a stronger intention to perform that behavior. When an individual also perceives they have enough control over the behavior, they are likely to perform it. Although most participants have a positive attitude towards the behavior, or the innovative idea at stake, many participants think that others in their working environment believe that they should not perform the behavior—i.e., they do not approve of the use of alternative methods (social normative pressure). This is expressed, for example, in lower publication chances, negative judgements by supervisors or failing the requirements that are imposed by funding agencies. Thus, the perception about a particular behavior—the use of alternative methods—is negatively influenced by the (perceived) judgment of others. Moreover, we found that many participants have a low self-efficacy, meaning that there is a perceived lack of behavioral control, i.e., their perceived ability to engage in the behavior at issue is low. Also, participants feel a lack of authority (in the sense of knowledge and skills, but also power) to initiate behavioral change. The existing subjective norms and perceived behavioral control, and the negative attitudes towards performing the behavior, lead to a lower behavioral intention, and, ultimately, a lower chance of the performance of the actual behavior.

Several participants mentioned there is a need for people of stature (belonging to the group of early adopters) to take the lead and break down perceived barriers. Early adopters serve as role models and have opinion leadership, and form the next group (after the innovators, in this case statisticians and methodologists) to adopt an innovative idea [ 20 ] ( Fig 2 ). If early adopters would stand up, conveying a positive attitude towards the innovation, breaking down the described perceived barriers and facilitating the use of alternatives (for example by adjusting policy, guidelines and educational programs and making available financial resources for further training), this could positively affect the perceived social norms and self-efficacy of the early and late majority and ultimately laggards, which could ultimately lead to behavioral change among all stakeholders within the scientific community.

A strength of our study is that it is the first empirical study on views on the use of NHST, its alternatives and reasons for the prevailing use of NHST. Another strength is the method of coding which corresponds to the thematic approach from Braun & Clarke [ 47 ], which allows the researcher to move beyond just categorizing and coding the data, but also analyze how the codes are related to each other [ 47 ]. It provides a rich description of what is studied, linked to theory, but also generating new hypotheses. Moreover, two independent researchers coded all transcripts, which adds to the credibility of the study. All findings and the coding scheme were discussed by the two researchers, until consensus was reached. Also, interview results were further explored, enriched and validated by means of (mixed) focus groups. Important themes that emanated from the interviews, such as interdependency, perceptions on the scientific duty, perceived disadvantages of alternatives or the consequences of the current scientific climate, served as starting points and main subjects of the focus groups. This set-up provided more data, and more insight about the data and validation of the data. Lastly, the use of a theoretical framework [ 20 , 21 ] to develop the topic list, guide the interviews and focus groups, and guide their analysis is a strength as it provides structure to the analysis and substantiation of the results.

A limitation of this study is its sampling method. By using the network of members of the project group, and the fact that a relatively high proportion of those invited to participate refused because they thought they knew too little about the subject to be able to contribute, our sample was biased towards participants that are (somewhat) aware of the NHST debate. Our sample may also consist of people that are relatively critical towards the use of NHST, compared to the total population of researchers. It was not easy to include participants who were indifferent about or who were pro-NHST, as those were presumably less willing to make time and participate in this study. Even in our sample we found that the majority of our participants solely used NHST and perceived it as difficult if not impossible to change their behavior. These perceptions are thus probably even stronger in the target population. Another limitation, that is inherent to qualitative research, is the risk of interviewer bias. Respondents are unable, unwilling, or afraid to answer questions in good conscience, and instead provide socially desirable answers. In the context of our research, people are aware that, especially as a scientist, it does not look good to be conservative, complacent, or ignorant, or not to be open to innovation and new ideas. Therefore, some participants might have given a too favorable view of themselves. The interviewer bias can also take the other direction when values and expectations of the interviewer consciously or unconsciously influence the answers of the respondents. Although we have tried to be as neutral and objective as possible in asking questions and interpreting answers, we cannot rule out the chance that our views and opinions on the use of NHST have at times steered the respondents somewhat, potentially leading to the foregoing desirable answers.

Generalizability is a topic that is often debated in qualitative research methodology. Many researchers do not consider generalizability the purpose of qualitative research, but rather finding in-depth insights and explanations. However, this is an unjustified simplification, as generalizing of findings from qualitative research is possible. Three types of generalization in qualitative research are described: representational generalization (whether what is found in a sample can be generalized to the parent population of the sample), inferential generalization (whether findings from the study can be generalized to other settings), and theoretical generalization (where one draws theoretical statements from the findings of the study for more general application) [ 48 ]. The extent to which our results are generalizable is uncertain, as we used a theoretical sampling method, and our study was conducted exclusively in the Netherlands. We expect that the generic themes (reactivity, the scientific duty and the scientific climate) are applicable to academia in many countries across the world (inferential generalization). However, some elements, such as the Dutch educational system, will differ to a more or lesser extent from other countries (and thus can only be representationally generalized). In the Netherlands there is, for example, only one educational route after secondary school that has an academic orientation (scientific education, equivalent to the US university level education). This route consists of a bachelor’s program (typically 3 years), and a master’s program (typically 1, 2 or 3 years). Not every study program contains (compulsory) statistical courses, and statistical courses differ in depth and difficulty levels depending on the study program. Thus, not all the results will hold for other parts of the world, and further investigation is required.

Our findings demonstrate how perceived barriers to shift away from NHST set a high threshold for behavioral change and create a circle of interdependency. Behavioral change is a complex process. As ‘the stronger the intention to engage in a behavior, the more likely should be its performance’[ 21 ], further research on this subject should focus on how to influence the intention of behavior; i.e. which perceived barriers for the use of alternatives are most promising to break down in order to increase the intention for behavioral change. The present study shows that negative normative beliefs and a lack of perceived behavioral control regarding the innovation among individuals in the scientific system is a substantial problem. When social norms change in favor of the innovation, and control over the behavior increases, then the behavioral intention becomes a sufficient predictor of behavior [ 49 ]. An important follow-up question will therefore be: how can people be enthused and empowered, to ultimately take up the use of alternative methods instead of NHST? Answering this question can, in the long run, lead to the diffusion of the innovation through the scientific system as a whole.

NHST has been the leading paradigm for many decades and is deeply rooted in our science system, despite longstanding criticism. The aim of this study was to gain insight as to why we continue to use NHST. Our findings have demonstrated how perceived barriers to make a shift away from NHST set a high threshold for actual behavioral change and create a circle of interdependency between stakeholders in the scientific system. Consequently, people find themselves in a state of reactivity, which limits behavioral change with respect to the use of NHST. The next step would be to get more insight into ways to effectively remove barriers and thereby increase the intention to take a step back from NHST. A paradigm shift within a couple of years is not realistic. However, we believe that by taking small steps, one at a time, it is possible to decrease the scientific community’s strong dependence on NHST and p-values.

Supporting information

S1 appendix, acknowledgments.

The authors are grateful to Anja de Kruif for her contribution to the design of the study and for moderating one of the focus groups.

Funding Statement

This research was funded by the NWO (Nederlandse Organisatie voor Wetenschappelijk Onderzoek; Dutch Organization for Scientific Research) ( https://www.nwo.nl/ ) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

Enago Academy

Quick Guide to Biostatistics in Clinical Research: Hypothesis Testing

' src=

In this article series, we will be looking at some of the important concepts of biostatistics in clinical trials and clinical research. Statistics is frequently used to analyze quantitative research data. Clinical trials and clinical research both often rely on statistics. Clinical trials proceed through many phases . Contract Research Organizations (CRO) can be hired to conduct a clinical trial. Clinical trials are an important step in deciding if a treatment can be safely and effectively used in medical practice. Once the clinical trial phases are completed, biostatistics is used to analyze the results.

Research generally proceeds in an orderly fashion as shown below.

Research Process

Once you have identified the research question you need to answer, it is time to frame a good hypothesis. The hypothesis is the starting point for biostatistics and is usually based on a theory. Experiments are then designed to test the hypothesis. What is a hypothesis ? A research hypothesis is a statement describing a relationship between two or more variables that can be tested. A good hypothesis will be clear, avoid moral judgments, specific, objective, and relevant to the research question. Above all, a hypothesis must be testable.

A simple hypothesis would contain one predictor and one outcome variable. For instance, if your hypothesis was, “Chocolate consumption is linked to type II diabetes” the predictor would be whether or not a person eats chocolate and the outcome would be developing type II diabetes. A good hypothesis would also be specific. This means that it should be clear which subjects and research methodology will be used to test the hypothesis. An example of a specific hypothesis would be, “Adults who consume more than 20 grams of milk chocolate per day, as measured by a questionnaire over the course of 12 months, are more likely to develop type II diabetes than adults who consume less than 10 grams of milk chocolate per day.”

Null and Alternative Hypothesis

In statistics, the null hypothesis (H 0 ) states that there is no relationship between the predictor and the outcome variable in the population being studied. For instance, “There is no relationship between a family history of depression and the probability that a person will attempt suicide.” The alternative hypothesis (H 1 ) states that there is a relationship between the predictor (a history of depression) and the outcome (attempted suicide). It is impossible to prove a statement by making several observations but it is possible to disprove a statement with a single observation. If you always saw red tulips, it is not proof that no other colors exist. However, seeing a single tulip that was not red would immediately prove that the statement, “All tulips are red” is false. This is why statistics tests the null hypothesis. It is also why the alternative hypothesis cannot be tested directly.

The alternative hypothesis proposed in medical research may be one-tailed or two-tailed. A one-tailed alternative hypothesis would predict the direction of the effect. Clinical studies may have an alternative hypothesis that patients taking the study drug will have a lower cholesterol level than those taking a placebo. This is an example of a one-tailed hypothesis. A two-tailed alternative hypothesis would only state that there is an association without specifying a direction. An example would be, “Patients who take the study drug will have a significantly different cholesterol level than those patients taking a placebo”. The alternative hypothesis does not state if that level will be higher or lower in those taking the placebo.

The P-Value Approach to Test Hypothesis

Once the hypothesis has been designed, statistical tests help you to decide if you should accept or reject the null hypothesis. Statistical tests determine the p-value associated with the research data. The p-value is the probability that one could have obtained the result by chance; assuming the null hypothesis (H 0 ) was true. You must reject the null hypothesis if the p-value of the data falls below the predetermined level of statistical significance. Usually, the level of statistical significance is set at 0.05. If the p- value is less than 0.05, then you would reject the null hypothesis stating that there is no relationship between the predictor and the outcome in the sample population.

However, if the p-value is greater than the predetermined level of significance, then there is no statistically significant association between the predictor and the outcome variable. This does not mean that there is no association between the predictor and the outcome in the population. It only means that the difference between the relationship observed and the relationship that could have occurred by random chance is small.

For example, null hypothesis (H 0 ): The patients who take the study drug after a heart attack did not have a better chance of not having a second heart attack over the next 24 months.

Data suggests that those who did not take the study drug were twice as likely to have a second heart attack with a p-value of 0.08. This p-value would indicate that there was an 8% chance that you would see a similar result (people on the placebo being twice as likely to have a second heart attack) in the general population because of random chance.

The hypothesis is not a trivial part of the clinical research process. It is a key element in a good biostatistics plan regardless of the clinical trial phase. There are many other concepts that are important for analyzing data from clinical trials. In our next article in the series, we will examine hypothesis testing for one or many populations, as well as error types.

' src=

Thank you for this very informative article. You describe all the things very well. I am doing a fellowship in Clinical research training. This information really helps me a lot in my research studies. I have been connected with your site since a long time for such updates. Thank you once again

Rate this article Cancel Reply

Your email address will not be published.

null hypothesis medical example

Enago Academy's Most Popular Articles

manuscript writing with AI

  • AI in Academia
  • Infographic
  • Manuscripts & Grants
  • Reporting Research
  • Trending Now

Can AI Tools Prepare a Research Manuscript From Scratch? — A comprehensive guide

As technology continues to advance, the question of whether artificial intelligence (AI) tools can prepare…

difference between abstract and introduction

Abstract Vs. Introduction — Do you know the difference?

Ross wants to publish his research. Feeling positive about his research outcomes, he begins to…

null hypothesis medical example

  • Old Webinars
  • Webinar Mobile App

Demystifying Research Methodology With Field Experts

Choosing research methodology Research design and methodology Evidence-based research approach How RAxter can assist researchers

Best Research Methodology

  • Manuscript Preparation
  • Publishing Research

How to Choose Best Research Methodology for Your Study

Successful research conduction requires proper planning and execution. While there are multiple reasons and aspects…

Methods and Methodology

Top 5 Key Differences Between Methods and Methodology

While burning the midnight oil during literature review, most researchers do not realize that the…

How to Draft the Acknowledgment Section of a Manuscript

Discussion Vs. Conclusion: Know the Difference Before Drafting Manuscripts

null hypothesis medical example

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

null hypothesis medical example

As a researcher, what do you consider most when choosing an image manipulation detector?

Miles D. Williams

Logo

Visiting Assistant Professor | Denison University

  • Download My CV
  • Send Me an Email
  • View My LinkedIn
  • Follow Me on Twitter

When the Research Hypothesis Is the Null

Posted on May 13, 2024 by Miles Williams in Methods   Statistics  

Back to Blog

What should you do if your research hypothesis is the null hypothesis? In other words, how should you approach hypothesis testing if your theory predicts no effect between two variables? I and a coauthor are working on a paper where a couple of our proposed hypotheses look like this, and we got some push-back from a reviewer about it. This prompted me to go down a rabbit hole of journal articles and message boards to see how others handle this situation. I quickly found that I waded into a contentious issue that’s connected to a bigger philosophical debate about the merits of hypothesis testing in general and whether the null hypothesis in particular as a bench-mark for hypothesis testing is even logically sound.

There’s too much to unpack with this debate for me to cover in a single blog post (and I’m sure I’d get some of the key points wrong anyway if I tried). The main issue I want to explore in this post is the practical problem of how to approach testing a null research hypothesis. From an applied perspective, this is a tricky problem that raises issues with how we calculate and interpret p-values. Thankfully, there is a sound solution for the null research hypothesis which I explore in greater detail below. It’s called a two one-sided test, and it’s easy to implement once you know what it is.

The usual approach

Most of the time when doing research, a scientist usually has a research hypothesis that goes something like X has a positive effect on Y . For example, a political scientist might propose that a get-out-the-vote (GOTV) campaign ( X ) will increase voter turnout ( Y ).

The typical approach for testing this claim might be to estimate a regression model with voter turnout as the outcome and the GOTV campaign as the explanatory variable of interest:

Y = α + β X + ε

If the parameter β > 0, this would support the hypothesis that GOTV campaigns improve voter turnout. To test this hypothesis, in practice the researcher would actually test a different hypothesis that we call the null hypothesis. This is the hypothesis that says there is no true effect of GOTV campaigns on voter turnout.

By proposing and testing the null, we now have a point of reference for calculating a measure of uncertainty—that is, the probability of observing an empirical effect of a certain magnitude or greater if the null hypothesis is true. This probability is called a p-value, and by convention if it is less than 0.05 we say that we can reject the null hypothesis.

For the hypothetical regression model proposed above, to get this p-value we’d estimate β, then calculate its standard error, and then we’d take the ratio of the former to the latter giving us what’s called a t-statistic or t-value. Under the null hypothesis, the t-value has a known distribution which makes it really easy to map any t-value to a p-value. The below figure illustrates using a hypothetical data sample of size N = 200. You can see that the t-statistic’s distribution has a distinct bell shape centered around 0. You can also see the range of t-values in blue where if we observed them in our empirical data we’d fail to reject the null hypothesis at the p < 0.05 level. Values in gray are t-values that would lead us to reject the null hypothesis at this same level.

null hypothesis medical example

When the null is the research hypothesis we want to test

There’s nothing new or special here. If you have even a basic stats background (particularly with Frequentist statistics), the conventional approach to hypothesis testing is pretty ubiquitous. Things get more tricky when our research hypothesis is that there is no effect. Say for a certain set of theoretical reasons we think that GOTV campaigns are basically useless at increasing voter turnout. If this argument is true, then if we estimate the following regression model, we’d expect β = 0.

The problem here is that our substantive research hypothesis is also the one that we want to try to find evidence against. We could just proceed like usual and just say that if we fail to reject the null this is evidence in support of our theory, but the problem with doing this is that failure to reject the null is not the same thing as finding support for the null hypothesis.

There are a few ideas in the literature for how we should approach this instead. Many of these approaches are Bayesian, but most of my research relies on Frequentist statistics, so these approaches were a no-go for me. However, there is one really simple approach that is consistent with the Frequentist paradigm: equivalence testing . The idea is simple. Propose some absolute effect size that is of minimal interest and then test whether an observed effect is different from it. This minimum effect is called the “smallest effect size of interest” (SESOI). I read about the approach in an article by Harms and Lakens (2018) in the Journal of Clinical and Translational Research .

Say, for example, that we deemed a t-value of +/-1.96 (the usual threshold for rejecting the null hypothesis) as extreme enough to constitute good evidence of a non-zero effect. We could make the appropriate adjustments to our t-distribution to identify a new range of t-values that would allow us to reject the hypothesis that an effect is non-zero. This is illustrated in the below figure. We can now see a range of t-values in the middle where we’d have t-values such that we could reject the non-zero hypothesis at the p < 0.05 level. This distribution looks like it’s been inverted relative to the usual null distribution. The reason is that with this approach what we’re doing is conducting a pair of alternative one-tailed tests. We’re testing both the hypothesis that β / se(β) - 1.96 > 0 and β / se(β) + 1.96 < 0. In the Harms and Lakens paper cited above, they call this approach two one-sided tests or TOST (I’m guessing this is pronounced “toast”).

null hypothesis medical example

Something to pay attention to with this approach is that the observed t-statistic needs to be very small in absolute magnitude for us to reject the hypothesis of a non-zero effect. This means that the bar for testing a null research hypothesis is actually quite high. This is demonstrated using the following simulation in R. Using the {seerrr} package, I had R generate 1,000 random draws (each of size 200) for a pair of variables x and y where the former is a binary “treatment” and the latter is a random normal “outcome.” By design, there is no true causal relationship between these variables. Once I simulated the data, I then generated a set of estimates of the effect of x on y for each simulated dataset and collected the results in an object called sim_ests . I then visualized two metrics that that I calculated with the simulated results: (1) the rejection rate for the null hypothesis test and (2) the rejection rate for the two one-sided equivalence tests. As you can see, if we were to try to test a research null hypothesis the usual way, we’d expect to be able to fail to reject the null about 95% of the time. Conversely, if we were to use the two one-sided equivalence tests, we’d expect to reject the non-zero alternative hypothesis only about 25% of the time. I tested out a few additional simulations to see if a larger sample size would lead to improvements in power (not shown), but no dice.

null hypothesis medical example

The two one-sided tests approach strikes me as a nice method when dealing with a null research hypothesis. It’s actually pretty easy to implement, too. The one downside is that this test is under-powered. If the null is true, it will only reject the alternative 25% of the time (though you could select a different non-zero alternative which would possibly give you more power). However, this isn’t all bad. The flip side of the coin is that this is a really conservative test, so if you can reject the alternative that puts you on solid rhetorical footing to show the data really do seem consistent with the null.

IMAGES

  1. Formulating hypothesis in nursing research

    null hypothesis medical example

  2. Null hypothesis

    null hypothesis medical example

  3. Null Hypothesis

    null hypothesis medical example

  4. 10 Easy Steps to Find Null Hypothesis in Research Articles

    null hypothesis medical example

  5. 15 Null Hypothesis Examples (2024)

    null hypothesis medical example

  6. How to Write a Null Hypothesis (with Examples and Templates)

    null hypothesis medical example

VIDEO

  1. Misunderstanding The Null Hypothesis

  2. Difference Between Null Hypothesis and Alternative Hypothesis

  3. Null & Alternate Hypothesis and type 1 & 2 error. Sanjoy Routh #researchaptitude

  4. Null hypothesis/Tests of significance

  5. Hypothesis Testing (Null and Alternative Hypothesis)

  6. Hypothesis Testing

COMMENTS

  1. Null Hypothesis: Definition, Rejecting & Examples

    When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant. Statisticians often denote the null hypothesis as H 0 or H A.. Null Hypothesis H 0: No effect exists in the population.; Alternative Hypothesis H A: The effect exists in the population.; In every study or experiment, researchers assess an effect or relationship.

  2. How to Write a Null Hypothesis (5 Examples)

    Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms: H0 (Null Hypothesis): Population parameter =, ≤, ≥ some value. HA (Alternative Hypothesis): Population parameter <, >, ≠ some value. Note that the null hypothesis always contains the equal sign.

  3. How to Formulate a Null Hypothesis (With Examples)

    To distinguish it from other hypotheses, the null hypothesis is written as H 0 (which is read as "H-nought," "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the ...

  4. PDF Second Edition

    Hypothesis testing Hypothesis testing, also known as statistical inference or significance testing, involves testing a specified hypothesized condition for a population's parameter. This condition is best described as the null hypothesis. For example, in a clinical trial of a new anti-hypertensive drug, the null hypothesis would state

  5. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  6. Null & Alternative Hypotheses

    The null and alternative hypotheses offer competing answers to your research question. When the research question asks "Does the independent variable affect the dependent variable?": The null hypothesis ( H0) answers "No, there's no effect in the population.". The alternative hypothesis ( Ha) answers "Yes, there is an effect in the ...

  7. Null hypothesis significance testing: a short tutorial

    Abstract: "null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely". No, NHST is the method to test the hypothesis of no effect. I agree - yet people use it to investigate (not test) if an effect is likely.

  8. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0, the —null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

  9. The null hypothesis significance test in health sciences research (1995

    The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested supplementary use of confidence intervals (CI).

  10. Hypothesis Testing

    Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test. Step 4: Decide whether to reject or fail to reject your null hypothesis. Step 5: Present your findings. Other interesting articles. Frequently asked questions about hypothesis testing.

  11. Null Hypothesis Definition and Examples, How to State

    Step 1: Figure out the hypothesis from the problem. The hypothesis is usually hidden in a word problem, and is sometimes a statement of what you expect to happen in the experiment. The hypothesis in the above question is "I expect the average recovery period to be greater than 8.2 weeks.". Step 2: Convert the hypothesis to math.

  12. Hypothesis Testing

    Hypothesis testing is the process used to evaluate the strength of evidence from the sample and provides a framework for making determinations related to the population, ie, it provides a method for understanding how reliably one can extrapolate observed findings in a sample under study to the larger population from which the sample was drawn ...

  13. Finding and Using Health Statistics

    H0: µ = 75. H0: µ = µ0. Ha: There will be a statistically significant difference between the student's score and the class average score on the math exam. Ha: µ ≠ 75. Ha: µ ≠ µ0. In the null hypothesis, there is no difference between the observed mean (µ) and the claimed value (75). However, in the alternative hypothesis, class ...

  14. Null hypothesis

    Basic definitions. The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.. The statement being tested in a test of statistical significance is called the null hypothesis. . The test of significance is designed ...

  15. The null hypothesis significance test in health sciences research (1995

    The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested ...

  16. Null and Alternative Hypotheses

    Concept Review. In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim.If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we: Evaluate the null hypothesis, typically denoted with H 0.The null is not rejected unless the hypothesis test shows otherwise.

  17. Probability, clinical decision making and hypothesis testing

    The null hypothesis is the hypothesis to be tested. It is denoted by the symbol H 0. It is also known as the hypothesis of no difference. The null hypothesis is set up with the sole purpose of efforts to knock it down. In the testing of hypothesis, the null hypothesis is either rejected (knocked down) or not rejected (upheld). If the null ...

  18. 9.1: Null and Alternative Hypotheses

    Review. In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim.If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we: Evaluate the null hypothesis, typically denoted with \(H_{0}\).The null is not rejected unless the hypothesis test shows otherwise.

  19. Null and Alternative Hypotheses

    H a: The alternative hypothesis: It is a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0. Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

  20. Null Hypothesis Examples

    An example of the null hypothesis is that light color has no effect on plant growth. The null hypothesis (H 0) is the hypothesis that states there is no statistical difference between two sample sets. In other words, it assumes the independent variable does not have an effect on the dependent variable in a scientific experiment.

  21. Examples of null and alternative hypotheses

    It is the opposite of your research hypothesis. The alternative hypothesis--that is, the research hypothesis--is the idea, phenomenon, observation that you want to prove. If you suspect that girls take longer to get ready for school than boys, then: Alternative: girls time > boys time. Null: girls time <= boys time.

  22. Why we habitually engage in null-hypothesis significance testing: A

    At this point, researchers commonly reject the null hypothesis and accept the alternative hypothesis [ 2 ]. Assessing statistical significance by means of contrasting the data with the null hypothesis is called Null Hypothesis Significance Testing (NHST). NHST is the best known and most widely used statistical procedure for making inferences ...

  23. Quick Guide to Biostatistics in Clinical Research: Hypothesis ...

    The alternative hypothesis proposed in medical research may be one-tailed or two-tailed. A one-tailed alternative hypothesis would predict the direction of the effect. ... For example, null hypothesis (H 0): The patients who take the study drug after a heart attack did not have a better chance of not having a second heart attack over the next ...

  24. When the Research Hypothesis Is the Null

    Say, for example, that we deemed a t-value of +/-1.96 (the usual threshold for rejecting the null hypothesis) as extreme enough to constitute good evidence of a non-zero effect. We could make the appropriate adjustments to our t-distribution to identify a new range of t-values that would allow us to reject the hypothesis that an effect is non-zero.