U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Hypothesis testing, p values, confidence intervals, and significance.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: March 13, 2023 .

  • Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

  • Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1]  When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3]  Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4]  When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5]  One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6]  Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7]  The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.  

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3]  In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12]  Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13]  A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14]  Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15]  confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14]  A larger width indicates a smaller sample size or a larger variability. [16]  A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15]  Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14]  In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13]  An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

  • Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14]  Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4]  Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. 

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Hypothesis Testing, P Values, Confidence Intervals, and Significance. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). [PeerJ. 2021] The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. PeerJ. 2021; 9:e12453. Epub 2021 Nov 24.
  • Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. [J Pharm Pract. 2010] Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. J Pharm Pract. 2010 Aug; 23(4):344-51. Epub 2010 Apr 13.
  • Interpreting "statistical hypothesis testing" results in clinical research. [J Ayurveda Integr Med. 2012] Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr; 3(2):65-9.
  • Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. [Dermatol Surg. 2005] Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Dermatol Surg. 2005 Apr; 31(4):462-6.
  • Review Is statistical significance testing useful in interpreting data? [Reprod Toxicol. 1993] Review Is statistical significance testing useful in interpreting data? Savitz DA. Reprod Toxicol. 1993; 7(2):95-100.

Recent Activity

  • Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearl... Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 26 November 2021

The clinician’s guide to p values, confidence intervals, and magnitude of effects

  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 1   na1 ,
  • Charles C. Wykoff 2 , 3 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 1 , 4 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 5 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 5

for the Retina Evidence Trials InterNational Alliance (R.E.T.I.N.A.) Study Group

Eye volume  36 ,  pages 341–342 ( 2022 ) Cite this article

16k Accesses

5 Citations

14 Altmetric

Metrics details

  • Outcomes research

A Correction to this article was published on 19 January 2022

This article has been updated

Introduction

There are numerous statistical and methodological considerations within every published study, and the ability of clinicians to appreciate the implications and limitations associated with these key concepts is critically important. These implications often have a direct impact on the applicability of study findings – which, in turn, often determine the appropriateness for the results to lead to modification of practice patterns. Because it can be challenging and time-consuming for busy clinicians to break down the nuances of each study, herein we provide a brief summary of 3 important topics that every ophthalmologist should consider when interpreting evidence.

p -values: what they tell us and what they don’t

Perhaps the most universally recognized statistic is the p-value. Most individuals understand the notion that (usually) a p -value <0.05 signifies a statistically significant difference between the two groups being compared. While this understanding is shared amongst most, it is far more important to understand what a p -value does not tell us. Attempting to inform clinical practice patterns through interpretation of p -values is overly simplistic, and is fraught with potential for misleading conclusions. A p -value represents the probability that the observed result (difference between the groups being compared)—or one that is more extreme—would occur by random chance, assuming that the null hypothesis (the alternative scenario to the study’s hypothesis) is that there are no differences between the groups being compared. For example, a p -value of 0.04 would indicate that the difference between the groups compared would have a 4% chance of occurring by random chance. When this probability is small, it becomes less likely that the null hypothesis is accurate—or, alternatively, that the probability of a difference between groups is high [ 1 ]. Studies use a predefined threshold to determine when a p -value is sufficiently small enough to support the study hypothesis. This threshold is conventionally a p -value of 0.05; however, there are reasons and justifications for studies to use a different threshold if appropriate.

What a p -value cannot tell us, is the clinical relevance or importance of the observed treatment effects. [ 1 ]. Specifically, a p -value does not provide details about the magnitude of effect [ 2 , 3 , 4 ]. Despite a significant p -value, it is quite possible for the difference between the groups to be small. This phenomenon is especially common with larger sample sizes in which comparisons may result in statistically significant differences that are actually not clinically meaningful. For example, a study may find a statistically significant difference ( p  < 0.05) between the visual acuity outcomes between two groups, while the difference between the groups may only amount to a 1 or less letter difference. While this may be in fact a statistically significant difference, the difference is likely not large enough to make a meaningful difference for patients. Thus, p -values lack vital information on the magnitude of effects for the assessed outcomes [ 2 , 3 , 4 ].

Overcoming the limitations of interpreting p -values: magnitude of effect

To overcome this limitation, it is important to consider both (1) whether or not the p -value of a comparison is significant according to the pre-defined statistical plan, and (2) the magnitude of the treatment effects (commonly reported as an effect estimate with 95% confidence intervals) [ 5 ]. The magnitude of effect is most often represented as the mean difference between groups for continuous outcomes, such as visual acuity on the logMAR scale, and the risk or odds ratio for dichotomous/binary outcomes, such as occurrence of adverse events. These measures indicate the observed effect that was quantified by the study comparison. As suggested in the previous section, understanding the actual magnitude of the difference in the study comparison provides an understanding of the results that an isolated p -value does not provide [ 4 , 5 ]. Understanding the results of a study should shift from a binary interpretation of significant vs not significant, and instead, focus on a more critical judgement of the clinical relevance of the observed effect [ 1 ].

There are a number of important metrics, such as the Minimally Important Difference (MID), which helps to determine if a difference between groups is large enough to be clinically meaningful [ 6 , 7 ]. When a clinician is able to identify (1) the magnitude of effect within a study, and (2) the MID (smallest change in the outcome that a patient would deem meaningful), they are far more capable of understanding the effects of a treatment, and further articulate the pros and cons of a treatment option to patients with reference to treatment effects that can be considered clinically valuable.

The role of confidence intervals

Confidence intervals are estimates that provide a lower and upper threshold to the estimate of the magnitude of effect. By convention, 95% confidence intervals are most typically reported. These intervals represent the range in which we can, with 95% confidence, assume the treatment effect to fall within. For example, a mean difference in visual acuity of 8 (95% confidence interval: 6 to 10) suggests that the best estimate of the difference between the two study groups is 8 letters, and we have 95% certainty that the true value is between 6 and 10 letters. When interpreting this clinically, one can consider the different clinical scenarios at each end of the confidence interval; if the patient’s outcome was to be the most conservative, in this case an improvement of 6 letters, would the importance to the patient be different than if the patient’s outcome was to be the most optimistic, or 10 letters in this example? When the clinical value of the treatment effect does not change when considering the lower versus upper confidence intervals, there is enhanced certainty that the treatment effect will be meaningful to the patient [ 4 , 5 ]. In contrast, if the clinical merits of a treatment appear different when considering the possibility of the lower versus the upper confidence intervals, one may be more cautious about the expected benefits to be anticipated with treatment [ 4 , 5 ].

There are a number of important details for clinicians to consider when interpreting evidence. Through this editorial, we hope to provide practical insights into fundamental methodological principals that can help guide clinical decision making. P -values are one small component to consider when interpreting study results, with much deeper appreciation of results being available when the treatment effects and associated confidence intervals are also taken into consideration.

Change history

19 january 2022.

A Correction to this paper has been published: https://doi.org/10.1038/s41433-021-01914-2

Li G, Walter SD, Thabane L. Shifting the focus away from binary thinking of statistical significance and towards education for key stakeholders: revisiting the debate on whether it’s time to de-emphasize or get rid of statistical significance. J Clin Epidemiol. 2021;137:104–12. https://doi.org/10.1016/j.jclinepi.2021.03.033

Article   PubMed   Google Scholar  

Gagnier JJ, Morgenstern H. Misconceptions, misuses, and misinterpretations of p values and significance testing. J Bone Joint Surg Am. 2017;99:1598–603. https://doi.org/10.2106/JBJS.16.01314

Goodman SN. Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med. 1999;130:995–1004. https://doi.org/10.7326/0003-4819-130-12-199906150-00008

Article   CAS   PubMed   Google Scholar  

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50. https://doi.org/10.1007/s10654-016-0149-3

Article   PubMed   PubMed Central   Google Scholar  

Phillips M. Letter to the editor: editorial: threshold p values in orthopaedic research-we know the problem. What is the solution? Clin Orthop. 2019;477:1756–8. https://doi.org/10.1097/CORR.0000000000000827

Devji T, Carrasco-Labra A, Qasim A, Phillips MR, Johnston BC, Devasenapathy N, et al. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. BMJ. 2020;369:m1714. https://doi.org/10.1136/bmj.m1714

Carrasco-Labra A, Devji T, Qasim A, Phillips MR, Wang Y, Johnston BC, et al. Minimal important difference estimates for patient-reported outcomes: a systematic survey. J Clin Epidemiol. 2020;0. https://doi.org/10.1016/j.jclinepi.2020.11.024

Download references

Author information

Authors and affiliations.

Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare-Hamilton, Hamilton, ON, Canada

Lehana Thabane

Department of Surgery, McMaster University, Hamilton, ON, Canada

Mohit Bhandari & Varun Chaudhary

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Boon, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

MRP was responsible for conception of idea, writing of manuscript and review of manuscript. VC was responsible for conception of idea, writing of manuscript and review of manuscript. MB was responsible for conception of idea, writing of manuscript and review of manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

MRP: Nothing to disclose. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed – unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis – unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: In this article the middle initial in author name Sophie J. Bakri was missing.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Phillips, M.R., Wykoff, C.C., Thabane, L. et al. The clinician’s guide to p values, confidence intervals, and magnitude of effects. Eye 36 , 341–342 (2022). https://doi.org/10.1038/s41433-021-01863-w

Download citation

Received : 11 November 2021

Revised : 12 November 2021

Accepted : 15 November 2021

Published : 26 November 2021

Issue Date : February 2022

DOI : https://doi.org/10.1038/s41433-021-01863-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

p value in literature review

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Understanding P values | Definition and Examples

Understanding P-values | Definition and Examples

Published on July 16, 2020 by Rebecca Bevans . Revised on June 22, 2023.

The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true.

P values are used in hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p value, the more likely you are to reject the null hypothesis.

Table of contents

What is a null hypothesis, what exactly is a p value, how do you calculate the p value, p values and statistical significance, reporting p values, caution when using p values, other interesting articles, frequently asked questions about p-values.

All statistical tests have a null hypothesis. For most tests, the null hypothesis is that there is no relationship between your variables of interest or that there is no difference among groups.

For example, in a two-tailed t test , the null hypothesis is that the difference between two groups is zero.

  • Null hypothesis ( H 0 ): there is no difference in longevity between the two groups.
  • Alternative hypothesis ( H A or H 1 ): there is a difference in longevity between the two groups.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The p value , or probability value, tells you how likely it is that your data could have occurred under the null hypothesis. It does this by calculating the likelihood of your test statistic , which is the number calculated by a statistical test using your data.

The p value tells you how often you would expect to see a test statistic as extreme or more extreme than the one calculated by your statistical test if the null hypothesis of that test was true. The p value gets smaller as the test statistic calculated from your data gets further away from the range of test statistics predicted by the null hypothesis.

The p value is a proportion: if your p value is 0.05, that means that 5% of the time you would see a test statistic at least as extreme as the one you found if the null hypothesis was true.

P values are usually automatically calculated by your statistical program (R, SPSS, etc.).

You can also find tables for estimating the p value of your test statistic online. These tables show, based on the test statistic and degrees of freedom (number of observations minus number of independent variables) of your test, how frequently you would expect to see that test statistic under the null hypothesis.

The calculation of the p value depends on the statistical test you are using to test your hypothesis :

  • Different statistical tests have different assumptions and generate different test statistics. You should choose the statistical test that best fits your data and matches the effect or relationship you want to test.
  • The number of independent variables you include in your test changes how large or small the test statistic needs to be to generate the same p value.

No matter what test you use, the p value always describes the same thing: how often you can expect to see a test statistic as extreme or more extreme than the one calculated from your test.

P values are most often used by researchers to say whether a certain pattern they have measured is statistically significant.

Statistical significance is another way of saying that the p value of a statistical test is small enough to reject the null hypothesis of the test.

How small is small enough? The most common threshold is p < 0.05; that is, when you would expect to find a test statistic as extreme as the one calculated by your test only 5% of the time. But the threshold depends on your field of study – some fields prefer thresholds of 0.01, or even 0.001.

The threshold value for determining statistical significance is also known as the alpha value.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

p value in literature review

P values of statistical tests are usually reported in the results section of a research paper , along with the key information needed for readers to put the p values in context – for example, correlation coefficient in a linear regression , or the average difference between treatment groups in a t -test.

P values are often interpreted as your risk of rejecting the null hypothesis of your test when the null hypothesis is actually true.

In reality, the risk of rejecting the null hypothesis is often higher than the p value, especially when looking at a single study or when using small sample sizes. This is because the smaller your frame of reference, the greater the chance that you stumble across a statistically significant pattern completely by accident.

P values are also often interpreted as supporting or refuting the alternative hypothesis. This is not the case. The  p value can only tell you whether or not the null hypothesis is supported. It cannot tell you whether your alternative hypothesis is true, or why.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient
  • Null hypothesis

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

A p -value , or probability value, is a number describing how likely it is that your data would have occurred under the null hypothesis of your statistical test .

P -values are usually automatically calculated by the program you use to perform your statistical test. They can also be estimated using p -value tables for the relevant test statistic .

P -values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.

If the test statistic is far from the mean of the null distribution, then the p -value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

No. The p -value only tells you how likely the data you have observed is to have occurred under the null hypothesis .

If the p -value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Understanding P-values | Definition and Examples. Scribbr. Retrieved April 12, 2024, from https://www.scribbr.com/statistics/p-value/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, an easy introduction to statistical significance (with examples), test statistics | definition, interpretation, and examples, what is effect size and why does it matter (examples), what is your plagiarism score.

Picking apart p values: common problems and points of confusion

  • Published: 03 August 2022
  • Volume 30 , pages 3245–3248, ( 2022 )

Cite this article

  • Sophia J. Madjarova 1 ,
  • Riley J. Williams III 1 ,
  • Benedict U. Nwachukwu 1 ,
  • R. Kyle Martin 2 ,
  • Jón Karlsson 3 ,
  • Mattheu Ollivier 4 &
  • Ayoosh Pareek 1  

3536 Accesses

2 Citations

15 Altmetric

Explore all metrics

Due to its frequent misuse, the p value has become a point of contention in the research community. In this editorial, we seek to clarify some of the common misconceptions about p values and the hazardous implications associated with misunderstanding this commonly used statistical concept. This article will discuss issues related to p value interpretation in addition to problems such as p-hacking and statistical fragility; we will also offer some thoughts on addressing these issues. The aim of this editorial is to provide clarity around the concept of statistical significance for those attempting to increase their statistical literacy in Orthopedic research.

Avoid common mistakes on your manuscript.

Despite its ubiquity in scientific and medical literature, the p value is a commonly misunderstood and misinterpreted statistical concept. Tam et al. conducted a survey-based study examining quantitative and qualitative understanding of p values among a group of 247 general practitioners and demonstrated that the two most common misconceptions were (1) the use of p values to represent real-world probabilities, correlating significance to a 95% chance of the test hypothesis being true vs a 5% chance of being false, and (2) use of P  = 0.05 as a threshold for evidence of observable results (i.e., P  ≤ 0.05 = observable effect; P  > 0.05 = not observable) [ 1 ]. When combined, these two conceptualizations accounted for 83% of the survey responses. Another study of statistical literacy among 277 residents from 11 different programs demonstrated that only 58% of the participants could correctly interpret a p value despite 88% of respondents indicated confidence in their understanding of p values, demonstrating a clear confidence-ability gap [ 2 ]. We can reasonably assume this gap also exists among practicing Orthopedic surgeons.

Beyond confusing the individual reader, poor understanding of p values has grave and far-reaching effects on the larger research community. Poor statistical literacy among researchers and peer-reviewers has led to widespread publication of biased studies, many with potentially irreproducible results. This phenomenon jeopardizes the development of a reliable scientific knowledge base. Practices such as p-hacking or cherry-picking data such that insignificant findings demonstrate significance have even further weakened the integrity of clinical research [ 3 ]. Through flooding the literature with potentially misleading information, the misuse and manipulation of p values threatens the future of evidence-based medicine, and we must course-correct.

To begin, we must properly define the p value. The p value is not a measure of the probability of a hypothesis being true or untrue. The standard significance level (alpha) arbitrarily defined as P  = 0.05 represents the threshold value to either help support ( P  ≤ 0.05) or reject ( P  > 0.05) the null hypothesis, which is the assumption that two groups are the same [ 4 ]. Thus, the p value is a measure of the degree to which experimental data conform to the distribution predicted by the null hypothesis. Or stated another way, the p value represents how likely one is to obtain the observed result, given the null hypothesis is true or the probability that we obtained the current result by chance (Fig.  1 ). In this sense, the lower the p value, the less likely it is that the observed result or difference is due to chance.

figure 1

Adapted from Wikimedia Commons at https://commons.wikimedia.org

Shaded area (green) represents values in the distribution with probability greater than alpha, where alpha is traditionally equal to 0.05. Under the assumption that the null hypothesis is true, this shaded region represents results that are unlikely to have occurred by chance alone, suggesting that the null hypothesis is not the best-fitting hypothesis.

One of the most common fallacies related to p values is that there can be a complete acceptance or rejection of the null hypothesis without consideration of other factors. Obtaining a value of P  ≤ 0.05 suggests that given the assumption that the null hypothesis is true, and there will be a difference between the two variables as great or greater than the one observed no more than 5% of the time if only chance affects the observed relationship. In reality, many factors may introduce errors, which can lead to a misleadingly small p value [ 5 ]. For example, at times, the researcher may perform multiple comparisons and though there was an established p value threshold of 0.05, the researcher did not adjust this for multiple comparisons (discussed below). Likewise, researchers may want to obtain more confidence in their findings, and therefore, it may be better at times to adjust the p value to a lower threshold (0.01 or even 0.001) as is common practice in fields such as astrophysics, genomics, among others. With a better working understanding of p values as related to how well data fits a prediction, researchers can also challenge Type 1 errors (false positives) which result from an erroneous rejection of the null hypothesis. In other words, reporting that there is an observed difference due to the experimental conditions will be more meaningful when the potential sources for error are acknowledged (i.e., selection bias, repeated measurements, and more).

Another common misconception amongst statistical significance is associating the level of significance (i.e., size of the p value) with the magnitude of the effect (or effect size). In this manner, for example, some readers may interpret a p value of 0.0001 to not only indicate statistical significance in post-operative outcome improvement, but may also assume, since the p value is so small, the effect size or the magnitude of post-operative outcome improvement is also very large—but this is not always the case.

A p value is dependent on both effect size, or the strength of the relationship between the variable, and the sample size. A small p value could be due to a large effect, or it could result from a small effect and a large enough sample size, and vice versa [ 6 ]. When examining research literature, therefore, we must maintain effect size, variance (spread of data), and sample size in mind as they all play a part in determining the final p value. As mentioned above, the commonly accepted threshold value for P is 0.05; however, this may not be an appropriate threshold for all studies. Imagine two strands of DNA that are statistically different based on nucleotide sequence, yet both could code for the same functional protein or a sufficient level of functional protein so that the organism is unaffected. In this case, there would be no clinical relevance associated with the statistical significance (Fig.  2 ) [ 7 ]. In addition, p values are often two-sided, meaning the effect size is neither more nor less than the accepted value (0 for the null). However, for questions related only to improving outcomes, which is probably the most commonly posed question in clinical research, a one-sided p value may be more appropriate.

figure 2

Potential cases of statistical significance vs clinical relevance. Figure adapted from Ganesh et al. 2017

Comparing p values between studies can also be a common source of confusion. Foremost, p values determined when testing in two different samples cannot be compared. This means that if study A examining two treatments resulted in P  = 0.1 and study B examining similar treatments resulted in P  = 0.001, the results of both studies are not necessarily contradictory. Conversely, if P  = 0.03 for both samples, it does not mean that the study results agree. This is mainly due to the possible variations in sample sizes, as well as standard deviation of the measured variable, both of which indirectly and directly influence the p value. It should be noted that very few cross-study conclusions can be drawn from the p value and often further statistical analysis, such as the determination of confidence intervals, is required [ 6 , 8 ]. It is in cases such as this where large registry studies and prospective randomized controlled trials most benefit us, allowing researchers to examine superiority of treatment methodologies in patient populations.

Many researchers have noted the increase in frequency of p values near the threshold of 0.05, yet noticed the common occurrence of values just below (i.e., P  = 0.048) than just above (i.e., P  = 0.051), and this is often an indicator of p-hacking. The practice of p-hacking, defined as the manipulation of data to fabricate significance, is a systematic problem in research. Authors who practice p-hacking are often concerned with increasing their personal number of peer-reviewed publications; these authors sacrifice the quality of their scientific investigation in preference for studies which shows a statistical difference due to positive publication bias. Common ways to p-hack include assessing multiple outcomes and only reporting those that show significance or increasing the sample size until the p value is within range of significance without an explicit a-priori power (sample size) calculation. Bin Abd Razak et al. investigated the articles published in 2015 by 3 top orthopaedic journals [ 9 ]. Through text-mining, they identified articles that provided a single p value when reporting results for their main hypothesis. Theoretically, the frequency of p values reported should decrease as the p value increases towards 0.05, because observed real world phenomena should presumably have a strong relationship between variables (large effect size) and small p value. Following this logic, a higher frequency of small p values is expected; however, the study showed an upwards trend when approaching P  = 0.05, suggesting the presence of p-hacking [ 9 ].

The high fragility of many orthopaedic studies only serves as further evidence of p-hacking in the field of orthopaedic research. Defined as the number of outcomes or patients that must change to affect the significance of the results, fragility provides a measure of the quantitative stability of our findings. Parisien et al. screened studies from two journals of comparative sports medicine between 2006 and 2016. From 102 studies included, 339 outcomes were defined, of which 98 were reported as significant and 241 were non-significant [ 10 ]. The fragility index, or median number of events needed to change to affect statistical significance, was determined to be 5 (IQR 3 to 8), but the average loss of patients to follow-up was greater at 7.9, challenging study reliability. Forrester et al. identified 128 surgical clinical studies in orthopaedic trauma with 545 outcomes. Again, the reported loss to follow-up was greater than the median fragility index (5) for over half of the studies included (53.3%) [ 11 ]. As surgeons, we are often affecting patients directly, and it is problematic that most of our decision making often are based on studies, where crossover of less than 5 events can make the results both clinically and statistically insignificant.

A potential solution proposed by critics of the p value is the Bonferroni correction, which involves dividing the predefined value for the significance level, alpha ( P  = 0.05), by the number of tests being performed, where tests are most commonly t tests or tests calculating Pearson’s Correlation Coefficient (r) [ 12 ]. If 4 t tests were performed, the new value alpha level would be 0.0125 for each test so that the overall alpha level for all tests remains 0.05. This application of the Bonferroni correction addresses experimentwise error; however, the correction can also correct familywise error which occurs when related groups are compared through analysis of variance (ANOVA) [ 13 , 13 ].

It should be noted that the Bonferroni correction is not without criticism. For example, decreasing the frequency of type I errors inherently increases the frequency of type II errors (false negatives) [ 12 , 13 ]. Moreover, the correction addresses a “universal” null hypothesis even though the groups being compared may not be exactly identical in all comparisons [ 13 ]. On the whole, the Bonferroni correction is seen by some as statistically conservative, or in layman’s terms, overly cautious. Other alternatives offered to the p value are the inclusion of 95% confidence intervals or Bayes factor. The most fervent p value critics suggest doing away with the p value entirely [ 15 ]. In addition, as treatment modalities become more important and the results of studies critical to patients’ well-being, we may want to move forward with a more robust power analysis which selects of lower p value (0.01 or 0.001), following in footsteps of others who wish to minimize the type I error.

Still, banishing the p value does not guarantee a future free from statistical misinterpretation. Researchers would be remiss to believe that the introduction of increasingly complex and arguably less-intuitive tests would help make statistical analysis more accessible throughout the research community. Moreover, the p value is deeply ingrained in the curriculum of courses spanning from high school introductory biology to graduate level biostatistics. To avoid the p value throughout the vast scientific community would be incredibly challenging and would certainly make information less accessible, both by requiring further statistical specialization and by leaving future researchers ill-equipped to analyze the incredible archive of currently published literature. We must instead promote a holistic view, in which foundational issues such as study design, clinical relevance vs statistical significance, fragility, and bias receive their due attention in addition to the p value.

Whether we like it or not, we have already reached a consensus on the use of the p value. It is our common statistical language, and rather than fumble with the implementation of an entirely new language, we need to double down on increasing statistical literacy surrounding the p value at every level of education. Bolstering the p value with corrections can help decrease the effects of p-hacking and false discovery rates, and should certainly remain a topic of discussion. However, the bottom line is that p values are not inherently evil. In fact, they have helped us identify a huge knowledge gap; many researchers lack the statistical knowledge to properly evaluate their results and communicate them. In that sense, our role is actually clear; we would do better to arm physicians and physician–scientists with comprehension and clarification, and that may be the easiest way to provide those around us with the tools to improve research quality.

Tam CWM, Khan AH, Knight A, Rhee J, Price K, McLean K (2018) How doctors conceptualise p values: a mixed methods study. Aust J Gen Pract 47:705–710

Article   Google Scholar  

Windish DM, Huot SJ, Green ML (2007) Medicine residents’ understanding of the biostatistics and results in the medical literature. JAMA 298:1010

Article   CAS   Google Scholar  

Elston DM (2021) Cherry picking, HARKing, and P-hacking. J Am Acad Dermatol. https://doi.org/10.1016/j.jaad.2021.06.844

Article   PubMed   PubMed Central   Google Scholar  

Pareek, MD A, Parkes, MD C, Martin. MD, FRCSC K, Engebretsen, MD, PhD L, Krych, MD A (2019) P Value: Purpose, Power, and Potential Pitfalls. AAOS Now

Lytsy P (2018) P in the right place: revisiting the evidential value of P -values. J Evid-Based Med 11:288–291

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG (2016) Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31:337–350

Ganesh S, Cave V (2018) P-values, p-values everywhere! NZ Vet J 66:55–56

Verhulst B (2016) In defense of p values. AANA J 84:305–308

PubMed   PubMed Central   Google Scholar  

Bin Abd Razak HR, Ang J-GE, Attal H, Howe T-S, Allen JC (2016) P-hacking in orthopaedic literature: a twist to the tail. J Bone Jt Surg 98:e91

Parisien RL, Trofa DP, Dashe J, Cronin PK, Curry EJ, Fu FH, Li X (2019) Statistical fragility and the role of p values in the sports medicine literature. J Am Acad Orthop Surg 27:e324–e329

Forrester LA, McCormick KL, Bonsignore-Opp L, Tedesco LJ, Baranek ES, Jang ES, Tyler WK (2021) Statistical fragility of surgical clinical trials in orthopaedic trauma. JAAOS Glob Res Rev. https://doi.org/10.5435/JAAOSGlobal-D-20-00197

Smith RA, Levine TR, Lachlan KA, Fediuk TA (2002) The high cost of complexity in experimental design and data analysis: type i and type ii error rates in multiway ANOVA. Hum Commun Res 28:515–530

Armstrong RA (2014) When to use the Bonferroni correction. Ophthalmic Physiol Opt 34:502–508

Ranstam J (2016) Multiple P -values and Bonferroni correction. Osteoarthritis Cartilage 24:763–764

Halsey LG (2019) The reign of the p -value is over: what alternative analyses could we employ to fill the power vacuum? Biol Lett 15:20190174

Download references

Author information

Authors and affiliations.

Department of Orthopedic Surgery & Sports Medicine, Hospital for Special Surgery, 535 East 70 th Street, New York, NY, 10021, USA

Sophia J. Madjarova, Riley J. Williams III, Benedict U. Nwachukwu & Ayoosh Pareek

Department of Orthopedic Surgery, University of Minnesota, Minneapolis, MN, USA

R. Kyle Martin

Orthopaedic Research Department, Göteborg University, Göteborg, Sweden

Jón Karlsson

Institut du movement et de l’appareil locomoteur, Aix-Marseille Université, Marseille, France

Mattheu Ollivier

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ayoosh Pareek .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Madjarova, S.J., Williams, R.J., Nwachukwu, B.U. et al. Picking apart p values: common problems and points of confusion. Knee Surg Sports Traumatol Arthrosc 30 , 3245–3248 (2022). https://doi.org/10.1007/s00167-022-07083-3

Download citation

Received : 07 July 2022

Accepted : 20 July 2022

Published : 03 August 2022

Issue Date : October 2022

DOI : https://doi.org/10.1007/s00167-022-07083-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

Featured Clinical Reviews

  • Screening for Atrial Fibrillation: US Preventive Services Task Force Recommendation Statement JAMA Recommendation Statement January 25, 2022
  • Evaluating the Patient With a Pulmonary Nodule: A Review JAMA Review January 18, 2022

Select Your Interests

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing
  • Download PDF
  • Share X Facebook Email LinkedIn
  • Permissions

The Enduring Evolution of the P Value

  • 1 Senior Editor, JAMA
  • 2 Department of Emergency Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois
  • Original Investigation P Value Reporting in the Biomedical Literature David Chavalarias, PhD; Joshua David Wallach, BA; Alvin Ho Ting Li, BHSc; John P. A. Ioannidis, MD, DSc JAMA
  • Viewpoint The Proposal to Lower P Value Thresholds to .005 John P. A. Ioannidis, MD, DSc JAMA

Mathematics and statistical analyses contribute to the language of science and to every scientific discipline. Clinical trials and epidemiologic studies published in biomedical journals are essentially exercises in mathematical measurement. 1 With the extensive contribution of statisticians to the methodological development of clinical trials and epidemiologic theory, it is not surprising that many statistical concepts have dominated scientific inferential processes, especially in research investigating biomedical cause-and-effect relations. 1 - 6 For example, the comparative point estimate of a risk factor (eg, a risk ratio) is used to mathematically express the strength of the association between the presumed exposure and the outcome of interest. 7 - 9 Mathematics is also used to express random variation inherent around the point estimate as a range that is termed a confidence interval . 1 However, despite the greater degree of information provided by point estimates and confidence intervals, the statistic most frequently used in biomedical research for conveying association is the P value. 10 - 14

In this issue of JAMA , Chavalarias et al 15 describe the evolution of P values reported in biomedical literature over the last 25 years. Based on automated text mining of more than 12 million MEDLINE abstracts and more than 800 000 abstracts and full-text articles in PubMed Central, the authors found that a greater percentage of scientific articles reported P values in the presentation of study findings over time, with the prevalence of P values in abstracts increasing from 7.3% in 1990 to 15.6% in 2014. Among the abstracts and full-text articles with P values, 96% reported at least 1 “statistically significant” result, with strong clustering of reported P values around .05 and .001. In addition, in an in-depth manual review of 796 abstracts and 99 full-text articles from articles reporting empirical data, the authors found that P values were reported in 15.7% and 55%, respectively, whereas confidence intervals were reported in only 2.3% of abstracts and were included for all reported effect sizes in only 4% of the full-text articles. The authors suggested that rather than reporting isolated P values, research articles should focus more on reporting effect sizes (eg, absolute and relative risks) and uncertainty metrics (eg, confidence intervals for the effect estimates).

To provide context for the increasing reporting of P values in the biomedical literature over the past 25 years, it is important to consider what a P value really is, some examples of its frequent misconceptions and inappropriate use, and the evidentiary application of P values based on the 3 main schools of statistical inference (ie, Fisherian, Neyman-Pearsonian, and Bayesian philosophies). 10 , 11

The prominence of the P value in the scientific literature is attributed to Fisher, who did not invent this probability measure but did popularize its extensive use for all forms of statistical research methods starting with his seminal 1925 book, Statistical Methods for Research Workers . 16 According to Fisher, the correct definition of the P value is “the probability of the observed result, plus more extreme results, if the null hypothesis were true.” 13 , 14 Fisher’s purpose was not to use the P value as a decision-making instrument but to provide researchers with a flexible measure of statistical inference within the complex process of scientific inference. In addition, there are important assumptions associated with proper use of the P value. 10 , 11 , 13 , 14 First, there is no relation between the causal factor being investigated and the outcome of interest (ie, the null hypothesis is true). Second, the study design and analyses providing the effect estimate, confidence intervals, and P value for the specific study project are completely free of systemic error (ie, there are no misclassification, selection, or confounding biases). Third, the appropriate statistical test is selected for the analysis (eg, the χ 2 test for a comparison of proportions).

Given these assumptions, it is not difficult to see how the concept of the P value became so frequently misunderstood and misused. 10 , 13 , 14 , 16 , 17 Goodman has provided a list of 12 misconceptions of the P value. 14 The most common and egregious of these misconceptions, for example, is that the P value is the probability of the null hypothesis being true. Another prevalent misconception is that if the P value is greater than .05, then the null hypothesis is true and there is no association between the exposure or treatment and outcome of interest.

Within the different philosophies of statistical inference, both the Fisherian and the Neyman-Pearsonian approaches are based on the “frequentist” interpretation of probability, which specifies that an experiment is theoretically considered one of an infinite number of exactly repeated experiments that yield statistically independent results. 10 - 15 , 18 Frequentist methods are the basis of almost all biomedical statistical methods taught for clinical trials and epidemiologic studies. Although both the Fisherian and Neyman-Pearsonian approaches have many similarities, they have important philosophical and practical differences.

Fisher’s approach uses a calculated P value that is interpreted as evidence against the null hypothesis of a particular research finding. 14 The smaller the P value, the stronger the evidence against the null hypothesis. There is no need for a predetermined level of statistical significance for the calculated P value. A null hypothesis can be rejected, but this is not necessarily based on a preset level of significance or probability of committing an error in the hypothesis test (eg, α < .05). In addition, there is no alternative hypothesis. Inference regarding the hypothesis is preferred over a mechanical decision to accept or reject a hypothesis based on a derived probability.

In contrast to Fisher, Neyman and Pearson in the 1930s formalized the hypothesis testing process with a priori assertions and declarations. For example, they added the concept of a formal alternative hypothesis that is mutually exclusive of the null hypothesis. 10 , 11 In addition, a value is preselected to merit the rejection of the null hypothesis known as the significance level. 13 , 14 The goal of the statistical calculations in the Neyman-Pearsonian approach is decision and not inference. By convention, the cutoff for determining statistical significance usually was selected to be a P value below .05. A calculated P value below the preselected level of significance is conclusively determined to be “statistically significant,” and the null hypothesis is rejected in favor of the alternate hypothesis. If the P value is above the level of significance, the null hypothesis is conclusively not rejected and assumed to be true.

Inevitably, this process leads to 2 potential errors. The first is rejecting the null hypothesis when it is actually true. This is known as a type I error and will occur with a frequency based on the level selected for determining significance (α). If α is selected to be .05, then a type I error will occur 5% of the time. The second potential error is accepting the null hypothesis when it is actually false. This is known as a type II error. The complement of a type II error is to reject the null hypothesis when it is truly false. This is termed the statistical power of a study and is the probability that a significance test will detect an effect that truly exists. It is also the basis for calculating sample sizes needed for clinical trials. The objective is to design an experiment to control or minimize both types of errors. 10 , 11

The main criticism of the Neyman-Pearsonian approach is the extreme rigidity of thinking and arriving at a conclusion. The researcher must either accept or reject a proposed hypothesis and make a dichotomous scientific decision accordingly based on a predetermined accepted level of statistical significance (eg, α < .05). Making decisions with such limited flexibility is usually neither realistic nor prudent. For example, it would be unreasonable to decide that a new cancer medication was ineffective because the calculated P value from a phase 2 trial was .051 and the predetermined level of statistical significance was considered to be less than .05.

Statistical and scientific inference need not be constricted by such rigid thinking. A form of inductive inference can be used to assess causal relations with degrees of certainty characterized as spectrums of probabilities. 19 This form of scientific reasoning, known as Bayesian induction , is especially useful for both statistical and scientific inferences by which effects are observed and the cause must be inferred. For example, if an investigator finds an association between a particular exposure and a specific health-related outcome, the investigator will infer the possibility of a causal relation based on the findings in conjunction with prior studies that evaluated the same possible causal effect. The degree of inference can be quantified using prior estimations of the effect estimate being evaluated.

The main advantage of Bayesian inductive reasoning is the ability to quantify the amount of certainty in terms of known or estimated conditional probabilities. Prior probabilities are transformed into posterior probabilities based on information obtained and included in Bayesian calculations. The main limitations of Bayesian method is that prior information is often unknown or not precisely quantified, making the calculation of posterior probabilities potentially inaccurate. In addition, calculating Bayes factors (a statistical measure for quantifying evidence for a hypothesis based on Bayesian calculations) as an alternative to P values requires additional computational steps. 20 , 21 In addition, Bayesian methods are often not taught in classical statistics courses. For these reasons, Bayesian methods are not frequently used in most biomedical research analyses. 22 However, scientific inferences based on using both P values and Bayesian methods are not necessarily mutually exclusive. Greenland and Poole 22 have suggested incorporating P values into modern Bayesian analysis frameworks.

Fundamentally, statistical inference using P values involves mathematical attempts to facilitate the development of explanatory theory in the context of random error. However, P values provide only a particular mathematical description of a specific data set and not a comprehensive scientific explanation of cause-and-effect relationships in a target population. Each step in the biomedical scientific process should be guided by investigators and biostatisticians who understand and incorporate subject matter knowledge into the research process from prior epidemiologic studies, clinical research, basic science, and biological theory.

With the increasing use of P values in the biomedical literature as reported by Chavalarias et al, it becomes critically important to understand the true meaning of the P value, including its strengths, limitations, and most appropriate application for statistical inference. Despite more teaching of methods and statistics in clinical medicine and for investigators, the authors’ findings that such a small proportion of abstracts reported effect sizes or measures of uncertainly are disappointing. There is nothing inherently wrong when P values are correctly used and interpreted. However, the automatic application of dichotomized hypothesis testing based on prearranged levels of statistical significance should be substituted with a more complex process using effect estimates, confidence intervals, and even P values, thereby permitting scientists, statisticians, and clinicians to use their own inferential capabilities to assign scientific significance.

Corresponding Author: Demetrios N. Kyriacou, MD, PhD, JAMA , 330 N Wabash, Chicago, IL 60611 ( [email protected] ).

Conflict of Interest Disclosures: The author has completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest and none were reported.

See More About

Kyriacou DN. The Enduring Evolution of the P Value. JAMA. 2016;315(11):1113–1115. doi:10.1001/jama.2016.2152

Manage citations:

© 2024

Artificial Intelligence Resource Center

Cardiology in JAMA : Read the Latest

Browse and subscribe to JAMA Network podcasts!

Others Also Liked

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts
  • Reference Manager
  • Simple TEXT file

People also looked at

Mini review article, p -values: misunderstood and misused.

p value in literature review

  • Oxford Internet Institute, University of Oxford, Oxford, UK

P -values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made the p -value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p -values are used and understood. In this paper we review this recent critical literature, much of which is routed in the life sciences, and consider its implications for social scientific research. We provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular, we explain how the False Discovery Rate (FDR) is calculated, and how this differs from a p -value. We also make explicit the Bayesian nature of many recent criticisms, a dimension that is often underplayed or ignored. We conclude by identifying practical steps to help remediate some of the concerns identified. We recommend that (i) far lower significance levels are used, such as 0.01 or 0.001, and (ii) p -values are interpreted contextually, and situated within both the findings of the individual study and the broader field of inquiry (through, for example, meta-analyses).

1. Introduction

P -values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. Obtaining a p -value that indicates “statistical significance” is often a requirement for publishing in a top journal. The emergence of computational social science, which relies mostly on analyzing large scale datasets, has increased the popularity of p -values even further. However, critics contend that p -values are routinely misunderstood and misused by many practitioners, and that even when understood correctly they are an ineffective metric: the standard significance level of 0.05 produces an overall FDR that is far higher, more like 30%. Others argue that p -values can be easily “hacked” to indicate statistical significance when none exists, and that they encourage the selective reporting of only positive results.

Considerable research exists into how p -values are (mis)used, [e.g., 1, 2]. In this paper we review the recent critical literature on p -values, much of which is routed in the life sciences, and consider its implications for social scientific research. We provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular, we explain how the FDR is calculated, and how this differs from a p -value. We also make explicit the Bayesian nature of many recent criticisms. In the final section we identify practical steps to help remediate some of the concerns identified.

P -values are used in Null Hypothesis Significance Testing (NHST) to decide whether to accept or reject a null hypothesis (which typically states that there is no underlying relationship between two variables). If the null hypothesis is rejected, this gives grounds for accepting the alternative hypothesis (that a relationship does exist between two variables). The p -value quantifies the probability of observing results at least as extreme as the ones observed given that the null hypothesis is true. It is then compared against a pre-determined significance level (α). If the reported p -value is smaller than α the result is considered statistically significant. Typically, in the social sciences α is set at 0.05. Other commonly used significance levels are 0.01 and 0.001.

In his seminal paper, “The Earth is Round ( p < .05)” Cohen argues that NHST is highly flawed: it is relatively easy to achieve results that can be labeled significant when a “nil” hypothesis (where the effect size of H 0 is set at zero) is used rather than a true “null” hypothesis (where the direction of the effect, or even the effect size, is specified) [ 3 ]. This problem is particularly acute in the context of “big data” exploratory studies, where researchers only seek statistical associations rather than causal relationships. If a large enough number of variables are examined, effectively meaning that a large number of null/alternative hypotheses are specified, then it is highly likely that at least some “statistically significant” results will be identified, irrespective of whether the underlying relationships are truly meaningful. As big data approaches become more common this issue will become both far more pertinent and problematic, with the robustness of many “statistically significant” findings being highly limited.

Lew argues that the central problem with NHST is reflected in its hybrid name, which is a combination of (i) hypothesis testing and (ii) significance testing [ 4 ]. In significance testing, first developed by Ronald Fisher in the 1920s, the p -value provides an index of the evidence against the null hypothesis. Originally, Fisher only intended for the p -value to establish whether further research into a phenomenon could be justified. He saw it as one bit of evidence to either support or challenge accepting the null hypothesis, rather than as conclusive evidence of significance [5; see also 6, 7]. In contrast, hypothesis tests, developed separately by Neyman and Pearson, replace Fisher's subjectivist interpretation of the p -value with a hard and fast “decision rule”: when the p -value is less than α, the null can be rejected and the alternative hypothesis accepted. Though this approach is simpler to apply and understand, a crucial stipulation of it is that a precise alternative hypothesis must be specified [ 6 ]. This means indicating what the expected effect size is (thereby setting a nil rather than a null hypothesis)—something that most researchers rarely do [ 3 ].

Though hypothesis tests and significance tests are distinct statistical procedures, and there is much disagreement about whether they can be reconciled into one coherent framework, NHST is widely used as a pragmatic amalgam for conducting research [ 8 , 9 ]. Hulbert and Lombardi argue that one of the biggest issues with NHST is that it encourages the use of terminology such as significant/nonsignificant . This dichotomizes the p -value on an arbitrary basis, and converts a probability into a certainty. This is unhelpful when the purpose of using statistics, as is typically the case in academic studies, is to weigh up evidence incrementally rather than make an immediate decision [9, p. 315]. Hulbert and Lombardi's analysis suggests that the real problem lies not with p -values, but with α and how this has led to p -values being interpreted dichotomously: too much importance is attached to the arbitrary cutoff α ≤ 0.05.

2. The False Discovery Rate

A p -value of 0.05 is normally interpreted to mean that there is a 1 in 20 chance that the observed results are nonsignificant, having occurred even though no underlying relationship exists. Most people then think that the overall proportion of results that are false positives is also 0.05. However, this interpretation confuses the p -value (which, in the long run, will approximately correspond to the type I error rate ) with the FDR. The FDR is what people usually mean when they refer to the error rate: it is the proportion of reported discoveries that are false positives. Though 0.05 might seem a reasonable level of inaccuracy, a type I error rate of 0.05 will likely produce an FDR that is far higher, easily 30% or more. The formula for FDR is:

Calculating the number of true positives and false positives requires knowing more than just the type I error rate, but also (i) the statistical power, or “sensitivity,” of tests and (ii) the prevalence of effects [ 10 ]. Statistical power is the probability that each test will correctly reject the null hypothesis when the alternative hypothesis is true. As such, tests with higher power are more likely to correctly record real effects. Prevalence is the number of effects, out of all the effects that are tested for, that actually exist in the real world. In the FDR calculation it determines the weighting given to the power and the type I error rate. Low prevalence contributes to a higher FDR as it increases the likelihood that false positives will be recorded. The calculation for FDR therefore is:

The percentage of reported positives that are actually true is called the Positive Predictive Value (PPV). The PPV and FDR are inversely related, such that a higher PPV necessarily means a lower FDR. To calculate the FDR we subtract the PPV from 1. If there are no false positives then PPV = 1 and FDR = 0. Table 1 shows how low prevalence of effects, low power, and a high type I error rate all contribute to a high FDR.

www.frontiersin.org

Table 1. Greater prevalence, greater power, and a lower Type I error rate reduce the FDR .

Most estimates of the FDR are surprisingly large; e.g., 50 [ 1 , 11 , 12 ] or 36% [ 10 ]. Jager and Leek more optimistically suggest that it is just 14% [ 13 ]. This lower estimate can be explained somewhat by the fact that they only use p -values reported in abstracts, and have a different algorithm to the other studies. Importantly, they highlight that whilst α is normally set to 0.05, many studies—particularly in the life sciences—achieve p -values far lower than this, meaning that the average type I error rate is less than α of 0.05 [13, p. 7]. Counterbalancing this, however, is Colquhoun's argument that because most studies are not “properly designed” (in the sense that treatments are not randomly allocated to groups and in RCTs assessments are not blinded) statistical power will often be far lower than reported—thereby driving the FDR back up again [ 10 ].

Thus, though difficult to calculate precisely, the evidence suggests that the FDR of findings overall is far higher than α of 0.05. This suggests that too much trust is placed in current research, much of which is wrong far more often than we think. It is also worth noting that this analysis assumes that researchers do not intentionally misreport or manipulate results to erroneously achieve statistical significance. These phenomena, known as “selective reporting” and “p-hacking,” are considered separately in Section 4.

3. Prevalence and Bayes

As noted above, the prevalence of effects significantly impacts the FDR, whereby lower prevalence increases the likelihood that reported effects are false positives. Yet prevalence is not controlled by the researcher and, furthermore, cannot be calculated with any reliable accuracy. There is no way of knowing objectively what the underlying prevalence of real effects is. Indeed, the tools by which we might hope to find out this information (such as NHST) are precisely what have been criticized in the literature surveyed here. Instead, to calculate the FDR, prevalence has to be estimated 1 . In this regard, FDR calculations are inherently Bayesian as they require the researcher to quantify their subjective belief about a phenomenon (in this instance, the underlying prevalence of real effects).

Bayesian theory is an alternative paradigm of statistical inference to frequentism, of which NHST is part of. Whereas, frequentists quantify the probability of the data given the null hypothesis ( P ( D | H 0 )), Bayesians calculate the probability of the hypothesis given the data ( P ( H 1 | D )). Though frequentism is far more widely practiced than Bayesianism, Bayesian inference is more intuitive: it assigns a probability to a hypothesis based on how likely we think it to be true.

The FDR calculations outlined above in Section 2 follow a Bayesian logic. First, a probability is assigned to the prior likelihood of a result being false (1 − prevalence). Then, new information (the statistical power and type I error rate) is incorporated to calculate a posterior probability (the FDR). A common criticism against Bayesian methods such as this is that they are insufficiently objective as the prior probability is only a guess. Whilst this is correct, the large number of “findings” produced each year, as well as the low rates of replicability [ 14 ], suggest that the prevalence of effects is, overall, fairly low. Another criticism against Bayesian inference is that it is overly conservative: assigning a low value to the prior probability makes it more likely that the posterior probability will also be low [ 15 ]. These criticisms not withstanding, Bayesian theory offers a useful way of quantifying how likely it is that research findings are true.

Not all of the authors in the literature reviewed here explicitly state that their arguments are Bayesian. The reason for this is best articulated by Colquhoun, who writes that “the description ‘Bayesian’ is not wrong but it is not necessary” [10, p. 5]. The lack of attention paid to Bayes in Ioannidis' well-regarded early article on p -values is particularly surprising given his use of Bayesian terminology: “the probability that a research finding is true depends on the prior probability of it being true (before doing the study)” [1, p. 696]. This perhaps reflects the uncertain position that Bayesianism holds in most universities, and the acrimonious nature of its relationship with frequentism [ 16 ]. Without commenting on the broader applicability of Bayesian statistical inference, we argue that a Bayesian methodology has great utility in assessing the overall credibility of academic research, and that it has received insufficient attention in previous studies. Here, we have sought to make visible, and to rectify, this oversight.

4. Publication Bias: Selective Reporting and P-Hacking

Selective reporting and p-hacking are two types of researcher-driven publication bias. Selective reporting is where nonsignificant (but methodologically robust) results are not reported, often because top journals consider them to be less interesting or important [ 17 ]. This skews the distribution of reported results toward positive findings, and arguably further increases the pressure on researchers to achieve statistical significance. Another form of publication bias, which also skews results toward positive findings, is called p-hacking. Head et al. define p-hacking as “when researchers collect or select data or statistical analyses until nonsignificant results become significant” [ 18 ]. This is direct manipulation of results so that, whilst they may not be technically false, they are unrepresentative of the underlying phenomena. See Figure 1 for a satirical illustration.

www.frontiersin.org

Figure 1. “Significant”: an illustration of selective reporting and statistical significance from XKCD . Available online at http://xkcd.com/882/ (Accessed February 16, 2016).

Head et al. outline specific mechanisms by which p -values are intentionally “hacked.” These include: (i) conducting analyse midway through experiments, (ii) recording many response variables and only deciding which to report postanalysis, (iii) excluding, combining, or splitting treatment groups postanalysis, (iv) including or excluding covariates postanalysis, (v) stopping data exploration if analysis yields a significant p -value. An excellent demonstration of how p -values can be hacked by manipulating the parameters of an experiment is Christie Aschwanden's interactive “Hack Your Way to Scientific Glory” [ 19 ]. This simulator, which analyses whether Republicans or Democrats being in office affects the US economy, shows how tests can be manipulated to produce statistically significant results supporting either parties.

In separate papers, Head et al. [ 18 ], and de Winter and Dodou [20] each examine the distributions of p -values that are reported in scientific publications in different disciplines. It is reported that there are considerably more studies reporting alpha just below the 0.05 significance level than above it (and considerably more than would be expected given the number of p -values that occur in other ranges), which suggests that p-hacking is taking place. This core finding is supported by Jager and Leek's study on “significant” publications as well [ 13 ].

5. What To Do

We argued above that a Bayesian approach is useful to estimate the FDR and assess the overall trustworthiness of academic findings. However, this does not mean that we also hold that Bayesian statistics should replace frequentist statistics more generally in empirical research [see: 21]. In this concluding section we recommend some pragmatic changes to current (frequentist) research practices that could lower the FDR and thus improve the credibility of findings.

Unfortunately, researchers cannot control how prevalent effects are. They only have direct influence over their study's α and its statistical power. Thus, one step to reduce the FDR is to make the norms for these more rigorous, such as by increasing the statistical power of studies. We strongly recommend that α of 0.05 is dropped as a convention, and replaced with a far lower α as standard, such as 0.01 or 0.001; see Table 1 . Other suggestions for improving the quality of statistical significance reporting include using confidence intervals [ 7 , p. 152]. Some have also called for researchers to focus more on effect sizes than statistical significance [ 22 , 23 ], arguing that statistically significant studies that have negligible effect sizes should be treated with greater skepticism. This is of particular importance in the context of big data studies, where many “statistically significant” studies report small effect sizes as the association between the dependent and independent variables is very weak.

Perhaps more important than any specific technical change in how data is analyzed is the growing consensus that research processes need to be implemented (and recorded) more transparently. Nuzzo, for example, argues that “one of the strongest protections for scientists is to admit everything” [ 7 , p. 152]. Head et al. also suggest that labeling research as either exploratory or confirmatory will help readers to interpret the results more faithfully [ 18 , p. 12]. Weissgerber et al. encourage researchers to provide “a more complete presentation of data,” beyond summary statistics [ 24 ]. Improving transparency is particularly important in “big” data-mining studies, given that the boundary between data exploration (a legitimate exercise) and p-hacking is often hard to identify, creating significant potential for intentional or unintentional manipulation of results. Several commentators have recommended that researchers pre-register all studies with initiatives such as the Open Science Framework [ 1 , 7 , 14 , 18 , 25 ]. Pre-registering ensures that a record is kept of the proposed method, effect size measurement, and what sort of results will be considered noteworthy. Any deviation from what is initially registered would then need to be justified, which would give the results greater credibility. Journals could also proactively assist researchers to improve transparency by providing platforms on which data and code can be shared, thus allowing external researchers to reproduce a study's findings and trace the method used [ 18 ]. This would provide academics with the practical means to corroborate or challenge previous findings.

Scientific knowledge advances through corroboration and incremental progress. In keeping with Fisher's initial view that p -values should be one part of the evidence used when deciding whether to reject the null hypothesis, our final suggestion is that the findings of any single study should always be contextualized within the broader field of research. Thus, we endorse the view offered in a recent editorial of Psychological Science that we should be extra skeptical about studies where (a) the statistical power is low, (b) the p -value is only slightly below 0.05, and (c) the result is surprising [ 14 ]. Normally, findings are only accepted once they have been corroborated through multiple studies, and even in individual studies it is common to “triangulate” a result with multiple methods and/or data sets. This offers one way of remediating the problem that even “statistically significant” results can be false; if multiple studies find an effect then it is more likely that it truly exists. We therefore, also support the collation and organization of research findings in meta-analyses as these enable researchers to quickly evaluate a large range of relevant evidence.

Author Contributions

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

For providing useful feedback on the original manuscript we thank Jonathan Bright, Sandra Wachter, Patricia L. Mabry, and Richard Vidgen.

1. ^ In much of the recent literature it is assumed that prevalence is very low, around 0.1 or 0.2 [ 1 , 10 , 11 , 12 ].

1. Ioannidis, J. Why most published research findings are false. PLoS Med . (2005) 2 :e124. doi: 10.1371/journal.pmed.0020124

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Ziliak, ST, McCloskey DN. The Cult of Statistical Significance: How the Standard Error Costs us Jobs, Justice, and Lives. Ann Arbor, MI: University of Michigan Press (2008).

Google Scholar

3. Cohen J. The earth is round ( p < .05). Am Psychol. (1994) 49 :997–1003.

4. Lew MJ. To P or not to P: on the evidential nature of P-values and their place in scientific inference. arXiv preprint arXiv:13110081 (2013).

PubMed Abstract | Google Scholar

5. Fisher RA. Statistical Methods for Research Workers. Edinburgh: Genesis Publishing Pvt. Ltd. (1925).

6. Sterne JA, Smith GD. Sifting the evidence–what's wrong with significance tests? BMJ (2001) 322 :226–31. doi: 10.1136/bmj.322.7280.226

7. Nuzzo R. Statistical errors. Nature (2014) 506 :150–2. doi: 10.1038/506150a

8. Berger JO. Could fisher, jeffreys and neyman have agreed on testing? Stat Sci. (2003) 18 :1–32. doi: 10.1214/ss/1056397485

CrossRef Full Text | Google Scholar

9. Hurlbert SH, Lombardi CM. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Ann Zool Fenn. (2009) 46 :311–49. doi: 10.5735/086.046.0501

CrossRef Full Text

10. Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci. (2014) 1 :140216. doi: 10.1098/rsos.140216

11. Biau DJ, Jolles BM, Porcher R. P value and the theory of hypothesis testing: an explanation for new researchers. Clin Orthop Relat Res. (2010) 468 :885–92. doi: 10.1007/s11999-009-1164-4

12. Freedman LP, Cockburn IM, Simcoe TS. The economics of reproducibility in preclinical research. PLoS Biol. (2015) 13 :e1002165. doi: 10.1371/journal.pbio.1002165

13. Jager LR, Leek JT. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics (2014) 15 :1–12. doi: 10.1093/biostatistics/kxt007

14. Lindsay DS. Replication in psychological science. Psychol Sci. (2015). 26 :1827–32. doi: 10.1177/0956797615616374

15. Gelman A. Objections to Bayesian statistics. Bayesian Anal. (2008) 3 :445–9. doi: 10.1214/08-BA318

16. McGrayne SB. The Theory that Would not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, & Emerged Triumphant from Two Centuries of Controversy. New Haven, CT: Yale University Press (2011).

17. Franco A, Malhotra N, Simonovits G. Publication bias in the social sciences: unlocking the file drawer. Science (2014) 345 :1502–5. doi: 10.1126/science.1255484

18. Head ML, Holman L, Lanfear R, Kahn AT, Jennions, MD. The extent and consequences of P-Hacking in science. PLoS Biol. (2015) 13 :e1002106. doi: 10.1371/journal.pbio.1002106

19. Aschwanden C. Science Isn't Broken, Five Thirty Eight. Available online at: http://fivethirtyeightcom/features/science-isnt-broken/ (Accessed January 22, 2016), (2015).

20. de Winter JC, and Dodou D. A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ (2015) 3 :e733. doi: 10.7287/peerj.preprints.447v4

21. Simonsohn U. Posterior-Hacking: Selective Reporting Invalidates Bayesian Results Also. Philadelphia, PA: University of Pennsylvania, The Wharton School; Draft Paper (2014).

22. Coe R. It's the effect size, stupid: what effect size is and why it is important. In: Paper Presented at the British Educational Research Association Annual Conference . Exeter (2002).

23. Sullivan GM, Feinn, R. Using effect size-or why the P value is not enough. J Grad Med Educ. (2012) 4 :279–82. doi: 10.4300/JGME-D-12-00156.1

24. Weissgerber TL, Milic NM, Winham SJ, Garovic VD. Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. (2015) 13 :e1002128. doi: 10.1371/journal.pbio.1002128

25. Peplow M. Social sciences suffer from severe publication bias. Nature (2014). doi: 10.1038/nature.2014.15787. Available online at: http://www.nature.com/news/social-sciences-suffer-from-severe-publication-bias-1.15787

Keywords: p -value, statistics, significance, p-hacking, prevalence, Bayes, big data

Citation: Vidgen B and Yasseri T (2016) P -Values: Misunderstood and Misused. Front. Phys. 4:6. doi: 10.3389/fphy.2016.00006

Received: 25 January 2016; Accepted: 19 February 2016; Published: 04 March 2016.

Reviewed by:

Copyright © 2016 Vidgen and Yasseri. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Taha Yasseri, [email protected]

This article is part of the Research Topic

At the Crossroads: Lessons and Challenges in Computational Social Science

  • See us on facebook
  • See us on twitter
  • See us on youtube
  • See us on linkedin
  • See us on instagram

Misleading p-values showing up more often in biomedical journal articles

A review of p-values in the biomedical literature from 1990 to 2015 shows that these widely misunderstood statistics are being used increasingly, instead of better metrics of effect size or uncertainty.

March 15, 2016 - By Jennie Dusheck

P-value graph

A study of millions of journal articles shows that their authors are increasingly reporting p-values but are often doing so in a misleading way, according to a study by researchers at the Stanford University School of Medicine . P-values are a measure of statistical significance intended to inform scientific conclusions.

Because p-values are so often misapplied, their increased use probably doesn’t indicate an improvement in the way biomedical research is conducted or the way data are analyzed, the researchers found.

“It’s usually a suboptimal technique, and then it’s used in a biased way, so it can become very misleading,” said John Ioannidis , MD, DSc, professor of disease prevention and of health research and policy and co-director of the Meta-Research Innovation Center at Stanford .

The study was published March 15 in JAMA . Ioannidis is the senior author. The lead author is David Chavalarias, PhD, director of the Complex Systems Institute in France.

When p-values = embarrassment

The Ioannidis team used automated text mining to search the biomedical databases MEDLINE and PubMed Central for the appearance of p-values in millions of abstracts, and also manually reviewed 1000 abstracts and 100 full papers. All the papers were published between 1990 and 2015.

The widespread misuse of p-values — often creating the illusion of credible research — has become an embarrassment to several academic fields, including psychology and biomedicine, especially since Ioannidis began publishing critiques of the way modern research is conducted.

Reports in Nature , STAT and FiveThirtyEight , for example, have covered the weaknesses of p-values. On March 7, the American Statistical Association issued a statement warning against their misuse. In one of a series of essays accompanying the statement, Boston University epidemiologist Kenneth Rothman, DMD, DrPH, wrote, “These are pernicious problems. ... It is a safe bet that people have suffered or died because scientists (and editors, regulators, journalists and others) have used significance tests to interpret results, and have consequently failed to identify the most beneficial courses of action.”

At Stanford, Ioannidis’ team found that among all the millions of biomedical abstracts in the databases, the reporting of p-values more than doubled from 7.3 percent in 1990 to 15.6 percent in 2014. In abstracts from core medical journals, 33 percent reported p-values, and in the subset of randomized controlled clinical trials, nearly 55 percent reported p-values.

John Ioannidis

John Ioannidis

The meaning of p-values

P-values are designed to illuminate a fundamental statistical conundrum. Suppose a clinical trial compares two drug treatments, and drug A appears to be 10 percent more effective than drug B. That could be because drug A is truly 10 percent more effective. Or it could be that chance just happened to make drug A appear more effective in that trial. In short, drug A could have just gotten lucky. How do you know?

A p-value estimates how likely it is that data could come out the way they did if a “null hypothesis” were true — in this case, that there is no difference between the effects of drugs A and B. So, for example, if drugs A and B are equally effective and you run a study comparing them, a p-value of 0.05 means that drug A will appear to be at least 10 percent more effective than drug B about 5 percent of the time.

In other words, assuming the drugs have the same effect, the p-value estimates how likely it is to get a result suggesting A is at least 10 percent better.

“The exact definition of p-value,” said Ioannidis, “is that if the null hypothesis is correct, the p-value is the chance of observing the research result or some more extreme result.” Unfortunately, many researchers mistakenly think that a p-value is an estimate of how likely it is that the null hypothesis is not correct or that the result is true.

P-values' truth

“The p-value does not tell you whether something is true. If you get a p-value of 0.01, it doesn’t mean you have a 1 percent chance of something not being true,” Ioannidis added. “A p-value of 0.01 could mean the result is 20 percent likely to be true, 80 percent likely to be true or 0.1 percent likely to be true — all with the same p-value. The p-value alone doesn’t tell you how true your result is.”

The p-value does not tell you whether something is true.

For an actual estimate of how likely a result is to be true or false, said Ioannidis, researchers should instead use false-discovery rates or Bayes factor calculations.

Despite the serious limitations of p-values, they have become a symbol of good experimental design in the current era. But unfortunately, they are little more than a symbol. Ioannidis and his team found that practically the only p-values reported in abstracts were those defined somewhat arbitrarily as “statistically significant” — a number typically set at less than 0.05. The team found that 96 percent of abstracts with p-values had at least one such “statistically significant” p-value.

“That suggests there’s selective pressure favoring more extreme results. The fact that you have so many significant results is completely unrealistic. It’s impossible that 96 percent of all the hypotheses being tested would be significant,” said Ioannidis.

But how big was the effect?

Despite increasing numbers of papers reporting that results were statistically significant, few papers reported how much of an effect a treatment had compared to controls or placebos. For example, suppose 10,000 patients showed an average improvement in symptoms that was statistically significant compared with another 10,000 who didn’t get the drug. But if patients on the drug were only 1 percent better, the statistical significance derived from the p-value would likely have no practical value.

Of the 796 papers manually reviewed by the Ioannidis team that contained empirical data, only 111 reported effect sizes and only 18 reported confidence intervals (a measure of the uncertainty about the magnitude of the effect). Finally, none reported Bayes factors or false-discovery rates, which Ioannidis said are better-suited to telling us if what is observed is true. Fewer than 2 percent of abstracts the team reviewed reported both an effect size and a confidence interval.

In a manual review of 99 randomly selected full-text articles with data, 55 reported at least one p-value, but only four reported confidence intervals for all effect sizes, none used Bayesian methods and only one used false-discovery rates.

Ioannidis advocates more stringent approaches to analyzing data. “The way to move forward,” he said, “is that p-values need to be used more selectively. When used, they need to be complemented by effect sizes and uncertainty [confidence intervals]. And it would often be a good idea to use a Bayesian approach or a false-discovery rate to answer the question, ‘How likely is this result to be true?’”

Suboptimal technique

“Across the entire literature," Ioannidis said, "the statistical approaches used are often suboptimal. P-values are potentially very misleading, and they are selectively reported in favor of more significant results, especially in the abstracts. And authors underuse metrics that would be more meaningful and more useful to have — effect sizes, confidence intervals and other metrics that can add value in understanding what the results mean.”

Joshua David Wallach, is a doctoral fellow at Stanford, is a co-author of the paper.

This research was supported by the Meta-Research Innovation Center at Stanford, known as METRICS, through a grant from the Laura and John Arnold Foundation ; a grant from the CNRS Mastodons program; and a grant from Sue and Bob O’Donnell to the Stanford Prevention Research Center .

Stanford’s Department of Medicine also supported the work.

Jennie Dusheck

About Stanford Medicine

Stanford Medicine is an integrated academic health system comprising the Stanford School of Medicine and adult and pediatric health care delivery systems. Together, they harness the full potential of biomedicine through collaborative research, education and clinical care for patients. For more information, please visit med.stanford.edu .

Artificial intelligence

Exploring ways AI is applied to health care

Stanford Medicine Magazine: AI

IMAGES

  1. P-Value: What It Is, How to Calculate It, and Why It Matters

    p value in literature review

  2. Understanding P-Values and Statistical Significance

    p value in literature review

  3. Understanding P-Values and Statistical Significance

    p value in literature review

  4. Explaining P-values

    p value in literature review

  5. 15 Literature Review Examples (2024)

    p value in literature review

  6. What is P-value in hypothesis testing

    p value in literature review

VIDEO

  1. P

  2. p value explained in hindi

  3. P-VALUE CONCEPT AND EXAMPLE #shorts #statistics #data #datanalysis #analysis #mean

  4. P value explained

  5. What is a p-value? #apstats #math #apstatistics #ap #stats

  6. P-Value in T-Tests: A Mathematical Breakdown

COMMENTS

  1. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  2. The clinician's guide to p values, confidence intervals, and magnitude

    The role of confidence intervals. Confidence intervals are estimates that provide a lower and upper threshold to the estimate of the magnitude of effect. By convention, 95% confidence intervals ...

  3. Understanding P-values

    The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true. P values are used in hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p value, the more likely you are to reject the null ...

  4. A practical guide to data analysis in general literature reviews

    A general literature review starts with formulating a research question, defining the population, and conducting a systematic search in scientific databases, steps that are ... the authors note that the p-value for this finding was 0.75; in other words, the probability that this difference was due merely to chance is 75%. So the difference in ...

  5. Reporting P values: A handy guide for biomedical researchers

    Researchers usually set a cut-off for p values (.05, .01, or .001), which is known as the significance threshold. If the p value obtained falls below the cutoff, the accompanying result is deemed "statistically significant.". Although p values are commonly used, they need to be reported carefully so that your paper provides strong and ...

  6. The roles, challenges, and merits of the p value: Patterns

    In this review, the authors discuss the roles, challenges, and merits of the p value in hypothesis-based scientific inquiry and emphasize why the interpretation of the p value needs to be contextual. They then present arguments from causal inference, feature selection, and predictive modeling. The authors explore common misuses and misinterpretations and potential remedies and examine Bayesian ...

  7. Review How to understand and teach P values: a diagnostic test framework

    There is a persistent misinterpretation of P values among students, clinicians, and researchers [[15], [16], [17]]. The ways in which some researchers present P values in the scientific literature are problematic [16]. They highlight the numerical value in the abstract and use it to justify their conclusions without adequate discussion of ...

  8. Full article: Statement on P-values

    Statement on. P. -values. Null hypothesis testing is a foundational stone of statistical analysis. The " p -value" has evolved to play a critical role in hypothesis testing. Nonetheless, use of p -values carries with it some controversy, some misinterpretation and some misapplication. It is our position that the p -value, defined in the ...

  9. Picking apart p values: common problems and points of confusion

    Despite its ubiquity in scientific and medical literature, the p value is a commonly misunderstood and misinterpreted statistical concept. Tam et al. conducted a survey-based study examining quantitative and qualitative understanding of p values among a group of 247 general practitioners and demonstrated that the two most common misconceptions were (1) the use of p values to represent real ...

  10. The Enduring Evolution of the P Value

    The prominence of the P value in the scientific literature is attributed to Fisher, who did not invent this probability measure but did popularize its extensive use for all forms of statistical research methods starting with his seminal 1925 book, Statistical Methods for Research Workers. 16 According to Fisher, the correct definition of the P ...

  11. Frontiers

    P-values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made the p-value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p-values are used and understood. In this paper we review this recent critical ...

  12. A practical guide to data analysis in general literature reviews

    The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields. ... But then the authors note that the p-value for this ...

  13. P-values and significance: The null hypothesis that they are not

    Still most statistical information in the medical literature is presented with its p-values and little else. The use of p<0.05 was proposed as an arbitrary threshold for defining a statistically significant difference. ... a systematic review of the literature accompanied by a metaanalysis of existing data has since disproven that intercessory ...

  14. How to understand and teach P values: a diagnostic test framework

    P value in scientific literature [13], whereas others have suggested the elevation of other statistical concepts, such as confidence intervals [14]. There is a persistent misinter-pretation of P values among students, clinicians, and re-searchers [15e17]. The ways in which some researchers present P values in the scientific literature are ...

  15. New Guidelines for Statistical Reporting in the Journal

    Some Journal readers may have noticed more parsimonious reporting of P values in our research articles over the past year. For example, in November 2018, we published two reports from the Vitamin ...

  16. Why P Values Are Not a Useful Measure of Evidence in Statistical

    Reporting p values from statistical significance tests is common in psychology's empirical literature. Sir Ronald Fisher saw the p value as playing a useful role in knowledge development by acting as an `objective' measure of inductive evidence against the null hypothesis. We review several reasons why the p value is an unobjective and inadequate measure of evidence when statistically testing ...

  17. The Practical Alternative to the p Value Is the Correctly Used p Value

    Abstract. Because of the strong overreliance on p values in the scientific literature, some researchers have argued that we need to move beyond p values and embrace practical alternatives. When proposing alternatives to p values statisticians often commit the "statistician's fallacy," whereby they declare which statistic researchers ...

  18. p -Values: The Insight to Modern Statistical Inference

    I introduce a p-value function that derives from the continuity inherent in a wide range of regular statistical models. This provides confidence bounds and confidence sets, tests, and estimates that all reflect model continuity. The development starts with the scalar-variable scalar-parameter exponential model and extends to the vector-parameter model with scalar interest parameter, then to ...

  19. Misleading p-values showing up more often in biomedical journal

    A review of p-values in the biomedical literature from 1990 to 2015 shows that these widely misunderstood statistics are being used increasingly, instead of better metrics of effect size or uncertainty. ... A review of p-values in the biomedical literature from 1990 to 2015 shows that these widely misunderstood statistics are being used ...