U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Biochem Med (Zagreb)
  • v.31(1); 2021 Feb 15

Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies

Ceyhan ceran serdar.

1 Medical Biology and Genetics, Faculty of Medicine, Ankara Medipol University, Ankara, Turkey

Murat Cihan

2 Ordu University Training and Research Hospital, Ordu, Turkey

Doğan Yücel

3 Department of Medical Biochemistry, Lokman Hekim University School of Medicine, Ankara, Turkey

Muhittin A Serdar

4 Department of Medical Biochemistry, Acibadem Mehmet Ali Aydinlar University, Istanbul, Turkey

Calculating the sample size in scientific studies is one of the critical issues as regards the scientific contribution of the study. The sample size critically affects the hypothesis and the study design, and there is no straightforward way of calculating the effective sample size for reaching an accurate conclusion. Use of a statistically incorrect sample size may lead to inadequate results in both clinical and laboratory studies as well as resulting in time loss, cost, and ethical problems. This review holds two main aims. The first aim is to explain the importance of sample size and its relationship to effect size (ES) and statistical significance. The second aim is to assist researchers planning to perform sample size estimations by suggesting and elucidating available alternative software, guidelines and references that will serve different scientific purposes.

Introduction

Statistical analysis is a crucial part of a research. A scientific study must include statistical tools in the study, beginning from the planning stage. Developed in the last 20-30 years, information technology, along with evidence-based medicine, increased the spread and applicability of statistical science. Although scientists have understood the importance of statistical analysis for researchers, a significant number of researchers admit that they lack adequate knowledge about statistical concepts and principles ( 1 ). In a study by West and Ficalora, more than two-thirds of the clinicians emphasized that “the level of biostatistics education that is provided to the medical students is not sufficient” ( 2 ). As a result, it was suggested that statistical concepts were either poorly understood or not understood at all ( 3 , 4 ). Additionally, intentionally or not, researchers tend to draw conclusions that cannot be supported by the actual study data, often due to the misuse of statistics tools ( 5 ). As a result, a large number of statistical errors occur affecting the research results.

Although there are a variety of potential statistical errors that might occur in any kind of scientific research, it has been observed that the sources of error have changed due to the use of dedicated software that facilitates statistics in recent years. A summary of main statistical errors frequently encountered in scientific studies is provided below ( 6 - 13 ):

  • Flawed and inadequate hypothesis;
  • Improper study design;
  • Lack of adequate control condition/group;
  • Spectrum bias;
  • Overstatement of the analysis results;
  • Spurious correlations;
  • Inadequate sample size;
  • Circular analysis (creating bias by selecting the properties of the data retrospectively);
  • Utilization of inappropriate statistical studies and fallacious bending of the analyses;
  • p-hacking ( i.e. addition of new covariates post hoc to make P values significant);
  • Excessive interpretation of limited or insignificant results (subjectivism);
  • Confusion (intentionally or not) of correlations, relationships, and causations;
  • Faulty multiple regression models;
  • Confusion between P value and clinical significance; and
  • Inappropriate presentation of the results and effects (erroneous tables, graphics, and figures).

Relationship among sample size, power, P value and effect size

In this review, we will concentrate on the problems associated with the relationships among sample size, power, P value, and effect size (ES). Practical suggestions will be provided whenever possible. In order to understand and interpret the sample size, power analysis, effect size, and P value, it is necessary to know how the hypothesis of the study was formed. It is best to evaluate a study for Type I and Type II errors ( Figure 1 ) through consideration of the study results in the context of its hypotheses ( 14 - 16 ).

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f1.jpg

Illustration of Type I and Type II errors.

A statistical hypothesis is the researcher’s best guess as to what the result of the experiment will show. It states, in a testable form the proposition the researcher plans to examine in a sample to be able to find out if the proposition is correct in the relevant population. There are two commonly used types of hypotheses in statistics. These are the null hypothesis (H0) and the alternative (H1) hypothesis. Essentially, the H1 is the researcher’s prediction of what will be the situation of the experimental group after the experimental treatment is applied. The H0 expresses the notion that there will be no effect from the experimental treatment.

Prior to the study, in addition to stating the hypothesis, the researcher must also select the alpha (α) level at which the hypothesis will be declared “supported”. The α represents how much risk the researcher is willing to take that the study will conclude H1 is correct when (in the full population) it is not correct (and thus, the null hypothesis is really true). In other words, alpha represents the probability of rejecting H0 when it actually is true. (Thus, the researcher has made an error by reporting that the experimental treatment makes a difference, when in fact, in the full population, that treatment has no effect.)

The most common α level chosen is 0.05, meaning the researcher is willing to take a 5% chance that a result supporting the hypothesis will be untrue in the full population. However, other alpha levels may also be appropriate in some circumstances. For pilot studies, α is often set at 0.10 or 0.20. In studies where it is especially important to avoid concluding a treatment is effective when it actually is not, the alpha may be set at a much lower value; it might be set at 0.001 or even lower. Drug studies are examples for studies that often set the alpha at 0.001 or lower because the consequences of releasing an ineffective drug can be extremely dangerous for patients.

Another probability value is called “the P value”. The P value is simply the obtained statistical probability of incorrectly accepting the alternate hypothesis. The P value is compared to the alpha value to determine if the result is “statistically significant”, meaning that with high probability the result found in the sample will also be true in the full population. If the P value is at or lower than alpha, H1 is accepted. If it is higher than alpha, the H1 is rejected and H0 is accepted instead.

There are actually two types of errors: the error of accepting H1 when it is not true in the population; this is called a Type I error; and is a false positive. The alpha defines the probability of a Type I error. Type I errors can happen for many reasons, from poor sampling that results in an experimental sample quite different from the population, to other mistakes occurring in the design stage or implementation of the research procedures. It is also possible to make an erroneous decision in the opposite direction; by incorrectly rejecting H1 and thus wrongly accepting H0. This is called a Type II error (or a false negative). The β defines the probability of a Type II error. The most common reason for this type of error is small sample size, especially when combined with moderately low or low effect sizes. Both small sample sizes and low effect sizes reduce the power in the study.

Power, which is the probability of rejecting a false null hypothesis, is calculated as 1-β (also expressed as “1 - Type II error probability”). For a Type II error of 0.15, the power is 0.85. Since reduction in the probability of committing a Type II error increases the risk of committing a Type I error (and vice versa ), a delicate balance should be established between the minimum allowed levels for Type I and Type II errors. The ideal power of a study is considered to be 0.8 (which can also be specified as 80%) ( 17 ). Sufficient sample size should be maintained to obtain a Type I error as low as 0.05 or 0.01 and a power as high as 0.8 or 0.9.

However, when power value falls below < 0.8, one cannot immediately conclude that the study is totally worthless. In parallel with this, the concept of “cost-effective sample size” has gained importance in recent years ( 18 ).

Additionally, the traditionally chosen alpha and beta error limits are generally arbitrary and are being used as a convention rather than being based on any scientific validity. Another key issue for a study is the determination, presentation and discussion of the effect size of the study, as will be discussed below in detail.

Although increasing the sample size is suggested to decrease the Type II errors, it will increase the cost of the project and delay the completion of the research activities in a foreseen period of time. In addition, it should not be forgotten that redundant samples may cause ethical problems ( 19 , 20 ).

Therefore, determination of the effective sample size is crucial to enable an efficient study with high significance, increasing the impact of the outcome. Unfortunately, information regarding sample size calculations are not often provided by clinical investigators in most diagnostic studies ( 21 , 22 ).

Calculation of the sample size

Different methods can be utilized before the onset of the study to calculate the most suitable sample size for the specific research. In addition to manual calculation, various nomograms or software can be used. The Figure 2 illustrates one of the most commonly used nomograms for sample size estimation using effect size and power ( 23 ).

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f2.jpg

Nomogram for sample size and power, for comparing two groups of equal size. Gaussian distributions assumed. Standardized difference (effect size) and aimed power values are initially selected on the nomogram. The line connecting these values cross the significance level region of the nomogram. The intercept at the appropriate significance value presents the required sample size for the study. In the above example, for effect size = 1, power = 0.8 and alpha value = 0.05, the sample size is found to be 30. (Adapted from reference 16 ).

Although manual calculation is preferred by the experts of the subject, it is a bit complicated and difficult for the researchers that are not statistics experts. In addition, considering the variety of the research types and characteristics, it should be noted that a great number of calculations will be required with too many variables ( Table 1 ) ( 16 , 24 - 30 ).

In recent years, numerous software and websites have been developed which can successfully calculate sample size in various study types. Some of the important software and websites are listed in Table 2 and are evaluated based both on the remarks stated in the literature and on our own experience, with respect to the content, ease of use, and cost ( 31 , 32 ). G-Power, R, and Piface stand out among the listed software in terms of being free-to use. G-Power is a free-to use tool that be used to calculate statistical power for many different t-tests, F-tests, χ 2 tests, z-tests and some exact tests. R is an open source programming language which can be tailored to meet individual statistical needs, by adding specific program modules called packages onto a specific base program. Piface is a java application specifically designed for sample size estimation and post-hoc power analysis. The most professional software is PASS (Power Analysis and Sample Size). With PASS, it is possible to analyse sample size and power for approximately 200 different study types. In addition, many websites provide substantial aid in calculating power and sample size, basing their methodology on scientific literature.

The sample size or the power of the study is directly related to the ES of the study. What is this important ES? The ES provides important information on how well the independent variable or variables predict the dependent variable. Low ES means that, independent variables don’t predict well because they are only slightly related to the dependent variable. Strong ES means that, independent variables are very good predictors of the dependent variable. Thus, ES is clinically important for evaluating how efficiently the clinicians can predict outcomes from the independent variables.

The scale of the ES values for different types of statistical tests conducted in different study types are presented in Table 3 .

In order to evaluate the effect of the study and indicate its clinical significance, it is very important to evaluate the effect size along with statistical significance. P value is important in the statistical evaluation of the research. While it provides information on presence/absence of an effect, it will not account for the size of the effect. For comprehensive presentation and interpretation of the studies, both effect size and statistical significance (P value) should be provided and considered.

It would be much easier to understand ES through an example. For example, assume that independent sample t-test is used to compare total cholesterol levels for two groups having normal distribution. Where X, SD and N stands for mean, standard deviation and sample size, respectively. Cohen’s d ES can be calculated as follows:

Mean (X), mmol/L Standard deviation (SD) Sample size (N)

Group 1 6.5 0.5 30

Group 2 5.2 0.8 30

Cohen d ES results represents: 0.8 large, 0.5 medium, 0.2 small effects). The result of 1.94 indicates a very large effect. Means of the two groups are remarkably different.

In the example above, the means of the two groups are largely different in a statistically significant manner. Yet, clinical importance of the effect (whether this effect is important for the patient, clinical condition, therapy type, outcome, etc .) needs to be specifically evaluated by the experts of the topic.

Power, alpha values, sample size, and ES are closely related with each other. Let us try to explain this relationship through different situations that we created using G-Power ( 33 , 34 ).

The Figure 3 shows the change of sample size depending on the ES changes (0.2, 1 and 2.5, respectively) provided that the power remains constant at 0.8. Arguably, case 3 is particularly common in pre-clinical studies, cell culture, and animal studies (usually 5-10 samples in animal studies or 3-12 samples in cell culture studies), while case 2 is more common in clinical studies. In clinical, epidemiological or meta-analysis studies, where the sample size is very large; case 1, which emphasizes the importance of smaller effects, is more commonly observed ( 33 ).

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f3.jpg

Relationship between effect size and sample size. P – power. ES - effect size. SS - sample size. The required sample size increases as the effect size decreases. In all cases, P value is set to 0.8. The sample sizes (SS) when ES is 0.2, 1, or 2.5; are 788, 34 and 8, respectively. The graphs at the bottom represent the influence of change in the sample size on the power.

In Figure 4 , case 4 exemplifies the change in power and ES values when the sample size is kept constant ( i.e. as low as 8). As can be seen here, in studies with low ES, working with few samples will mean waste of time, redundant processing, or unnecessary use of laboratory animals.

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f4.jpg

Relationship between effect size and power. Two different cases are schematized where the sample size is kept constant either at 8 or at 30. When the sample size is kept constant, the power of the study decreases as the effect size decreases. When the effect size is 2.5, even 8 samples are sufficient to obtain power = ~0.8. When the effect size is 1, increasing sample size from 8 to 30 significantly increases the power of the study. Yet, even 30 samples are not sufficient to reach a significant power value if effect size is as low as 0.2.

Likewise, case 5 exemplifies the situation where the sample size is kept constant at 30. In this case, it is important to note that when ES is 1, the power of the study will be around 0.8. Some statisticians arbitrarily regard 30 as a critical sample size. However, case 5 clearly demonstrates that it is essential not to underestimate the importance of ES, while deciding on the sample size.

Especially in recent years, where clinical significance or effectiveness of the results has outstripped the statistical significance; understanding the effect size and power has gained tremendous importance ( 35 – 38 ).

Preliminary information about the hypothesis is eminently important to calculate the sample size at intended power. Usually, this is accomplished by determining the effect size from the results of a previous study or a preliminary study. There are software available which can calculate sample size using the effect size

We now want to focus on sample size and power analysis in some of the most common research areas.

Determination of sample size in pre-clinical studies

Animal studies are the most critical studies in terms of sample size. Especially due to ethical concerns, it is vital to keep the sample size at the lowest sufficient level. It should be noted that, animal studies are radically different from human studies because many animal studies use inbred animals having extremely similar genetic background. Thus, far fewer animals are needed in the research because genetic differences that could affect the study results are kept to a minimum ( 39 , 40 ).

Consequently, alternative sample size estimation methodologies were suggested for each study type ( 41 - 44 ). If the effect size is to be determined using the results from previous or preliminary studies, sample size estimation may be performed using G-Power. In addition, Table 4 may also be used for easy estimation of the sample size ( 40 ).

In addition to sample size estimations that may be computed according to Table 4 , formulas stated in Table 1 and the websites mentioned in Table 2 may also be utilized to estimate sample size in animal studies. Relying on previous studies pose certain limitations since it may not always be possible to acquire reliable “pooled standard deviation” and “group mean” values.

Arifin et al. proposed simpler formulas ( Table 5 ) to calculate sample size in animal studies ( 45 ). In group comparison studies, it is possible to calculate the sample size as follows: N = (DF/k)+1 (Eq. 4).

Based on acceptable range of the degrees of freedom (DF), the DF in formulas are replaced with the minimum ( 10 ) and maximum ( 20 ). For example, in an experimental animal study where the use of 3 investigational drugs are tested minimum number of animals that will be required: N = (10/3)+1 = 4.3; rounded up to 5 animals / group, total sample size = 5 x 3 = 15 animals. Maximum number of animals that will be required: N = (20/3)+1 = 7.7; rounded down to 7 animals / group, total sample size = 7 x 3 = 21 animals.

In conclusion, for the recommended study, 5 to 7 animals per group will be required. In other words, a total of 15 to 21 animals will be required to keep the DF within the range of 10 to 20.

In a compilation where Ricci et al. reviewed 15 studies involving animal models, it was noted that the sample size used was 10 in average (between 6 and 18), however, no formal power analysis was reported by any of the groups. It was striking that, all studies included in the review have used parametric analysis without prior normality testing ( i.e. Shapiro-Wilk) to justify their statistical methodology ( 46 ).

It is noteworthy that, unnecessary animal use could be prevented by keeping the power at 0.8 and selecting one-tailed analysis over two-tailed analysis with an accepted 5% risk of making type I error as performed in some pharmacological studies, reducing the number of required animals by 14% ( 47 ).

Neumann et al. proposed a group-sequential design to minimize animal use without a decrease in statistical power. In this strategy, researchers started the experiments with only 30% of the animals that were initially planned to be included in the study. After an interim analysis of the results obtained with 30% of the animals, if sufficient power is not reached, another 30% is included in the study. If results from this initial 60% of the animals provide sufficient statistical power, then the rest of the animals are excused from the study. If not, the remaining animals are also included in the study. This approach was reported to save 20% of the animals in average, without leading to a decrease in statistical power ( 48 ).

Alternative sample size estimation strategies are implemented for animal testing in different countries. As an example, a local authority in southwestern Germany recommended that, in the absence of a formal sample size estimation, less than 7 animals per experimental group should be included in pilot studies and the total number of experimental animals should not exceed 100 ( 48 ).

On the other hand, it should be noted that, for a sample size of 8 to 10 animals per group, statistical significance will not be accomplished unless a large or very large ES (> 2) is expected ( 45 , 46 ). This problem remains as an important limitation for animal studies. Software like G-Power can be used for sample size estimation. In this case, results obtained from a previous or a preliminary study will be required to be used in the calculations. However, even when a previous study is available in literature, using its data for a sample size estimation will still pose an uncertainty risk unless a clearly detailed study design and data is provided in the publication. Although researchers suggested that reliability analyses could be performed by methods such as Markov Chain Monte Carlo, further research is needed in this regard ( 49 ).

The output of the joint workshop held by The National Institutes of Health (NIH), Nature Publishing Group and Science; “Principles and Guidelines for Reporting Preclinical Research” that was published in 2014, has since been acknowledged by many organizations and journals. This guide has shed significant light on studies using biological materials, involving animal studies, and handling image-based data ( 50 ).

Another important point regarding animal studies is the use of technical repetition (pseudo replication) instead of biological repetition. Technical repetition is a specific type of repetition where the same sample is measured multiple times, aiming to probe the noise associated with the measurement method or the device. Here, no matter how many times the same sample is measured, the actual sample size will remain the same. Let us assume a research group is investigating the effect of a therapeutic drug on blood glucose level. If the researchers measure the blood glucose level of 3 mice receiving the actual treatment and 3 mice receiving placebo, this would be a biological repetition. On the other hand, if the blood glucose level of a single mouse receiving the actual treatment and the blood glucose level of a single mouse receiving placebo are each measured 3 times, this would be technical repetition. Both designs will provide 6 data points to calculate P value, yet the P value obtained from the second design would be meaningless since each treatment group will only have one member ( Figure 5 ). Multiple measurements on single mice are pseudo replication; therefore do not contribute to N. No matter how ingenious, no statistical analysis method can fix incorrectly selected replicates at the post-experimental stage; replicate types should be selected accurately at the design stage. This problem is a critical limitation, especially in pre-clinical studies that conduct cell culture experiments. It is very important for critical assessment and evaluation of the published research results ( 51 ). This issue is mostly underestimated, concealed or ignored. It is striking that in some publications, the actual sample size is found to be as low as one. Experiments comparing drug treatments in a patient-derived stem cell line are specific examples for this situation. Although there may be many technical replications for such experiments and the experiment can be repeated several times, the original patient is a single biological entity. Similarly, when six metatarsals are harvested from the front paws of a single mouse and cultured as six individual cultures, another pseudo replication is practiced where the sample size is actually 1, instead of 6 ( 52 ). Lazic et al . suggested that almost half of the studies (46%) had mistaken pseudo replication (technical repeat) for genuine replication, while 32% did not provide sufficient information to enable evaluation of appropriateness of the sample size ( 53 , 54 ).

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f5.jpg

Technical vs biological repeat.

In studies providing qualitative data (such as electrophoresis, histology, chromatography, electron microscopy), the number of replications (“number of repeats” or “sample size”) should explicitly be stated.

Especially in pre-clinical studies, standard error of the mean (SEM) is frequently used instead of SD in some situations and by certain journals. The SEM is calculated by dividing the SD by the square root of the sample size (N). The SEM will indicate how variable the mean will be if the whole study is repeated many times. Whereas the SD is a measure of how scattered the scores within a set of data are. Since SD is usually higher than SEM, researchers tend to use SEM. While SEM is not a distribution criterion; there is a relation between SEM and 95% confidence interval (CI). For example, when N = 3, 95% CI is almost equal to mean ± 4 SEM, but when N ≥ 10; 95% CI equals to mean ± 2 SEM. Standard deviation and 95% CI can be used to report the statistical analysis results such as variation and precision on the same plot to demonstrate the differences between test groups ( 52 , 55 ).

Given the attrition and unexpected death risk of the laboratory animals during the study, the researchers are generally recommended to increase the sample size by 10% ( 56 ).

Sample size calculation for some genetic studies

Sample size is important for genetic studies as well. In genetic studies, calculation of allele frequencies, calculation of homozygous and heterozygous frequencies based on Hardy-Weinberg principle, natural selection, mutation, genetic drift, association, linkage, segregation, haplotype analysis are carried out by means of probability and statistical models ( 57 - 62 ). While G-Power is useful for basic statistics, substantial amount of analyses can be conducted using genetic power calculator ( http://zzz.bwh.harvard.edu/gpc/ ) ( 61 , 62 ). This calculator, which provides automated power analysis for variance components (VC) quantitative trait locus (QTL) linkage and association tests in sibships, and other common tests, is significantly effective especially for genetics studies analysing complex diseases.

Case-control association studies for single nucleotide polymorphisms (SNPs) may be facilitated using OSSE web site ( http://osse.bii.a-star.edu.sg/ ). As an example, let us assume the minor allele frequencies of an SNP in cases and controls are approximately 15% and 7% respectively. To have a power of 0.8 with 0.05 significance, the study is required to include 239 samples both for cases and controls, adding up to 578 samples in total ( Figure 6 ).

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f6.jpg

Interface of Online Sample Size Estimator (OSSE) Tool. (Available at: http://osse.bii.a-star.edu.sg/ ).

Hong and Park have proposed tables and graphics in their article for facilitating sample size estimation ( 57 ). With the assumption of 5% disease prevalence, 5% minor allele frequency and complete linkage disequilibrium (D’ = 1), the sample size in a case-control study with a single SNP marker, 1:1 case-to-control ratio, 0.8 statistical power, and 5% type I error rate can be calculated according to the genetic models of inheritance (allelic, additive, dominant, recessive, and co-dominant models) and the odd ratios of heterozygotes/rare homozygotes ( Table 6 ). As demonstrated by Hong and Park among all other types of inheritance, dominant inheritance requires the lowest sample size to achieve 0.8 statistical power. Whereas, testing a single SNP in a recessive inheritance model requires a very large sample size even with a high homozygote ratio, that is practically challenging with a limited budget ( 57 ). The Table 6 illustrates the difficulty in detecting a disease allele following a recessive mode of inheritance with moderate sample size.

Sample size and power analyses in clinical studies

In clinical research, sample size is calculated in line with the hypothesis and study design. The cross-over study design and parallel study design apply different approaches for sample size estimation. Unlike pre-clinical studies, a significant number of clinical journals necessitate sample size estimation for clinical studies.

The basic rules for sample size estimation in clinical trials are as follows ( 63 , 64 ):

  • Error level (alpha): It is generally set as < 0.05. The sample size should be increased to compensate for the decrease in the effect size.
  • Power must be > 0.8: The sample size should be increased to increase the power of the study. The higher the power, the lower the risk of missing an actual effect.

An external file that holds a picture, illustration, etc.
Object name is bm-31-1-010502-f7.jpg

The relationship among clinical significance, statistical significance, power and effect size. In the example above, in order to provide a clinically significant effect, a treatment is required to trigger at least 0.5 mmol/L decreases in cholesterol levels. Four different scenarios are given for a candidate treatment, each having different mean total cholesterol change and 95% confidence interval. ES - effect size. N – number of participant. Adapted from reference 65 .

  • Similarity and equivalence: The sample size required demonstrating similarity and equivalence is very low.

Sample size estimation can be performed manually using the formulas in Table 1 as well as software and websites in Table 2 (especially by G-Power). However, all of these calculations require preliminary results or previous study outputs regarding the hypothesis of interest. Sample size estimations are difficult in complex or mixed study designs. In addition: a) unplanned interim analysis, b) planned interim analysis and

  • adjustments for common variables may be required for sample size estimation.

In addition, post-hoc power analysis (possible with G-Power, PASS) following the study significantly facilitates the evaluation of the results in clinical studies.

A number of high-quality journals emphasize that the statistical significance is not sufficient on its own. In fact, they would require evaluation of the results in terms of effect size and clinical effect as well as statistical significance.

In order to fully comprehend the effect size, it would be useful to know the study design in detail and evaluate the effect size with respect to the type of the statistical tests conducted as provided in Table 3 .

Hence, the sample size is one of the critical steps in planning clinical trials, and any negligence or shortcomings in its estimate may lead to rejection of an effective drug, process, or marker. Since statistical concepts have crucial roles in calculating the sample size, sufficient statistical expertise is of paramount importance for these vital studies.

Sample size, effect size and power calculation in laboratory studies

In clinical laboratories, software such as G-Power, Medcalc, Minitab, and Stata can be used for group comparisons (such as t-tests, Mann Whitney U, Wilcoxon, ANOVA, Friedman, Chi-square, etc. ), correlation analyses (Pearson, Spearman, etc .) and regression analyses.

Effect size that can be calculated according to the methods mentioned in Table 3 is important in clinical laboratories as well. However, there are additional important criteria that must be considered while investigating differences or relationships. Especially the guidelines (such as CLSI, RiliBÄK, CLIA, ISO documents) that were established according to many years of experience, and results obtained from biological variation studies provide us with essential information and critical values primarily on effect size and sometimes on sample size.

Furthermore, in addition to the statistical significance (P value interpretation), different evaluation criteria are also important for the assessment of the effect size. These include precision, accuracy, coefficient of variation (CV), standard deviation, total allowable error, bias, biological variation, and standard deviation index, etc . as recommended and elaborated by various guidelines and reference literature ( 66 - 70 ).

In this section, we will assess sample size, effect size, and power for some analysis types used in clinical laboratories.

Sample size in method and device comparisons

Sample size is a critical determinant for Linear, Passing Bablok, and Deming regression studies that are predominantly being used in method comparison studies. Sample size estimations for the Passing-Bablok and Deming method comparison studies are exemplified in Table 7 and Table 8 respectively. As seen in these tables, sample size estimations are based on slope, analytical precision (% CV), and range ratio (c) value ( 66 , 67 ). These tables might seem quite complicated for some researchers that are not familiar with statistics. Therefore, in order to further simplify sample size estimation; reference documents and guidelines have been prepared and published. As stated in CLSI EP09-A3 guideline, the general recommendation for the minimum sample size for validation studies to be conducted by the manufacturer is 100; while the minimum sample size for user-conducted verification is 40 ( 68 ). In addition, these documents clearly explain the requirements that should be considered while collecting the samples for method/device comparison studies. For instance, samples should be homogeneously dispersed covering the whole detection range. Hence, it should be kept in mind that randomly selected 40-100 sample will not be sufficient for impeccable method comparison ( 68 ).

Additionally, comparison studies might be carried out in clinical laboratories for other purposes; such as inter-device, where usage of relatively few samples is suggested to be sufficient. For method comparison studies to be conducted using patient samples; sample size estimation, and power analysis methodologies, in addition to the required number of replicates are defined in CLSI document EP31-A-IR. The critical point here is to know the values of constant difference, within-run standard deviation, and total sample standard deviation ( 69 ). While studies that compare devices having high analytical performance would suffice lower sample size; studies comparing devices with lower analytical performance would require higher sample size.

Lu et al. used maximum allowed differences for calculating sample sizes that would be required in Bland Altman comparison studies. This type of sample size estimation, which is critically important in laboratory medicine, can easily be performed using Medcalc software ( 70 ).

Sample size in lot to lot variation studies

It is acknowledged that lot-to-lot variation may influence the test results. In line with this, method comparison is also recommended to monitor the performance of the kit in use, between lot changes. To aid in the sample size estimation of these studies; CLSI has prepared the EP26-A guideline “User evaluation of between-reagent lot variation; approved guideline”, which provides a methodology like EP31-A-IR ( 71 , 72 ).

The Table 9 presents sample size and power values of a lot-to-lot variation study comparing glucose measurements at 3 different concentrations. In this example, if the difference in the glucose values measured by different lots is > 0.2 mmol/L, > 0.58 mmol/L and > 1.16 mmol/L at analyte concentrations of 2.77 mmol/L, 8.32 mmol/L and 16.65 mmol/L respectively, lots would be confirmed to be different. In a scenario where one sample is used for each concentration; if the lot-to-lot variation results obtained from each of the three different concentrations are lower than the rejection limits (meaning that the precision values for the tested lots are within the acceptance limits), then the lot variation is accepted to lie within the acceptance range. While the example for glucose measurements presented in the guideline suggests that “1 sample” would be sufficient at each analyte concentration, it should be noted that sample size might vary according to the number to devices to be tested, analytical performance results of the devices ( i.e. precision), total allowable error, etc. For different analytes and scenarios ( i.e. for occasions where one sample/concentration is not sufficient), researchers need to refer CLSI EP26-A ( 71 ).

Some researchers find CLSI EP26-A and CLSI EP31 rather complicated for estimating the sample size in lot-to-lot variation and method comparison studies (which are similar to a certain extent). They instead prefer to use the sample size (number of replicates) suggested by Mayo Laboratories. Mayo Laboratories decided that lot-to-lot variation studies may be conducted using 20 human samples where the data are analysed by Passing-Bablok regression and accepted according to the following criteria: a) slope of the regression line will lie between 0.9 and 1.1; b) R2 coefficient of determination will be > 0.95; c) the Y-intercept of the regression line will be < 50% of the lowest reportable concentration, d) difference of the means between reagent lots will be < 10% ( 73 ).

Sample size in verification studies

Acceptance limits should be defined before the verification and validation studies. These could be determined according to clinical cut-off values, biological variation, CLIA criteria, RiliBÄK criteria, criteria defined by the manufacturer, or state of the art criteria. In verification studies, the “sample size” and the “minimum proportion of the observed samples required to lie within the CI limits” are proportional. For instance, for a 50-sample study, 90% of the samples are required to lie within the CI limits for approval of the verification; while for a 200-sample study, 93% is required ( Table 10 ). In an example study whose total allowable error (TAE) is specified as 15%; 50 samples were measured. Results of the 46 samples (92% of all samples) lied within the TAE limit of 15%. Since the proportion of the samples having results within the 15% TAE limit (92% of the samples) exceeds the minimum proportion required to lie within the TAE limits (90% of the samples), the method is verified ( 74 ).

Especially in recent years, researchers tend to use CLSI EP15-A3 or alternative strategies relying on EP15-A3, for verification analyses. While the alternative strategies diverge from each other in many ways, most of them necessitate a sample size of at least 20 ( 75 - 78 ). Yet, for bias studies, especially for the ones involving External Quality Control materials, even lower sample sizes ( i.e. 10) may be observed ( 79 ). Verification still remains to be one of the critical problems for clinical laboratories. It is not possible to find a single criteria and a single verification method that fits all test methods ( i.e. immunological, chemical, chromatographical, etc. ).

While sample size for qualitative laboratory tests may vary according to the reference literature and the experimental context, CLSI EP12 recommends at least 50 positive and 50 negative samples, where 20% of the samples from each group are required to fall within cut-off value +/- 20% ( 80 , 81 ). According to the clinical microbiology validation/verification guideline Cumitech 31A, the minimum number of the samples in positive and negative groups is 100/each group for validation studies, and 10/each group for verification studies ( 82 ).

Sample size in diagnostic and prognostic studies

ROC analysis is the most important statistical analysis in diagnostic and prognostic studies. Although sample size estimation for ROC analyses might be slightly complicated; Medcalc, PASS, and Stata may be used to facilitate the estimation process. Before the actual size estimations, it is a prerequisite for the researcher to calculate potential area under the curve (AUC) using data from previous or preliminary studies. In addition, size estimation may also be calculated manually according to Table 1 , or using sensitivity (or TPF) and 1-specificity (FPF) values according to Table 11 which is adapted from CLSI EP24-A2 ( 83 , 84 ).

As is known, X-axis of the ROC curve is FPF, and Y-axis is TPF. While TPF represents sensitivity, FPF represents 1-specificity. Utilizing Table 11 , for a 0.85 sensitivity, 0.90 specificity and a maximum allowable error of 5% (L = 0.05), 196 positive and 139 negative samples are required. For the scenarios not included in this table, reader should refer to the formulas given under “diagnostic prognostic studies” subsection of Table 1 .

Standards for reporting of diagnostic accuracy studies (STARD) checklist may be followed for diagnostic studies. It is a powerful checklist whose application is explained in detail by Cohen et al. and Flaubaut et al. ( 85 , 86 ). This document suggests that, readers demand to understand the anticipated precision and power of the study and whether authors were successful in recruiting the sufficient number of participants; therefore it is critical for the authors to explain the intended sample size of their study and how it was determined. For this reason, in diagnostic and prognostic studies, sample size and power should clearly be stated.

As can be seen here, the critical parameters for sample size estimation are AUC, specificity and sensitivity, and their 95% CI values. The table 12 demonstrates the relationship of sample size with sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV); the lower the sample size, the higher is the 95% CI values, leading to increase in type II errors ( 87 ). As can be seen here, confidence interval is narrowed as the sample size increases, leading to a decrease in type II errors.

Like all sample size calculations, preliminary information is required for sample size estimations in diagnostic and prognostic studies. Yet, variation occurs among sample size estimates that are calculated according to different reference literature or guidelines. This variation is especially prominent depending on the specific requirements of different countries and local authorities.

While sample size calculations for ROC analyses may easily be performed via Medcalc, the method explained by Hanley et al. and Delong et al. may be utilized to calculate sample size in studies comparing different ROC curves ( 88 , 89 ).

Sample size for reference interval determination

Both IFCC working groups and the CLSI guideline C28-A3c offer suggestions regarding sample size estimations in reference interval studies ( 90 - 93 ). These references mainly suggest at least 120 samples should be included for each study sub-group ( i.e., age-group, gender, race, etc. ). In addition, the guideline also states that, at least 20 samples should be studied for verification of the determined reference intervals.

Since extremes of the observed values may under/over-represent the actual percentile values of a population in nonparametric studies, care should be taken not to rely solely on the extreme values while determining the nonparametric 95% reference interval. Reed et al. suggested a minimum sample size of 120 to be used for 90% CI, 146 for 95% CI, and 210 for 99% CI (93). Linnet proposed that up to 700 samples should be obtained for results having highly skewed distributions ( 94 ). The IFCC Committee on Reference Intervals and Decision Limits working group recommends a minimum of 120 reference subjects for nonparametric methods, to obtain results within 90% CI limits ( 90 ).

Due to the inconvenience of the direct method, in addition to the challenges encountered using paediatric and geriatric samples as well as the samples obtained from complex biological fluids ( i.e. cerebrospinal fluid); indirect sample size estimations using patient results has gained significant importance in recent years. Hoffmann method, Bhattacharya method or their modified versions may be used for indirect determination of the reference intervals ( 95 - 101 ). While a specific sample size is not established, sample size between 1000 and 10.000 is recommended for each sub-group. For samples that cannot be easily acquired ( i.e. paediatric and geriatric samples, and complex biological fluids), sample sizes as low as 400 may be used for each sub-group ( 92 , 100 ).

Sample size in survey studies

The formulations given on Table 1 and the websites mentioned on Table 2 will be particularly useful for sample size estimations in survey studies which are dependent primarily on the population size ( 101 ).

Three critical aspects should be determined for sample size determination in survey studies:

  • Population size
  • Confidence Interval (CI) of 95% means that, when the study is repeated, with 95% probability, the same results will be obtained. Depending on the hypothesis and the study aim, confidence interval may lie between 90% and 99%. Confidence interval below 90% is not recommended.

For a given CI, sample size and ME is inversely proportional; sample size should be increased in order to obtain a narrower ME. On the contrary, for a fixed ME, CI and sample size is directly proportional; in order to obtain a higher CI, the sample size should be increased. In addition, sample size is directly proportional to the population size; higher sample size should be used for a larger population. A variation in ME causes a more drastic change in sample size than a variation in CI. As exemplified in Table 13 , for a population of 10,000 people, a survey with a 95% CI and 5% ME would require at least 370 samples. When CI is changed from 95% to 90% or 99%, the sample size which was 370 initially would change into 264 or 623 respectively. Whereas, when ME is changed from 5% to 10% or 1%; the sample size which was initially 370 would change into 96 or 4900 respectively. For other ME and CI levels, the researcher should refer to the equations and software provided on Table 1 and Table 2 .

The situation is slightly different for the survey studies to be conducted for problem detection. It would be most appropriate to perform a preliminary survey with a small sample size, followed by a power analysis, and completion of the study using the appropriate number of samples estimated based on the power analysis. While 30 is suggested as a minimum sample size for the preliminary studies, the optimal sample size can be determined using the formula suggested in Table 14 which is based on the prevalence value ( 103 ). It is unlikely to reach a sufficient power for revealing of uncommon problems (prevalence 0.02) at small sample sizes. As can be seen on the table, in the case of 0.02 prevalence, a sample size of 30 would yield a power of 0.45. In contrast, frequent problems ( i.e. prevalence 0.30) were discovered with higher power (0.83) even when the sample size was as low as 5. For situations where power and prevalence are known, effective sample size can easily be estimated using the formula in Table 1 .

Does big sample size always increase the impact of a study?

While larger sample size may provide researchers with great opportunities, it may create problems in interpretation of statistical significance and clinical impact. Especially in studies with big sample sizes, it is critically important for the researchers not to rely only on the magnitude of the regression (or correlation) coefficient, and the P value. The study results should be evaluated together with the effect size, study efficiencies ( i.e. basic research, clinical laboratory, and clinical studies) and confidence interval levels. Monte Carlo simulations could be utilized for statistical evaluations of the big data results ( 18 , 104 ).

As a result, sample size estimation is a critical step for scientific studies and may show significant differences according to research types. It is important that sample size estimation is planned ahead of the study, and may be performed through various routes:

  • If a similar previous study is available, or preliminary results of the current study are present, their results may be used for sample size estimations via the websites and software mentioned in Table 1 and Table 2 . Some of these software may also be used to calculate effect size and power.
  • If the magnitude of the measurand variation that is required for a substantial clinical effect is available ( i.e. significant change is 0.51 mmol/L for cholesterol, 26.5 mmol/L for creatinine, etc. ), it may be used for sample size estimation ( Figure 7 ). Presence of Total Allowable Error, constant and critical differences, biological variations, reference change value (RCV), etc. will further aid in sample size estimation process. Free software (especially G-Power) and web sites presented on Table 2 will facilitate calculations.
  • If effect size can be calculated by a preliminary study, sample size estimations may be performed using the effect size ( via G-Power, Table 4 , etc. )
  • In the absence of a previous study, if a preliminary study cannot be performed, an effect size may be initially estimated and be used for sample size estimations
  • If none of the above is available or possible, relevant literature may be used for sample size estimation.
  • For clinical laboratories, especially CLSI documents and guidelines may prove useful for sample size estimation ( Table 9,11 ​ 9,11 ).

Sample size estimations may be rather complex, requiring advanced knowledge and experience. In order to properly appreciate the concept and perform precise size estimation, one should comprehend properties of different study techniques and relevant statistics to certain extend. To assist researchers in different fields, we aimed to compile useful guidelines, references and practical software for calculating sample size and effect size in various study types. Sample size estimation and the relationship between P value and effect size are key points for comprehension and evaluation of biological studies. Evaluation of statistical significance together with the effect size is critical for both basic science, and clinical and laboratory studies. Therefore, effect size and confidence intervals should definitely be provided and its impact on the laboratory/clinical results should be discussed thoroughly.

Potential conflict of interest

None declared.

  • Technical Note
  • Open access
  • Published: 12 February 2016

When is enough, enough? Understanding and solving your sample size problems in health services research

  • Victoria Pye 1 ,
  • Natalie Taylor 1 ,
  • Robyn Clay-Williams 1 &
  • Jeffrey Braithwaite 1  

BMC Research Notes volume  9 , Article number:  90 ( 2016 ) Cite this article

11k Accesses

13 Citations

1 Altmetric

Metrics details

Health services researchers face two obstacles to sample size calculation: inaccessible, highly specialised or overly technical literature, and difficulty securing methodologists during the planning stages of research. The purpose of this article is to provide pragmatic sample size calculation guidance for researchers who are designing a health services study. We aimed to create a simplified and generalizable process for sample size calculation, by (1) summarising key factors and considerations in determining a sample size, (2) developing practical steps for researchers—illustrated by a case study and, (3) providing a list of resources to steer researchers to the next stage of their calculations. Health services researchers can use this guidance to improve their understanding of sample size calculation, and implement these steps in their research practice.

Sample size literature for randomized controlled trials and study designs in which there is a clear hypothesis, single outcome measure, and simple comparison groups is available in abundance. Unfortunately health services research does not always fit into these constraints. Rather, it is often cross-sectional, and observational (i.e., with no ‘experimental group’) with multiple outcomes measured simultaneously. It can also be difficult work with no a priori hypothesis. The aim of this paper is to guide researchers during the planning stages to adequately power their study and to avoid the situation described in Fig.  1 . By blending key pieces of methodological literature with a pragmatic approach, researchers will be equipped with valuable information to plan and conduct sufficiently powered research using appropriate methodological designs. A short case study is provided (Additional file 1 ) to illustrate how these methods can be applied in practice.

A statistician’s dilemma

The importance of an accurate sample size calculation when designing quantitative research is well documented [ 1 – 3 ]. Without a carefully considered calculation, results can be missed, biased or just plain incorrect. In addition to squandering precious research funds, the implications of a poor sample size calculation can render a study unethical, unpublishable, or both. For simple study designs undertaken in controlled settings, there is a wealth of evidence based guidance on sample size calculations for clinical trials, experimental studies, and various types of rigorous analyses (Table  1 ), which can help make this process relatively straightforward. Although experimental trials (e.g., testing new treatment methods) are undertaken within health care settings, research to further understand and improve the health service itself is often cross-sectional, involves no intervention, and is likely to be observing multiple associations [ 4 ]. For example, testing the association between leadership on hospital wards and patient re-admission, controlling for various factors such as ward speciality, size of team, and staff turnover, would likely involve collecting a variety of data (e.g., personal information, surveys, administrative data) at one time point, with no experimental group or single hypothesis. Multi-method study designs of this type create challenges, as inputs for an adequate sample size calculation are often not readily available. These inputs are typically: defined groups for comparison, a hypothesis about the difference in outcome between the groups (an effect size), an estimate of the distribution of the outcome (variance), and desired levels of significance and power to find these differences (Fig.  2 ).

Inputs for a sample size calculation

Even in large studies there is often an absence of funding for statistical support, or the funding is inadequate for the size of the project [ 5 ]. This is particularly evident in the planning phase, which is arguably when it is required the most [ 6 ]. A study by Altman et al. [ 7 ] of statistician involvement in 704 papers submitted to the British Medical Journal and Annals of Internal Medicine indicated that only 51 % of observational studies received input from trained biostatisticians and, even when accounting for contributions from epidemiologists and other methodologists, only 52 % of observational studies utilized statistical advice in the study planning phase [ 7 ]. The practice of health services researchers performing their own statistical analysis without appropriate training or consultation from trained statisticians is not considered ideal [ 5 ]. In the review decisions of journal editors, manuscripts describing studies requiring statistical expertise are more likely to be rejected prior to peer review if the contribution of a statistician or methodologist has not been declared [ 7 ].

Calculating an appropriate sample size is not only to be considered a means to an end in obtaining accurate results. It is an important part of planning research, which will shape the eventual study design and data collection processes. Attacking the problem of sample size is also a good way of testing the validity of the study, confirming the research questions and clarifying the research to be undertaken and the potential outcomes. After all it is unethical to conduct research that is knowingly either overpowered or underpowered [ 2 , 3 ]. A study using more participants then necessary is a waste of resources and the time and effort of participants. An underpowered study is of limited benefit to the scientific community and is similarly wasteful.

With this in mind, it is surprising that methodologists such as statisticians are not customarily included in the study design phase. Whilst a lack of funding is partially to blame, it might also be that because sample size calculation and study design seem relatively simple on the surface, it is deemed unnecessary to enlist statistical expertise, or that it is only needed during the analysis phase. However, literature on sample size normally revolves around a single well defined hypothesis, an expected effect size, two groups to compare, and a known variance—an unlikely situation in practice, and a situation that can only occur with good planning. A well thought out study and analysis plan, formed in a conjunction with a statistician, can be utilized effectively and independently by researchers with the help of available literature. However a poorly planned study cannot be corrected by a statistician after the fact. For this reason a methodologist should be consulted early when designing the study.

Yet there is help if a statistician or methodologist is not available. The following steps provide useful information to aid researchers in designing their study and calculating sample size. Additionally, a list of resources (Table  1 ) that broadly frame sample size calculation is provided to guide researchers toward further literature searches. Footnote 1

A place to begin

Merrifield and Smith [ 1 ], and Martinez-Mesa et al. [ 3 ] discuss simple sample size calculations and explain the key concepts (e.g., power, effect size and significance) in simple terms and from a general health research perspective. These are a useful reference for non-statisticians and a good place to start for researchers who need a quick reminder of the basics. Lenth [ 2 ] provides an excellent and detailed exposition of effect size, including what one should avoid in sample size calculation.

Despite the guidance provided by this literature, there are additional factors to consider when determining sample size in health services research. Sample size requires deliberation from the outset of the study. Figure  3 depicts how different aspects of research are related to sample size and how each should be considered as part of an iterative planning phase. The components of this process are detailed below.

Stages in sample size calculation

Study design and hypothesis

The study design and hypothesis of a research project are two sides of the same coin. When there is a single unifying hypothesis, clear comparison groups and an effect size, e.g., drug A will reduce blood pressure 10 % more than drug B, then the study design becomes clear and the sample size can be calculated with relative ease. In this situation all the inputs are available for the diagram in Fig.  2 .

However, in large scale or complex health services research the aim is often to further our understanding about the way the system works, and to inform the design of appropriate interventions for improvement. Data collected for this purpose is cross-sectional in nature, with multiple variables within health care (e.g., processes, perceptions, outputs, outcomes, costs) collected simultaneously to build an accurate picture of a complex system. It is unlikely that there is a single hypothesis that can be used for the sample size calculation, and in many cases much of the hypothesising may not be performed until after some initial descriptive analysis. So how does one move forward?

To begin, consider your hypothesis (one or multiple). What relationships do you want to find specifically? There are three reasons why you may not find the relationships you are looking for:

The relationship does not exist.

The study was not adequately powered to find the relationship.

The relationship was obscured by other relationships.

There is no way to avoid the first, avoiding the second involves a good understanding of power and effect size (see Lenth [ 2 ]), and avoiding the third requires an understanding of your data and your area of research. A sample size calculation needs to be well thought out so that the research can either find the relationship, or, if one is not found, to be clear why it wasn’t found. The problem remains that before an estimate of the effect size can be made, a single hypothesis, single outcome measure and study design is required. If there is more than one outcome measure, then each requires an independent sample size calculation as each outcome measure has a unique distribution. Even with an analysis approach confirmed (e.g., a multilevel model), it can be difficult to decide which effect size measure should be used if there is a lack of research evidence in the area, or a lack of consensus within the literature about which effect sizes are appropriate. For example, despite the fact that Lenth advises researchers to avoid using Cohen’s effect size measurements [ 2 ], these margins are regularly applied [ 8 ].

To overcome these challenges, the following processes are recommended:

Select a primary hypothesis. Although the study may aim to assess a large variety of outcomes and independent variables, it is useful to consider if there is one relationship that is of most importance. For example, for a study attempting to assess mortality, re-admissions and length of stay as outcomes, each outcome will require its own hypothesis. It may be that for this particular study, re-admission rates are most important, therefore the study should be powered first and foremost to address that hypothesis. Walker [ 9 ] describes why having a single hypothesis is easier to communicate and how the results for primary and secondary hypotheses should be reported.

Consider a set of important hypotheses and the ways in which you might have to answer each one. Each hypothesis will likely require different statistical tests and methods. Take the example of a study aiming to understand more about the factors associated with hospital outcomes through multiple tests for associations between outcomes such as length of stay, mortality, and readmission rates (dependent variables) and nurse experience, nurse-patient ratio and nurse satisfaction (independent variables). Each of these investigations may use a different type of analysis, a different statistical test, and have a unique sample size requirement. It would be possible to roughly calculate the requirements and select the largest one as the overall sample size for the study. This way, the tests that require smaller samples are sure to be adequately powered. This option requires more time and understanding than the first.

During the study planning phase, when a literature review is normally undertaken, it is important not only to assess the findings of previous research, but also the design and the analysis. During the literature review phase, it is useful to keep a record of the study designs, outcome measures, and sample sizes that have already been reported. Consider whether those studies were adequately powered by examining the standard errors of the results and note any reported variances of outcome variables that are likely to be measured.

One of the most difficult challenges is to establish an appropriate expected effect size. This is often not available in the literature and has to be a judgement call based on experience. However previous studies may provide insight into clinically significant differences and the distribution of outcome measures, which can be used to help determine the effect size. It is recommended that experts in the research area are consulted to inform the decision about the expected effect size [ 2 , 8 ].

Simulation and rules of thumb

For many study designs, simulation studies are available (Table  1 ). Simulation studies generally perform multiple simulated experiments on fictional data using different effect sizes, outcomes and sample sizes. From this, an estimation of the standard error and any bias can be identified for the different conditions of the experiments. These are great tools and provide ‘ball park’ figures for similar (although most likely not identical) study designs. As evident in Table  1 , simulation studies often accompany discussions of sample size calculations. Simulation studies also provide ‘rules of thumb’, or heuristics about certain study designs and the sample required for each one. For example, one rule of thumb dictates that more than five cases per variable are required for a regression analysis [ 10 ].

Before making a final decision on a hypothesis and study design, identify the range of sample sizes that will be required for your research under different conditions. Early identification of a sample size that is prohibitively large will prevent time being wasted designing a study destined to be underpowered. Importantly, heuristics should not be used as the main source of information for sample size calculation. Rules of thumb are rarely congruous with careful sample size calculation [ 10 ] and will likely lead to an underpowered study. They should only be used, along with the information gathered through the use of the other techniques recommended in this paper, as a guide to inform the hypothesis and study design.

Other considerations

Be mindful of multiple comparisons.

The nature of statistical significance is that one in every 20 hypotheses tested will give a (false) significant result. This should be kept in mind when running multiple tests on the collected data. The hypothesis and appropriate tests should be nominated before the data are collected and only those tests should be performed. There are ways to correct for multiple comparisons [ 9 ], however, many argue that this is unnecessary [ 11 ]. There is no definitive way to ‘fix’ the problem of multiple tests being performed on a single data set and statisticians continue to argue over the best methodology [ 12 , 13 ]. Despite its complexity, it is worth considering how multiple comparisons may affect the results, and if there would be a reasonable way to adjust for this. The decision made should be noted and explained in the submitted manuscript.

After reading some introductory literature around sample size calculation it should be possible to derive an estimate to meet the study requirements. If this sample is not feasible, all is not lost. If the study is novel, it may add to the literature regardless of sample size. It may be possible to use pilot data from this preliminary work to compute a sample size calculation for a future study, to incorporate a qualitative component (e.g., interviews, focus groups), for answering a research question, or to inform new research.

Post hoc power analysis

This involves calculating the power of the study retrospectively, by using the observed effect size in the data collected to add interpretation to an insignificant result [ 2 ]. Hoenig and Heisey [ 14 ] detail this concept at length, including the range of associated limitations of such an approach. The well-reported criticisms of post hoc power analysis should cultivate research practice that involves appropriate methodological planning prior to embarking on a project.

Health services research can be a difficult environment for sample size calculation. However, it is entirely possible that, provided that significance, power, effect size and study design have been appropriately considered, a logical, meaningful and defensible calculation can always be obtained, achieving the situation described in Fig.  4 .

A statistician’s dream

Literature summarising an aspect of sample size calculation is included in Table  1 , providing a comprehensive mix of different aspects. The list is not exhaustive, and is to be used as a starting point to allow researchers to perform a more targeted search once their sample size problems have become clear. A librarian was consulted to inform a search strategy, which was then refined by the lead author. The resulting literature was reviewed by the lead author to ascertain suitability for inclusion.

Merrifield A, Smith W. Sample size calculations for the design of health studies: a review of key concepts for non-statisticians. NSW Public Health Bull. 2012;23(8):142–7. doi: 10.1071/NB11017 .

Article   Google Scholar  

Lenth RV. Some practical guidelines for effective sample size determination. Am Stat. 2001;55(3):187–93. doi: 10.1198/000313001317098149 .

Martinez-Mesa J, Gonzalez-Chica DA, Bastos JL, Bonamigo RR, Duquia RP. Sample size: how many participants do i need in my research? An Bras Dermatol. 2014;89(4):609–15. doi: 10.1590/abd1806-4841.20143705 .

Article   PubMed Central   PubMed   Google Scholar  

Webb P, Bain C. Essential epidemiology: an introduction for students and health professionals. 2nd ed. Cambridge: Cambridge University Press; 2011.

Google Scholar  

Omar RZ, McNally N, Ambler G, Pollock AM. Quality research in healthcare: are researchers getting enough statistical support? BMC Health Serv Res. 2006;6:2. doi: 10.1186/1472-6963-6-2 .

Maxwell SE, Kelley K, Rausch JR. Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol. 2008;59:537–63. doi: 10.1146/annurev.psych.59.103006.093735 .

Article   PubMed   Google Scholar  

Altman DG, Goodman SN, Schroter S. How statistical expertise is used in medical research. J Am Med Assoc. 2002;287(21):2817–20. doi: 10.1001/jama.287.21.2817 .

Sullivan GM, Feinn R. Using effect size—or why the P value is not enough. J Grad Med Educ. 2012;4(3):279–82. doi: 10.4300/JGME-D-12-00156.1 .

Walker AM. Reporting the results of epidemiologic studies. Am J Public Health. 1986;76(5):556–8.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Green SB. How many subjects does it take to do a regression analysis. Multivariate Behav Res. 1991;26(3):499–510. doi: 10.1207/s15327906mbr2603_7 .

Article   CAS   PubMed   Google Scholar  

Feise R. Do multiple outcome measures require p-value adjustment? BMC Med Res Methodol. 2002;2(1):8. doi: 10.1186/1471-2288-2-8 .

Savitz DA, Olshan AF. Describing data requires no adjustment for multiple comparisons: a reply from Savitz and Olshan. Am J Epidemiol. 1998;147(9):813–4. doi: 10.1093/oxfordjournals.aje.a009532 .

Savitz DA, Olshan AF. Multiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol. 1995;142(9):904–8.

CAS   PubMed   Google Scholar  

Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat. 2001;55(1):19–24. doi: 10.1198/000313001300339897 .

Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW, Jager KJ. Sample size calculations: basic principles and common pitfalls. Nephrol Dial Transplant. 2010;25(5):1388–93. doi: 10.1093/ndt/gfp732 .

Vardeman SB, Morris MD. Statistics and ethics: some advice for young statisticians. Am Stat. 2003;57(1):21–6. doi: 10.1198/0003130031072 .

Dowd BE. Separated at birth: statisticians, social scientists, and causality in health services research. Health Serv Res. 2011;46(2):397–420. doi: 10.1111/j.1475-6773.2010.01203.x .

Dimick JB, Welch HG, Birkmeyer JD. Surgical mortality as an indicator of hospital quality: the problem with small sample size. J Am Med Assoc. 2004;292(7):847–51. doi: 10.1001/jama.292.7.847 .

Article   CAS   Google Scholar  

Thomas DC, Siemiatycki J, Dewar R, Robins J, Goldberg M, Armstrong BG. The problem of multiple inference in studies designed to generate hypotheses. Am J Epidemiol. 1985;122(6):1080–95.

VanVoorhis CW, Morgan BL. Understanding power and rules of thumb for determining sample sizes. Tutor Quant Methods Psychol. 2007;3(2):43–50.

Van Belle G. Statistical rules of thumb. 2nd ed. New York: Wiley; 2011.

Serumaga-Zake PA, Arnab R, editors. A suggested statistical procedure for estimating the minimum sample size required for a complex cross-sectional study. The 7th international multi-conference on society, cybernetics and informatics: IMSCI, 2013 Orlando, Florida, USA; 2013.

Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Stat Med. 1998;17(14):1623–34. doi: 10.1002/(SICI)1097-0258(19980730)17:14<1623:AID-SIM871>3.0.CO;2-S .

Alam MK, Rao MB, Cheng F-C. Sample size determination in logistic regression. Sankhya B. 2010;72(1):58–75. doi: 10.1007/s13571-010-0004-6 .

Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49(12):1373–9. doi: 10.1016/S0895-4356(96)00236-3 .

Dupont WD, Plummer WD Jr. Power and sample size calculations for studies involving linear regression. Control Clin Trials. 1998;19(6):589–601. doi: 10.1016/S0197-2456(98)00037-3 .

Zhong B. How to calculate sample size in randomized controlled trial? J Thorac Dis. 2009;1(1):51–4.

PubMed Central   PubMed   Google Scholar  

Maas CJM, Hox JJ. Sufficient sample sizes for multilevel modeling. Methodology. 2005;1(3):86–92. doi: 10.1027/1614-2241.1.3.86 .

Cohen MP. Sample size considerations for multilevel surveys. Int Stat Rev. 2005;73(3):279–87. doi: 10.1111/j.1751-5823.2005.tb00149.x .

Paccagnella O. Sample size and accuracy of estimates in multilevel models: new simulation results. Methodology. 2011;7(3):111–20. doi: 10.1027/1614-2241/A000029 .

Maas CJM, Hox JJ. Robustness issues in multilevel regression analysis. Stat Neerl. 2004;58(2):127–37. doi: 10.1046/j.0039-0402.2003.00252.x .

Download references

Authors’ contributions

VP drafted the paper, performed literature searches and tabulated the findings. NT made substantial contribution to the structure and contents of the article. RCW provided assistance with the figures and tables, as well as structure and contents of the article. Both RCW and NT aided in the analysis and interpretation of findings. JB provided input into the conception and design of the article and critically reviewed its contents. All authors read and approved the final manuscript.

Acknowledgements

We would like to acknowledge Emily Hogden for assistance with editing and submission. The funding source for this article is an Australian National Health and Medical Research Council (NHMRC) Program Grant, APP1054146.

Authors’ information

VP is a biostatistician with 7 years’ experience in health research settings. NT is a health psychologist with organizational behaviour change and implementation expertise. RCW is a health services researcher with expertise in human factors and systems thinking. JB is a professor of health services research and Foundation Director of the Australian Institute of Health Innovation.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and affiliations.

Australian Institute of Health Innovation, Faculty of Medicine and Health Sciences, Macquarie University, Level 6, 75 Talavera Road, Macquarie University, Sydney, NSW, 2109, Australia

Victoria Pye, Natalie Taylor, Robyn Clay-Williams & Jeffrey Braithwaite

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Victoria Pye .

Additional file

Additional file 1. case study. this case study illustrates the steps of a sample size calculation., rights and permissions.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Pye, V., Taylor, N., Clay-Williams, R. et al. When is enough, enough? Understanding and solving your sample size problems in health services research. BMC Res Notes 9 , 90 (2016). https://doi.org/10.1186/s13104-016-1893-x

Download citation

Received : 15 January 2016

Accepted : 28 January 2016

Published : 12 February 2016

DOI : https://doi.org/10.1186/s13104-016-1893-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Sample size
  • Effect size
  • Health services research
  • Methodologies

BMC Research Notes

ISSN: 1756-0500

as the sample size used in a research project increases

SIPS logo

  • Previous Article
  • Next Article

Six Approaches to Justify Sample Sizes

Six ways to evaluate which effect sizes are interesting, the value of information, what is your inferential goal, additional considerations when designing an informative study, competing interests, data availability, sample size justification.

ORCID logo

[email protected]

  • Split-Screen
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Open the PDF for in another window
  • Guest Access
  • Get Permissions
  • Cite Icon Cite
  • Search Site

Daniël Lakens; Sample Size Justification. Collabra: Psychology 5 January 2022; 8 (1): 33267. doi: https://doi.org/10.1525/collabra.33267

Download citation file:

  • Ris (Zotero)
  • Reference Manager

An important step when designing an empirical study is to justify the sample size that will be collected. The key aim of a sample size justification for such studies is to explain how the collected data is expected to provide valuable information given the inferential goals of the researcher. In this overview article six approaches are discussed to justify the sample size in a quantitative empirical study: 1) collecting data from (almost) the entire population, 2) choosing a sample size based on resource constraints, 3) performing an a-priori power analysis, 4) planning for a desired accuracy, 5) using heuristics, or 6) explicitly acknowledging the absence of a justification. An important question to consider when justifying sample sizes is which effect sizes are deemed interesting, and the extent to which the data that is collected informs inferences about these effect sizes. Depending on the sample size justification chosen, researchers could consider 1) what the smallest effect size of interest is, 2) which minimal effect size will be statistically significant, 3) which effect sizes they expect (and what they base these expectations on), 4) which effect sizes would be rejected based on a confidence interval around the effect size, 5) which ranges of effects a study has sufficient power to detect based on a sensitivity power analysis, and 6) which effect sizes are expected in a specific research area. Researchers can use the guidelines presented in this article, for example by using the interactive form in the accompanying online Shiny app, to improve their sample size justification, and hopefully, align the informational value of a study with their inferential goals.

Scientists perform empirical studies to collect data that helps to answer a research question. The more data that is collected, the more informative the study will be with respect to its inferential goals. A sample size justification should consider how informative the data will be given an inferential goal, such as estimating an effect size, or testing a hypothesis. Even though a sample size justification is sometimes requested in manuscript submission guidelines, when submitting a grant to a funder, or submitting a proposal to an ethical review board, the number of observations is often simply stated , but not justified . This makes it difficult to evaluate how informative a study will be. To prevent such concerns from emerging when it is too late (e.g., after a non-significant hypothesis test has been observed), researchers should carefully justify their sample size before data is collected.

Researchers often find it difficult to justify their sample size (i.e., a number of participants, observations, or any combination thereof). In this review article six possible approaches are discussed that can be used to justify the sample size in a quantitative study (see Table 1 ). This is not an exhaustive overview, but it includes the most common and applicable approaches for single studies. 1 The first justification is that data from (almost) the entire population has been collected. The second justification centers on resource constraints, which are almost always present, but rarely explicitly evaluated. The third and fourth justifications are based on a desired statistical power or a desired accuracy. The fifth justification relies on heuristics, and finally, researchers can choose a sample size without any justification. Each of these justifications can be stronger or weaker depending on which conclusions researchers want to draw from the data they plan to collect.

All of these approaches to the justification of sample sizes, even the ‘no justification’ approach, give others insight into the reasons that led to the decision for a sample size in a study. It should not be surprising that the ‘heuristics’ and ‘no justification’ approaches are often unlikely to impress peers. However, it is important to note that the value of the information that is collected depends on the extent to which the final sample size allows a researcher to achieve their inferential goals, and not on the sample size justification that is chosen.

The extent to which these approaches make other researchers judge the data that is collected as informative depends on the details of the question a researcher aimed to answer and the parameters they chose when determining the sample size for their study. For example, a badly performed a-priori power analysis can quickly lead to a study with very low informational value. These six justifications are not mutually exclusive, and multiple approaches can be considered when designing a study.

The informativeness of the data that is collected depends on the inferential goals a researcher has, or in some cases, the inferential goals scientific peers will have. A shared feature of the different inferential goals considered in this review article is the question which effect sizes a researcher considers meaningful to distinguish. This implies that researchers need to evaluate which effect sizes they consider interesting. These evaluations rely on a combination of statistical properties and domain knowledge. In Table 2 six possibly useful considerations are provided. This is not intended to be an exhaustive overview, but it presents common and useful approaches that can be applied in practice. Not all evaluations are equally relevant for all types of sample size justifications. The online Shiny app accompanying this manuscript provides researchers with an interactive form that guides researchers through the considerations for a sample size justification. These considerations often rely on the same information (e.g., effect sizes, the number of observations, the standard deviation, etc.) so these six considerations should be seen as a set of complementary approaches that can be used to evaluate which effect sizes are of interest.

To start, researchers should consider what their smallest effect size of interest is. Second, although only relevant when performing a hypothesis test, researchers should consider which effect sizes could be statistically significant given a choice of an alpha level and sample size. Third, it is important to consider the (range of) effect sizes that are expected. This requires a careful consideration of the source of this expectation and the presence of possible biases in these expectations. Fourth, it is useful to consider the width of the confidence interval around possible values of the effect size in the population, and whether we can expect this confidence interval to reject effects we considered a-priori plausible. Fifth, it is worth evaluating the power of the test across a wide range of possible effect sizes in a sensitivity power analysis. Sixth, a researcher can consider the effect size distribution of related studies in the literature.

Since all scientists are faced with resource limitations, they need to balance the cost of collecting each additional datapoint against the increase in information that datapoint provides. This is referred to as the value of information   (Eckermann et al., 2010) . Calculating the value of information is notoriously difficult (Detsky, 1990) . Researchers need to specify the cost of collecting data, and weigh the costs of data collection against the increase in utility that having access to the data provides. From a value of information perspective not every data point that can be collected is equally valuable (J. Halpern et al., 2001; Wilson, 2015) . Whenever additional observations do not change inferences in a meaningful way, the costs of data collection can outweigh the benefits.

The value of additional information will in most cases be a non-monotonic function, especially when it depends on multiple inferential goals. A researcher might be interested in comparing an effect against a previously observed large effect in the literature, a theoretically predicted medium effect, and the smallest effect that would be practically relevant. In such a situation the expected value of sampling information will lead to different optimal sample sizes for each inferential goal. It could be valuable to collect informative data about a large effect, with additional data having less (or even a negative) marginal utility, up to a point where the data becomes increasingly informative about a medium effect size, with the value of sampling additional information decreasing once more until the study becomes increasingly informative about the presence or absence of a smallest effect of interest.

Because of the difficulty of quantifying the value of information, scientists typically use less formal approaches to justify the amount of data they set out to collect in a study. Even though the cost-benefit analysis is not always made explicit in reported sample size justifications, the value of information perspective is almost always implicitly the underlying framework that sample size justifications are based on. Throughout the subsequent discussion of sample size justifications, the importance of considering the value of information given inferential goals will repeatedly be highlighted.

Measuring (Almost) the Entire Population

In some instances it might be possible to collect data from (almost) the entire population under investigation. For example, researchers might use census data, are able to collect data from all employees at a firm or study a small population of top athletes. Whenever it is possible to measure the entire population, the sample size justification becomes straightforward: the researcher used all the data that is available.

Resource Constraints

A common reason for the number of observations in a study is that resource constraints limit the amount of data that can be collected at a reasonable cost (Lenth, 2001) . In practice, sample sizes are always limited by the resources that are available. Researchers practically always have resource limitations, and therefore even when resource constraints are not the primary justification for the sample size in a study, it is always a secondary justification.

Despite the omnipresence of resource limitations, the topic often receives little attention in texts on experimental design (for an example of an exception, see Bulus and Dong (2021) ). This might make it feel like acknowledging resource constraints is not appropriate, but the opposite is true: Because resource limitations always play a role, a responsible scientist carefully evaluates resource constraints when designing a study. Resource constraint justifications are based on a trade-off between the costs of data collection, and the value of having access to the information the data provides. Even if researchers do not explicitly quantify this trade-off, it is revealed in their actions. For example, researchers rarely spend all the resources they have on a single study. Given resource constraints, researchers are confronted with an optimization problem of how to spend resources across multiple research questions.

Time and money are two resource limitations all scientists face. A PhD student has a certain time to complete a PhD thesis, and is typically expected to complete multiple research lines in this time. In addition to time limitations, researchers have limited financial resources that often directly influence how much data can be collected. A third limitation in some research lines is that there might simply be a very small number of individuals from whom data can be collected, such as when studying patients with a rare disease. A resource constraint justification puts limited resources at the center of the justification for the sample size that will be collected, and starts with the resources a scientist has available. These resources are translated into an expected number of observations ( N ) that a researcher expects they will be able to collect with an amount of money in a given time. The challenge is to evaluate whether collecting N observations is worthwhile. How do we decide if a study will be informative, and when should we conclude that data collection is not worthwhile?

When evaluating whether resource constraints make data collection uninformative, researchers need to explicitly consider which inferential goals they have when collecting data (Parker & Berman, 2003) . Having data always provides more knowledge about the research question than not having data, so in an absolute sense, all data that is collected has value. However, it is possible that the benefits of collecting the data are outweighed by the costs of data collection.

It is most straightforward to evaluate whether data collection has value when we know for certain that someone will make a decision, with or without data. In such situations any additional data will reduce the error rates of a well-calibrated decision process, even if only ever so slightly. For example, without data we will not perform better than a coin flip if we guess which of two conditions has a higher true mean score on a measure. With some data, we can perform better than a coin flip by picking the condition that has the highest mean. With a small amount of data we would still very likely make a mistake, but the error rate is smaller than without any data. In these cases, the value of information might be positive, as long as the reduction in error rates is more beneficial than the cost of data collection.

Another way in which a small dataset can be valuable is if its existence eventually makes it possible to perform a meta-analysis (Maxwell & Kelley, 2011) . This argument in favor of collecting a small dataset requires 1) that researchers share the data in a way that a future meta-analyst can find it, and 2) that there is a decent probability that someone will perform a high-quality meta-analysis that will include this data in the future (S. D. Halpern et al., 2002) . The uncertainty about whether there will ever be such a meta-analysis should be weighed against the costs of data collection.

One way to increase the probability of a future meta-analysis is if researchers commit to performing this meta-analysis themselves, by combining several studies they have performed into a small-scale meta-analysis (Cumming, 2014) . For example, a researcher might plan to repeat a study for the next 12 years in a class they teach, with the expectation that after 12 years a meta-analysis of 12 studies would be sufficient to draw informative inferences (but see ter Schure and Grünwald (2019) ). If it is not plausible that a researcher will collect all the required data by themselves, they can attempt to set up a collaboration where fellow researchers in their field commit to collecting similar data with identical measures. If it is not likely that sufficient data will emerge over time to reach the inferential goals, there might be no value in collecting the data.

Even if a researcher believes it is worth collecting data because a future meta-analysis will be performed, they will most likely perform a statistical test on the data. To make sure their expectations about the results of such a test are well-calibrated, it is important to consider which effect sizes are of interest, and to perform a sensitivity power analysis to evaluate the probability of a Type II error for effects of interest. From the six ways to evaluate which effect sizes are interesting that will be discussed in the second part of this review, it is useful to consider the smallest effect size that can be statistically significant, the expected width of the confidence interval around the effect size, and effects that can be expected in a specific research area, and to evaluate the power for these effect sizes in a sensitivity power analysis. If a decision or claim is made, a compromise power analysis is worthwhile to consider when deciding upon the error rates while planning the study. When reporting a resource constraints sample size justification it is recommended to address the five considerations in Table 3 . Addressing these points explicitly facilitates evaluating if the data is worthwhile to collect. To make it easier to address all relevant points explicitly, an interactive form to implement the recommendations in this manuscript can be found at https://shiny.ieis.tue.nl/sample_size_justification/ .

A-priori Power Analysis

When designing a study where the goal is to test whether a statistically significant effect is present, researchers often want to make sure their sample size is large enough to prevent erroneous conclusions for a range of effect sizes they care about. In this approach to justifying a sample size, the value of information is to collect observations up to the point that the probability of an erroneous inference is, in the long run, not larger than a desired value. If a researcher performs a hypothesis test, there are four possible outcomes:

A false positive (or Type I error), determined by the α level. A test yields a significant result, even though the null hypothesis is true.

A false negative (or Type II error), determined by β , or 1 - power. A test yields a non-significant result, even though the alternative hypothesis is true.

A true negative, determined by 1- α . A test yields a non-significant result when the null hypothesis is true.

A true positive, determined by 1- β . A test yields a significant result when the alternative hypothesis is true.

Given a specified effect size, alpha level, and power, an a-priori power analysis can be used to calculate the number of observations required to achieve the desired error rates, given the effect size. 3   Figure 1 illustrates how the statistical power increases as the number of observations (per group) increases in an independent t test with a two-sided alpha level of 0.05. If we are interested in detecting an effect of d = 0.5, a sample size of 90 per condition would give us more than 90% power. Statistical power can be computed to determine the number of participants, or the number of items (Westfall et al., 2014) but can also be performed for single case studies (Ferron & Onghena, 1996; McIntosh & Rittmo, 2020)  

graphic

Although it is common to set the Type I error rate to 5% and aim for 80% power, error rates should be justified (Lakens, Adolfi, et al., 2018) . As explained in the section on compromise power analysis, the default recommendation to aim for 80% power lacks a solid justification. In general, the lower the error rates (and thus the higher the power), the more informative a study will be, but the more resources are required. Researchers should carefully weigh the costs of increasing the sample size against the benefits of lower error rates, which would probably make studies designed to achieve a power of 90% or 95% more common for articles reporting a single study. An additional consideration is whether the researcher plans to publish an article consisting of a set of replication and extension studies, in which case the probability of observing multiple Type I errors will be very low, but the probability of observing mixed results even when there is a true effect increases (Lakens & Etz, 2017) , which would also be a reason to aim for studies with low Type II error rates, perhaps even by slightly increasing the alpha level for each individual study.

Figure 2 visualizes two distributions. The left distribution (dashed line) is centered at 0. This is a model for the null hypothesis. If the null hypothesis is true a statistically significant result will be observed if the effect size is extreme enough (in a two-sided test either in the positive or negative direction), but any significant result would be a Type I error (the dark grey areas under the curve). If there is no true effect, formally statistical power for a null hypothesis significance test is undefined. Any significant effects observed if the null hypothesis is true are Type I errors, or false positives, which occur at the chosen alpha level. The right distribution (solid line) is centered on an effect of d = 0.5. This is the specified model for the alternative hypothesis in this study, illustrating the expectation of an effect of d = 0.5 if the alternative hypothesis is true. Even though there is a true effect, studies will not always find a statistically significant result. This happens when, due to random variation, the observed effect size is too close to 0 to be statistically significant. Such results are false negatives (the light grey area under the curve on the right). To increase power, we can collect a larger sample size. As the sample size increases, the distributions become more narrow, reducing the probability of a Type II error. 4

graphic

It is important to highlight that the goal of an a-priori power analysis is not to achieve sufficient power for the true effect size. The true effect size is unknown. The goal of an a-priori power analysis is to achieve sufficient power, given a specific assumption of the effect size a researcher wants to detect. Just like a Type I error rate is the maximum probability of making a Type I error conditional on the assumption that the null hypothesis is true, an a-priori power analysis is computed under the assumption of a specific effect size. It is unknown if this assumption is correct. All a researcher can do is to make sure their assumptions are well justified. Statistical inferences based on a test where the Type II error rate is controlled are conditional on the assumption of a specific effect size. They allow the inference that, assuming the true effect size is at least as large as that used in the a-priori power analysis, the maximum Type II error rate in a study is not larger than a desired value.

This point is perhaps best illustrated if we consider a study where an a-priori power analysis is performed both for a test of the presence of an effect, as for a test of the absence of an effect. When designing a study, it essential to consider the possibility that there is no effect (e.g., a mean difference of zero). An a-priori power analysis can be performed both for a null hypothesis significance test, as for a test of the absence of a meaningful effect, such as an equivalence test that can statistically provide support for the null hypothesis by rejecting the presence of effects that are large enough to matter (Lakens, 2017; Meyners, 2012; Rogers et al., 1993) . When multiple primary tests will be performed based on the same sample, each analysis requires a dedicated sample size justification. If possible, a sample size is collected that guarantees that all tests are informative, which means that the collected sample size is based on the largest sample size returned by any of the a-priori power analyses.

For example, if the goal of a study is to detect or reject an effect size of d = 0.4 with 90% power, and the alpha level is set to 0.05 for a two-sided independent t test, a researcher would need to collect 133 participants in each condition for an informative null hypothesis test, and 136 participants in each condition for an informative equivalence test. Therefore, the researcher should aim to collect 272 participants in total for an informative result for both tests that are planned. This does not guarantee a study has sufficient power for the true effect size (which can never be known), but it guarantees the study has sufficient power given an assumption of the effect a researcher is interested in detecting or rejecting. Therefore, an a-priori power analysis is useful, as long as a researcher can justify the effect sizes they are interested in.

If researchers correct the alpha level when testing multiple hypotheses, the a-priori power analysis should be based on this corrected alpha level. For example, if four tests are performed, an overall Type I error rate of 5% is desired, and a Bonferroni correction is used, the a-priori power analysis should be based on a corrected alpha level of .0125.

An a-priori power analysis can be performed analytically, or by performing computer simulations. Analytic solutions are faster but less flexible. A common challenge researchers face when attempting to perform power analyses for more complex or uncommon tests is that available software does not offer analytic solutions. In these cases simulations can provide a flexible solution to perform power analyses for any test (Morris et al., 2019) . The following code is an example of a power analysis in R based on 10000 simulations for a one-sample t test against zero for a sample size of 20, assuming a true effect of d = 0.5. All simulations consist of first randomly generating data based on assumptions of the data generating mechanism (e.g., a normal distribution with a mean of 0.5 and a standard deviation of 1), followed by a test performed on the data. By computing the percentage of significant results, power can be computed for any design.

p <- numeric(10000) # to store p-values for (i in 1:10000) { #simulate 10k tests x <- rnorm(n = 20, mean = 0.5, sd = 1) p[i] <- t.test(x)$p.value # store p-value } sum(p < 0.05) / 10000 # Compute power

There is a wide range of tools available to perform power analyses. Whichever tool a researcher decides to use, it will take time to learn how to use the software correctly to perform a meaningful a-priori power analysis. Resources to educate psychologists about power analysis consist of book-length treatments (Aberson, 2019; Cohen, 1988; Julious, 2004; Murphy et al., 2014) , general introductions (Baguley, 2004; Brysbaert, 2019; Faul et al., 2007; Maxwell et al., 2008; Perugini et al., 2018) , and an increasing number of applied tutorials for specific tests (Brysbaert & Stevens, 2018; DeBruine & Barr, 2019; P. Green & MacLeod, 2016; Kruschke, 2013; Lakens & Caldwell, 2021; Schoemann et al., 2017; Westfall et al., 2014) . It is important to be trained in the basics of power analysis, and it can be extremely beneficial to learn how to perform simulation-based power analyses. At the same time, it is often recommended to enlist the help of an expert, especially when a researcher lacks experience with a power analysis for a specific test.

When reporting an a-priori power analysis, make sure that the power analysis is completely reproducible. If power analyses are performed in R it is possible to share the analysis script and information about the version of the package. In many software packages it is possible to export the power analysis that is performed as a PDF file. For example, in G*Power analyses can be exported under the ‘protocol of power analysis’ tab. If the software package provides no way to export the analysis, add a screenshot of the power analysis to the supplementary files.

graphic

The reproducible report needs to be accompanied by justifications for the choices that were made with respect to the values used in the power analysis. If the effect size used in the power analysis is based on previous research the factors presented in Table 5 (if the effect size is based on a meta-analysis) or Table 6 (if the effect size is based on a single study) should be discussed. If an effect size estimate is based on the existing literature, provide a full citation, and preferably a direct quote from the article where the effect size estimate is reported. If the effect size is based on a smallest effect size of interest, this value should not just be stated, but justified (e.g., based on theoretical predictions or practical implications, see Lakens, Scheel, and Isager (2018) ). For an overview of all aspects that should be reported when describing an a-priori power analysis, see Table 4 .

Planning for Precision

Some researchers have suggested to justify sample sizes based on a desired level of precision of the estimate (Cumming & Calin-Jageman, 2016; Kruschke, 2018; Maxwell et al., 2008) . The goal when justifying a sample size based on precision is to collect data to achieve a desired width of the confidence interval around a parameter estimate. The width of the confidence interval around the parameter estimate depends on the standard deviation and the number of observations. The only aspect a researcher needs to justify for a sample size justification based on accuracy is the desired width of the confidence interval with respect to their inferential goal, and their assumption about the population standard deviation of the measure.

If a researcher has determined the desired accuracy, and has a good estimate of the true standard deviation of the measure, it is straightforward to calculate the sample size needed for a desired level of accuracy. For example, when measuring the IQ of a group of individuals a researcher might desire to estimate the IQ score within an error range of 2 IQ points for 95% of the observed means, in the long run. The required sample size to achieve this desired level of accuracy (assuming normally distributed data) can be computed by:

where N is the number of observations, z is the critical value related to the desired confidence interval, sd is the standard deviation of IQ scores in the population, and error is the width of the confidence interval within which the mean should fall, with the desired error rate. In this example, (1.96 × 15 / 2)^2 = 216.1 observations. If a researcher desires 95% of the means to fall within a 2 IQ point range around the true population mean, 217 observations should be collected. If a desired accuracy for a non-zero mean difference is computed, accuracy is based on a non-central t -distribution. For these calculations an expected effect size estimate needs to be provided, but it has relatively little influence on the required sample size (Maxwell et al., 2008) . It is also possible to incorporate uncertainty about the observed effect size in the sample size calculation, known as assurance   (Kelley & Rausch, 2006) . The MBESS package in R provides functions to compute sample sizes for a wide range of tests (Kelley, 2007) .

What is less straightforward is to justify how a desired level of accuracy is related to inferential goals. There is no literature that helps researchers to choose a desired width of the confidence interval. Morey (2020) convincingly argues that most practical use-cases of planning for precision involve an inferential goal of distinguishing an observed effect from other effect sizes (for a Bayesian perspective, see Kruschke (2018) ). For example, a researcher might expect an effect size of r = 0.4 and would treat observed correlations that differ more than 0.2 (i.e., 0.2 < r < 0.6) differently, in that effects of r = 0.6 or larger are considered too large to be caused by the assumed underlying mechanism (Hilgard, 2021) , while effects smaller than r = 0.2 are considered too small to support the theoretical prediction. If the goal is indeed to get an effect size estimate that is precise enough so that two effects can be differentiated with high probability, the inferential goal is actually a hypothesis test, which requires designing a study with sufficient power to reject effects (e.g., testing a range prediction of correlations between 0.2 and 0.6).

If researchers do not want to test a hypothesis, for example because they prefer an estimation approach over a testing approach, then in the absence of clear guidelines that help researchers to justify a desired level of precision, one solution might be to rely on a generally accepted norm of precision to aim for. This norm could be based on ideas about a certain resolution below which measurements in a research area no longer lead to noticeably different inferences. Just as researchers normatively use an alpha level of 0.05, they could plan studies to achieve a desired confidence interval width around the observed effect that is determined by a norm. Future work is needed to help researchers choose a confidence interval width when planning for accuracy.

When a researcher uses a heuristic, they are not able to justify their sample size themselves, but they trust in a sample size recommended by some authority. When I started as a PhD student in 2005 it was common to collect 15 participants in each between subject condition. When asked why this was a common practice, no one was really sure, but people trusted there was a justification somewhere in the literature. Now, I realize there was no justification for the heuristics we used. As Berkeley (1735) already observed: “Men learn the elements of science from others: And every learner hath a deference more or less to authority, especially the young learners, few of that kind caring to dwell long upon principles, but inclining rather to take them upon trust: And things early admitted by repetition become familiar: And this familiarity at length passeth for evidence.”

Some papers provide researchers with simple rules of thumb about the sample size that should be collected. Such papers clearly fill a need, and are cited a lot, even when the advice in these articles is flawed. For example, Wilson VanVoorhis and Morgan (2007) translate an absolute minimum of 50+8 observations for regression analyses suggested by a rule of thumb examined in S. B. Green (1991) into the recommendation to collect ~50 observations. Green actually concludes in his article that “In summary, no specific minimum number of subjects or minimum ratio of subjects-to-predictors was supported”. He does discuss how a general rule of thumb of N = 50 + 8 provided an accurate minimum number of observations for the ‘typical’ study in the social sciences because these have a ‘medium’ effect size, as Green claims by citing Cohen (1988) . Cohen actually didn’t claim that the typical study in the social sciences has a ‘medium’ effect size, and instead said (1988, p. 13) : “Many effects sought in personality, social, and clinical-psychological research are likely to be small effects as here defined”. We see how a string of mis-citations eventually leads to a misleading rule of thumb.

Rules of thumb seem to primarily emerge due to mis-citations and/or overly simplistic recommendations. Simonsohn, Nelson, and Simmons (2011) recommended that “Authors must collect at least 20 observations per cell”. A later recommendation by the same authors presented at a conference suggested to use n > 50, unless you study large effects (Simmons et al., 2013) . Regrettably, this advice is now often mis-cited as a justification to collect no more than 50 observations per condition without considering the expected effect size. If authors justify a specific sample size (e.g., n = 50) based on a general recommendation in another paper, either they are mis-citing the paper, or the paper they are citing is flawed.

Another common heuristic is to collect the same number of observations as were collected in a previous study. This strategy is not recommended in scientific disciplines with widespread publication bias, and/or where novel and surprising findings from largely exploratory single studies are published. Using the same sample size as a previous study is only a valid approach if the sample size justification in the previous study also applies to the current study. Instead of stating that you intend to collect the same sample size as an earlier study, repeat the sample size justification, and update it in light of any new information (such as the effect size in the earlier study, see Table 6 ).

Peer reviewers and editors should carefully scrutinize rules of thumb sample size justifications, because they can make it seem like a study has high informational value for an inferential goal even when the study will yield uninformative results. Whenever one encounters a sample size justification based on a heuristic, ask yourself: ‘Why is this heuristic used?’ It is important to know what the logic behind a heuristic is to determine whether the heuristic is valid for a specific situation. In most cases, heuristics are based on weak logic, and not widely applicable. It might be possible that fields develop valid heuristics for sample size justifications. For example, it is possible that a research area reaches widespread agreement that effects smaller than d = 0.3 are too small to be of interest, and all studies in a field use sequential designs (see below) that have 90% power to detect a d = 0.3. Alternatively, it is possible that a field agrees that data should be collected with a desired level of accuracy, irrespective of the true effect size. In these cases, valid heuristics would exist based on generally agreed goals of data collection. For example, Simonsohn (2015) suggests to design replication studies that have 2.5 times as large sample sizes as the original study, as this provides 80% power for an equivalence test against an equivalence bound set to the effect the original study had 33% power to detect, assuming the true effect size is 0. As original authors typically do not specify which effect size would falsify their hypothesis, the heuristic underlying this ‘small telescopes’ approach is a good starting point for a replication study with the inferential goal to reject the presence of an effect as large as was described in an earlier publication. It is the responsibility of researchers to gain the knowledge to distinguish valid heuristics from mindless heuristics, and to be able to evaluate whether a heuristic will yield an informative result given the inferential goal of the researchers in a specific study, or not.

No Justification

It might sound like a contradictio in terminis , but it is useful to distinguish a final category where researchers explicitly state they do not have a justification for their sample size. Perhaps the resources were available to collect more data, but they were not used. A researcher could have performed a power analysis, or planned for precision, but they did not. In those cases, instead of pretending there was a justification for the sample size, honesty requires you to state there is no sample size justification. This is not necessarily bad. It is still possible to discuss the smallest effect size of interest, the minimal statistically detectable effect, the width of the confidence interval around the effect size, and to plot a sensitivity power analysis, in relation to the sample size that was collected. If a researcher truly had no specific inferential goals when collecting the data, such an evaluation can perhaps be performed based on reasonable inferential goals peers would have when they learn about the existence of the collected data.

Do not try to spin a story where it looks like a study was highly informative when it was not. Instead, transparently evaluate how informative the study was given effect sizes that were of interest, and make sure that the conclusions follow from the data. The lack of a sample size justification might not be problematic, but it might mean that a study was not informative for most effect sizes of interest, which makes it especially difficult to interpret non-significant effects, or estimates with large uncertainty.

The inferential goal of data collection is often in some way related to the size of an effect. Therefore, to design an informative study, researchers will want to think about which effect sizes are interesting. First, it is useful to consider three effect sizes when determining the sample size. The first is the smallest effect size a researcher is interested in, the second is the smallest effect size that can be statistically significant (only in studies where a significance test will be performed), and the third is the effect size that is expected. Beyond considering these three effect sizes, it can be useful to evaluate ranges of effect sizes. This can be done by computing the width of the expected confidence interval around an effect size of interest (for example, an effect size of zero), and examine which effects could be rejected. Similarly, it can be useful to plot a sensitivity curve and evaluate the range of effect sizes the design has decent power to detect, as well as to consider the range of effects for which the design has low power. Finally, there are situations where it is useful to consider a range of effect that is likely to be observed in a specific research area.

What is the Smallest Effect Size of Interest?

The strongest possible sample size justification is based on an explicit statement of the smallest effect size that is considered interesting. A smallest effect size of interest can be based on theoretical predictions or practical considerations. For a review of approaches that can be used to determine a smallest effect size of interest in randomized controlled trials, see Cook et al.  (2014) and Keefe et al.  (2013) , for reviews of different methods to determine a smallest effect size of interest, see King (2011) and Copay, Subach, Glassman, Polly, and Schuler (2007) , and for a discussion focused on psychological research, see Lakens, Scheel, et al.  (2018) .

It can be challenging to determine the smallest effect size of interest whenever theories are not very developed, or when the research question is far removed from practical applications, but it is still worth thinking about which effects would be too small to matter. A first step forward is to discuss which effect sizes are considered meaningful in a specific research line with your peers. Researchers will differ in the effect sizes they consider large enough to be worthwhile (Murphy et al., 2014) . Just as not every scientist will find every research question interesting enough to study, not every scientist will consider the same effect sizes interesting enough to study, and different stakeholders will differ in which effect sizes are considered meaningful (Kelley & Preacher, 2012) .

Even though it might be challenging, there are important benefits of being able to specify a smallest effect size of interest. The population effect size is always uncertain (indeed, estimating this is typically one of the goals of the study), and therefore whenever a study is powered for an expected effect size, there is considerable uncertainty about whether the statistical power is high enough to detect the true effect in the population. However, if the smallest effect size of interest can be specified and agreed upon after careful deliberation, it becomes possible to design a study that has sufficient power (given the inferential goal to detect or reject the smallest effect size of interest with a certain error rate). A smallest effect of interest may be subjective (one researcher might find effect sizes smaller than d = 0.3 meaningless, while another researcher might still be interested in effects larger than d = 0.1), and there might be uncertainty about the parameters required to specify the smallest effect size of interest (e.g., when performing a cost-benefit analysis), but after a smallest effect size of interest has been determined, a study can be designed with a known Type II error rate to detect or reject this value. For this reason an a-priori power based on a smallest effect size of interest is generally preferred, whenever researchers are able to specify one (Aberson, 2019; Albers & Lakens, 2018; Brown, 1983; Cascio & Zedeck, 1983; Dienes, 2014; Lenth, 2001) .

The Minimal Statistically Detectable Effect

The minimal statistically detectable effect, or the critical effect size, provides information about the smallest effect size that, if observed, would be statistically significant given a specified alpha level and sample size (Cook et al., 2014) . For any critical t value (e.g., t = 1.96 for α = 0.05, for large sample sizes) we can compute a critical mean difference (Phillips et al., 2001) , or a critical standardized effect size. For a two-sided independent t test the critical mean difference is:

and the critical standardized mean difference is:

In Figure 4 the distribution of Cohen’s d is plotted for 15 participants per group when the true effect size is either d = 0 or d = 0.5. This figure is similar to Figure 2 , with the addition that the critical d is indicated. We see that with such a small number of observations in each group only observed effects larger than d = 0.75 will be statistically significant. Whether such effect sizes are interesting, and can realistically be expected, should be carefully considered and justified.

graphic

G*Power provides the critical test statistic (such as the critical t value) when performing a power analysis. For example, Figure 5 shows that for a correlation based on a two-sided test, with α = 0.05, and N = 30, only effects larger than r = 0.361 or smaller than r = -0.361 can be statistically significant. This reveals that when the sample size is relatively small, the observed effect needs to be quite substantial to be statistically significant.

graphic

It is important to realize that due to random variation each study has a probability to yield effects larger than the critical effect size, even if the true effect size is small (or even when the true effect size is 0, in which case each significant effect is a Type I error). Computing a minimal statistically detectable effect is useful for a study where no a-priori power analysis is performed, both for studies in the published literature that do not report a sample size justification (Lakens, Scheel, et al., 2018) , as for researchers who rely on heuristics for their sample size justification.

It can be informative to ask yourself whether the critical effect size for a study design is within the range of effect sizes that can realistically be expected. If not, then whenever a significant effect is observed in a published study, either the effect size is surprisingly larger than expected, or more likely, it is an upwardly biased effect size estimate. In the latter case, given publication bias, published studies will lead to biased effect size estimates. If it is still possible to increase the sample size, for example by ignoring rules of thumb and instead performing an a-priori power analysis, then do so. If it is not possible to increase the sample size, for example due to resource constraints, then reflecting on the minimal statistically detectable effect should make it clear that an analysis of the data should not focus on p values, but on the effect size and the confidence interval (see Table 3 ).

It is also useful to compute the minimal statistically detectable effect if an ‘optimistic’ power analysis is performed. For example, if you believe a best case scenario for the true effect size is d = 0.57 and use this optimistic expectation in an a-priori power analysis, effects smaller than d = 0.4 will not be statistically significant when you collect 50 observations in a two independent group design. If your worst case scenario for the alternative hypothesis is a true effect size of d = 0.35 your design would not allow you to declare a significant effect if effect size estimates close to the worst case scenario are observed. Taking into account the minimal statistically detectable effect size should make you reflect on whether a hypothesis test will yield an informative answer, and whether your current approach to sample size justification (e.g., the use of rules of thumb, or letting resource constraints determine the sample size) leads to an informative study, or not.

What is the Expected Effect Size?

Although the true population effect size is always unknown, there are situations where researchers have a reasonable expectation of the effect size in a study, and want to use this expected effect size in an a-priori power analysis. Even if expectations for the observed effect size are largely a guess, it is always useful to explicitly consider which effect sizes are expected. A researcher can justify a sample size based on the effect size they expect, even if such a study would not be very informative with respect to the smallest effect size of interest. In such cases a study is informative for one inferential goal (testing whether the expected effect size is present or absent), but not highly informative for the second goal (testing whether the smallest effect size of interest is present or absent).

There are typically three sources for expectations about the population effect size: a meta-analysis, a previous study, or a theoretical model. It is tempting for researchers to be overly optimistic about the expected effect size in an a-priori power analysis, as higher effect size estimates yield lower sample sizes, but being too optimistic increases the probability of observing a false negative result. When reviewing a sample size justification based on an a-priori power analysis, it is important to critically evaluate the justification for the expected effect size used in power analyses.

Using an Estimate from a Meta-Analysis

In a perfect world effect size estimates from a meta-analysis would provide researchers with the most accurate information about which effect size they could expect. Due to widespread publication bias in science, effect size estimates from meta-analyses are regrettably not always accurate. They can be biased, sometimes substantially so. Furthermore, meta-analyses typically have considerable heterogeneity, which means that the meta-analytic effect size estimate differs for subsets of studies that make up the meta-analysis. So, although it might seem useful to use a meta-analytic effect size estimate of the effect you are studying in your power analysis, you need to take great care before doing so.

If a researcher wants to enter a meta-analytic effect size estimate in an a-priori power analysis, they need to consider three things (see Table 5 ). First, the studies included in the meta-analysis should be similar enough to the study they are performing that it is reasonable to expect a similar effect size. In essence, this requires evaluating the generalizability of the effect size estimate to the new study. It is important to carefully consider differences between the meta-analyzed studies and the planned study, with respect to the manipulation, the measure, the population, and any other relevant variables.

Second, researchers should check whether the effect sizes reported in the meta-analysis are homogeneous. If not, and there is considerable heterogeneity in the meta-analysis, it means not all included studies can be expected to have the same true effect size estimate. A meta-analytic estimate should be used based on the subset of studies that most closely represent the planned study. Note that heterogeneity remains a possibility (even direct replication studies can show heterogeneity when unmeasured variables moderate the effect size in each sample (Kenny & Judd, 2019; Olsson-Collentine et al., 2020) ), so the main goal of selecting similar studies is to use existing data to increase the probability that your expectation is accurate, without guaranteeing it will be.

Third, the meta-analytic effect size estimate should not be biased. Check if the bias detection tests that are reported in the meta-analysis are state-of-the-art, or perform multiple bias detection tests yourself (Carter et al., 2019) , and consider bias corrected effect size estimates (even though these estimates might still be biased, and do not necessarily reflect the true population effect size).

Using an Estimate from a Previous Study

If a meta-analysis is not available, researchers often rely on an effect size from a previous study in an a-priori power analysis. The first issue that requires careful attention is whether the two studies are sufficiently similar. Just as when using an effect size estimate from a meta-analysis, researchers should consider if there are differences between the studies in terms of the population, the design, the manipulations, the measures, or other factors that should lead one to expect a different effect size. For example, intra-individual reaction time variability increases with age, and therefore a study performed on older participants should expect a smaller standardized effect size than a study performed on younger participants. If an earlier study used a very strong manipulation, and you plan to use a more subtle manipulation, a smaller effect size should be expected. Finally, effect sizes do not generalize to studies with different designs. For example, the effect size for a comparison between two groups is most often not similar to the effect size for an interaction in a follow-up study where a second factor is added to the original design (Lakens & Caldwell, 2021) .

Even if a study is sufficiently similar, statisticians have warned against using effect size estimates from small pilot studies in power analyses. Leon, Davis, and Kraemer (2011) write:

Contrary to tradition, a pilot study does not provide a meaningful effect size estimate for planning subsequent studies due to the imprecision inherent in data from small samples.

The two main reasons researchers should be careful when using effect sizes from studies in the published literature in power analyses is that effect size estimates from studies can differ from the true population effect size due to random variation, and that publication bias inflates effect sizes. Figure 6 shows the distribution for η p 2 for a study with three conditions with 25 participants in each condition when the null hypothesis is true and when there is a ‘medium’ true effect of η p 2 = 0.0588 (Richardson, 2011) . As in Figure 4 the critical effect size is indicated, which shows observed effects smaller than η p 2 = 0.08 will not be significant with the given sample size. If the null hypothesis is true effects larger than η p 2 = 0.08 will be a Type I error (the dark grey area), and when the alternative hypothesis is true effects smaller than η p 2 = 0.08 will be a Type II error (light grey area). It is clear all significant effects are larger than the true effect size ( η p 2 = 0.0588), so power analyses based on a significant finding (e.g., because only significant results are published in the literature) will be based on an overestimate of the true effect size, introducing bias.

graphic

But even if we had access to all effect sizes (e.g., from pilot studies you have performed yourself) due to random variation the observed effect size will sometimes be quite small. Figure 6 shows it is quite likely to observe an effect of η p 2 = 0.01 in a small pilot study, even when the true effect size is 0.0588. Entering an effect size estimate of η p 2 = 0.01 in an a-priori power analysis would suggest a total sample size of 957 observations to achieve 80% power in a follow-up study. If researchers only follow up on pilot studies when they observe an effect size in the pilot study that, when entered into a power analysis, yields a sample size that is feasible to collect for the follow-up study, these effect size estimates will be upwardly biased, and power in the follow-up study will be systematically lower than desired (Albers & Lakens, 2018) .

In essence, the problem with using small studies to estimate the effect size that will be entered into an a-priori power analysis is that due to publication bias or follow-up bias the effect sizes researchers end up using for their power analysis do not come from a full F distribution, but from what is known as a truncated   F distribution (Taylor & Muller, 1996) . For example, imagine if there is extreme publication bias in the situation illustrated in Figure 6 . The only studies that would be accessible to researchers would come from the part of the distribution where η p 2 > 0.08, and the test result would be statistically significant. It is possible to compute an effect size estimate that, based on certain assumptions, corrects for bias. For example, imagine we observe a result in the literature for a One-Way ANOVA with 3 conditions, reported as F (2, 42) = 0.017, η p 2 = 0.176. If we would take this effect size at face value and enter it as our effect size estimate in an a-priori power analysis, the suggested sample size to achieve 80% power would suggest we need to collect 17 observations in each condition.

However, if we assume bias is present, we can use the BUCSS R package (S. F. Anderson et al., 2017) to perform a power analysis that attempts to correct for bias. A power analysis that takes bias into account (under a specific model of publication bias, based on a truncated F distribution where only significant results are published) suggests collecting 73 participants in each condition. It is possible that the bias corrected estimate of the non-centrality parameter used to compute power is zero, in which case it is not possible to correct for bias using this method. As an alternative to formally modeling a correction for publication bias whenever researchers assume an effect size estimate is biased, researchers can simply use a more conservative effect size estimate, for example by computing power based on the lower limit of a 60% two-sided confidence interval around the effect size estimate, which Perugini, Gallucci, and Costantini (2014) refer to as safeguard power . Both these approaches lead to a more conservative power analysis, but not necessarily a more accurate power analysis. It is simply not possible to perform an accurate power analysis on the basis of an effect size estimate from a study that might be biased and/or had a small sample size (Teare et al., 2014) . If it is not possible to specify a smallest effect size of interest, and there is great uncertainty about which effect size to expect, it might be more efficient to perform a study with a sequential design (discussed below).

To summarize, an effect size from a previous study in an a-priori power analysis can be used if three conditions are met (see Table 6 ). First, the previous study is sufficiently similar to the planned study. Second, there was a low risk of bias (e.g., the effect size estimate comes from a Registered Report, or from an analysis for which results would not have impacted the likelihood of publication). Third, the sample size is large enough to yield a relatively accurate effect size estimate, based on the width of a 95% CI around the observed effect size estimate. There is always uncertainty around the effect size estimate, and entering the upper and lower limit of the 95% CI around the effect size estimate might be informative about the consequences of the uncertainty in the effect size estimate for an a-priori power analysis.

Using an Estimate from a Theoretical Model

When your theoretical model is sufficiently specific such that you can build a computational model, and you have knowledge about key parameters in your model that are relevant for the data you plan to collect, it is possible to estimate an effect size based on the effect size estimate derived from a computational model. For example, if one had strong ideas about the weights for each feature stimuli share and differ on, it could be possible to compute predicted similarity judgments for pairs of stimuli based on Tversky’s contrast model (Tversky, 1977) , and estimate the predicted effect size for differences between experimental conditions. Although computational models that make point predictions are relatively rare, whenever they are available, they provide a strong justification of the effect size a researcher expects.

Compute the Width of the Confidence Interval around the Effect Size

If a researcher can estimate the standard deviation of the observations that will be collected, it is possible to compute an a-priori estimate of the width of the 95% confidence interval around an effect size (Kelley, 2007) . Confidence intervals represent a range around an estimate that is wide enough so that in the long run the true population parameter will fall inside the confidence intervals 100 - α percent of the time. In any single study the true population effect either falls in the confidence interval, or it doesn’t, but in the long run one can act as if the confidence interval includes the true population effect size (while keeping the error rate in mind). Cumming (2013) calls the difference between the observed effect size and the upper bound of the 95% confidence interval (or the lower bound of the 95% confidence interval) the margin of error.

If we compute the 95% CI for an effect size of d = 0 based on the t statistic and sample size (Smithson, 2003) , we see that with 15 observations in each condition of an independent t test the 95% CI ranges from d = -0.72 to d = 0.72 5 . The margin of error is half the width of the 95% CI, 0.72. A Bayesian estimator who uses an uninformative prior would compute a credible interval with the same (or a very similar) upper and lower bound (Albers et al., 2018; Kruschke, 2011) , and might conclude that after collecting the data they would be left with a range of plausible values for the population effect that is too large to be informative. Regardless of the statistical philosophy you plan to rely on when analyzing the data, the evaluation of what we can conclude based on the width of our interval tells us that with 15 observation per group we will not learn a lot.

One useful way of interpreting the width of the confidence interval is based on the effects you would be able to reject if the true effect size is 0. In other words, if there is no effect, which effects would you have been able to reject given the collected data, and which effect sizes would not be rejected, if there was no effect? Effect sizes in the range of d = 0.7 are findings such as “People become aggressive when they are provoked”, “People prefer their own group to other groups”, and “Romantic partners resemble one another in physical attractiveness” (Richard et al., 2003) . The width of the confidence interval tells you that you can only reject the presence of effects that are so large, if they existed, you would probably already have noticed them. If it is true that most effects that you study are realistically much smaller than d = 0.7, there is a good possibility that we do not learn anything we didn’t already know by performing a study with n = 15. Even without data, in most research lines we would not consider certain large effects plausible (although the effect sizes that are plausible differ between fields, as discussed below). On the other hand, in large samples where researchers can for example reject the presence of effects larger than d = 0.2, if the null hypothesis was true, this analysis of the width of the confidence interval would suggest that peers in many research lines would likely consider the study to be informative.

We see that the margin of error is almost, but not exactly, the same as the minimal statistically detectable effect ( d = 0.75). The small variation is due to the fact that the 95% confidence interval is calculated based on the t distribution. If the true effect size is not zero, the confidence interval is calculated based on the non-central t distribution, and the 95% CI is asymmetric. Figure 7 visualizes three t distributions, one symmetric at 0, and two asymmetric distributions with a noncentrality parameter (the normalized difference between the means) of 2 and 3. The asymmetry is most clearly visible in very small samples (the distributions in the plot have 5 degrees of freedom) but remains noticeable in larger samples when calculating confidence intervals and statistical power. For example, for a true effect size of d = 0.5 observed with 15 observations per group would yield d s = 0.50, 95% CI [-0.23, 1.22]. If we compute the 95% CI around the critical effect size, we would get d s = 0.75, 95% CI [0.00, 1.48]. We see the 95% CI ranges from exactly 0.00 to 1.48, in line with the relation between a confidence interval and a p value, where the 95% CI excludes zero if the test is statistically significant. As noted before, the different approaches recommended here to evaluate how informative a study is are often based on the same information.

graphic

Plot a Sensitivity Power Analysis

A sensitivity power analysis fixes the sample size, desired power, and alpha level, and answers the question which effect size a study could detect with a desired power. A sensitivity power analysis is therefore performed when the sample size is already known. Sometimes data has already been collected to answer a different research question, or the data is retrieved from an existing database, and you want to perform a sensitivity power analysis for a new statistical analysis. Other times, you might not have carefully considered the sample size when you initially collected the data, and want to reflect on the statistical power of the study for (ranges of) effect sizes of interest when analyzing the results. Finally, it is possible that the sample size will be collected in the future, but you know that due to resource constraints the maximum sample size you can collect is limited, and you want to reflect on whether the study has sufficient power for effects that you consider plausible and interesting (such as the smallest effect size of interest, or the effect size that is expected).

Assume a researcher plans to perform a study where 30 observations will be collected in total, 15 in each between participant condition. Figure 8 shows how to perform a sensitivity power analysis in G*Power for a study where we have decided to use an alpha level of 5%, and desire 90% power. The sensitivity power analysis reveals the designed study has 90% power to detect effects of at least d = 1.23. Perhaps a researcher believes that a desired power of 90% is quite high, and is of the opinion that it would still be interesting to perform a study if the statistical power was lower. It can then be useful to plot a sensitivity curve across a range of smaller effect sizes.

graphic

The two dimensions of interest in a sensitivity power analysis are the effect sizes, and the power to observe a significant effect assuming a specific effect size. These two dimensions can be plotted against each other to create a sensitivity curve. For example, a sensitivity curve can be plotted in G*Power by clicking the ‘X-Y plot for a range of values’ button, as illustrated in Figure 9 . Researchers can examine which power they would have for an a-priori plausible range of effect sizes, or they can examine which effect sizes would provide reasonable levels of power. In simulation-based approaches to power analysis, sensitivity curves can be created by performing the power analysis for a range of possible effect sizes. Even if 50% power is deemed acceptable (in which case deciding to act as if the null hypothesis is true after a non-significant result is a relatively noisy decision procedure), Figure 9 shows a study design where power is extremely low for a large range of effect sizes that are reasonable to expect in most fields. Thus, a sensitivity power analysis provides an additional approach to evaluate how informative the planned study is, and can inform researchers that a specific design is unlikely to yield a significant effect for a range of effects that one might realistically expect.

graphic

If the number of observations per group had been larger, the evaluation might have been more positive. We might not have had any specific effect size in mind, but if we had collected 150 observations per group, a sensitivity analysis could have shown that power was sufficient for a range of effects we believe is most interesting to examine, and we would still have approximately 50% power for quite small effects. For a sensitivity analysis to be meaningful, the sensitivity curve should be compared against a smallest effect size of interest, or a range of effect sizes that are expected. A sensitivity power analysis has no clear cut-offs to examine (Bacchetti, 2010) . Instead, the idea is to make a holistic trade-off between different effect sizes one might observe or care about, and their associated statistical power.

The Distribution of Effect Sizes in a Research Area

In my personal experience the most commonly entered effect size estimate in an a-priori power analysis for an independent t test is Cohen’s benchmark for a ‘medium’ effect size, because of what is known as the default effect . When you open G*Power, a ‘medium’ effect is the default option for an a-priori power analysis. Cohen’s benchmarks for small, medium, and large effects should not be used in an a-priori power analysis (Cook et al., 2014; Correll et al., 2020) , and Cohen regretted having proposed these benchmarks (Funder & Ozer, 2019) . The large variety in research topics means that any ‘default’ or ‘heuristic’ that is used to compute statistical power is not just unlikely to correspond to your actual situation, but it is also likely to lead to a sample size that is substantially misaligned with the question you are trying to answer with the collected data.

Some researchers have wondered what a better default would be, if researchers have no other basis to decide upon an effect size for an a-priori power analysis. Brysbaert (2019) recommends d = 0.4 as a default in psychology, which is the average observed in replication projects and several meta-analyses. It is impossible to know if this average effect size is realistic, but it is clear there is huge heterogeneity across fields and research questions. Any average effect size will often deviate substantially from the effect size that should be expected in a planned study. Some researchers have suggested to change Cohen’s benchmarks based on the distribution of effect sizes in a specific field (Bosco et al., 2015; Funder & Ozer, 2019; Hill et al., 2008; Kraft, 2020; Lovakov & Agadullina, 2017) . As always, when effect size estimates are based on the published literature, one needs to evaluate the possibility that the effect size estimates are inflated due to publication bias. Due to the large variation in effect sizes within a specific research area, there is little use in choosing a large, medium, or small effect size benchmark based on the empirical distribution of effect sizes in a field to perform a power analysis.

Having some knowledge about the distribution of effect sizes in the literature can be useful when interpreting the confidence interval around an effect size. If in a specific research area almost no effects are larger than the value you could reject in an equivalence test (e.g., if the observed effect size is 0, the design would only reject effects larger than for example d = 0.7), then it is a-priori unlikely that collecting the data would tell you something you didn’t already know.

It is more difficult to defend the use of a specific effect size derived from an empirical distribution of effect sizes as a justification for the effect size used in an a-priori power analysis. One might argue that the use of an effect size benchmark based on the distribution of effects in the literature will outperform a wild guess, but this is not a strong enough argument to form the basis of a sample size justification. There is a point where researchers need to admit they are not ready to perform an a-priori power analysis due to a lack of clear expectations (Scheel et al., 2020) . Alternative sample size justifications, such as a justification of the sample size based on resource constraints, perhaps in combination with a sequential study design, might be more in line with the actual inferential goals of a study.

So far, the focus has been on justifying the sample size for quantitative studies. There are a number of related topics that can be useful to design an informative study. First, in addition to a-priori or prospective power analysis and sensitivity power analysis, it is important to discuss compromise power analysis (which is useful) and post-hoc or retrospective power analysis (which is not useful, e.g., Zumbo and Hubley (1998) , Lenth (2007) ). When sample sizes are justified based on an a-priori power analysis it can be very efficient to collect data in sequential designs where data collection is continued or terminated based on interim analyses of the data. Furthermore, it is worthwhile to consider ways to increase the power of a test without increasing the sample size. An additional point of attention is to have a good understanding of your dependent variable, especially it’s standard deviation. Finally, sample size justification is just as important in qualitative studies, and although there has been much less work on sample size justification in this domain, some proposals exist that researchers can use to design an informative study. Each of these topics is discussed in turn.

Compromise Power Analysis

In a compromise power analysis the sample size and the effect are fixed, and the error rates of the test are calculated, based on a desired ratio between the Type I and Type II error rate. A compromise power analysis is useful both when a very large number of observations will be collected, as when only a small number of observations can be collected.

In the first situation a researcher might be fortunate enough to be able to collect so many observations that the statistical power for a test is very high for all effect sizes that are deemed interesting. For example, imagine a researcher has access to 2000 employees who are all required to answer questions during a yearly evaluation in a company where they are testing an intervention that should reduce subjectively reported stress levels. You are quite confident that an effect smaller than d = 0.2 is not large enough to be subjectively noticeable for individuals (Jaeschke et al., 1989) . With an alpha level of 0.05 the researcher would have a statistical power of 0.994, or a Type II error rate of 0.006. This means that for a smallest effect size of interest of d = 0.2 the researcher is 8.30 times more likely to make a Type I error than a Type II error.

Although the original idea of designing studies that control Type I and Type II error rates was that researchers would need to justify their error rates (Neyman & Pearson, 1933) , a common heuristic is to set the Type I error rate to 0.05 and the Type II error rate to 0.20, meaning that a Type I error is 4 times as unlikely as a Type II error. The default use of 80% power (or a 20% Type II or β error) is based on a personal preference of Cohen (1988) , who writes:

It is proposed here as a convention that, when the investigator has no other basis for setting the desired power value, the value .80 be used. This means that β is set at .20. This arbitrary but reasonable value is offered for several reasons (Cohen, 1965, pp. 98-99). The chief among them takes into consideration the implicit convention for α of .05. The β of .20 is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of .20/.05, i.e., that Type I errors are of the order of four times as serious as Type II errors. This .80 desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value ad hoc.

We see that conventions are built on conventions: the norm to aim for 80% power is built on the norm to set the alpha level at 5%. What we should take away from Cohen is not that we should aim for 80% power, but that we should justify our error rates based on the relative seriousness of each error. This is where compromise power analysis comes in. If you share Cohen’s belief that a Type I error is 4 times as serious as a Type II error, and building on our earlier study on 2000 employees, it makes sense to adjust the Type I error rate when the Type II error rate is low for all effect sizes of interest (Cascio & Zedeck, 1983) . Indeed, Erdfelder, Faul, and Buchner (1996) created the G*Power software in part to give researchers a tool to perform compromise power analysis.

Figure 10 illustrates how a compromise power analysis is performed in G*Power when a Type I error is deemed to be equally costly as a Type II error, which for a study with 1000 observations per condition would lead to a Type I error and a Type II error of 0.0179. As Faul, Erdfelder, Lang, and Buchner (2007) write:

Of course, compromise power analyses can easily result in unconventional significance levels greater than α = .05 (in the case of small samples or effect sizes) or less than α = .001 (in the case of large samples or effect sizes). However, we believe that the benefit of balanced Type I and Type II error risks often offsets the costs of violating significance level conventions.

graphic

This brings us to the second situation where a compromise power analysis can be useful, which is when we know the statistical power in our study is low. Although it is highly undesirable to make decisions when error rates are high, if one finds oneself in a situation where a decision must be made based on little information, Winer (1962) writes:

The frequent use of the .05 and .01 levels of significance is a matter of convention having little scientific or logical basis. When the power of tests is likely to be low under these levels of significance, and when Type I and Type II errors are of approximately equal importance, the .30 and .20 levels of significance may be more appropriate than the .05 and .01 levels.

For example, if we plan to perform a two-sided t test, can feasibly collect at most 50 observations in each independent group, and expect a population effect size of 0.5, we would have 70% power if we set our alpha level to 0.05. We can choose to weigh both types of error equally, and set the alpha level to 0.149, to end up with a statistical power for an effect of d = 0.5 of 0.851 (given a 0.149 Type II error rate). The choice of α and β in a compromise power analysis can be extended to take prior probabilities of the null and alternative hypothesis into account (Maier & Lakens, 2022; Miller & Ulrich, 2019; Murphy et al., 2014) .

A compromise power analysis requires a researcher to specify the sample size. This sample size itself requires a justification, so a compromise power analysis will typically be performed together with a resource constraint justification for a sample size. It is especially important to perform a compromise power analysis if your resource constraint justification is strongly based on the need to make a decision, in which case a researcher should think carefully about the Type I and Type II error rates stakeholders are willing to accept. However, a compromise power analysis also makes sense if the sample size is very large, but a researcher did not have the freedom to set the sample size. This might happen if, for example, data collection is part of a larger international study and the sample size is based on other research questions. In designs where the Type II error rate is very small (and power is very high) some statisticians have also recommended to lower the alpha level to prevent Lindley’s paradox, a situation where a significant effect ( p < α ) is evidence for the null hypothesis (Good, 1992; Jeffreys, 1939) . Lowering the alpha level as a function of the statistical power of the test can prevent this paradox, providing another argument for a compromise power analysis when sample sizes are large (Maier & Lakens, 2022) . Finally, a compromise power analysis needs a justification for the effect size, either based on a smallest effect size of interest or an effect size that is expected. Table 7 lists three aspects that should be discussed alongside a reported compromise power analysis.

What to do if Your Editor Asks for Post-hoc Power?

Post-hoc, retrospective, or observed power is used to describe the statistical power of the test that is computed assuming the effect size that has been estimated from the collected data is the true effect size (Lenth, 2007; Zumbo & Hubley, 1998) . Post-hoc power is therefore not performed before looking at the data, based on effect sizes that are deemed interesting, as in an a-priori power analysis, and it is unlike a sensitivity power analysis where a range of interesting effect sizes is evaluated. Because a post-hoc or retrospective power analysis is based on the effect size observed in the data that has been collected, it does not add any information beyond the reported p value, but it presents the same information in a different way. Despite this fact, editors and reviewers often ask authors to perform post-hoc power analysis to interpret non-significant results. This is not a sensible request, and whenever it is made, you should not comply with it. Instead, you should perform a sensitivity power analysis, and discuss the power for the smallest effect size of interest and a realistic range of expected effect sizes.

Post-hoc power is directly related to the p value of the statistical test (Hoenig & Heisey, 2001) . For a z test where the p value is exactly 0.05, post-hoc power is always 50%. The reason for this relationship is that when a p value is observed that equals the alpha level of the test (e.g., 0.05), the observed z score of the test is exactly equal to the critical value of the test (e.g., z = 1.96 in a two-sided test with a 5% alpha level). Whenever the alternative hypothesis is centered on the critical value half the values we expect to observe if this alternative hypothesis is true fall below the critical value, and half fall above the critical value. Therefore, a test where we observed a p value identical to the alpha level will have exactly 50% power in a post-hoc power analysis, as the analysis assumes the observed effect size is true.

For other statistical tests, where the alternative distribution is not symmetric (such as for the t test, where the alternative hypothesis follows a non-central t distribution, see Figure 7 ), a p = 0.05 does not directly translate to an observed power of 50%, but by plotting post-hoc power against the observed p value we see that the two statistics are always directly related. As Figure 11 shows, if the p value is non-significant (i.e., larger than 0.05) the observed power will be less than approximately 50% in a t test. Lenth (2007) explains how observed power is also completely determined by the observed p value for F tests, although the statement that a non-significant p value implies a power less than 50% no longer holds.

graphic

When editors or reviewers ask researchers to report post-hoc power analyses they would like to be able to distinguish between true negatives (concluding there is no effect, when there is no effect) and false negatives (a Type II error, concluding there is no effect, when there actually is an effect). Since reporting post-hoc power is just a different way of reporting the p value, reporting the post-hoc power will not provide an answer to the question editors are asking (Hoenig & Heisey, 2001; Lenth, 2007; Schulz & Grimes, 2005; Yuan & Maxwell, 2005) . To be able to draw conclusions about the absence of a meaningful effect, one should perform an equivalence test, and design a study with high power to reject the smallest effect size of interest (Lakens, Scheel, et al., 2018) . Alternatively, if no smallest effect size of interest was specified when designing the study, researchers can report a sensitivity power analysis.

Sequential Analyses

Whenever the sample size is justified based on an a-priori power analysis it can be very efficient to collect data in a sequential design. Sequential designs control error rates across multiple looks at the data (e.g., after 50, 100, and 150 observations have been collected) and can reduce the average expected sample size that is collected compared to a fixed design where data is only analyzed after the maximum sample size is collected (Proschan et al., 2006; Wassmer & Brannath, 2016) . Sequential designs have a long history (Dodge & Romig, 1929) , and exist in many variations, such as the Sequential Probability Ratio Test (Wald, 1945) , combining independent statistical tests (Westberg, 1985) , group sequential designs (Jennison & Turnbull, 2000) , sequential Bayes factors (Schönbrodt et al., 2017) , and safe testing (Grünwald et al., 2019) . Of these approaches, the Sequential Probability Ratio Test is most efficient if data can be analyzed after every observation (Schnuerch & Erdfelder, 2020) . Group sequential designs, where data is collected in batches, provide more flexibility in data collection, error control, and corrections for effect size estimates (Wassmer & Brannath, 2016) . Safe tests provide optimal flexibility if there are dependencies between observations (ter Schure & Grünwald, 2019) .

Sequential designs are especially useful when there is considerable uncertainty about the effect size, or when it is plausible that the true effect size is larger than the smallest effect size of interest the study is designed to detect (Lakens, 2014) . In such situations data collection has the possibility to terminate early if the effect size is larger than the smallest effect size of interest, but data collection can continue to the maximum sample size if needed. Sequential designs can prevent waste when testing hypotheses, both by stopping early when the null hypothesis can be rejected, as by stopping early if the presence of a smallest effect size of interest can be rejected (i.e., stopping for futility). Group sequential designs are currently the most widely used approach to sequential analyses, and can be planned and analyzed using rpact (Wassmer & Pahlke, 2019) or gsDesign (K. M. Anderson, 2014) . 6

Increasing Power Without Increasing the Sample Size

The most straightforward approach to increase the informational value of studies is to increase the sample size. Because resources are often limited, it is also worthwhile to explore different approaches to increasing the power of a test without increasing the sample size. The first option is to use directional tests where relevant. Researchers often make directional predictions, such as ‘we predict X is larger than Y’. The statistical test that logically follows from this prediction is a directional (or one-sided) t test. A directional test moves the Type I error rate to one side of the tail of the distribution, which lowers the critical value, and therefore requires less observations to achieve the same statistical power.

Although there is some discussion about when directional tests are appropriate, they are perfectly defensible from a Neyman-Pearson perspective on hypothesis testing (Cho & Abe, 2013) , which makes a (preregistered) directional test a straightforward approach to both increase the power of a test, as the riskiness of the prediction. However, there might be situations where you do not want to ask a directional question. Sometimes, especially in research with applied consequences, it might be important to examine if a null effect can be rejected, even if the effect is in the opposite direction as predicted. For example, when you are evaluating a recently introduced educational intervention, and you predict the intervention will increase the performance of students, you might want to explore the possibility that students perform worse, to be able to recommend abandoning the new intervention. In such cases it is also possible to distribute the error rate in a ‘lop-sided’ manner, for example assigning a stricter error rate to effects in the negative than in the positive direction (Rice & Gaines, 1994) .

Another approach to increase the power without increasing the sample size, is to increase the alpha level of the test, as explained in the section on compromise power analysis. Obviously, this comes at an increased probability of making a Type I error. The risk of making either type of error should be carefully weighed, which typically requires taking into account the prior probability that the null-hypothesis is true (Cascio & Zedeck, 1983; Miller & Ulrich, 2019; Mudge et al., 2012; Murphy et al., 2014) . If you have to make a decision, or want to make a claim, and the data you can feasibly collect is limited, increasing the alpha level is justified, either based on a compromise power analysis, or based on a cost-benefit analysis (Baguley, 2004; Field et al., 2004) .

Another widely recommended approach to increase the power of a study is use a within participant design where possible. In almost all cases where a researcher is interested in detecting a difference between groups, a within participant design will require collecting less participants than a between participant design. The reason for the decrease in the sample size is explained by the equation below from Maxwell, Delaney, and Kelley (2017) . The number of participants needed in a two group within-participants design (NW) relative to the number of participants needed in a two group between-participants design (NB), assuming normal distributions, is:

The required number of participants is divided by two because in a within-participants design with two conditions every participant provides two data points. The extent to which this reduces the sample size compared to a between-participants design also depends on the correlation between the dependent variables (e.g., the correlation between the measure collected in a control task and an experimental task), as indicated by the (1- ρ ) part of the equation. If the correlation is 0, a within-participants design simply needs half as many participants as a between participant design (e.g., 64 instead 128 participants). The higher the correlation, the larger the relative benefit of within-participants designs, and whenever the correlation is negative (up to -1) the relative benefit disappears. Especially when dependent variables in within-participants designs are positively correlated, within-participants designs will greatly increase the power you can achieve given the sample size you have available. Use within-participants designs when possible, but weigh the benefits of higher power against the downsides of order effects or carryover effects that might be problematic in a within-participants design (Maxwell et al., 2017) . 7 For designs with multiple factors with multiple levels it can be difficult to specify the full correlation matrix that specifies the expected population correlation for each pair of measurements (Lakens & Caldwell, 2021) . In these cases sequential analyses might provide a solution.

In general, the smaller the variation, the larger the standardized effect size (because we are dividing the raw effect by a smaller standard deviation) and thus the higher the power given the same number of observations. Some additional recommendations are provided in the literature (Allison et al., 1997; Bausell & Li, 2002; Hallahan & Rosenthal, 1996) , such as:

Use better ways to screen participants for studies where participants need to be screened before participation.

Assign participants unequally to conditions (if data in the control condition is much cheaper to collect than data in the experimental condition, for example).

Use reliable measures that have low error variance (Williams et al., 1995) .

Smart use of preregistered covariates (Meyvis & Van Osselaer, 2018) .

It is important to consider if these ways to reduce the variation in the data do not come at too large a cost for external validity. For example, in an intention-to-treat analysis in randomized controlled trials participants who do not comply with the protocol are maintained in the analysis such that the effect size from the study accurately represents the effect of implementing the intervention in the population, and not the effect of the intervention only on those people who perfectly follow the protocol (Gupta, 2011) . Similar trade-offs between reducing the variance and external validity exist in other research areas.

Know Your Measure

Although it is convenient to talk about standardized effect sizes, it is generally preferable if researchers can interpret effects in the raw (unstandardized) scores, and have knowledge about the standard deviation of their measures (Baguley, 2009; Lenth, 2001) . To make it possible for a research community to have realistic expectations about the standard deviation of measures they collect, it is beneficial if researchers within a research area use the same validated measures. This provides a reliable knowledge base that makes it easier to plan for a desired accuracy, and to use a smallest effect size of interest on the unstandardized scale in an a-priori power analysis.

In addition to knowledge about the standard deviation it is important to have knowledge about the correlations between dependent variables (for example because Cohen’s d z for a dependent t test relies on the correlation between means). The more complex the model, the more aspects of the data-generating process need to be known to make predictions. For example, in hierarchical models researchers need knowledge about variance components to be able to perform a power analysis (DeBruine & Barr, 2019; Westfall et al., 2014) . Finally, it is important to know the reliability of your measure (Parsons et al., 2019) , especially when relying on an effect size from a published study that used a measure with different reliability, or when the same measure is used in different populations, in which case it is possible that measurement reliability differs between populations. With the increasing availability of open data, it will hopefully become easier to estimate these parameters using data from earlier studies.

If we calculate a standard deviation from a sample, this value is an estimate of the true value in the population. In small samples, our estimate can be quite far off, while due to the law of large numbers, as our sample size increases, we will be measuring the standard deviation more accurately. Since the sample standard deviation is an estimate with uncertainty, we can calculate a confidence interval around the estimate (Smithson, 2003) , and design pilot studies that will yield a sufficiently reliable estimate of the standard deviation. The confidence interval for the variance σ 2 is provided in the following formula, and the confidence for the standard deviation is the square root of these limits:

Whenever there is uncertainty about parameters, researchers can use sequential designs to perform an internal pilot study   (Wittes & Brittain, 1990) . The idea behind an internal pilot study is that researchers specify a tentative sample size for the study, perform an interim analysis, use the data from the internal pilot study to update parameters such as the variance of the measure, and finally update the final sample size that will be collected. As long as interim looks at the data are blinded (e.g., information about the conditions is not taken into account) the sample size can be adjusted based on an updated estimate of the variance without any practical consequences for the Type I error rate (Friede & Kieser, 2006; Proschan, 2005) . Therefore, if researchers are interested in designing an informative study where the Type I and Type II error rates are controlled, but they lack information about the standard deviation, an internal pilot study might be an attractive approach to consider (Chang, 2016) .

Conventions as meta-heuristics

Even when a researcher might not use a heuristic to directly determine the sample size in a study, there is an indirect way in which heuristics play a role in sample size justifications. Sample size justifications based on inferential goals such as a power analysis, accuracy, or a decision all require researchers to choose values for a desired Type I and Type II error rate, a desired accuracy, or a smallest effect size of interest. Although it is sometimes possible to justify these values as described above (e.g., based on a cost-benefit analysis), a solid justification of these values might require dedicated research lines. Performing such research lines will not always be possible, and these studies might themselves not be worth the costs (e.g., it might require less resources to perform a study with an alpha level that most peers would consider conservatively low, than to collect all the data that would be required to determine the alpha level based on a cost-benefit analysis). In these situations, researchers might use values based on a convention.

When it comes to a desired width of a confidence interval, a desired power, or any other input values required to perform a sample size computation, it is important to transparently report the use of a heuristic or convention (for example by using the accompanying online Shiny app). A convention such as the use of a 5% Type 1 error rate and 80% power practically functions as a lower threshold of the minimum informational value peers are expected to accept without any justification (whereas with a justification, higher error rates can also be deemed acceptable by peers). It is important to realize that none of these values are set in stone. Journals are free to specify that they desire a higher informational value in their author guidelines (e.g., Nature Human Behavior requires registered reports to be designed to achieve 95% statistical power, and my own department has required staff to submit ERB proposals where, whenever possible, the study was designed to achieve 90% power). Researchers who choose to design studies with a higher informational value than a conventional minimum should receive credit for doing so.

In the past some fields have changed conventions, such as the 5 sigma threshold now used in physics to declare a discovery instead of a 5% Type I error rate. In other fields such attempts have been unsuccessful (e.g., Johnson (2013) ). Improved conventions should be context dependent, and it seems sensible to establish them through consensus meetings (Mullan & Jacoby, 1985) . Consensus meetings are common in medical research, and have been used to decide upon a smallest effect size of interest (for an example, see Fried, Boers, and Baker (1993) ). In many research areas current conventions can be improved. For example, it seems peculiar to have a default alpha level of 5% both for single studies and for meta-analyses, and one could imagine a future where the default alpha level in meta-analyses is much lower than 5%. Hopefully, making the lack of an adequate justification for certain input values in specific situations more transparent will motivate fields to start a discussion about how to improve current conventions. The online Shiny app links to good examples of justifications where possible, and will continue to be updated as better justifications are developed in the future.

Sample Size Justification in Qualitative Research

A value of information perspective to sample size justification also applies to qualitative research. A sample size justification in qualitative research should be based on the consideration that the cost of collecting data from additional participants does not yield new information that is valuable enough given the inferential goals. One widely used application of this idea is known as saturation and is indicated by the observation that new data replicates earlier observations, without adding new information (Morse, 1995) . For example, let’s imagine we ask people why they have a pet. Interviews might reveal reasons that are grouped into categories, but after interviewing 20 people, no new categories emerge, at which point saturation has been reached. Alternative philosophies to qualitative research exist, and not all value planning for saturation. Regrettably, principled approaches to justify sample sizes have not been developed for these alternative philosophies (Marshall et al., 2013) .

When sampling, the goal is often not to pick a representative sample, but a sample that contains a sufficiently diverse number of subjects such that saturation is reached efficiently. Fugard and Potts (2015) show how to move towards a more informed justification for the sample size in qualitative research based on 1) the number of codes that exist in the population (e.g., the number of reasons people have pets), 2) the probability a code can be observed in a single information source (e.g., the probability that someone you interview will mention each possible reason for having a pet), and 3) the number of times you want to observe each code. They provide an R formula based on binomial probabilities to compute a required sample size to reach a desired probability of observing codes.

A more advanced approach is used in Rijnsoever (2017) , which also explores the importance of different sampling strategies. In general, purposefully sampling information from sources you expect will yield novel information is much more efficient than random sampling, but this also requires a good overview of the expected codes, and the sub-populations in which each code can be observed. Sometimes, it is possible to identify information sources that, when interviewed, would at least yield one new code (e.g., based on informal communication before an interview). A good sample size justification in qualitative research is based on 1) an identification of the populations, including any sub-populations, 2) an estimate of the number of codes in the (sub-)population, 3) the probability a code is encountered in an information source, and 4) the sampling strategy that is used.

Providing a coherent sample size justification is an essential step in designing an informative study. There are multiple approaches to justifying the sample size in a study, depending on the goal of the data collection, the resources that are available, and the statistical approach that is used to analyze the data. An overarching principle in all these approaches is that researchers consider the value of the information they collect in relation to their inferential goals.

The process of justifying a sample size when designing a study should sometimes lead to the conclusion that it is not worthwhile to collect the data, because the study does not have sufficient informational value to justify the costs. There will be cases where it is unlikely there will ever be enough data to perform a meta-analysis (for example because of a lack of general interest in the topic), the information will not be used to make a decision or claim, and the statistical tests do not allow you to test a hypothesis with reasonable error rates or to estimate an effect size with sufficient accuracy. If there is no good justification to collect the maximum number of observations that one can feasibly collect, performing the study anyway is a waste of time and/or money (Brown, 1983; Button et al., 2013; S. D. Halpern et al., 2002) .

The awareness that sample sizes in past studies were often too small to meet any realistic inferential goals is growing among psychologists (Button et al., 2013; Fraley & Vazire, 2014; Lindsay, 2015; Sedlmeier & Gigerenzer, 1989) . As an increasing number of journals start to require sample size justifications, some researchers will realize they need to collect larger samples than they were used to. This means researchers will need to request more money for participant payment in grant proposals, or that researchers will need to increasingly collaborate (Moshontz et al., 2018) . If you believe your research question is important enough to be answered, but you are not able to answer the question with your current resources, one approach to consider is to organize a research collaboration with peers, and pursue an answer to this question collectively.

A sample size justification should not be seen as a hurdle that researchers need to pass before they can submit a grant, ethical review board proposal, or manuscript for publication. When a sample size is simply stated, instead of carefully justified, it can be difficult to evaluate whether the value of the information a researcher aims to collect outweighs the costs of data collection. Being able to report a solid sample size justification means a researcher knows what they want to learn from a study, and makes it possible to design a study that can provide an informative answer to a scientific question.

This work was funded by VIDI Grant 452-17-013 from the Netherlands Organisation for Scientific Research. I would like to thank Shilaan Alzahawi, José Biurrun, Aaron Caldwell, Gordon Feld, Yaov Kessler, Robin Kok, Maximilian Maier, Matan Mazor, Toni Saari, Andy Siddall, and Jesper Wulff for feedback on an earlier draft. A computationally reproducible version of this manuscript is available at https://github.com/Lakens/sample_size_justification. An interactive online form to complete a sample size justification implementing the recommendations in this manuscript can be found at https://shiny.ieis.tue.nl/sample_size_justification/.

I have no competing interests to declare.

A computationally reproducible version of this manuscript is available at https://github.com/Lakens/sample_size_justification .

The topic of power analysis for meta-analyses is outside the scope of this manuscript, but see Hedges and Pigott (2001) and Valentine, Pigott, and Rothstein (2010) .

It is possible to argue we are still making an inference, even when the entire population is observed, because we have observed a metaphorical population from one of many possible worlds, see Spiegelhalter (2019) .

Power analyses can be performed based on standardized effect sizes or effect sizes expressed on the original scale. It is important to know the standard deviation of the effect (see the ‘Know Your Measure’ section) but I find it slightly more convenient to talk about standardized effects in the context of sample size justifications.

These figures can be reproduced and adapted in an online shiny app: http://shiny.ieis.tue.nl/d_p_power/ .

Confidence intervals around effect sizes can be computed using the MOTE Shiny app: https://www.aggieerin.com/shiny-server/

Shiny apps are available for both rpact: https://rpact.shinyapps.io/public/ and gsDesign: https://gsdesign.shinyapps.io/prod/

You can compare within- and between-participants designs in this Shiny app: http://shiny.ieis.tue.nl/within_between .

Supplementary data

Recipient(s) will receive an email with a link to 'Sample Size Justification' and will not need an account to access the content.

Subject: Sample Size Justification

(Optional message may have a maximum of 1000 characters.)

Citing articles via

Email alerts, affiliations.

  • Recent Content
  • Special Collections
  • All Content
  • Submission Guidelines
  • Publication Fees
  • Journal Policies
  • Editorial Team
  • Online ISSN 2474-7394
  • Copyright © 2024

Stay Informed

Disciplines.

  • Ancient World
  • Anthropology
  • Communication
  • Criminology & Criminal Justice
  • Film & Media Studies
  • Food & Wine
  • Browse All Disciplines
  • Browse All Courses
  • Book Authors
  • Booksellers
  • Instructions
  • Journal Authors
  • Journal Editors
  • Media & Journalists
  • Planned Giving

About UC Press

  • Press Releases
  • Seasonal Catalog
  • Acquisitions Editors
  • Customer Service
  • Exam/Desk Requests
  • Media Inquiries
  • Print-Disability
  • Rights & Permissions
  • UC Press Foundation
  • © Copyright 2024 by the Regents of the University of California. All rights reserved. Privacy policy    Accessibility

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Employee Exit Interviews
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Market Research
  • Artificial Intelligence
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • How To Determine Sample Size

Try Qualtrics for free

How to determine sample size.

12 min read Sample size can make or break your research project. Here’s how to master the delicate art of choosing the right sample size.

Author:  Will Webster

Sample size is the beating heart of any research project. It’s the invisible force that gives life to your data, making your findings robust, reliable and believable.

Sample size is what determines if you see a broad view or a focus on minute details; the art and science of correctly determining it involves a careful balancing act. Finding an appropriate sample size demands a clear understanding of the level of detail you wish to see in your data and the constraints you might encounter along the way.

Remember, whether you’re studying a small group or an entire population, your findings are only ever as good as the sample you choose.

Free eBook: Empower your market research efforts today

Let’s delve into the world of sampling and uncover the best practices for determining sample size for your research.

“How much sample do we need?” is one of the most commonly-asked questions and stumbling points in the early stages of research design . Finding the right answer to it requires first understanding and answering two other questions:

How important is statistical significance to you and your stakeholders?

What are your real-world constraints.

At the heart of this question is the goal to confidently differentiate between groups, by describing meaningful differences as statistically significant. Statistical significance isn’t a difficult concept, but it needs to be considered within the unique context of your research and your measures.

First, you should consider when you deem a difference to be meaningful in your area of research. While the standards for statistical significance are universal, the standards for “meaningful difference” are highly contextual.

For example, a 10% difference between groups might not be enough to merit a change in a marketing campaign for a breakfast cereal, but a 10% difference in efficacy of breast cancer treatments might quite literally be the difference between life and death for hundreds of patients. The exact same magnitude of difference has very little meaning in one context, but has extraordinary meaning in another. You ultimately need to determine the level of precision that will help you make your decision.

Within sampling, the lowest amount of magnification – or smallest sample size – could make the most sense, given the level of precision needed, as well as timeline and budgetary constraints.

If you’re able to detect statistical significance at a difference of 10%, and 10% is a meaningful difference, there is no need for a larger sample size, or higher magnification. However, if the study will only be useful if a significant difference is detected for smaller differences – say, a difference of 5% — the sample size must be larger to accommodate this needed precision. Similarly, if 5% is enough, and 3% is unnecessary, there is no need for a larger statistically significant sample size.

You should also consider how much you expect your responses to vary. When there isn’t a lot of variability in response, it takes a lot more sample to be confident that there are statistically significant differences between groups.

For instance, it will take a lot more sample to find statistically significant differences between groups if you are asking, “What month do you think Christmas is in?” than if you are asking, “How many miles are there between the Earth and the moon?”. In the former, nearly everybody is going to give the exact same answer, while the latter will give a lot of variation in responses. Simply put, when your variables do not have a lot of variance, larger sample sizes make sense.

Statistical significance

The likelihood that the results of a study or experiment did not occur randomly or by chance, but are meaningful and indicate a genuine effect or relationship between variables.

Magnitude of difference

The size or extent of the difference between two or more groups or variables, providing a measure of the effect size or practical significance of the results.

Actionable insights

Valuable findings or conclusions drawn from data analysis that can be directly applied or implemented in decision-making processes or strategies to achieve a particular goal or outcome.

It’s crucial to understand the differences between the concepts of “statistical significance”, “magnitude of difference” and “actionable insights” – and how they can influence each other:

  • Even if there is a statistically significant difference, it doesn’t mean the magnitude of the difference is large: with a large enough sample, a 3% difference could be statistically significant
  • Even if the magnitude of the difference is large, it doesn’t guarantee that this difference is statistically significant: with a small enough sample, an 18% difference might not be statistically significant
  • Even if there is a large, statistically significant difference, it doesn’t mean there is a story, or that there are actionable insights

There is no way to guarantee statistically significant differences at the outset of a study – and that is a good thing.

Even with a sample size of a million, there simply may not be any differences – at least, any that could be described as statistically significant. And there are times when a lack of significance is positive.

Imagine if your main competitor ran a multi-million dollar ad campaign in a major city and a huge pre-post study to detect campaign effects, only to discover that there were no statistically significant differences in brand awareness . This may be terrible news for your competitor, but it would be great news for you.

relative importance of age

With Stats iQ™ you can analyze your research results and conduct significance testing

As you determine your sample size, you should consider the real-world constraints to your research.

Factors revolving around timings, budget and target population are among the most common constraints, impacting virtually every study. But by understanding and acknowledging them, you can definitely navigate the practical constraints of your research when pulling together your sample.

Timeline constraints

Gathering a larger sample size naturally requires more time. This is particularly true for elusive audiences, those hard-to-reach groups that require special effort to engage. Your timeline could become an obstacle if it is particularly tight, causing you to rethink your sample size to meet your deadline.

Budgetary constraints

Every sample, whether large or small, inexpensive or costly, signifies a portion of your budget. Samples could be like an open market; some are inexpensive, others are pricey, but all have a price tag attached to them.

Population constraints

Sometimes the individuals or groups you’re interested in are difficult to reach; other times, they’re a part of an extremely small population. These factors can limit your sample size even further.

What’s a good sample size?

A good sample size really depends on the context and goals of the research. In general, a good sample size is one that accurately represents the population and allows for reliable statistical analysis.

Larger sample sizes are typically better because they reduce the likelihood of sampling errors and provide a more accurate representation of the population. However, larger sample sizes often increase the impact of practical considerations, like time, budget and the availability of your audience. Ultimately, you should be aiming for a sample size that provides a balance between statistical validity and practical feasibility.

4 tips for choosing the right sample size

Choosing the right sample size is an intricate balancing act, but following these four tips can take away a lot of the complexity.

1) Start with your goal

The foundation of your research is a clearly defined goal. You need to determine what you’re trying to understand or discover, and use your goal to guide your research methods – including your sample size.

If your aim is to get a broad overview of a topic, a larger, more diverse sample may be appropriate. However, if your goal is to explore a niche aspect of your subject, a smaller, more targeted sample might serve you better. You should always align your sample size with the objectives of your research.

2) Know that you can’t predict everything

Research is a journey into the unknown. While you may have hypotheses and predictions, it’s important to remember that you can’t foresee every outcome – and this uncertainty should be considered when choosing your sample size.

A larger sample size can help to mitigate some of the risks of unpredictability, providing a more diverse range of data and potentially more accurate results. However, you shouldn’t let the fear of the unknown push you into choosing an impractically large sample size.

3) Plan for a sample that meets your needs and considers your real-life constraints

Every research project operates within certain boundaries – commonly budget, timeline and the nature of the sample itself. When deciding on your sample size, these factors need to be taken into consideration.

Be realistic about what you can achieve with your available resources and time, and always tailor your sample size to fit your constraints – not the other way around.

4) Use best practice guidelines to calculate sample size

There are many established guidelines and formulas that can help you in determining the right sample size.

The easiest way to define your sample size is using a sample size calculator , or you can use a manual sample size calculation if you want to test your math skills. Cochran’s formula is perhaps the most well known equation for calculating sample size, and widely used when the population is large or unknown.

Cochran's sample size formula

Beyond the formula, it’s vital to consider the confidence interval, which plays a significant role in determining the appropriate sample size – especially when working with a random sample – and the sample proportion. This represents the expected ratio of the target population that has the characteristic or response you’re interested in, and therefore has a big impact on your correct sample size.

If your population is small, or its variance is unknown, there are steps you can still take to determine the right sample size. Common approaches here include conducting a small pilot study to gain initial estimates of the population variance, and taking a conservative approach by assuming a larger variance to ensure a more representative sample size.

Empower your market research

Conducting meaningful research and extracting actionable intelligence are priceless skills in today’s ultra competitive business landscape. It’s never been more crucial to stay ahead of the curve by leveraging the power of market research to identify opportunities, mitigate risks and make informed decisions.

Equip yourself with the tools for success with our essential eBook, “The ultimate guide to conducting market research” .

With this front-to-back guide, you’ll discover the latest strategies and best practices that are defining effective market research. Learn about practical insights and real-world applications that are demonstrating the value of research in driving business growth and innovation.

Related resources

Selection bias 11 min read, systematic random sampling 15 min read, convenience sampling 18 min read, probability sampling 8 min read, non-probability sampling 17 min read, stratified random sampling 12 min read, simple random sampling 9 min read, request demo.

Ready to learn more about Qualtrics?

Book cover

Determining Sample Size and Power in Research Studies pp 29–40 Cite as

Understanding Concepts in Estimating Sample Size in Survey Studies

  • J. P. Verma 3 &
  • Priyam Verma 4  
  • First Online: 21 July 2020

2847 Accesses

A small sample yields inaccurate findings, conversely a large sample is an unnecessary mobilization of extra resources, therefore, determining optimum sample size is a crucial exercise in research studies. Moreover, in large sample even small effect may be found to be significant which may not have any practical utility. There are two concepts which will be conceptualized and its application will be discussed for inferring the optimal sample size. It is broadly defined as precision of estimates and power of the test. The concept of precision is used in determining sample size in survey studies, whereas in hypothesis testing experiments, sample size is estimated on the basis of power required in the study. In this chapter, we will discuss the theoretical foundations to arrive at the formula for calculating optimal sample size based on the concept of precision. The following chapter will deal with the concept of power, which is used in hypothesis testing.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Bibliography

Abramson, J. J., & Abramson, Z. H. (1999). Survey methods in community medicine: Epidemiological research, programme evaluation, clinical trials (5th ed.). London: Churchill Livingstone/Elsevier Health Sciences. ISBN 0-443-06163-7.

Google Scholar  

Bartlett, J. E., II., Kotrlik, J. W., & Higgins, C. (2001). Organizational research: Determining appropriate sample size for survey research (PDF). Information Technology, Learning, and Performance Journal, 19 (1), 43–50.

Beam, G. (2012). The problem with survey research (p. xv). New Brunswick, NJ: Transaction.

Francis, J. J., Johnston, M., Robertson, C., Glidewell, L., Entwistle, V., Eccles, M. P., et al. (2010). What is an adequate sample size? Operationalising data saturation for theory-based interview studies. Psychology and Health, 25, 1229–1245. https://doi.org/10.1080/08870440903194015 .

Article   Google Scholar  

Galvin, R. (2015). How many interviews are enough? Do qualitative interviews in building energy consumption research produce reliable knowledge? Journal of Building Engineering, 1, 2–12.

Guest, G., Bunce, A., & Johnson, L. (2006). How many interviews are enough? An experiment with data saturation and variability. Field Methods, 18, 59–82. https://doi.org/10.1177/1525822X05279903 .

Jung, S. H., Kang, S. H., & Ahn, C. (2001). Sample size calculations for clustered binary data. Statistics in Medicine, 20 (1971–1982), 2001.

Kish, L. (1965). Survey sampling . Wiley. ISBN 0-471-48900-X.

Kish, L. (1965). Survey sampling . Wiley. ISBN 978-0-471-48900-9.

Leung, W.-C. (2001). Conducting a survey, in student BMJ, (British Medical Journal, Student Edition), May 2001.

Onwuegbuzie, A. J., & Leech, N. L. (2007). A call for qualitative power analyses. Quality & Quantity, 41, 105–121. https://doi.org/10.1007/s11135-005-1098-1 .

Ornstein, M. D. (1998). Survey research. Current Sociology, 46 (4), iii–136.

Sandelowski, M. (1995). Sample size in qualitative research. Research in Nursing & Health, 18, 179–183.

Smith, S. (2013, April 8). Determining sample size: How to ensure you get the correct sample size . Qualtrics. Retrieved November 15, 2016.

Download references

Author information

Authors and affiliations.

Sri Sri Aniruddhadeva Sports University, Chabua, Dibrugarh, Assam, India

J. P. Verma

Department of Economics, University of Houston, Houston, TX, USA

Priyam Verma

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to J. P. Verma .

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this chapter

Cite this chapter.

Verma, J.P., Verma, P. (2020). Understanding Concepts in Estimating Sample Size in Survey Studies. In: Determining Sample Size and Power in Research Studies. Springer, Singapore. https://doi.org/10.1007/978-981-15-5204-5_3

Download citation

DOI : https://doi.org/10.1007/978-981-15-5204-5_3

Published : 21 July 2020

Publisher Name : Springer, Singapore

Print ISBN : 978-981-15-5203-8

Online ISBN : 978-981-15-5204-5

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Spring 2024

UVA's newly-renovated main library, Shannon, is open! As we move the books back in, services may be impacted. See current service status.

Home

Understanding Precision-Based Sample Size Calculations

When designing an experiment it’s good practice to estimate the number of subjects or observations we’ll need. If we recruit or collect too few, our analysis may be too uncertain or misleading. If we collect too many, we potentially waste time and expense on diminishing returns. The optimal sample size provides enough information to allow us to analyze our research questions with confidence. The traditional approach to sample size estimation is based on hypothesis tests . In this article we present sample size estimation based on confidence intervals .

To motivate the discussion, we’ll generate fake data for an imaginary experiment. Let’s say we ask a female person and a male person to randomly ask strangers entering a grocery store to take a brief survey on their shopping habits. Of interest is whether or not the sex of the person administering the survey is associated with the proportion of people who agree to take the survey. We’ll assume females get people to take the survey 40% of the time while males get people to take the survey 30% of the time. This is our hypothesized effect size. We're not sure it's true, but if it is, we would like our experiment to detect it.

Below we use the rbinom() function to simulate data for this experiment. This generates two random sequences of zeroes and ones from a binomial distribution. A one indicates an event occurred. A zero indicates an event did not occur. The prob argument allows us to specify the probability an event occurs. We set n = 100 to generate 100 observations for each person. Finally we take the mean of each sequence to get the proportion of ones. Recall that taking the mean of 0/1 data is equivalent to calculating the proportion of ones.

The difference in our observed proportions is 0.08.

Of course the “true” difference is 0.4 - 0.3 = 0.1.

Now let’s analyze the data using the prop.test() function. This allows us to perform a two-sample proportion test. The null hypothesis is that the proportions for the two groups are equivalent. Since we simulated the data we already know the proportions are different. What does prop.test() think? To use prop.test() we need to provide the number of “successes” (ie, number of ones) to the x argument and the number of trials to the n argument. We set correct = FALSE to suppress the continuity correction .

The result of the hypothesis test is a p-value of 0.2356. This says the probability of getting data this different, or more different, assuming there is no difference in the proportions is about 0.24. This probability is pretty high compared to traditional significance levels such as 0.01 or 0.05.

The prop.test() function also calculates a 95% confidence interval on the difference in proportions using a normal approximation, sometimes referred to as the “Wald” method. The reported confidence interval of [-0.05, 0.21] overlaps 0, which tells us we can’t be too certain that one group has a higher proportion than the other.

Obviously the p-value and confidence interval are wrong. We know the population proportion values are different. But our sample size of 100 is too small for this difference to consistently reveal itself. So how large a sample should we collect? A common approach is to determine the sample size that would reliably reject a null hypothesis test at a stated significance level, such as 0.05. The power.prop.test() function allows us to do this if we plan to compare the proportions using a two-sample proportion test. Below we plug in our hypothesized proportions, the desired significance level of 0.05, and a power of 0.9.

Assuming the stated difference in proportions is real, a sample size of 477 per group gives us a probability of 0.9 of rejecting the null at a 0.05 significance level. (Estimated sample sizes are always rounded up.)

But what if instead of simply rejecting the null hypothesis we wanted to estimate the difference in proportions with a desired precision ? In our example above, the width of the 95% confidence interval on the difference in proportions was pretty wide, about 0.26. We can calculate that as follows:

What sample size do we need to achieve a tighter confidence interval width, such as 0.1? This is almost certainly more valuable than simply declaring two proportions different. This would allow us to draw conclusions about the direction and magnitude of the difference.

Let’s look at the formula for the confidence interval on a difference in proportions with equal sample sizes.

\[p1 - p2 \pm z_{\alpha/2}\sqrt{\frac{p1(1-p1)}{n} + \frac{p2(1-p2)}{n}}\] The margin of error is the second half of the formula, which we’ll call \(\epsilon\) . The term \(z_{\alpha/2}\) is the normal quantile. For a 95% confidence interval this is about 1.96. The square root term is the standard error.

\[\epsilon = z_{\alpha/2}\sqrt{\frac{p1(1-p1)}{n} + \frac{p2(1-p2)}{n}} \] Using some algebra we can solve for n and obtain the following:

\[n = \frac{z_{\alpha/2}^2[p1(1-p1) + p2(1-p2)]}{\epsilon^2} \]

Let’s say we wanted a confidence width of about 0.1. This implies a margin of error of 0.1/2 = 0.05. The necessary sample size for hypothesized proportions of 0.3 and 0.4 would be about 692.

Notice this is much bigger than the power-based sample size estimate. This is because we’re planning to estimate the difference in proportions with precision , not just reject the null hypothesis at an arbitrary significance level.

The presize package (Haynes et al. 2021) provides the prec_riskdiff() function to do this calculation for us. (The term “risk difference” is often used to describe a difference in proportions.) Simply provide the hypothesized proportions and the desired width of the confidence interval. We also specify method = "wald" to replicate our calculations above.

The n1 and n2 columns report the required sample size. The r column is the allocation ratio. The default is 1, which means equal group sizes.

It’s worth noting that \(p (1 - p)\) will never exceed 0.25 no matter the value for p. Therefore we could modify the formula to estimate a sample size for a given precision assuming no knowledge about the proportions:

\[n = \frac{z_{\alpha/2}^2[0.25 + 0.25]}{\epsilon^2} = \frac{z_{\alpha/2}^2}{2\epsilon^2} \]

This produces a larger sample size estimate of about 769:

We can get the same result using prec_riskdiff() by setting both p1 and p2 equal to 0.5:

Let’s simulate data with 692 observations and see whether or not the width of the confidence interval is within 0.1.

Using the replicate() function we can do this, say, 2000 times and see how often the width of the confidence interval is less than or equal to 0.1.

It’s not a guarantee, but more often than not the width of the confidence interval is less than 0.1, and never much higher.

The presize package offers over a dozen functions for estimating precision-based sample sizes for various statistics. Each function is well documented and includes helpful examples.

For more information on the motivation behind the presize package and precision-based sample size estimates in general, see Bland (2009).

  • Bland J M. (2009) The tyranny of power: is there a better way to calculate sample size? BMJ 2009; 339:b3985 https://www.bmj.com/content/339/bmj.b3985
  • Haynes et al., (2021). presize: An R-package for precision-based sample size calculation in clinical research. Journal of Open Source Software, 6(60), 3118, https://doi.org/10.21105/joss.03118
  • Hogg, R and Tanis, E. (2006). Probability and Statistical Inference (7th ed.) . Pearson. (Ch. 6)
  • R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ .

Clay Ford Statistical Research Consultant University of Virginia Library February 20, 2023

For questions or clarifications regarding this article, contact  [email protected] .

View the entire collection  of UVA Library StatLab articles, or learn how to cite .

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!

Related categories:

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

as the sample size used in a research project increases

Home Audience

Sample Size Determination: Definition, Formula, and Example

as the sample size used in a research project increases

Are you ready to survey your research target? Research surveys help you gain insights from your target audience. The data you collect gives you insights to meet customer needs, leading to increased sales and customer loyalty. Sample size calculation and determination are imperative to the researcher to determine the right number of respondents, keeping in mind the research study’s quality.

So, how should you do the sample size determination? How do you know who should get your survey? How do you decide on the number of the target audience?

Sending out too many surveys can be expensive without giving you a definitive advantage over a smaller sample. But if you send out too few, you won’t have enough data to draw accurate conclusions. 

Knowing how to calculate and determine the appropriate sample size accurately can give you an edge over your competitors. Let’s take a look at what a good sample includes. Also, let’s look at the sample size calculation formula so you can determine the perfect sample size for your next survey.

What is Sample Size?

‘Sample size’ is a market research term used for defining the number of individuals included in conducting research. Researchers choose their sample based on demographics, such as age, gender questions , or physical location. It can be vague or specific. 

For example, you may want to know what people within the 18-25 age range think of your product. Or, you may only require your sample to live in the United States, giving you a wide population range. The total number of individuals in a particular sample is the sample size.

What is sample size determination?

Sample size determination is the process of choosing the right number of observations or people from a larger group to use in a sample. The goal of figuring out the sample size is to ensure that the sample is big enough to give statistically valid results and accurate estimates of population parameters but small enough to be manageable and cost-effective.

In many research studies, getting information from every member of the population of interest is not possible or useful. Instead, researchers choose a sample of people or events that is representative of the whole to study. How accurate and precise the results are can depend a lot on the size of the sample.

Choosing the statistically significant sample size depends on a number of things, such as the size of the population, how precise you want your estimates to be, how confident you want to be in the results, how different the population is likely to be, and how much money and time you have for the study. Statistics are often used to figure out how big a sample should be for a certain type of study and research question.

Figuring out the sample size is important in ensuring that research findings and conclusions are valid and reliable.

Why do you need to determine the sample size?

Let’s say you are a market researcher in the US and want to send out a survey or questionnaire . The survey aims to understand your audience’s feelings toward a new cell phone you are about to launch. You want to know what people in the US think about the new product to predict the phone’s success or failure before launch.

Hypothetically, you choose the population of New York, which is 8.49 million. You use a sample size determination formula to select a sample of 500 individuals that fit into the consumer panel requirement. You can use the responses to help you determine how your audience will react to the new product.

However, determining a sample size requires more than just throwing your survey at as many people as possible. If your estimated sample sizes are too big, it could waste resources, time, and money. A sample size that’s too small doesn’t allow you to gain maximum insights, leading to inconclusive results.

LEARN ABOUT: Survey Sample Sizes

What are the terms used around the sample size?

Before we jump into sample size determination, let’s take a look at the terms you should know:

terms_used_around_sample_size

1. Population size: 

Population size is how many people fit your demographic. For example, you want to get information on doctors residing in North America. Your population size is the total number of doctors in North America. 

Don’t worry! Your population size doesn’t always have to be that big. Smaller population sizes can still give you accurate results as long as you know who you’re trying to represent.

2. Confidence level: 

The confidence level tells you how sure you can be that your data is accurate. It is expressed as a percentage and aligned to the confidence interval. For example, if your confidence level is 90%, your results will most likely be 90% accurate.

3. The margin of error (confidence interval): 

There’s no way to be 100% accurate when it comes to surveys. Confidence intervals tell you how far off from the population means you’re willing to allow your data to fall. 

A margin of error describes how close you can reasonably expect a survey result to fall relative to the real population value. Remember, if you need help with this information, use our margin of error calculator .

4. Standard deviation: 

Standard deviation is the measure of the dispersion of a data set from its mean. It measures the absolute variability of a distribution. The higher the dispersion or variability, the greater the standard deviation and the greater the magnitude of the deviation. 

For example, you have already sent out your survey. How much variance do you expect in your responses? That variation in response is the standard deviation.

Sample size calculation formula – sample size determination

With all the necessary terms defined, it’s time to learn how to determine sample size using a sample calculation formula.

Your confidence level corresponds to a Z-score. This is a constant value needed for this equation. Here are the z-scores for the most common confidence levels:

90% – Z Score = 1.645

95% – Z Score = 1.96

99% – Z Score = 2.576

If you choose a different confidence level, various online tools can help you find your score.

Necessary Sample Size = (Z-score)2 * StdDev*(1-StdDev) / (margin of error)2

Here is an example of how the math works, assuming you chose a 90% confidence level, .6 standard deviation, and a margin of error (confidence interval) of +/- 4%.

((1.64)2 x .6(.6)) / (.04)2

( 2.68x .0.36) / .0016

.9648 / .0016

603 respondents are needed, and that becomes your sample size.

Free Sample Size Calculator

How is a sample size determined?

Determining the right sample size for your survey is one of the most common questions researchers ask when they begin a market research study. Luckily, sample size determination isn’t as hard to calculate as you might remember from an old high school statistics class.

Before calculating your sample size, ensure you have these things in place:

Goals and objectives: 

What do you hope to do with the survey? Are you planning on projecting the results onto a whole demographic or population? Do you want to see what a specific group thinks? Are you trying to make a big decision or just setting a direction? 

Calculating sample size is critical if you’re projecting your survey results on a larger population. You’ll want to make sure that it’s balanced and reflects the community as a whole. The sample size isn’t as critical if you’re trying to get a feel for preferences. 

For example, you’re surveying homeowners across the US on the cost of cooling their homes in the summer. A homeowner in the South probably spends much more money cooling their home in the humid heat than someone in Denver, where the climate is dry and cool. 

For the most accurate results, you’ll need to get responses from people in all US areas and environments. If you only collect responses from one extreme, such as the warm South, your results will be skewed.

Precision level: 

How close do you want the survey results to mimic the true value if everyone responded? Again, if this survey determines how you’re going to spend millions of dollars, then your sample size determination should be exact. 

The more accurate you need to be, the larger the sample you want to have, and the more your sample will have to represent the overall population. If your population is small, say, 200 people, you may want to survey the entire population rather than cut it down with a sample.

Confidence level: 

Think of confidence from the perspective of risk. How much risk are you willing to take on? This is where your Confidence Interval numbers become important. How confident do you want to be — 98% confident, 95% confident? 

Understand that the confidence percentage you choose greatly impacts the number of completions you’ll need for accuracy. This can increase the survey’s length and how many responses you need, which means increased costs for your survey. 

Knowing the actual numbers and amounts behind percentages can help make more sense of your correct sample size needs vs. survey costs. 

For example, you want to be 99% confident. After using the sample size determination formula, you find you need to collect an additional 1000 respondents. 

This, in turn, means you’ll be paying for samples or keeping your survey running for an extra week or two. You have to determine if the increased accuracy is more important than the cost.

Population variability: 

What variability exists in your population? In other words, how similar or different is the population?

If you are surveying consumers on a broad topic, you may have lots of variations. You’ll need a larger sample size to get the most accurate picture of the population. 

However, if you’re surveying a population with similar characteristics, your variability will be less, and you can sample fewer people. More variability equals more samples, and less variability equals fewer samples. If you’re not sure, you can start with 50% variability.

Response rate: 

You want everyone to respond to your survey. Unfortunately, every survey comes with targeted respondents who either never open the study or drop out halfway. Your response rate will depend on your population’s engagement with your product, service organization, or brand. 

The higher the response rate, the higher your population’s engagement level. Your base sample size is the number of responses you must get for a successful survey.

Consider your audience: 

Besides the variability within your population, you need to ensure your sample doesn’t include people who won’t benefit from the results. One of the biggest mistakes you can make in sample size determination is forgetting to consider your actual audience. 

For example, you don’t want to send a survey asking about the quality of local apartment amenities to a group of homeowners.

Select your respondents

Focus on your survey’s objectives: 

You may start with general demographics and characteristics, but can you narrow those characteristics down even more? Narrowing down your audience makes getting a more accurate result from a small sample size easier. 

For example, you want to know how people will react to new automobile technology. Your current population includes anyone who owns a car in a particular market. 

However, you know your target audience is people who drive cars that are less than five years old. You can remove anyone with an older vehicle from your sample because they’re unlikely to purchase your product.

Once you know what you hope to gain from your survey and what variables exist within your population, you can decide how to calculate sample size. Using the formula for determining sample size is a great starting point to get accurate results. 

After calculating the sample size, you’ll want to find reliable customer survey software to help you accurately collect survey responses and turn them into analyzed reports.

LEARN MORE: Population vs Sample

In sample size determination, statistical analysis plan needs careful consideration of the level of significance, effect size, and sample size. 

Researchers must reconcile statistical significance with practical and ethical factors like practicality and cost. A well-designed study with a sufficient sample size can improve the odds of obtaining statistically significant results.

To meet the goal of your survey, you may have to try a few methods to increase the response rate, such as:

  • Increase the list of people who receive the survey.
  • To reach a wider audience, use multiple distribution channels, such as SMS, website, and email surveys.
  • Send reminders to survey participants to complete the survey.
  • Offer incentives for completing the survey, such as an entry into a prize drawing or a discount on the respondent’s next order.
  • Consider your survey structure and find ways to simplify your questions. The less work someone has to do to complete the survey, the more likely they will finish it. 
  • Longer surveys tend to have lower response rates due to the length of time it takes to complete the survey. In this case, you can reduce the number of questions in your survey to increase responses.  

QuestionPro’s sample size calculator makes it easy to find the right sample size for your research based on your desired level of confidence, your margin of error, and the size of the population.

FREE TRIAL         LEARN MORE

Frequently Asked Questions (FAQ)

The four ways to determine sample size are: 1. Power analysis 2. Convenience sampling, 3. Random sampling , 4. Stratified sampling

The three factors that determine sample size are: 1. Effect size, 2. Level of significance 3. Power

Using statistical techniques like power analysis, the minimal detectable effect size, or the sample size formula while taking into account the study’s goals and practical limitations is the best way to calculate the sample size.

The sample size is important because it affects how precise and accurate the results of a study are and how well researchers can spot real effects or relationships between variables.

The sample size is the number of observations or study participants chosen to be representative of a larger group

MORE LIKE THIS

customer advocacy software

21 Best Customer Advocacy Software for Customers in 2024

Apr 19, 2024

quantitative data analysis software

10 Quantitative Data Analysis Software for Every Data Scientist

Apr 18, 2024

Enterprise Feedback Management software

11 Best Enterprise Feedback Management Software in 2024

online reputation management software

17 Best Online Reputation Management Software in 2024

Apr 17, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

GeoPoll

How to Determine Sample Size for a Research Study

Frankline kibuacha | apr. 06, 2021 | 3 min. read.

sample size research

This article will discuss considerations to put in place when determining your sample size and how to calculate the sample size.

Confidence Interval and Confidence Level

As we have noted before, when selecting a sample there are multiple factors that can impact the reliability and validity of results, including sampling and non-sampling errors . When thinking about sample size, the two measures of error that are almost always synonymous with sample sizes are the confidence interval and the confidence level.

Confidence Interval (Margin of Error)

Confidence intervals measure the degree of uncertainty or certainty in a sampling method and how much uncertainty there is with any particular statistic. In simple terms, the confidence interval tells you how confident you can be that the results from a study reflect what you would expect to find if it were possible to survey the entire population being studied. The confidence interval is usually a plus or minus (±) figure. For example, if your confidence interval is 6 and 60% percent of your sample picks an answer, you can be confident that if you had asked the entire population, between 54% (60-6) and 66% (60+6) would have picked that answer.

Confidence Level

The confidence level refers to the percentage of probability, or certainty that the confidence interval would contain the true population parameter when you draw a random sample many times. It is expressed as a percentage and represents how often the percentage of the population who would pick an answer lies within the confidence interval. For example, a 99% confidence level means that should you repeat an experiment or survey over and over again, 99 percent of the time, your results will match the results you get from a population.

The larger your sample size, the more confident you can be that their answers truly reflect the population. In other words, the larger your sample for a given confidence level, the smaller your confidence interval.

Standard Deviation

Another critical measure when determining the sample size is the standard deviation, which measures a data set’s distribution from its mean. In calculating the sample size, the standard deviation is useful in estimating how much the responses you receive will vary from each other and from the mean number, and the standard deviation of a sample can be used to approximate the standard deviation of a population.

The higher the distribution or variability, the greater the standard deviation and the greater the magnitude of the deviation. For example, once you have already sent out your survey, how much variance do you expect in your responses? That variation in responses is the standard deviation.

Population Size

population

As demonstrated through the calculation below, a sample size of about 385 will give you a sufficient sample size to draw assumptions of nearly any population size at the 95% confidence level with a 5% margin of error, which is why samples of 400 and 500 are often used in research. However, if you are looking to draw comparisons between different sub-groups, for example, provinces within a country, a larger sample size is required. GeoPoll typically recommends a sample size of 400 per country as the minimum viable sample for a research project, 800 per country for conducting a study with analysis by a second-level breakdown such as females versus males, and 1200+ per country for doing third-level breakdowns such as males aged 18-24 in Nairobi.

How to Calculate Sample Size

As we have defined all the necessary terms, let us briefly learn how to determine the sample size using a sample calculation formula known as Andrew Fisher’s Formula.

  • Determine the population size (if known).
  • Determine the confidence interval.
  • Determine the confidence level.
  • Determine the standard deviation ( a standard deviation of 0.5 is a safe choice where the figure is unknown )
  • Convert the confidence level into a Z-Score. This table shows the z-scores for the most common confidence levels:
  • Put these figures into the sample size formula to get your sample size.

sample size calculation

Here is an example calculation:

Say you choose to work with a 95% confidence level, a standard deviation of 0.5, and a confidence interval (margin of error) of ± 5%, you just need to substitute the values in the formula:

((1.96)2 x .5(.5)) / (.05)2

(3.8416 x .25) / .0025

.9604 / .0025

Your sample size should be 385.

Fortunately, there are several available online tools to help you with this calculation. Here’s an online sample calculator from Easy Calculation. Just put in the confidence level, population size, the confidence interval, and the perfect sample size is calculated for you.

GeoPoll’s Sampling Techniques

With the largest mobile panel in Africa, Asia, and Latin America, and reliable mobile technologies, GeoPoll develops unique samples that accurately represent any population. See our country coverage  here , or  contact  our team to discuss your upcoming project.

Related Posts

Sample Frame and Sample Error

Probability and Non-Probability Samples

How GeoPoll Conducts Nationally Representative Surveys

  • Tags market research , Market Research Methods , sample size , survey methodology

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

Key facts about U.S. Latinos for National Hispanic Heritage Month

National Hispanic Heritage Month, which begins in the United States each year on Sept. 15, celebrates U.S. Latinos , their culture and their history. Started in 1968 by Congress as Hispanic Heritage Week, it was expanded to a month in 1988. The celebration begins in the middle of September to coincide with independence days in several Latin American countries: Guatemala, Honduras, El Salvador, Nicaragua and Costa Rica celebrate theirs on Sept. 15, followed by Mexico on Sept. 16, Chile on Sept. 18 and Belize on Sept. 21.

Here are some key facts about the U.S. Latino population by geography and by characteristics such as language use and origin group.

As part of our ongoing research about Hispanics in the United States, we analyzed how this group has changed over time using data from the U.S. Census Bureau. The decennial census ( PL94-171 census data ) provided some historical state and national population counts, and population estimates provided the latest data on total population, births and immigration.

We also examined characteristics of the U.S. Hispanic population using the American Community Survey (ACS), which provides data for states and the U.S. on Hispanic origin, language use, country of birth and educational attainment. Data from the 2022 ACS and some from the 2010 ACS are from tabulations released by U.S. Census Bureau . Some ACS and census data is from Integrated Public Use Microdata Series (IPUMS) of the University of Minnesota.

The U.S. Hispanic population reached 63.6 million in 2022, up from 50.5 million in 2010. The 26% increase in the Hispanic population was faster than the nation’s 8% growth rate but slower than the 34% increase in the Asian population. In 2022, Hispanics made up nearly one-in-five people in the U.S. (19%), up from 16% in 2010 and just 5% in 1970.

A line chart showing that the U.S. Hispanic population reached more than 63 million in 2022.

Hispanics have played a major role in U.S. population growth over the past decade. The U.S. population grew by 24.5 million from 2010 to 2022, and Hispanics accounted for 53% of this increase – a greater share than any other racial or ethnic group. The next closest group is non-Hispanic people who identify with two or more races. Their population grew by 8.4 million during this time, accounting for 34% of the overall increase.

A bar chart showing that Hispanics made up more than half of total U.S. population growth from 2010 to 2022.

The number of Latinos who say they are multiracial has increased dramatically. More than 27 million Latinos identified with more than one race in 2022, up from 3 million in 2010. The increase could be due to several factors, including changes to the census form that make it easier for people to select multiple races and growing racial diversity.

A bar chart showing that the U.S. Hispanic multiracial population has increased sharply since 2010.

Growth in the number of multiracial Latinos comes primarily from those who identify as at least one specific race and “some other race” (i.e., those who write in a response). This population grew from 2.1 million to 24.9 million between 2010 and 2022 and now represents about 91% of multiracial Latinos. The increase was due almost entirely to growth in the number of people who identified as White and some other race, according to the 2020 census.

At the same time, the number of Latinos who identified as White and no other race declined from 26.7 million in 2010 to 10.7 million in 2022.

The roughly 37.4 million people of Mexican origin in the U.S. represented nearly 60% of the nation’s Hispanic population in 2022. Those of Puerto Rican origin are the next largest group, at 5.9 million, which does not include another roughly 3.2 million Puerto Ricans who lived on the island in 2022. The U.S. population of Puerto Rican origin has grown partly due to people moving from Puerto Rico to the 50 states and the District of Columbia.

A line chart showing that Puerto Rico’s population has declined in recent decades.

Six other Hispanic origin groups in the U.S. each have 1 million or more people: Salvadorans, Cubans, Dominicans, Guatemalans, Colombians and Hondurans. In addition, in 2022, Spaniards accounted for nearly 1 million U.S. Latinos.

Puerto Rico’s population has declined by about 500,000 since 2010, from 3.7 million to 3.2 million. Puerto Rico has experienced a net population loss since at least 2005 , driven by low fertility rates and migration to the U.S. mainland. An ongoing economic recession and devastation from hurricanes Maria and Irma in 2017 have also contributed to the decline.

Venezuelans have seen the fastest population growth among U.S. Latinos. From 2010 to 2022, the Venezuelan-origin population in the U.S. increased by 236% to 815,000. Four other groups saw growth rates exceeding 50%: Hondurans increased by 67%, followed by Guatemalans (62%), Dominicans (59%) and Colombians (51%).

By contrast, the number of people of Mexican origin in the U.S. grew by only 14%, by far the slowest rate among the most populous origin groups.

A table showing Hispanic origin groups in the U.S., 2022.

Hispanics are the largest racial or ethnic group in California and Texas. This demographic milestone in California happened in 2014 and was a first for the state with the nation’s largest Hispanic population . Latinos accounted for 40% of California’s population in 2022, among the greatest shares in the country.

Line charts showing that Hispanics became the largest racial or ethnic group in California in 2014, and in Texas in 2021.

That year, there were about 15.7 million Hispanics in California, up from 14.0 million in 2010. The non-Hispanic White population, the next largest group, declined from 15.0 million to 13.2 million during this time, reflecting a broader national trend .

In Texas, the state with the next largest Latino population (12.1 million), Latinos also made up 40% of the population in 2022 and became the largest racial or ethnic group in 2021. In Florida, the state with the third-largest Latino population (6.0 million), Latinos made up 27% of residents.

A map of the U.S. showing that California and Texas had the nation’s largest Hispanic populations in 2022.

Rounding out the top five states with the largest Hispanic populations were New York (3.9 million) and Arizona (2.4 million). Eight more states had 1 million or more Hispanics: Illinois, New Jersey, Colorado, Georgia, Pennsylvania, North Carolina, Washington and New Mexico.

Vermont had the nation’s smallest Latino population (15,000) in 2022, followed by Maine (29,000), West Virginia and North Dakota (34,000 each), and South Dakota (42,000).

In New Mexico, Hispanics have been a majority of the population since 2021 and the state’s largest racial or ethnic group since the early 2000s. In 2022, the state was home to 1.1 million Hispanics.

Three states’ Hispanic populations increased by more than 1 million from 2010 to 2022. Texas (2.5 million increase), Florida (1.8 million) and California (1.6 million) accounted for almost half of the growth nationwide since 2010. Arizona (480,000 increase), New Jersey (464,000) and New York (432,000) had the next-biggest increases. All 50 states and the District of Columbia have seen growth in their Hispanic populations since 2010.

A map showing that Texas, California and Florida have seen the biggest Hispanic population growth since 2010.

North and South Dakota’s Hispanic populations have grown the fastest since 2010. The number of Hispanics in North and South Dakota more than doubled (146% and 107% increases, respectively) from 2010 to 2022. But even with that growth, these states each had fewer than 45,000 Hispanics in 2022, among the smallest populations in the country.

The slowest growth was in New Mexico (10% increase), California (12%), and Illinois and New York (13% each), all states with significant Hispanic populations.

A map of the U.S. showing that North Dakota and South Dakota have seen the fastest Hispanic population growth since 2010.

The makeup of the U.S. Hispanic population varies widely across major metropolitan areas.  Most of the metro areas in the Midwest, West and South with the largest Hispanic populations are predominantly Mexican. About three-quarters of Hispanics in the Chicago (77%) and Los Angeles (75%) areas identify as Mexican, as do 67% in the Houston area.

Metro areas in the Northeast tend to have more diverse Hispanic origins. For example, no origin group makes up more than 30% of the New York and Boston metro areas’ Hispanic populations.

Metro areas in Florida and the nation’s capital have distinctive Hispanic enclaves. Puerto Ricans make up 43% of Hispanics in the Orlando area, while Cubans make up 39% of Hispanics in the Miami area. In the Washington, D.C., metro area, Salvadorans account for 30% of Hispanics.

A bar chart showing that the U.S. Latino populations are more diverse in Northeastern metro areas than in others.

Catholics remain the largest religious group among Latinos in the U.S., but they have become a smaller share of the Latino population over the past decade. In 2022, 43% of Latinos adults identify as Catholic, down from 67% in 2010. Meanwhile, 30% of Latinos are religiously unaffiliated (describing themselves as atheist, agnostic or “nothing in particular”), up from 10% in 2010. The share of Latinos who identify as Protestants – including evangelical Protestants – has been relatively stable.

An area chart showing the steady decline in share of U.S. Latinos who identify as Catholic.

Newborns, not immigrants, have driven the recent growth among U.S. Hispanics. During the 2010s, an average of 1 million Hispanic babies were born each year, slightly more than during the 2000s. At the same time, about 350,000 Hispanic immigrants arrived annually, down substantially from the previous two decades.

A bar chart showing that newborns have driven U.S. Hispanic population growth in recent decades, but immigration has slowed.

The recent predominance of new births over immigration as a source of Hispanic population growth is a reversal of historical trends. In the 1980s and 1990s, immigration drove Hispanic population growth.

From 2020 to 2022, average annual births among Hispanics were slightly below the previous decade, but immigration decreased considerably, from 350,000 per year to 270,000. Some of this decline can be attributed to immigration into the U.S. stopping almost entirely during the early stages of the COVID-19 pandemic. With the removal of pandemic-related restrictions , the contribution of immigration to Hispanic growth appears to be returning to early 2010s levels.

The share of Latinos in the U.S. who speak English proficiently is growing. In 2022, 72% of Latinos ages 5 and older spoke English proficiently, up from 59% in 2000. U.S.-born Latinos are driving this growth: The share of U.S.-born Latinos who speak English proficiently increased by 9 percentage points in that span, compared with a 5-point increase among Latino immigrants. All told, 42.3 million Latinos in the U.S. spoke English proficiently in 2022.

Line charts showing that, for Latinos, English proficiency has increased and Spanish use at home has decreased, especially among those born in the U.S.

At the same time, the share of Latinos who speak Spanish at home declined from 78% in 2000 to 68% in 2022, and most of that decline was among the U.S. born.

Even though the share of Latinos who speak Spanish at home has declined, the number who do so has grown from 24.6 million in 2000 to 39.7 million in 2022 because of the overall growth in the Latino population.

The share of U.S. Hispanics with college experience has increased since 2010. About 45% of U.S. Hispanic adults ages 25 and older had at least some college experience in 2022, up from 36% in 2010. The share of Hispanics with a bachelor’s degree or more education also increased, from 13% to 20%. The share with a bachelor’s degree or higher increased more among Hispanic women (from 14% to 22%) than Hispanic men (12% to 18%).

The number of Latinos enrolled in college or postgraduate education also increased between 2010 and 2022, from 2.9 million to 4.2 million. Among all U.S. undergraduate and graduate students, the share of Latinos increased from 14% in 2010 to 20% in 2022, slightly higher than the Latino share of the total population.

Four-in-five Latinos are U.S. citizens. As of 2022, 81% of Latinos living in the country are U.S. citizens, up from 74% in 2010. This includes people born in the U.S. and its territories (including Puerto Rico), people born abroad to American parents, and immigrants who have become naturalized citizens. The Center recently published citizenship rates among Hispanic origin groups for 2021; this data is not yet available for 2022.

Note: This post has been regularly updated since it was originally published on Sept. 16, 2014.

  • Hispanic/Latino Demographics
  • Hispanic/Latino Identity
  • Hispanics/Latinos & Language
  • Immigrant Populations
  • Immigration & Migration

Portrait photo of staff

How Hispanic Americans Get Their News

Key facts about u.s. latinos with graduate degrees, 8 facts about recent latino immigrants to the u.s., latinos’ views of and experiences with the spanish language, facts on hispanics of venezuelan origin in the united states, 2021, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

IMAGES

  1. Sample Size Determination: Definition, Formula, and Example

    as the sample size used in a research project increases

  2. How to Determine the Sample Size in your Research

    as the sample size used in a research project increases

  3. What Is A Sample Size? A Guide to Market Research Sample Sizes

    as the sample size used in a research project increases

  4. How to find the correct sample size for your research survey (formula

    as the sample size used in a research project increases

  5. As the Sample Size Increases the Margin of Error

    as the sample size used in a research project increases

  6. Sample Size Calculator and Guide to Survey Sample Size

    as the sample size used in a research project increases

VIDEO

  1. Sample Size Calculation in Experimental Research

  2. Lecture 10: Sample Size Determination-II

  3. Enhancing Sample Size Determination: A Monte Carlo Approach

  4. Sample Size 2

  5. Sample Size Calculation || Case Control Study Design || Alpha, Beta, Test Power

  6. how to calculate sample size in descriptive studies,sample size calculation in research paper/thesis

COMMENTS

  1. Sample Size and its Importance in Research

    The sample size for a study needs to be estimated at the time the study is proposed; too large a sample is unnecessary and unethical, and too small a sample is unscientific and also unethical. The necessary sample size can be calculated, using statistical software, based on certain assumptions. If no assumptions can be made, then an arbitrary ...

  2. Sample size, power and effect size revisited: simplified and practical

    Although increasing the sample size is suggested to decrease the Type II errors, it will increase the cost of the project and delay the completion of the research activities in a foreseen period of time. ... In clinical research, sample size is calculated in line with the hypothesis and study design. The cross-over study design and parallel ...

  3. What Is Sample Size?

    Sample size is the number of observations or individuals included in a study or experiment. It is the number of individuals, items, or data points selected from a larger population to represent it statistically. The sample size is a crucial consideration in research because it directly impacts the reliability and extent to which you can ...

  4. Is the Sample Size Big Enough? 4 Things You Need to Know!

    The table demonstrates how changes in the different parameters described above affect the sample size in Example #1. An increase in the power of the study requires a larger sample size (Example #2). However, increasing the effect size (Example #3) or increasing the underlying risk (Example #4) reduces the sample size needed.

  5. Statistical Power, Sampling, and Effect Sizes: Three Keys to Research

    For example, if our fictitious researcher's study was designed in such a way to maximize statistical power (e.g., increase sample size for each group to 500), it is more likely that even a small ES for group differences on the dependent variable (self-efficacy scores) would be detected.

  6. When is enough, enough? Understanding and solving your sample size

    Study design and hypothesis. The study design and hypothesis of a research project are two sides of the same coin. When there is a single unifying hypothesis, clear comparison groups and an effect size, e.g., drug A will reduce blood pressure 10 % more than drug B, then the study design becomes clear and the sample size can be calculated with relative ease.

  7. Sample Size Justification

    A good sample size justification in qualitative research is based on 1) an identification of the populations, including any sub-populations, 2) an estimate of the number of codes in the (sub-)population, 3) the probability a code is encountered in an information source, and 4) the sampling strategy that is used.

  8. How to Determine Sample Size

    3) Plan for a sample that meets your needs and considers your real-life constraints. Every research project operates within certain boundaries - commonly budget, timeline and the nature of the sample itself. When deciding on your sample size, these factors need to be taken into consideration.

  9. Understanding Concepts in Estimating Sample Size in Survey ...

    Usually, in survey studies estimate of proportion is obtained with precision of 5% or less. Table 3.1 Sample size for different precisions at 95% confidence level. Full size table. Thus, while estimating population proportion, the researcher needs to identify the following three things: 1. Degree of confidence in the estimation (95% or 99% or ...

  10. PDF Choosing the Size of the Sample

    on one's study because the sample size is too small. This chapter includes a description of guidelines for determining sample size. Guidelines for Choosing Sample Size . Determination of sample size should begin with a review of the factors covered in Chapter 1. One should have a clear understanding of the following: • Objectives of the study:

  11. Understanding Precision-Based Sample Size Calculations

    The default is 1, which means equal group sizes. It's worth noting that p(1 − p) will never exceed 0.25 no matter the value for p. Therefore we could modify the formula to estimate a sample size for a given precision assuming no knowledge about the proportions: n = z2α / 2[0.25 + 0.25] ϵ2 = z2α / 2 2ϵ2.

  12. Sample Size Determination: Definition, Formula, and Example

    'Sample size' is a market research term used for defining the number of individuals included in conducting research. ... you can reduce the number of questions in your survey to increase responses. QuestionPro's sample size calculator makes it easy to find the right sample size for your research based on your desired level of confidence ...

  13. Sample size determination: A practical guide for health researchers

    Allowing for an MoE of 5% and a confidence level of 95%, the minimum sample size is 195.9. The recommended sample size can be set at 245, so as to allow for a 20% nonresponse rate. Note that a large nonresponse rate is assumed here, as the popula-tion involves physicians.63. 7.2.

  14. (PDF) Research Sampling and Sample Size Determination: A practical

    increase the required sample size by approximately 65%, or A decrease in the confidence interval will increase the required sample size proportionally. c. Cost benefit ratio — before beginning a ...

  15. How to Determine Sample Size for a Research Study

    2.58. Put these figures into the sample size formula to get your sample size. Here is an example calculation: Say you choose to work with a 95% confidence level, a standard deviation of 0.5, and a confidence interval (margin of error) of ± 5%, you just need to substitute the values in the formula: ( (1.96)2 x .5 (.5)) / (.05)2.

  16. Sample Size and its Importance in Research

    A sample that is larger than necessary will be better representative of the population and will hence provide more accurate results. However, beyond a certain point, the increase in accuracy will be small and hence not worth the effort and expense involved in recruiting the extra patients. Furthermore, an overly large sample.

  17. PSYC300 chapter 8 Flashcards

    as the sample size used in a research project increases statistical significance also increases increasing the sample size (N) will increase the statistical significance of a relationship whenever the effect size is

  18. Lecture Homework

    Which of the following will be likely to increase statistical significance? A larger sample size. Alpha serves as. a standard against which the p-value is compared. ... r = 0. The stronger the relationship between two variables, the larger the effect size. As the sample size used in a research project increases, statistical significance also ...

  19. Facts about U.S. Latinos for Hispanic Heritage Month

    The 26% increase in the Hispanic population was faster than the nation's 8% growth rate but slower than the 34% increase in the Asian population. In 2022, Hispanics made up nearly one-in-five people in the U.S. (19%), up from 16% in 2010 and just 5% in 1970. Hispanics have played a major role in U.S. population growth over the past decade.