Logo

When Unequal Sample Sizes Are and Are NOT a Problem in ANOVA

by Karen Grace-Martin   225 Comments

Stage 2

In your statistics class, your professor made a big deal about unequal sample sizes in one-way Analysis of Variance (ANOVA) for two reasons.

1. Because she was making you calculate everything by hand.  Sums of squares require a different formula* if sample sizes are unequal, but statistical software will automatically use the right formula. So we’re not too concerned. We’re definitely using software.

2. Nice properties in ANOVA such as the Grand Mean being the intercept in an effect-coded regression model don’t hold when data are unbalanced.  Instead of the grand mean, you need to use a weighted mean.  That’s not a big deal if you’re aware of it.

But there are a few real issues with unequal sample sizes in ANOVA. They don’t invalidate an analysis, but it’s important to be aware of them as you’re interpreting your output.

Two Practical Issues for Unequal Sample Sizes in One-Way ANOVA

1. assumption robustness with unequal samples.

The main practical issue in one-way ANOVA is that unequal sample sizes affect the robustness of the equal variance assumption.

ANOVA is considered robust to moderate departures from this assumption. But that’s not true when the sample sizes are very different.  According to Keppel (1993), there is no good rule of thumb for how unequal the sample sizes need to be for heterogeneity of variance to be a problem.

So if you have equal variances in your groups and unequal sample sizes, no problem. If you have unequal variances and equal sample sizes, no problem.

The only problem is if you have unequal variances and unequal sample sizes.

2. Power with Unequal samples

The statistical power of a hypothesis test that compares groups is highest when groups have equal sample sizes.

Power is based on the smallest sample size, so while it doesn’t hurt power to have more observations in the larger group, it doesn’t help either.

So if you have a specific number of individuals to randomly assign to groups, you’ll have the most power if you assign them equally.

If your grouping is a natural one, you’re not making decisions based on a total number of individuals. It’s very common to just happen to get a larger sample of one group compared to the others.

That doesn’t bias your test or give you incorrect results. It just means the power you have is based on the smaller sample.

So if you have 30 individuals with Treatment A and 40 individuals with Treatment B and 300 controls, that’s fine. It’s just that you could have stopped with 30 controls. The extra 270 didn’t help the power of this particular test.

Yes, this all holds true for independent samples t-tests

Independent samples t-tests are essentially a simplificiation of a one-way ANOVA for only two groups. In fact, if you run your t-test as an ANOVA, you’ll get the same p-value. And the between-groups F statistic will be the square of the t statistic you got in your t-test.

(Really, try it…. pretty cool, huh?)

This means they work the same way. Unbalanced t-tests have the same practical issues with unequal samples, but it doesn’t otherwise affect the validity or bias in the test.

Problems in Factorial ANOVA

Factorial ANOVA includes all those ANOVA models with more than one crossed factor . It generally involves one or more interaction terms.

Real issues with unequal sample sizes do occur in factorial ANOVA in one situation: when the sample sizes are confounded in the two (or more) factors. Let’s unpack this.

For example, in a two-way ANOVA, let’s say that your two independent variables ( factors ) are Age (young vs. old) and Marital Status (married vs. not).

Let’s say there are twice as many young people as old. So unequal sample sizes.

And say the younger group has a much larger percentage of singles than the older group.  In other words, the two factors are not independent of each other.  The effect of marital status cannot be distinguished from the effect of age.

So you may get a big mean difference between the marital statuses, but it’s really being driven by age.

What about Chi Square Tests?

(This article is about ANOVA (and t-tests), but I’ve updated to include Chi-Square tests after getting a lot of questions).

There are a number of different chi-square tests, but the two that can seem concerning in this context are the Chi-Square Test of Independence and The Chi-Square Test of Homogeneity. Both have two categorical variables. Both count the the frequencies of the combinations of these categories.

They calculate the test statistic the same way. Without getting into the math, it’s basically a comparison of the actual frequencies of the combinations with the frequencies you’d expect under the null hypothesis.

And luckily, unequal sample sizes do not affect the ability to calculate that chi-square test statistic. It’s pretty rare to have equal sample sizes, in fact. The expected values take the sample sizes into account. So no problems at all here.

That said, when there is a third variable involved, you can have an issue with Simpson’s Paradox . You may or may not have collected that third variable, so it’s worth thinking about whether there could be something else that is creating an association in a combination of two groups of that third variable that doesn’t exist in each group alone.

But that’s not really an issue with unequal sample sizes. That’s an issue of omitting an important variable from an analysis.

Updated Dec 18, 2020 to add more detail

hypothesis testing unequal sample size

Reader Interactions

' src=

September 29, 2022 at 10:45 am

thank you for the information

' src=

August 6, 2021 at 5:45 am

Hi there, I have two groups: one has a sample size of 41 and the other has a sample size of 59. I have one IV for which I am want to see its effects on three different DVs, each DV having four levels. My distribution is a non-normal one and an assessment of normality even after the log transformation doesn’t make my distribution normal. So what kind of analysis should I use and how do I deal with unequal sample sizes? Kindly help me out here Thank you, Nidhi

' src=

August 5, 2021 at 1:40 pm

Hi, I was wondering if you have references that I can cite in my paper. This has been extremely useful. Thank you!

' src=

June 11, 2021 at 11:06 am

This is an excellent article. Thank you.

What do you suggest doing for the marriage x age example you mentioned, when sample sizes are confounded in two factors?

Your example “For example, in a two-way ANOVA, let’s say that your two independent variables (factors) are Age (young vs. old) and Marital Status (married vs. not).

Let’s say there are twice as many young people as old. So unequal sample sizes.

And say the younger group has a much larger percentage of singles than the older group. In other words, the two factors are not independent of each other. The effect of marital status cannot be distinguished from the effect of age.”

' src=

June 22, 2021 at 9:51 am

Great question.

Well the first thing to do is to simply interpret with this in mind. Don’t just assume you can interpret the effect of one variable out of context with the other.

If you have a large enough sample, you can also take a random subsample from your larger group to stratify. So for example, you could randomly subsample within each of the four categories of Age and Marital status so that you have exactly 50 (or whatever number suits this) young singles; young marrieds; old singles; old marrieds.

' src=

February 19, 2021 at 10:48 am

Hello, great article! I am looking to compare information between 2 groups from 2007 to 2017. Normally I’d use a t-test but for some of the groups the totals are different i.e. in 2007 n = 115 whereas in 2017 n = 84. I want to be able to see if there are any differences between the 2 years and if any improvements have been made.

Any advice would be much appreciated!

' src=

May 29, 2020 at 12:14 am

Evening, I am running a study on the differences between distance learning, hybrid learning, and face to face. I have three different numbers for the groups Hybrid (N= 312); Distance (N= 1131), and Face to Face (N= 1007). The syllabi are the same for each group. I was running a Krustal-Wallis Test to determine any significance between each group. Then it occurred to me that I had different numbers for each group. Is this a problem for this study?

' src=

November 12, 2019 at 4:29 am

Dear Sir I have two variables with unequal sample size. I has 46 and 2nd has only 13 samples. How I will determine the significant difference among them

' src=

October 7, 2019 at 7:06 am

hello everyone.. i have conducted a clinical trial in four groups of animals with different sample size A(6) animals B(7) animals C(6) animals D(10) animals …i have checked effect of different medicine …observation was taken repeatedly untill the desired results obtained …each animal show different time of recovery ..so i have different sample size with unequal repeated observation …please suggest me a proper statistical analysis model.

' src=

August 12, 2019 at 6:47 am

Hi, I am running a factorial anova with 3 factors and 2 groups as IV My groups are divided as 28 and 19 for one group And 21 and 25 for the other group DO you think this might be a problem for me when running the anova?

' src=

January 25, 2019 at 9:37 am

Is it true that I can also run two-way ANOVA with unequal sample size in Minitab?

March 4, 2019 at 11:18 am

Hi Joanne, Minitab will let you run it, but be careful about the inferences you’re making.

' src=

January 15, 2019 at 1:31 pm

I am conducting a linear regression on Valence ratings (continuous response variable) with Condition (4-level factor) and Culture (x-level factor) as explanatory variable, using non-parametric bootstrapping because of non-normality and most importantly heteroskedasticity in my data. I am wondering which culture groups I can include in my analyses, given that the number of observations per culture group in each condition are very unequal, going from 8 to 217. Do you think that I should aim for a minimum of, say, 20 observations per level of Culture in each level of Condition? Is there a rule of thumb about the ratio of minimum:maximum number of observations per cell? Or of minimum number of observations in general? Thank you!

' src=

October 5, 2018 at 9:53 am

Thanks for the website! I have a bit of a different question related to this topic. I have a repeated measures design with 3 conditions. In condition A there are 150 observations for each participant, in condition B I have 20 observations for each participant and in C I have 150 observations for each participant. The total number of participants is 24 (i.e., each of the 24 subjects did all the three conditions in the experiment).

How is repeated measures ANOVA affected by this unequal numbers of observations in each condition? Would you happen to know where I can read more about this?

October 26, 2018 at 5:11 pm

I think this may help: https://www.theanalysisfactor.com/linear-mixed-models-for-missing-data-in-pre-post-studies/ https://www.theanalysisfactor.com/repeated-measures-approaches/

' src=

September 16, 2017 at 11:09 pm

I have a question. What do you suggest I do to compare the means of two groups when group 1 has a sample size of 17 persons and group 2 has a sample size of 82 persons? The sample variance values for the two groups are not that different. The sample means don’t look all that different. The only concern I have is that group 1 has n1 = 17 respondents and group 2 has n2 = 82 respondents. Any help is greatly appreciated.

Kenneth Lewis

September 21, 2017 at 4:40 pm

A t-test is fine there.

' src=

August 31, 2017 at 4:39 am

Dear Karen,

Thank you for this article, it is very interesting. I have a question linked to this problematic.

When applying a post-hoc test comparing each group of the ANOVA with only one (say vehicle group versus all group doses of a treatment; with a Dunnett step-down post-hoc comparison), and you chose to higher the sample size of the vehicle at the cost of other groups’ sample size, are there known scenarios in which the power of the comparisons will be higher than in the balanced design? (without an alpha risk inflation?)

Thank you in advance

' src=

December 4, 2016 at 3:35 am

I have 3 different levels of English proficiency taking 6 different tests. However, all these different groups have different numbers of examinees. The first group has 490 participants, the second group has 1919 participants and the third group has 529 participants. Thus, I can say that I have unequal sample sizes for Mixed ANOVA. When I do the analysis by using SPSS, it calculates the sum of squares and degrees of freedom by using the minimum sample size of the first group, which is 490. Is there a way to make SPSS analyze all the data of the unequal groups?

' src=

September 28, 2016 at 7:16 am

Hi, does it matter if groups are marginally unequal, say by n 1? Thanks.

' src=

September 9, 2016 at 12:45 pm

Hello; Some variables in my data set are non-normal and my data also not independent, the data have also unequal sample or unbalanced data. Please suggest me what statistical test I should adopt.

' src=

June 15, 2016 at 5:07 pm

Hi! You talk about real issues with unequal sample sizes/variances in factorial anova, is this less of an issue when there is only one IV?

June 17, 2016 at 9:53 am

Yes, exactly. In a one-way model, it’s just not a big deal. Only when interactions come into it.

' src=

December 8, 2015 at 7:39 pm

Its not a comment but a question. How can I compute for a sample size when I have 2 groups to choose from? I mean, see I have data for the population size of both male and female households in a particular site however, they are unequal. I need respondents from each group because I am having a comparative analysis.

' src=

September 8, 2015 at 7:55 am

Hi Karen, Great site!!

I was wondering where can I find the formulas for calculating 2-way Anova for non-balanced samples. Moreover, I’ll be happy know the differences between 2 way Anova with and without replication (and formulas for both cases will be great).

Best wishes, Yohay.

' src=

May 19, 2015 at 12:29 am

Very educative discussions. I am working on a research, which entails 2way unequal sample size. i am wondering if the SPSS version 20 can perform such task because thats what i have on my system.

thanks as i look forward to hearing from you.

' src=

April 4, 2015 at 9:59 am

I need your help regarding my project . I divided the patients according to severity of disease in three groups , Group A=55, Group B=29 and Group c=30. I want to apply one-way anova but my data is not normally distributed and i need mean and standard deviation . Is it ok that if i will continue with one way anova. In my second project i have two group Group A=79 and Group B=35 and i want to apply independent t-test but again problem is that the data is not normally distributed. Please suggest me.

I will be really grateful to you

Dr.Mudassar Imran

' src=

April 20, 2018 at 8:42 am

For each population,the response variable that you want to measure is not normally distributed,then if the sample size is large enough then there is no need for normality because the 3 sample size and 3 sample standard deviation will be close to 3 population parameters which is required if null hypothesis is true.

' src=

March 3, 2015 at 2:57 pm

hello, great discussion! I was wondering if the repeated-measures ANOVA using STATISTICA software is adjusting the sums of squares equation for unequal samples size like SPSS does? thanks!

March 6, 2015 at 4:53 pm

I have no idea. I don’t use Statistica. Anyone else know?

' src=

November 27, 2014 at 5:58 am

i am doing a study on the prevalence and patterns of urinary tract infection amongst pregnant women attending a particular hospital in my country, comparing them to the non pregnant controls. i attained my study sample based on the prevalence. please how do i attain a formula to calculate the sample size now since i have been asked to stratify my pregnant patients into ist , 2nd and 3rd trimester?

November 27, 2014 at 5:40 am

what is the formula to attain a sample size for the comparism of means of unequal groups?

' src=

April 23, 2018 at 3:04 am

google it…u will find the formula

' src=

August 14, 2014 at 3:15 am

I have a question related to unequal sample sizes. I have a 2 (language background first language speaker (L1)/second language (L2) speaker) x3 (visual status: early blind/late blind/sighted) design. I investigate whether it is an advantage to have become blind as a child when it comes to second language acquisition.

In total I have N80: 40 L1 speakers and 40 L2 speakers (equal sizes), and each of these two groups have 11 early blind, 9 late blind and 20 sighted participants. Are these unequal sample sizes related to visual status a problem when using a 2×3 Anova? What do you suggest?

Many many thanks! Helena

' src=

June 10, 2014 at 3:50 pm

Hello Sir, Sir in my research study, i had done work in three groups Group A( n=50), Group B( n=50) and Group C(n=25), i have used one way anova. is there any problem for selection of uneven sample size of Group C or it may affect statistical analysis. Please sir advice mee. Thanking u

' src=

May 16, 2014 at 9:50 am

Hi, I’m doing study for me Bachelor of Science thesis too. Currently, I’m having problem with data analysis. My experiment design is 3×3 factorial design which consists of two independent variables (frying temperature and frying duration). However, for the duration factor is abit special which it has different duration). The setting is fried at 140C for 4, 5, 6 minutes while 160 and 180C fried for 1,2 and 3mins). Shall I use one way ANOVA or two way to analyse the effect on my sample?

' src=

May 2, 2014 at 5:55 am

I’m looking at the spatial variation of fish parasites for my Bachelor of Science thesis. I want to compare mean parasite abundance between male (n=71) and female (n=105) fish. I log transformed the parasite data and it has a normal distribution and equal variances, I was just wondering if I can use a One-way ANOVA to compare the mean abundance between sexes or would it be safer to just apply a non parametric Mann-Whitney U or Kruskal Wallis Test. Hope to hear from you.

' src=

May 1, 2014 at 2:08 pm

My experiment model have two factors – temperature and different time points. I performed 2-Way GLM for the unequal sample size I have. However, it seems that there is no effect from the interaction of two factors and the temperature itself. My question is that will the result of comparison between two temp at different time points be valid if I perform them using one-way GLM after the no significant finding in the initial 2-way GLM?

' src=

April 30, 2014 at 2:53 pm

I hope you can help me. I’m trying to finish a paper for this term. I’ve just run two ANCOVAs. There were no problems with outliers, some problems with normality (skew and kurtosis <|3| although formal tests were significant), no problems with collinearity, correlation between covariates or homogeneity of slopes. Levene's test was significant for both ANCOVAS. The cell sizes and SDs are as follows:

ANCOVA with DV "A" N=30, SD=1.31 N=78, SD=1.16 N=55, SD=.88 N=171, SD =1.21

ANCOVA with DV "B" N=30, SD=0.91 N=78, SD=0.89 N=55, SD=.74 N=171, SD =.72

I realize that the smallest group has the highest variance in both cases. I hate to transform variables since it makes interpretation so complicated. What other options do I have?

' src=

April 24, 2014 at 12:13 pm

in my paper i am comparing the psychological well being of orphan and non-orphan children.sample size of n1=166,n2=333.is there a problem in computing independent t-test?

' src=

April 23, 2014 at 10:58 am

I have data of 2 years 2004 and 2008 and I realized the sample size is not equal for both of the year how can i do data cleaning in stata in this case..

' src=

April 22, 2014 at 6:44 am

Thanks for the information that you provided here. I have the same issue. I have a caregiver group of 96 and 42 control participants that I compare them on one variables. I checked for the variance and there were no significant differences in the variance. so I guess that refer to that. However, do you know any published book that I can cite?

' src=

April 13, 2014 at 5:46 pm

I am working with a data set that has n~200, n~13, n~20. I would like to do an ANOVA but I am not sure how to approach this. What was the sample 20/200 you mentioned? Would a weighted mean account for these differences? The number of samples is also related to the number of interesting components for that group (not due to poor sampling).

' src=

April 11, 2014 at 10:00 am

Hi, I found your website very helpful and have a few of questions:

1) I have data of a entire population and am comparing the means of three groups. Do I still need to do significance testing since this isn’t really sampling?

2) The 3 groups have very different size (1200, 12000, 40000). I found the data not normal, so can I just use Kruskall-Wallis test?

3) I understand ANOVA is popular but I never found any data set that is normal. i.e. shapiro wilks test or kolmogrv test always have sig. <0 so I kept on using Kruskall walli test. Is that ok?

Many thanks!

' src=

April 6, 2014 at 12:51 am

I’m doing an analysis on mechanical properties with one factor. I have 3 groups, group 1 (n=5), group 2 (n=9) and group 3 (n=8). I have read the comment people asked and the replied you have given. So am I right to say that for one way ANOVA, is alright to analysis different sample size per group.

April 7, 2014 at 5:00 pm

' src=

March 22, 2014 at 11:37 pm

Hi I was wondering what the full reference is for Keppel (1993). I’m interested in looking at that paper. Thanks

April 4, 2014 at 9:40 am

It’s a book, not a paper. “Design and Analysis: A Researcher’s Handbook.”

' src=

March 22, 2014 at 7:10 am

Hello, Karen I’m glad I came across this site! Please I’m facing a challenge with my research work. I sampled 6 different land use types, replicated 4 land use types 5times and the other two, 4 and 2 (due to their limited size for sampling). Now I want to see to significant difference using a parameter between different replications and their means using ANOVA. This shows an unbalanced sampling, and I’ve tried to use Gabriel test but my variance shows unequal and my data is not normally distributed. Please, how do I go about this analysis? Thanks!

April 4, 2014 at 12:59 pm

I’d have to know a lot more about your study and data to make suggestions about an analysis. I’m just not comfortable making suggestions as it’s too easy for someone to have left out crucial info. It seems you have a lot going on there. So I’d suggest a consultation .

' src=

March 13, 2014 at 11:24 am

Thanks so much Karen, that makes a bit more sense now! Will have a go at graphing them. Thanks again!

' src=

March 10, 2014 at 11:18 am

Hi Karen, I am in the process of collecting data and plan to use a 2 (gender, between subjects) x 3 (condition, between subjects) x 3 (time of testing, within subjects) ANOVA to analyse my data.

I want to run an a priori power analysis to check how many participants I should have in each cell. I am unsure if I am using Gpower correctly (particularly if an effect size of .3 is ok), but it gives me a sample of 102 overall (17 per cell?). I wonder if this seems right and if having vastly mismatched cells will matter? (some cells currently have 49 participants).

Thank you in advance!

March 6, 2014 at 1:35 pm

Hi Karen, I am hoping you might be able to offer some suggestions regarding two questions I am struggling with for my data analysis.

1) I have one study which has shown a statistically siginificant difference between two sample groups, using a Mann-Whitney test as the data is not normal, however the groups are unequal in size (Group 1 = 3369, Group 2 = 1524). My supervisor has asked whether I can apply a correction factor to account for the difference in group size, however I was under the impression that the Mann-Whiteny already accounts for this? Any ideas??

2) Another study has two sample groups with almost exactly equal means (Group 1=5.67, Group 2=5.75), which to me intuitively says they are not statistically different, however again the data are not normally distributed (and not equal in size either Group 1=103, Group 2 = 221), so I am assuming I have to run a non-parametric test, which results in statistically significant differnece between the groups??

I hope that all makes sense!

Any light at all you can shed on this would be greatly appreciated, I have been struggling for days and have exhausted the textbooks and web pages!!! Thanks in advance!

March 10, 2014 at 5:04 pm

Hi Seaneen,

1) No correction necessary. M-W is fine for unequal samples. 2) It’s possible to have so-small-it’s-not-interesting but statistically significant results. But another possibility is that the nonparametric test isn’t comparing means. If you have an outlier or two, that would affect means (possibly making them closer than say, the medians) but would not affect the nonparametric test. So it’s possible those two distributions have the same mean, but aren’t generally overlapping as much as the close means would indicate. I say graph them.

' src=

March 2, 2014 at 10:15 pm

I need to run an ANOVA with two samples (n is unequal for the groups) for several measurements. I am not able to carry this out, perhaps because the sample sizes are different? I am comparing 28 different categories between two groups at 3 different ages. How do I do this? I ran student t-tests that gave good information, but am now asked to run an ANOVA. Any help would be appreciated.

March 10, 2014 at 5:10 pm

I’m not sure I understand what is your DV. Is it the 28 categories? Or you’re saying you have 28 DVs?

' src=

February 21, 2014 at 11:16 am

sorry I mean, pick randomly 7 females 🙂

March 10, 2014 at 5:26 pm

This is tricky–unequal sample sizes are definitely a problem with two-way models, but at the same time 7 is a very, very small sample. Is there any way to get more males instead?

February 21, 2014 at 11:15 am

Hi Karen, I´m running a 2×2 mixed ANOVA (between factor is gender male and female, within is measurement at Time 1 and Time 2) with 7 males and 29 females. Is it okay do to that or is the samplesizes too unequal? The variances in score (using two different scales) are mostly twice as much for woman than for men, for instance std. (man/woman) = 0.4/0.8 , 0.4/0.9 and the scores from the other scale 5.6/4.9 and 3.7/6.3. Or should I randomly (SPSS can do it) take 7 males and then perform the 2×2 mixed ANOVA?

' src=

February 19, 2014 at 3:50 am

Hello, I m using a multiple regression for my research project. My sample sizes are unequal like students-720, parents-135 and teachers-80. I want to find the effect of parents and teachers on students. I have used SPSS software to calculate it, but still want to confirm from you whether you can do muliple regression with unequal sample size. Pls help me as i am confused and stuck in this. Thanks.

' src=

February 17, 2014 at 8:53 am

I am a business student and i dont have a strong statistic background but im not afraid of learning if there are any articles that can help please let me know. I have three variables. one is independent, second a mediator and third is dependent. Data will be collected from managers and employees. IV and DV data will be collected from managers and mediator data from employees. Now the problem is if there are 20 managers and there are 100 employees. I was following baren and kenny (1986) approach and Jud and kenny (1981b) recomendation to run regresson models to analyze data . Now im looking at other techniques due to unequal sample size. Can i analyze data in anova if there is any artice on this sourt of problem please let me know i appreciate any help i get. Thanks

' src=

February 12, 2014 at 4:20 pm

I am trying to figure out sample size of an article on socially conscious mutual funds. The article takes a look at industries/sectors that are screened out of these mutual funds in order to evaluate performance. The three independent sectors that are looked at are tobacco, alcohol, and gambling. Each sector is compared to the S&P 500 Index over an 11 year span. Tobacco has 15 stocks in the industry, alcohol-18, and gambling-22. Do you know what the number of the sample size would be for this? Would it be 3? Or 1, since they are all exclusive?

February 14, 2014 at 2:15 pm

It’s hard for me to say without seeing the paper and exactly which analysis they’re doing and how. It could either be the number of stocks or it could be, as you suggested, the number of industries.

' src=

February 9, 2014 at 1:27 am

So glad I found this site! I’m having trouble accepting my analysis and perhaps I’m doing it wrong so hopefully you can shed some light.

My master’s thesis is on female choice. I conducted three-choice experiments in which females are presented 3 different acoustic stimuli simultaneously. I record which stimulus they choose as well as the time it took them to make the choice (latency). My issue is with the latency analysis. I assumed that a one-way ANOVA was a proper test because my independent factor is categorical (choice) and my dependent factor is continuous (latency–time).

My sample sizes: Stimulus 1: 2 Stimulus 2: 10 Stimulus 3: 18

One issue I have is that the variance for the group with two individuals is HUGE, mainly because one female took her time to choose that stimulus, whereas another female chose that same stimulus rather quickly. I found no significance across the board, but is it because of that low sample size of group 1?

Thank you so much for your help. I really appreciate it.

February 14, 2014 at 2:13 pm

Theoretically it doesn’t matter that your samples are unequal, but practically, you’re going to have a hard time if a sample is only 2.

Your choices are to run more subjects or drop that stimulus group. Unfortunately, that’s about all you can do. Since none of your groups is very large, running more subjects would be the best, if you can manage it.

' src=

February 2, 2014 at 12:32 pm

Hi. I have done an analysis on 3 groups. Group 1 has 24 subjest, group 2 has 398 and group 3 has 755 subjects. On analysing variable vomiting; group 1 had 12 subjects with vomiting out of 24 (50%); group 2 had 169 subjects out of 398 ( 42.5%) and group 3 had 270 out of 756 (35.8%) with vomiting. On analysis by chi square (3×3) pvalue was statistically significant ( .041). To find out which group differed from each other i did pair wise comaprison between group1 and2, group 1 and 3 and group 2 and 3. The pvalue for group 2 and 3 analysis was less than .05 thus statistically significant but for group 1 and 2 and group 2 and 3 the analysis was not statistically significant. My question is: the difference between group 2 with 42.5% of cases and 35.8% of cases with vomiting was statitically significant but why the difference between group 1 with 50% ( which is higher than proportion of cases seen in group 2) when comapred with group 3 with 35.8% was not statistically significant. Is it because of very less number of subjects in group 1 the difference was not sigmificant or something else.

February 3, 2014 at 5:30 pm

Yes, that’s probably it. With so few people in Group 1, you don’t have much power to find a difference.

' src=

January 30, 2014 at 5:39 pm

Could you please help me with your valuable suggestions in stats?

I have three groups (n1=16, n2=23 and n3=24) with different sample sizes. I want to see the significant difference between these groups based on a parameter in common. Please let me know the best method or tool to analyse.

February 3, 2014 at 5:24 pm

Well it depends on which parameter you want to compare. If it’s the mean of each group on some dependent variable, then you can use one way ANOVA. The different sample sizes are no problem.

' src=

January 26, 2014 at 12:30 pm

Is it compulsory to have no of patients equal in both group for data analysis?? If not then can i exclude a single patient to remove bias at the end of study for analysis to make equal sample in both groups?

February 3, 2014 at 4:16 pm

It’s not necessary at all, unless you had some sort of patient matching. It sounds like you don’t, so you’re good to go.

' src=

January 22, 2014 at 9:27 pm

I’m looking at differences in fish weight between a control groups and 4 different treatments groups from experiment start to finish.

I am a Masters thesis student and have a run a 2-way ANOVA on my data to but have unequal groups (unavoidable and I was told this wouldn’t be a problem by my supervisors). I have 3 independent variables {sample period, treatment and frequency} and 1 dependent {weight}.

So turns out it is a problem – the levene’s test is 0.017. My data conforms to normality and my model is significant 0.018. My factor (sample period) which is significant to the .001.

Should I be running another stats test or is there a way to adjust for the lack of homogeneity?

Thanks for help!

January 24, 2014 at 1:22 pm

I would investigate those variances more. Levene’s test isn’t very useful for testing assumptions (see Keppel, 1993).

' src=

January 14, 2014 at 12:00 pm

Sorry for double posting, I meant to create a new reply but replied to a post instead:

Thank you for this article, both the article and the discussions below are enlightening 🙂

Can I ask your opinion on one related thing; I want to run a two-way ANOVA with unequal sample sizes. The reason for the unequal sizes is that there is a third factor that doesn’t participate to this ANOVA and requires its own data points. What would be the way to go when downsizing the larger sample groups in terms of randomization?

To give an example, let’s say we compare responses from athletes and non-athletes, which are either male or female. So the factors are Gender (Male, Female) and Athlete (Yes, No). This will be analyzed with a two-way ANOVA, let’s call it ANOVA A. So we have:

Male Athletes: n=20 Male Non-Athletes: n=20 Female Athletes: n=40, but we want to make it n=20 Female Non-Athletes: n=40, but we want to make it n=20

The Female subjects are more because in the same study but a different analysis we will do exactly the same comparison, but with an added factor, eg. In-pregnancy (Yes, No), which doesn’t apply to males. So that one will be another two factor ANOVA, let’s call it ANOVA B:

Female Athletes In-Pregnancy: n=20 Female Non-Atheltes In-Pregnancy: n=20 Female Atheltes Not-In-Pregnancy: n=20 Female Non-Athletes Not-In-Pregnancy: n=20

How do we choose which females to use in the downsized group for ANOVA A? It sounds logical to randomly select 20 Female Athletes and 20 female Non-Athletes, but should we care if they are In-Pregnancy or not? Or should we account for that as well?

Thanks a lot,

January 15, 2014 at 10:37 am

That’s a great question.

I assume that if you had not had the pregnant/non-pregnant groups selected out for the second study, you would have just randomly selected 20 Female athletes and 20 female athletes. Unless it’s standard or relevant to find out if they’re pregnant, you wouldn’t ever know, right?

So there are two options for the study where pregnancy is not relevant.

1. Figure out what percentage of the female athlete population is usually pregnant at any given time, then sample your two samples at the same rate. 2. Decide that the population of interest is non-pregnant female athletes and just use that sample.

' src=

December 13, 2013 at 9:16 am

Hello. Than ks for the information. I would like to ask, what is recommended to use as post hoc when runnin on-way ANOVA with different size samples. 4 groups n = 10, 1 control group n = 30. thanks a lot 🙂

December 23, 2013 at 1:24 pm

I would usually use a Tukey. Tukey Kramer is the version for unequal sample sizes.

' src=

December 11, 2013 at 2:20 pm

I get confused with my data analysis. Im about to study motivation towards grade achievement. The motivation is divided into 2 categories: intrinsic (interest and attitude) and also extrinsic (family, social, teaching style, learning style). grade is defined in term of A, A-, B+,B, B-, C+, C, C-, D and E. Since I have run the ANOVA one way test, the result shows there are sig. different among those means. But when I try to run the post hoc test, its comes out like this: Warnings Post hoc tests are not performed for Gred because at least one group has fewer than two cases.

Can I know how to solve such problem please?? Im new in statistic..

December 23, 2013 at 1:29 pm

It’s hard to tell exactly what is going on without looking at it, but it sounds like there is one group within your motivation categories with only one person. I would start with some frequency tables.

' src=

December 11, 2013 at 6:21 am

I’m using ANOVA to compare user preference ratings R within various cities, for groups A, B, and C. Unfortunately, my group sizes are HUGELY skewed – group A will typically have 20,000 or more members per city, group B will have ~1,000, and group C can have as few as 100.

In response, I have been running ANOVA by 1) determining count of C members in each city, call this Cn (let’s say 130 C people in Dallas) 2) randomly pick Cn members from group A within each city, calling this a sample-A group (in contrast to population-A for the city). So in my hypothetical, this might mean picking 130 A ratings out of 25,000. 3) I then perform a one-sample t-test on the sample-A vs population-A within each city – in the Dallas hypothetical, comparing the 130 sample-A to the 25,000 population-A. 4) repeat steps 2 and 3 until until I get a sample-A selection with no significant difference from population-A for each city. This might mean I re-pick the 130 Dallas A ratings several times until I’ve picked a representative sample. 5) I repeat 2 – 4 for group B. 6) I perform my ANOVA test on Sample-A, Sample-B, and Sample-C within each city.

This seems to be working quite well; indeed, I’ve clearly identified cities where the ratings of A, B, and C groups truly seem to differ. However, I’m not an experience statistician, and since this approach feels ad-hoc, I’m curious as to whether the results would stand up to scrutiny.

December 23, 2013 at 2:02 pm

Your sampling seems fine. The one thing I would change, though, is eliminate steps 3-5. Those are still based on the very large pop size. As long as your sampling is truly random, there should theoretically be no difference between the mean of the population and the sample.

' src=

October 23, 2013 at 10:40 pm

if I have three different sample sizes which are 48 , 46 and 44.

can I use one-way ANOVA.

Thanks. : )

' src=

October 14, 2013 at 5:12 pm

I have a question, when running a one way anova with three levels (60, 62, 63 participants in each group) and one group not having met the normality assumption (although the histogram looks like it satisfies normality) but equal variance was met, what kind of post hoc test should I be using? and why?

thanks!!! 🙂

October 16, 2013 at 9:54 am

Hi Yasmine,

There isn’t a post hoc for a situation of non-normality. If the normality is close enough for the ANOVA F test, it’s good enough for posthocs.

' src=

October 3, 2013 at 2:57 pm

I have 3 subgroups from the main group. The no. of sample in each group was 6,7,9. Can I use ANOVA or Kruskall Wallis H test in comparison and why?

' src=

October 2, 2013 at 5:25 am

I am analysing my data using STATISTICA, I have a problem of getting standard error as zero across my dry matter variable yet other variables do not have a zero standard error. what could be the problem? Thank you

October 7, 2013 at 11:27 am

I would need a lot more information, and probably to actually see the analysis to figure this one out. It sounds like you’re overspecifying the model in some way.

' src=

October 1, 2013 at 1:43 am

Hi, i m doing a studt with six groups , so i have to do anova. but when i check for normality by using shepiro wilks test or kolmogrv test, data in two of the six groups is not normally distributed. can i still continue with anova or KW test?

' src=

September 29, 2013 at 12:00 pm

hello mam, my total sample is 218, divided into three different groups and count is: group a:65, group b:61, group c:92. i have to do comparison between these three groups. for that i used anova for comparison and after find the result (p) value i have to use post post hoc test. Could you please suggest me what type of post hoc test i can use in my study, because my sample is large. thank you. please reply asap.

' src=

September 15, 2013 at 2:25 am

Hi, I was wondering if you can help me to find an answer for my question? I have collected 567 data on smoking status. 11 respondents (2.5%) are smoker and 553 (97.5%) are non-smoker. I want to conduct a t-test to compare these two groups regarding their difference in mean of another variables. Is is doable? I just ignored testing this variable due to very unbalanced sample size. is that right?

September 25, 2013 at 10:17 am

It’s doable. Just be very careful to check the equal variance assumption. The bigger issue is that 11 is very small, and you may not want to make inferences on the responses from 11 people.

' src=

September 12, 2013 at 11:39 pm

Thank you for sharing your knowledge with us. I have an ANCOVA question for you. I am trying to compare a treatment and a control group, across 8 different segments of people. My sample sizes for treatment and control groups for each of the 8 segments are not even. The worst uneven sample sizes are n(treatment)=20, n(control)=8. My results are showing significant difference between the treatment and control groups in only one of the eight segments, however the “observed power” for the test is much lower than 0.8. So, I am wondering whether these results are reliable at all? If I want to increase the power, is there any way other than increasing the sample size (because I can not)? For instance, is there any other test?

Thank you for your help, in advance, /Hector

September 25, 2013 at 10:37 am

Yes, if a test is insignificant and the true effect size is the effect size you measured, then you have insufficient power to detect that effect. You don’t need observed power to check that.

Here are pretty much the only -ways to increase power. https://www.theanalysisfactor.com/5-ways-to-increase-power-in-a-study/

' src=

September 10, 2013 at 10:41 am

When I analyse data with ANOVA, I am able to present my p values and means in a table and this acceptable. However, i have a study in which i intend to KruskWallis and i would want to have my results in a table from. Is it order to put the medians or i use p values only? i have not come across this very later situation. Advice.

September 12, 2013 at 1:57 pm

Hi Keneth, although technically a Kruskal Wallis is not testing medians, it is pretty common to report medians as a descriptive stat, along with the K-W test statistic and p-value.

' src=

September 3, 2013 at 11:12 pm

Namste Mam I have some problem in my statistics, I have two sample size one 18 and other 17 when i test normality, from Shapiro test(R) presenting p values of 17(sample size) 0.007442i.e p is less than o.o5 and (18 sample size) 0.3423 i.e p is greater than o.o5 respectively. With the p-Values it is observed that one has normal distribution but next does not present normal distribution. In this situation which test is suitable, Can i use Wilcox.test rank sum test (nonparametric test). I have drawn this sample from one community Forest which is divided into two blocks one is unmanaged and other is managed block of CFs

September 12, 2013 at 2:01 pm

Namaste Ambika,

I don’t like Shapiro Wilk test as a final decision maker about normality. I would first investigate what distributions you do have. If the one doesn’t look normal, why not? Skew? An Ourlier? Uniform?

That said, the Wilcoxon is considered distribution-free, so it’s safe to use, if it answers your research question.

' src=

September 1, 2013 at 10:21 am

Hi In my paper, males and females compared through Manova Test. The number of males is 37 and females are 86. Is this difference of numbers affect the results? How can I justify this difference?

' src=

August 22, 2013 at 9:12 am

I would be most grateful if you could help me as I have an ANCOVA question for you.

Two of my independent variables have unequal sample sizes, for example: the first variable (depression) was drawn from a student sample, the depression variable has 6 ordinal levels with: n=55, 16, 6, 5, 4, 1 (in each level of depression). The second variable (anxiety), also from a student sample and has 4 ordinal levels with: n=36, 28, 17, 6. As you probably assumed: when depression and anxiety increases the n for level of the respective group gets smaller (there are few subjects with higher levels of anxiety or depression in the sample).

Question: Should run the analysis as it is (I have used levene’s test of equality of error variance and it was non-significant), or should I merge i.e the levels 3-6 in the depression variable and 3 & 4 in the anxiety variable. What would you do?

Thank you very much for your time, Daniel

September 4, 2013 at 2:20 pm

There isn’t one right answer to this one, since you don’t seem to have problems with unequal variance.

But I can tell you a group with n=1 (the highest depression) has no variance, so isn’t useful. It is certainly reasonable to combine those groups, as long as it makes theoretical and logical sense.

And as long as those natural groupings aren’t giving you opposite results, it should help your power as well.

' src=

August 17, 2013 at 12:35 pm

Hi Karen, I’m running Anova to compare means. Anova sig. = .129 but post hoc test concludes there’s significant difference at 0.05 level. How come?

September 4, 2013 at 2:28 pm

I was going to refer you to another article, but just realized I haven’t written anything on this. It’s so important (and common). Here’s the quick answer:

1. They’re not actually testing the exact same thing. 2. The F test always trumps the post-hoc. If it’s not significant, don’t run a post-hoc. 🙂

' src=

August 12, 2013 at 4:33 am

Please help me with my assignment. I really dont know what to do cause our prof didn’t teach this yet and this is some kind of advance study for us but its so hard 🙁

HOMEWORK – Introduction to Analysis of Variance A psychologist conducts a research to compare learning performance for three (3) species of monkeys. The animals are tested individually on a delayed-response task. A raisin is hidden in one of three containers while the animal watcher from its cage window. A shade is then pulled over the window for 1 minute to block the view. After this delay period, the monkey is allowed to respond by tipping over one container. If its response is correct, the monkey is rewarded with the raisin. The number of trials it takes before the animal makes five (5) consecutive correct responses is recorded. The researcher used all the available animals from each species which resulted in unequal sample size (n). The data is summarized below. Ref. (Gravetter, Frederick J.; Walnau, Larry B.;, 2012)

Vervet Rhesus Baboon

n=4 n=10 n=6 N=20 M=9 M=14 M=4 G=200 T=36 T=140 T=24

SS=200 SS=500 SS=320

Summary Table for One-Way ANOVA Source SS df MS F Between Within Total

Fcrit = ? at alpha 0.05

Guide Questions: 1. Formulate the steps in hypothesis testing (10 pts) 2. Construct the summary table for One-way ANOVA (8 pts) 3. Identify if the problem uses one-tail or two-tail of alpha level? Explain why? (2 pts)

September 5, 2013 at 4:50 pm

Hi Grace, while I appreciate how hard this can be, as a rule, I don’t help with homework. That’s what your TA is paid the big bucks to do. 🙂

' src=

August 1, 2013 at 1:42 pm

I was hoping to use ANCOVA to compare a battery of neuropsychological tests in carriers vs non-carriers, controlling for age, gender and education level and I have three questions about that which I was hoping you could help me with. 🙂

Firstly, do I have to demean the covariates, before feeding them into (the) SPSS (multivariate general linear model)?

Secondly, is Levene’s Test of Equality of Error Variances the test I need to do to check if the variances are sufficiently similar to perform the ANCOVA on?

Lastly, assuming this is the case, what happens if Levene’s test is significant? Does it matter a lot for ANCOVA (or is it very robust anyway)? Is there a non-parametric alternative that I could use instead?

Thank you very much!

August 7, 2013 at 3:32 pm

1. I’m not sure what you mean by demean. I assume you mean “mean center.’ If so, it’s not necessary, but can be helpful. 2. Levene’s is popular, but I don’t use it, at least not as a sole criterion. 3. It’s robust, unless sample sizes are quite different.

' src=

July 11, 2013 at 12:37 am

Hi, If my samples from two groups were slightly unbalanced (8 vs 9), but the homogeneity of variance was not violated (Levene’s test > 0.05). Does it mean that I could interpret the results as if the data were balanced? Thank you very much. Mike

July 15, 2013 at 3:49 pm

July 16, 2013 at 8:44 am

' src=

June 15, 2013 at 2:00 am

I’m new in spss and research analysis hope you can help me. I am doing an analysis on the influence of teacher characteristics (ex.academic background) to student scores. i have 135 teachers and more than 4000 students. how should i prepare my data set so i can do a multiple regression? thank you!!!

July 1, 2013 at 1:09 pm

Hi Kaye, if you’re looking at teacher characteristics on their students, you need to account for the fact that the students with the same teacher are not independent. You do this with a multilevel or mixed model. You can get a lot more info here: https://www.theanalysisfactor.com/category/mixed-and-multilevel-models/

' src=

June 13, 2013 at 11:11 pm

Hi Karen, I am wanting to run a one-way between groups ANOVA, however my groups sizes are 88, 76 and 7. Do you have any suggestions or comments on whether this is going to provide useful information? thanks Mark

June 14, 2013 at 3:21 pm

It’s hard to do any sort of comparison with only 7 observations in a group. That said, in some studies that’s all you ever have. This could be useful, but pay very close attention to those assumptions. A non-parametric test, like Kruskall-Wallis, may be a safer approach.

' src=

June 12, 2013 at 8:26 pm

I’m hoping to run a one-way ANOVA with 4 independent factors. The sample sizes are 102, 100, 100 & 59. Levine’s test was significant (0.001) after an arcsin transformation (data were percentages). The distributions are normal.

I read somewhere that if there is less than a 5-fold difference in standard deviations, the ANOVA should still be robust, even with heterogeneity of variance, but the site did not list any references. In my case, there is a 1.49-fold difference between the largest and smallest standard deviation.

I was wondering whether you think I can use an ANOVA?

Also, I’m having trouble tracking down the paper you referenced (Keppel, 1993). In what journal was it published?

Thank you very much! 🙂

June 14, 2013 at 3:10 pm

HI Chantalle,

5-fold sounds higher than I’ve seen, but 1.49 is probably fine. Keppel is a textbook, not a journal article. Desighn and Analysis: A researcher’s Handbook is the title.

' src=

June 12, 2013 at 2:25 pm

I am running both t-tests and logistic regression analyses looking at income differences between two groups. One group has 980 subjects; the other 9800. In another comparison, one group has 980 and the other group has 430,000.

I have run t-tests using the lincom function in stata (with unequal variances). I have also drawn a random sample of 10% of the larger group and re-run some of the analyses. While my means change slightly with the smaller samples, the overall patterns persist and statistical significance does not change.

I have a reviewer who has asked whether I have applied any corrections to take sample size differences into account. Would you suggest any additional corrections, other than what I have already done? The reviewer in particular questioned whether I could trust my results that indicated statistical significance, given the very different sizes between the two groups. Would you agree with this concern?

I appreciate any feedback!

July 1, 2013 at 4:31 pm

I understand doing corrections in a factorial situation, but you don’t have that. It sounds like you already tried the subset of the larger group, and got the same answer. I’m not sure what other corrections you’re supposed to try.

' src=

June 2, 2013 at 5:29 pm

I have 3 sample groups I wish to compare. Sample A = 20 Sample B = 20 Sample C = 40. Do I need to adjust my ANOVA to compare them? If so, how do I calculate the weighted mean? The samples come from 3 different stakeholder groups i.e. different populations. Does make a difference when calculating the weighted mean?

All my data is in Excel. Is it possible to carry out an ANOVA with weighted mean in Excel?

Sorry for all the questions. Your help would be greatly appreciated.

June 6, 2013 at 5:17 pm

Hi Richard,

I suspect is it possible to do an ANOVA with weighted means in Excel, but I don’t ever use Excel for data analysis, so I have no idea how.

You would need to do adjustments to means if you’re calculating by hand, but stat software will do it for you automatically.

' src=

May 24, 2013 at 11:59 am

I have completed an independent samples t-test and because equal variances are not assumed, I go with the statistics which SPSS provides for that correction. However, my sample sizes are not similar (71/242) and therefore I have been taught to be very leery of the corrected t statistic. One solution I have been told is to select a random sample of the bigger group (so I would select 71 cases randomly out of the 242) and then run the test so that you have equal groups (71 to 71) to run your t test. Have you ever heard of this? Is this the most robust way of dealing with the issue of having both unequal variances and unequal n size?

Any help/suggestions would be much appreciated!

June 6, 2013 at 5:20 pm

I have heard of that (just read it in a book again yesterday). You’re absolutely right that when the sample sizes are that different, you have to be careful about unequal variances.

Another option, btw, would be a nonparametric test, like Wilcoxon Rank Sum.

' src=

May 1, 2013 at 4:44 am

i have a 2 x 2 x 2 mixed anova design as well, it’s a 2×2 repeated measures followed by a between group (gender). but my sample size difference is 59 and 29, is that too big a difference?

May 1, 2013 at 4:58 am

Also, past research have said females would generally do better, so with it at 59 and males at 29, should i report a possible confound?

May 1, 2013 at 12:08 pm

It could. This is exactly the situation where the bigger sample of females could cause problems. Are the results the same within each gender?

May 2, 2013 at 9:21 am

there is a marginal significance p=0.058 in only one of the interaction between gender and another IV

' src=

April 25, 2013 at 1:14 pm

I’ve run a 2 (groups) x 3 (modalities) x 3 (intervals) mixed ANOVA.

Now, in group 1 there are 17 subjects, while in group 2 there are 15 subjects.

One reviewer asked if I applied any “correction” to take into account the different sample size.

I did not think this was a problem, above all with this small difference. Do you have any advice? What should I do?

Thank’s,

April 29, 2013 at 6:36 pm

Hi Marco, there is no need to do anything, particularly if at least two of those IVs are manipulated. It’s only a problem if there’s a relationship among the IVs. Even so, those n’s are very similar, even if not equal.

' src=

April 25, 2013 at 8:55 am

I am working on my masterthesis and am confronted with a dataset with 2 unequal groups sizes (n=48, n=160) at baseline (T1). I have to test whether there is a difference between the two groups at baseline before the start of the treatment but also after 3 and 9 months (T2, T3). Besides, the second group size gets smaller over time (n=132 at T3), so I am wondering what test to perform to deal with these difficulties. Hope to hear from you.

Greetz Chantal

April 29, 2013 at 6:37 pm

It’s really not a problem if the groups are unequal sizes. The bigger problem would be why one group is losing subjects over time but the other isn’t (although maybe I’m just assuming that last part)

' src=

April 3, 2013 at 2:47 pm

I have conducted an ANOVA for 3 between factor groups A (n=26) B (n=19) and a neutral group (n=68). no significant effect was found, but i would like to know if this was likely due to the neutral group? what problems would the large size of this neutral group present for this situation?

Thanks, Muj

April 3, 2013 at 4:26 pm

I’m not sure what you mean by if it was due to that group. Because it has the largest size, it should have the narrowest standard error. It would entirely depend on the order of the three means. It’s the two small groups that would potentially cause problems. That’s where your power is limited.

April 3, 2013 at 5:34 pm

Thanks for the prompt reply, I am testing the effects of schizotypy on memory performances in particular accuracy and reaction time, its proposed that there would be a difference between high and low groups with high groups performing significantly worse…however no significant effect was found the neutral group does have the narrowest standard error (25.95), compared to a low schizotypy group (43.58) and a high schizotypy group (50.98)… means for RTare low = 848.58. high = 965.13. neutral =927.29

I was asked by my supervisor to comment on the potential problems of the large neutral group, could it be that she means that my other two samples were not as matched and had reduced power and so there was not a significant effect?

Sorry for my essay ^ but many thanks for your help! 😀

' src=

March 18, 2013 at 9:08 pm

how can treat with non-parametric paired t-test if you have unequal samples size using r ?

April 2, 2013 at 5:48 pm

Unequal sample sizes ARE a problem if the data are paired. Do you mean that some pairs are missing one half of the pair?

' src=

March 18, 2013 at 8:45 am

I love you. Thanks! 🙂

' src=

February 22, 2013 at 4:05 pm

Good afternoon Karen,

I have a question for you….my sample size is 351 (68 male/283 female). I am comparing male/female on several continuous variables and using parametric tests; t-test and manova, etc. The issue is the large difference between groups and feeling that I should conduct non parametrics? Would this ‘satisfy’ those reading my work? The results are the same with both para/non para., but I am concerned about the great differences due to the fact that this is my major hypothesis.

Thanks so much for your advise. Anne

March 4, 2013 at 11:03 am

If you’ve checked assumptions and have no problem with unequal variance, it’s fine.

That said, reviewers don’t always know that, so they may challenge you. If it would make you feel safer, and you are getting the same results anyway, there is nothing wrong with running it as a nonparametric for the t-test. You may have more trouble with the manova though–I don’t know of a nonparametric equivalent.

' src=

February 20, 2013 at 3:47 pm

There is a good discussion of what to do when the variances are unequal here: http://beheco.oxfordjournals.org/content/17/4/688.full and it presents a good solution that holds for unequal n.

I have a simulation that lets you explore the issue for the test that assumes homogeneity of variance here: http://onlinestatbook.com/2/tests_of_means/robust_sim.html

and a discussion of unequal n in multi-factor designs here: http://onlinestatbook.com/2/analysis_of_variance/unequal.html

' src=

February 4, 2013 at 9:50 am

Good morning Karen,

Great site! I have a few questions: My data: gender comparisons re knowledge, attitudes, beliefs. Male n=68, female n=263.

1) I am running multiple regression, t-test and MANOVA. I want to know if I need to run non parametrics to account for the unequal group n’s? Doesn’t the Central Limit Theorem kick in due to my large sample sizes?

2) In my MANOVA, my Levene’s test shows two variables that are significant at both the.05 and .01 levels. Should I not use MANOVA and look at other tests instead?

Thanks so much for your advice, Anne

February 6, 2013 at 2:31 pm

1) You can run nonparametrics, but it’s usually not necessary. It’s hard to say what you need to do in any specific situation without all the details.

2) I’m not sure I understand this question, and as for what you should do, see my response to #1. If you want to restate that, I can give you some info so you can decide what you should do. 🙂

' src=

January 28, 2013 at 5:33 pm

Dear Karen, I’m doing a paper on unequal sample sizes and I was wondering if you could give me the reference you’re citing… Keppel (1993) I’m looking for it, but I don’t seem to be able to locate it.

Thanks! Alex

January 29, 2013 at 5:12 pm

Hi Alex, It’s Design and Analysis: A Researcher’s Handbook, 3rd ed. There is a more recent edition, coauthored with someone I can’t remember.

' src=

January 20, 2013 at 5:50 pm

I’m hoping for help on dealing with unequal sample sizes for logistic regression analysis. Any assistance is greatly appreciated.

I’m currently working with data from two types of media user (single media user: N=182 and multiple media user: N=1963). We will be using demographics (such as age, gender, and race) to predict these two types of media user.

So basically, my DV is 0=single media user and 1= multiple media user, and my IVs are age (continuous), gender (dichotomous) and race (dichotomous). However, due to the extremely unequal size of the two categories of my DV, I doubt if I can still use SPSS to run logistic regression.

Thanks so much for answering in advance!

January 22, 2013 at 4:59 pm

You can still use logistic regression in SPSS.

The only problem may be an issue called zero cell counts. It occurs when one category of a categorical IV never co-occurs with a category of the DV. It’s common when both the IV and the DV have lopsided sample sizes. So for example, if you have no men who were single media users, you’d have a problem. Other than that, it’s fine.

' src=

December 19, 2012 at 7:47 am

Hi. never mind my last post. Y had my data wrong, now I found the error and the Anova type III SS works fine with my unbalanced design.

Thank you anyway for all your previous posts

December 19, 2012 at 5:29 am

Hi Karen. I have been reading the previous posts and I have cleared many doubts. Thank you. However, I still can not perform a 2-way ANOVA with unequal sample sizes…I tried with minitab and I am now using SPSS. My experiment analysed growth in 2 algae depending on community density and neighbor identity. Factor density has 2 levels (n = 13 for each). Factor neighbor identity has 4 levels (level 1, n =6; level 2, n =7, level3, n = 6, level4, n =7). In SPSS I did an univariate GLM but the output is not complete…It doesn’t give me the interaction, nor the density F-value. Am I doing something wrong? I thought there would be no problem doing an unbalanced 2-way ANOVA…Or is it only possible to do a 1-way ANOVA? Thank you for this blog

' src=

December 9, 2012 at 11:00 pm

I am so much impressed by your suggestion to researcher!!! I am also doing research for master study and little knowledge about stat.. specially in these tests… i have problem that whether i can run one way ANOVA since i have three group of respondents with 10, 80 and 18 samples. And my questionniare are based on likertscale can i do factor analysis for this research?? if possible can i have any reference regarding factor analysis and one way ANOVA??

Thanks for this useful blog…

Again thanking you!!! Kind regards Banira

December 12, 2012 at 12:03 pm

Thanks, Banira!

The inequality of the sample sizes is not a problem. 10 is small in any analysis, whether the other group has 11 or 80. That doesn’t change.

Here something about factor analysis on likert scale data. I can’t evaluate if it’s appropriate based on what you’ve told me–depends on many things, particulary what you’re trying to find out from it.. https://www.theanalysisfactor.com/can-likert-scale-data-ever-be-continuous/

' src=

December 6, 2012 at 5:20 pm

small, medium and large holding sample size are unequal (190, 122, 58)

December 6, 2012 at 5:17 pm

I am really surprised reading the content, questions and the response on the post. This led me to ask my own problem here. I am working on the agriculture sector. I am looking at the health effects of pesticide use and its associated costs. Say its negative costs of pesticide use. Now I want to see, does it varies by land size. So I categorized into three groups : small. medium, and large holdings. I know the appropriate analysis could be one-way anova; but the challenge for me is to post hoc test. How to decide whether the data violates equal variance assumption?, if it violates, then should I go for ANOVA or Welch test! I sometime read about Robust mean test! waiting for your response. Regards Kancha

December 12, 2012 at 11:53 am

There are a number of ways to check the homogeneity of variance assumption, and remember none of them is a definitive test. I tend to favor graphs over tests, because the tests are problematic.

' src=

November 29, 2012 at 1:53 pm

First of all, thank you for this post. I do, however, have a data analysis problem that hopefully you can help me with:

My research question is whether or not prescribed burning as a site preparation treatment affects growth of chestnut trees. I have 10 control and 10 burned plots, each with 3 subplots; each subplot was planted with 3 seedlings of a specific variety of chestnut tree (Chinese, American, Backcross 1, Backcross 2, Backcross 3). All 20 plots contained a subplot of Backcross 3 (60 seedlings total) while half of each treatment contained Chinese and American, and the other half contained the backcrosses. So my inherent study design (when controlling for variety) resuults in my growth data being unbalanced. After 3 years of study, some of the seedlings died, adding to the unbalanced dataset.

I have used least square means in conjunction with an analysis of variance to analyze the first year’s data; after 3 years of data I no longer have access to program (SAS) where I can analyze the data in this way.

Ideas for how I can analyze this dataset? Does the unbalance affect my analysis of variance? Can I use a linear mixed effects model to analyze the data and will it account for the unbalance?

Cheers, Jenny

December 6, 2012 at 11:45 am

As a general rule, linear mixed models are better than ANOVA in this kind of design. Especially when the data are unbalanced. Linear mixed models can deal with the unbalanced data much better.

You will need some sort of statistical software to analyze it. If you don’t have access to SAS, you can always use R. It’s free.

' src=

November 15, 2012 at 5:07 pm

Thank you for all this information you have made available to everyone. I have some trouble finding a solution to a set of measurements I have to make.

I have one group with 5 levels. I also have a number of continuous variables that I want to examine. The histogram show approximately normal distribution for all but one of them that looks a bit positively skewed. The first problem is that Unfortunately the sizes of the groups differ a lot (33,70,324,258,245). The second problem is that for two of the continuous variables levene’s test shows that the variances differ highly significantly across the groups.

So the test I would like to run is an one-way anova with a post-hoc scheffe but I am pretty sure that anova assumptions are being violated here.

What I’ve tried: I tried separate t-tests but only between the 3 biggest groups and only for the cont. variables that have similar variances across the groups. What I receive significance for all the cases (for “equal var. not assumed” too). The problem is a) is t-test appropriate for n>200? b) can I run a t-test for n1>200 and n2=30?

For the cases where variance was significantly unequal I tried a mann-witney test. the problem (?) with this is that I receive very very high U-values (n1=254, n2=342, u=27909 p<,000). Is such a u-value normal or I've done something wrong?

Finally, I tried a Kruskall-Wallis test but I can't find a solution to have the information that a scheffe test would give me…

So to conclude: 1) How should I deal with highly unequal samples like 30×254? 2) If ANOVA is inappropriate how should I replace the "missing" scheffe test? 3) Should I prefer separate t-tests or u-tests instead of parametric or non-parametric ANOVA's?

Thank you very much in advance!

December 6, 2012 at 1:12 pm

It’s a little hard for me to give specific advice when results seem “funny” without seeing the data, but a t-test wouldn’t be any better than an anova. I would suggest the kruskall-wallis, followed by a bonferroni correction. Yes, it’s more conservative than Scheffe, but you’re not doing a lot of comparisons, so it should be too bad.

' src=

November 14, 2012 at 6:33 pm

I am looking for a good way to test for homogeneity of variance before conducting a one-way ANOVA, but I have run into some trouble. My sample’s size is 233 and the group sizes are unequal (66-76-91). The result of Levene’s test was p = .028, but I read it yields significant results rather quickly when large samples (N=233)are used. Therefore, I was going to do Hartely’s Fmax to do a double-check for homogeneity, but I believe it requires equal sample sizes. Do you know a good test for this particular case to test for homogeneity of variance?

Kind regards Patrick

November 16, 2012 at 12:05 pm

Hi Patrick,

Keppel (1993)”Design and Analysis” says Hartley is problematic because *it* is affected by hetergeneity of variance and non-normality. He suggests instead of a test to just see if the Fmax is > 3. If it’s lower, simulation studies have shown there’s no affect on the p-values. When Fmax is > 9, the ANOVA F test becomes highly problematic.

All tests of assumptions tend to be over-sensitive in large samples.

' src=

November 5, 2012 at 5:20 pm

Hi Karen, I am new in using ANOVA on minitab. Please help me with a problem i am facing. The problem is i need to conclude on my document that lot size (in production department) does not affect the critical quality attribute of a product(here the product is pouch). Critical qualtiy attribute is burst test of given pouch. i have taken a lot size of 20,000 pouches and second lot size of 6000 units. From first lot size, there were 35 burst values calculated and from second lot size there were 30 burst values. May i know how to do step by step ANOVA and how to interpret its values? I am not sure whether i can post my values for both lot sizes. please let me know. your suggestions and help is greatly appreciated. Thank you Karen for this blog.

November 5, 2012 at 8:24 pm

Hi Anushka,

I’m not entirely sure I understand your analysis. When you say there were 30 burst values, do that mean 30 of the pouches burst? Or is there some burst value that you measure whose mean you are calculating? This is important because ANOVA is not appropriate in the first situation, and only possibly in the second if you’re trying to compare those means across the lots (or something else). You may want to sign up for a Quick Question Consultation–I’d be happy to help you once I understand the situation thoroughly.

Thanks, Karen

' src=

November 1, 2012 at 10:54 am

Thank you Karen for your assistance 🙂

Regards, Hina

' src=

October 30, 2012 at 4:19 pm

Thanks for taking the time to share your obvious expertise.

My question relates to planned contrasts in one way ANOVAs. My data has 11 groups and is unbalanced. The groups range in size from n=5 to n=11. My homogeneity of variance looks good across groups.

My uncertainty is I would like to run a planned contrast for one group and the means of the remaining 10 groups, but given that the groups are unbalanced, would a planned orthogonal contrast work in such a circumstance? I am only interested in one planned contrast rather than a post hoc analysis of groups. I can’t find any literature on this issue.

Thanks for whatever help you can provide in helping me move forward.

November 5, 2012 at 8:27 pm

It would be fine–this isn’t uncommon. When you have only one contrast, there is nothing to be orthogonal with (a set of contrasts can be orthogonal).

' src=

October 23, 2012 at 7:05 am

Dear Karen, First of all I want to thank you for your help! I really apreciate what you’re doing here! I also have a question how to proceed statistically with my data. Here some information about my data:

-I isolated cells from 16 individual patients (so I have 16 groups) -I let cells migrate and measured the distance after 24h of each group. -unequal group size (going from n1=20 until nx=46) -no homogeneity of variance (Leven’s test: H0 is significantly rejected)

So I started doing a non-parametric ANOVA (Kruskal-Wallis chi-sqd test). H0 is rejected.

-Q1: Can I therefor assume that there are groups that are not from the same population?

-Q2: what would be here more convenient as a non-parametric post-hoc test?

-Q3: I’m still not sure about the distribution of my data. I tried to find out with histograms, but I don’t have enough groups to confirm a non-parametric distribution. How can I find out? (I know, I should have asked this first)

Again, thank you very much for your help. Any other advice to proceed is very welcome!!! Joel, Switzerland

October 23, 2012 at 4:39 pm

Ah, this might be a question that needs a consultation. It sounds like you’re trying to compare patients to each other. Is that right? If so, is that really what you want to do?

And to answer #3 specifically, the whole idea of nonparametric tests is the distribution doesn’t matter. They’re sometimes called distribution-free tests.

' src=

October 3, 2012 at 8:27 am

Hello, you mentioned Keppel (2003). Could you give us the whole citation of this work as well? I’ve searched for it, but unfortunately couldn’t find anything. Thank you!

October 4, 2012 at 9:01 am

Sorry about that–it’s 1993. The title is “Design and Analysis: A Researcher’s Handbook.”

October 10, 2012 at 4:53 am

Oh, 1993 – my mistake. Thanks!

October 23, 2012 at 3:31 pm

Hi Tamara, no my mistake! I just fixed it. 🙂

' src=

September 27, 2012 at 11:00 am

I am running a mixed design ANOVA: IV – participant nationality (2), poser nationality (2), configuration (3), emotion (4). DV – Reaction time, accuracy

to test my predictions, i have run 2 3-way ANOVAs: participant nationality x poser nationality x configuration participant nationality x poser nationality x emotion

The procedure is that all participants view the same stimuli and respond using a key press response, but when analysing the data, i’m interested in seeing how the participant nationality interacts with the other variables.

I have discovered significant levene’s results in my analysis, and I’m not quite sure what to do. I have been trying to consult my SPSS textbook but it is not very handy, and i know with the one way ANOVA, i can just ensure that the welch’s F ratio is included in the analysis, however, when computing using the repeated measures design, I do not have this option.

Do you know what else I can do?

Regards, Tien

October 23, 2012 at 4:16 pm

I don’t know how large your samples are, but Levene’s test alone can be inaccurate. I would start by doing some graphs to see how far off the variances are and check the rule of thumb of largest variance/smallest variance < 9.

You're right that Welch is only available with one-way anova, at least in spss. Your options are a rank transformation (see http://www.bio.ri.ccf.org/robrien/IntroBiostat/RankAsBridge.pdf ) or a weighted least squares.

September 20, 2012 at 2:54 am

hi, I have two problems. 1. I have two age groups early adulthood, n=45 and middle adulthood n=45. can I run independent t test to find social support? I have read that for applying t test the sample size must be less than 30 as my sample size is 45 each. 2. I want to run one way anova. I want to see how education affects our copng. I have divided education into 7 groups n each group sample size is different. should I apply Gabriel or Tukey post hoc test to see which group differ? plz reply soon. THXS hina

October 23, 2012 at 3:42 pm

1. The sample size does not have to be less than 30 for a t-test. Go right ahead. 2. I don’t remember Gabriel off the top of my head so I looked it up and found this very nice response to a similar question: http://listserv.uga.edu/cgi-bin/wa?A2=ind0002&L=spssx-l&P=39625

' src=

September 18, 2012 at 5:41 am

Thank you so much Karen, I searched on internet and found you post, so glad someone knows my problem:

Could you please take a look at my example:

I measured Blood Pressure of 31 subjects.

I grouped this 31 subjects according to their Sex (Male/Female = 21/10) and Smoking Status (Yes/No = 6/25).

Thus I have 4 groups: Male smokers: 4 Male non-smokers: 17 Female smokers: 2 Female non-smokers: 8

I wonder if I can use the “General Linear Model” to analyse the fixed effects of Sex and Smoking Satus (and their interaction) on Blood Pressure?

Since I have small unequal sample size in each group.

or if I can’t use the GLM, what else methods are good?

Thousands of thanks from South Korea~! ^____^

October 23, 2012 at 4:33 pm

Well, you can run it, but I would take the results with a grain of salt. Your sample sizes are pretty low in general. I suspect you’d be fine with only main effects, but you’ll have trouble with an interaction term.

A nonparametric test might be a better option.

' src=

September 17, 2012 at 3:15 pm

Hi Karen, I have a dataset with species richness separated by forest type. The sample sizes are (49,61,256). They are normally distributed, but Levene’s test shows that there is unequality of the variances. I’ve run the Kruskal-Wallis test on the data, which showed significant difference. However, I wanted to see exactly which forest types were different from each other. After running a Mann-Whitney test, the results showed that none of the paired types were significantly different. I’m confused because this contradicts the Kruskal-Wallis test. Should I run a Kruskal-Wallis test on each of the pairs or is there a better post-hoc analysis? Thanks! Amanda

October 23, 2012 at 3:59 pm

Did you add a Bonferroni correction to the Mann-Whitneys? It shouldn’t contradict the Kruskall Wallis if you didn’t, but it’s always possible, especially if the original p-value wasn’t especially low and the MW p-values aren’t too high.

Here are some options for post hoc tests on nonparametrics: http://www.talkstats.com/showthread.php/4634-non-parametric-post-hoc-tests

Otherwise, you might want to try a Weighted Least Squares ANOVA, using the inverse of the variance of each group as the weight. It’s a good option when variances are unequal but normality is met.

' src=

August 28, 2012 at 6:03 am

hi karen, i have a sample of ticks and human samples that i will test for babesia,Q-fever and rickettsia.what is the appropriate statistical test,i have reading for the last 5 days still cant figure what to use .thanks

September 11, 2012 at 4:30 pm

You haven’t given me enough information to answer. I would need to know the research question, how the variables are measured, and the study design to even get started.

Please feel free to give me more details.

' src=

August 6, 2012 at 11:34 am

Just a simple question with regard to unequal group sizes, and which statistical test to choose.

I have two samples, and I’m trying to find out whether I should use a Mann-Whitney U test or a independent samples T-test. I was wondering if you could point me in the right direction. Sample 1: n=62. Sample 2: n=38.

Are group sizes so unequal that I should use the Mann-Whitney U test, or is the difference meager, in which case I could use a T-test? Answers are given on a 5 point Likert scale.

I look forward to hearing from you.

August 3, 2012 at 2:45 pm

Hmm, that’s a big question. At best, it won’t answer the research question. At worst, it will answer it, but wrongly.

Was there a specific situation you’re referring to?

' src=

August 2, 2012 at 4:28 pm

How might choosing the wrong analysis affect the results of your study?

' src=

July 10, 2012 at 12:23 pm

In your above post you mention using the Tukey Kramer as a post hoc test for unequal n’s…would the Fisher’s Protected t-test also work? I need to conduct a post hoc analysis on the interaction effects of a Two-Way ANOVA. Since SPSS can only do post hoc analyses for main effects I have to do it by hand. My stats book recommends Fisher’s for a One-Way ANOVA, but it doesn’t say whether or not it will also work for the Two-Way ANOVA. If it is suitable to use, do I still use degrees of freedom within when determining my critical value?

I appreciate your guidance!

July 13, 2012 at 11:39 am

Most post-hoc tests are indeed set up for main effects. The usual way to interpret interactions is to use simple effects tests. See https://www.theanalysisfactor.com/interpreting-interactions-when-the-f-test-and-the-simple-effects-disagree/

I don’t know if Fisher’s protected t-test is the same as Fisher’s LSD–if so, it’s not really adjusting the family wise error rate. I’d have to see your text to be sure.

Depending on which software you use, you may be able to get adjusted mean comparisons using the EMMeans (SPSS) or LSMeans (SAS) statement instead of post-hoc.

' src=

June 20, 2012 at 12:40 pm

I have a quick question regarding unequal sample sizes in ANOVA. I am running a test on a continuous outcome score in 5 different groups. The group sizes are very unbalanced (Group 1=98, Group 2=366, Group 3=180, Group 4=22, Group 5=10). I want to know if the mean outcome scores are significantly different between the groups, AND which ones are different, using SPSS. Am I wrong to assume that I can use ANOVA with a Tukeys post-hoc for this? Thanks!!!

June 25, 2012 at 12:04 pm

The Tukey-Kramer test is an adaptation for Tukey when sample sizes are unequal.

Here is a nice listing of the various post-hoc tests. http://www.uky.edu/ComputingCenter/SSTARS/www/documentation/MultipleComparisons_3.htm

' src=

June 8, 2012 at 10:59 am

I am using the GLM procedure in SPSS to examine the association between daily physical activity and different measures of physical fitness while adjusting for a few variables (age, gender and school) in a sample of about 600 children from whom we have adequate physical activity data. For some of my models, Levene’s test is significant. I was wondering if this should be a concerned given the relatively large sample size. If this is a concern, I have heard that it might be a good idea to compute the Welch statistic. So, I was wondering how to get that statistic in IBM SPSS 19.

Many thanks,

June 19, 2012 at 7:32 pm

With a sample of 600, I’m sure Levene will find even very small differences in variances significant. I don’t know of a way to compute a Welch adjustment in GLM, but one-way anova has it.

' src=

May 22, 2012 at 12:29 pm

Karen, I am currently conducting research on my masters thesis and have run into a problem as to what Statistical test I haver run! My research is comparing distance from before and after. I am comparing the recruiting distance of schools that have won a ncaa football national from three years prior to three years after. I am trying to see if there is a significant increase in recruiting distance because a team won a national title. Obviously there is an unequal sample size due to the difference in the number of recruits for each year both pre-championship and post-championship! I have used both the paired samples t-test and the wilcoxin signed ranked test! It seems that these test do not equate for unequal sample size, which I fell may be throwing off my data! DO you have any suggestions? I am comparing those distances of before and after for 5 teams

May 25, 2012 at 2:03 pm

Hi Michael,

All 5 of these teams won a championship? i.e. is there a control group, say the teams who made it to the championship, but lost?

And are you taking year into account, or is it just split into before and after?

I’m going to assume you have no control group and one total. And because this is a masters thesis, I’m going to try to keep this as simple as possible.

What it sounds like to me is a paired test won’t work. Let’s say school #1 has 80 recruits before and 120 after the championship. Despite being from the same school, the actual recruits don’t match up. In other words, recruit #1 before has nothing to do with recruit #1 after. Each recruit has their own distance. What you’ve got is what’s called a randomized block design. The recruits are blocked by school.

What it means is you need to do an ANOVA, using school as a random factor. And because of the unbalanced data, you want to use a linear mixed model procedure, rather than an ANOVA. It has better algorithms for unbalanced data.

One resource I can recommend is a webinar I did on Fixed and Random Factors . If you follow that link, the recording is free. There’s some background there.

Any good Design of Experiments book will have info as well. I like Geoffrey Keppel’s.

Actually running a Mixed model in Mixed can get tricky, but you have a very straightforward design, so this won’t be a hard one for someone with experience.

' src=

May 22, 2012 at 11:16 am

Hi Karen, I have conducted a socio-economic study where I have collected information from 140 people near a main road and 50 people who are away from the main road. I have done K-Wallis tests (because of non-normality) to examine differences in e.g. total income, education between the two categories so as to say something about the effect of the road. I checked for the homogeneity of variance assumption and everything seems ok. I have received some comments that my data are biased and am seeking your view. From reading your responses in the above, I would think what I have done is fine. Kindly comment. Thank you.

May 25, 2012 at 1:48 pm

It’s hard to comment on exactly what they mean without seeing the comments of why they’re biased.

However, I suspect it’s not a problem with your choice of statistical analysis method. It’s a problem with your sampling, and therefore your conclusions. Clearly, people can’t be randomly assigned to living near or far from a main road.

So yes, it’s fine to use a K-W test to show that distributions in income, education, etc. differ in the two areas. What isn’t fine is saying the reason for the differences are due to an effect of the road. There could be other differences in the two groups, or the effect could be in the reverse direction. Perhaps people who have higher educations, incomes, etc. tend to choose to live away from a main road because they can.

You just can’t tell given your research design.

June 20, 2012 at 11:28 am

Thank you Karen. This has been very useful. I really like the clarity in your responses. Makes the statistics understandable. Thank you again.

' src=

May 16, 2012 at 6:27 pm

I have a data set with one control and two treatments. Basically, three groups of cows were fed a control diet, a contaminated diet, and a contaminated diet with additive. I have the following samples sized: control = 2 cows, trt1 = 5 cows, and trt2= 5 cows. These are pretty small numbers to begin with, but we were limited by money (yay research!). I’m using Proc Mixed in SAS for the ANOVA, but after reading some of the comments above, I’m not sure I’ve done this correctly. Can you offer some advice on the proper way to analyze this data?

Thanks Stephanie

May 17, 2012 at 10:44 am

Hi Stephanie,

Proc Mixed might be fine–you haven’t given me enough information. Is there a reason for mixed, like repeated measures on each cow or randomized blocks?

' src=

May 5, 2012 at 12:30 am

Hey Karen, thank you for posting this article and for taking the time to respond to so many of your readers’ questions. I’m really impressed with that and will be checking out more of your website. Cheers

May 17, 2012 at 10:43 am

Thanks, Remy, for the kind words.

' src=

April 17, 2012 at 11:21 am

I noticed that you replied to a person you never used Levene’s test. So, I just wondered how you test the homogeneity of variance as a stat consult, since Levene’s test is known as being affected by the sample size.

Another question is I’m working on a project involved one way ANOVA. Basically, we want to compare students’ outcome under seven different instruction methods. Since we have unequal sample sizes, the way we chose to analyze the data is we test the homogeneity of variance first, if the assumption is met, we go with normal ANOVA F and Tukey as post hoc. If the assumption is not met, we go with Welch F plus Games-Howell as post hoc. Is this way correct? Any thoughts are greatly appreciated! Thank you.

' src=

March 31, 2012 at 8:14 am

Hello Karen,

I wonder if you can help:

I’ve conducted 8 2x2x2x2-way between-subjects ANOVAs. The sample size is 179. There are approximately similar numbers of participants in each level of the independent variables and across the 16 combinations of the IVs.

I have 8 dependent variables. For some of the ANOVAs, the Levene’s Test is significant, for others it is not. When it is significant, I have used all expected ways of transforming the DV, without success.

Because the Levene’s test is sometimes not significant on the same sample, does this mean that the comment you made above about Levene’s sometimes being significant in large samples does not apply to my data? (Would 179 participants be considered a large sample?)

There is no alternative non-parametric test for a 4-way ANOVA, so I’m unsure what to do. Any advice you could offer would be greatly appreciated.

Kind regards

April 2, 2012 at 10:16 am

I would suggest using other ways to check for non-constant variance other than Levene’s. I don’t think you have an issue of a sample being too big–you’ve got only slightly more than 10 per condition.

And transformations are really only useful for non-constant variance if you also have non-normality. Otherwise the normality will be messed up. (that’s a technical term).

I have a whole workshop on this, which you might want to look into: Assumptions of Linear Models

Best, Karen

March 12, 2012 at 5:29 pm

I noticed that you replied to a person you never used Levene’s test. So, I just wondered how you test the homogeneity of variance as a stat consult, since Levene’s test is known as being affected by the sample size.

Another question is I’m working on a project involved one way ANOVA. Basically, we want to compare students’ outcome under seven different instruction methods. Since we have unequal sample sizes, the way we chose to analyze the data is we test the homogeneity of variance first, if the assumption is met, we go with normal ANOVA F and Tukey as post hoc. If the assumption is not met, we go with Welch F plus Games-Howell as post hoc. Is this way correct? Any thoughts are greatly appreciated! Thank you.

' src=

February 15, 2012 at 9:44 am

Hi Karen, I was just wondering if you have any suggestion as to how to further interpret findings if the variance is unequal (Levene is highly significant, groups are large >300) when conducting an ANCOVA in SPSS. There seems to be no way to obtain Welch or Hochberg when a covariate is included (age)…Do you have any suggestions?

Kind regards Sindre

February 24, 2012 at 8:02 pm

You could always run a weighted least squares model. You can put the weight in either GLM or Regression procedures.

February 10, 2012 at 6:08 pm

The way it works is that any means that are NOT significantly different in the post-hoc tests get the same letter superscript.

Let’s say the post-hoc results were simple, where M1 indicates the mean of group 1:

M3 < M1=M2=M4=M6 < M5=M7 They would be labelled this way in the table (sorry, I can't get the superscripts in the comments, so just pretend the letter are up): M1a M2a M3b M4a M5c M6a M7c When it gets tricky is when there's overlap, which is very common with 7 groups. So let's say for example, we have this more complicated example: M3 < M5=M7 M3 = M1=M2=M4=M6 M1=M2=M4=M6 = M5=M7 So the highest and lowest means are significantly different from each other, but the ones in the middle don't differ significantly from anything. The means would be labelled like this: M1a,b M2a,b M3a M4a,b M5b M6a,b M7b So M3 are in a different group than M5 and M7. But M1, for example, has the same subscript as both M3 and M5 because it overlaps them. Hope that helps! Karen

' src=

February 10, 2012 at 12:55 pm

I would like to know if how are the different letter superscripts used in a post-hoc test? can you suggest a reading material with examples on when and how to use different letter superscripts when 7 treatments are considered, and the level of significance vary in at least 4 of the paired means.

January 31, 2012 at 3:53 pm

You could run a two-way anova as is without the interaction on this. The problem subcategories are the ones with 1 and 0 people in them.

The other alternative, if the interaction seems necessary, is to collapse the experience variable into fewer categories.

I would suggest graphing the means to see if the interaction is important, and if not, leave it out. If it is, you’d be better of collapsing.

' src=

January 22, 2012 at 11:45 pm

I just want to know if i could actually use two way factorial anova for this.

I have two groups of DEvice 1 (n=27) and Device 2 (n=28). in each group, I have 5 sub categories of participants (very low, low, moderate, high and very high experience of playing games). For the Device 1 group I have 9, 8, 5, 2, 2 and 1 for ach sub category. For the Device 2, I have 7, 4, 7, 8, 2, 0 for each sub category. Can I use two way ANOVA for this? Or should I just provide descriptive analysis? The main objective of the experiment is to see if there is any difference on the participants total score when playing games in Device 1 or 2.

December 9, 2011 at 2:00 pm

I never use Levene’s test. With large sample sizes, it’s almost always significant. With small sample sizes, it’s almost never significant.

So it’s not very helpful. Geoffrey Keppel’s book Design and Analysis of Experiments has a good section on this.

Or if you want a full explanation and demonstration about assumptions, what they mean and better ways to check them, I would actually recommend my workshop on assumptions in linear models. We have a home study version and you can get more information at: http://www.theanalysisinstitute.com/workshops/GLM-Assumptions/index.html

' src=

November 27, 2011 at 6:02 am

Hi Karen, I have run a two way ANOVA (2 by 2 facotrial design) and gained a significant Levene’s Test p = .012. I have adjusted the crititcal alpha for interpretation of significance for both the main and interaction effects, however I was wondering what are the practical methods that can be used in future studies such that Levene’s is not violated? and are you able to give me some references.

Also, with another 2 by 2 factorial design that reveals a significant interaction, I am aware that follow up simple effects are required. Through the use of the split data method in SPSS and recalculated the F statistic using the overall MSE. Is there a need to control for Type 1 error by using Bonferroni’s?

' src=

November 9, 2011 at 6:39 am

Hi again Karen, and thanks!

I see your point, her (Erika’s) dependent was a categorical. Mine, however, are not. That I am sure of. However, a greater concern for me is that my sample sizes vary considerably: group 1 equals 464, group 2 = 444, and group 3 = 24.

My problem is that even though an ANOVA shows significant differences for the three groups on a specific dependent variable, and the largest calculated mean-difference is between the smallest group and one of the other, post hoc tests cannot tell apart the smallest group from the group where the largest mean-difference appear.

Standardized mean scores for the groups: Group 1: -.08 (a) Group 2: .27 (b) Group 3: -.11 (ab)

Currently, I use Hochberg’s GT2 post hoc test, as it, I have read, is quite robust to violations of homogeneity of variance. I also, where indicated by the Levene’s test, modify the p-values using the Welch modification.

I know that this may be a lot to ask but I wonder whether you think I could benefit from bootstrapping or if such a procedure will not help me as the ratio among my three groups will not differ?

Best Viktor

November 8, 2011 at 11:48 am

Hi Karen, I am interested in your 3rd response to Erika the 7th December 2010. You write Erika should not use ANOVA as the response is categorical, not continuous.

Do you mean that because of Erika’s design, a control: n=60; dose 1 n=114; and dose 2 n=175, it is inappropriate to use ANOVA here?

I am interested in this as I have similar conditions; a grouping variable with three categorical (depending on viewpoint) responses, a (very) unbalanced design, and for some dependents, unequal variances. Hence, I wonder, what analysis would be appropriate if I conclude my response is categorical, rather than a continuous?

November 8, 2011 at 12:18 pm

Good question. No, the control/Dose 1/Dose 2 variable is her Independent Variable. It’s totally appropriate to have grouping (ie. categorical) variables for the independent variable.

In Erika’s study, her Dependent Variable (aka Response Variable or Outcome) is ALSO categorical: Is the Presence of the plant the same after as it was before: Yes or No.

ANOVA is comparing means in the Dependent variable for the different categories of the Independent Variable. Since there is no way to calculate a mean of Yes/No, you can’t use anova.

So I’m not sure based on how you’ve described your study whether your dependent variable is indeed categorical. You mention unequal variances, which makes me think they really are numerical.

Here are a few posts that might be helpful:

When Dependent Variables Are Not Fit for GLM, Now What?

6 Types of Dependent Variables that will Never Meet the GLM Normality Assumption

' src=

September 14, 2011 at 5:42 am

hi in my test am comparing a single variable among 3 groups having different sample size. can i do one way ANOVA inspite of the unequal sample size?

September 16, 2011 at 8:34 am

September 14, 2011 at 4:27 am

hello again, we have unequal sample size.. thank you again!

September 15, 2011 at 1:30 pm

Hmm, we may be past your deadline anyway, but in any case, I’d need more information about what you need. The fact that you have unequal sample sizes in the ANOVA isn’t problematic. Just run it as you would any 2×2 ANOVA. If you need help running a 2×2 ANOVA in SPSS, I can tell you to use Univariate GLM. If you need more detail than that, I need a better idea of what you understand already and what you need help with. 🙂

September 14, 2011 at 4:08 am

good day! this is very urgent…. we have a report to pass tomorrow and our research design is two-way (2×2) anova factorial design. we dont know how to make results in spss. thank you!

' src=

September 8, 2011 at 7:24 pm

I have some data that gives the amount of time taken by three different surgeons to undertake a specific procedure. Given that I have a varying number of data points for each surgeon (e.g. 50/40/25) and that there may be unequal variance (e.g. slower surgeons having a greater variety of recorded times), what is the best way to figure out if there are significant differences in the time taken by each surgeon?

September 15, 2011 at 1:26 pm

I would start by seeing if the unequal variances are large enough to cause problems. If they are, with a one-way analysis like that, you could easily just run a nonparametric test.

' src=

August 6, 2011 at 5:15 am

I conducted a two-way ANOVA to test if there are differences in levels of teaching innovation (scores 0-6) between teachers based on school (1=regular school, 2=all-day school) and in-service training (1=none, 2=Basic ICT Skills, 3=Educational applications of ICT). I used unequal sample sizes (75 all-day teachers and 90 regular teachers).

The ANOVA table showed that there are no differences in either main effects or interaction effect (p<0.05). However, the Model p-value was smaller than 0.05 showing that there are significant differences in the model.

I discussed only the interaction and main effects p-values. My chair told me to recheck the data analysis because it does not make sense with the Model having significant differences whereas none of the effects (main and interaction) had no significant differences. When I deleted the Model row from the table claiming that the only important p-values to discuss were the main and interaction effects p-values, my chair said this was wrong.

The data analysis is correct–I double checked it. My question is: What does this Model p-value mean? Does it have to do with the unequal sample sizes? How should I discuss this Model p-value? Is it really this important to include it in my results?

Thanks in advance,

August 26, 2011 at 12:52 pm

Hi Adamantia,

Thanks for being patient–I’ve been out of the office and just got back.

I can’t give you a definite answer of what is going on without trying it on data, but this is what is *probably* going on.

The Model p-value evaluates the overall effect of all IVs. IF all the IVs are completely independent and sample sizes are equal, the overall model effect won’t be significant if no IVs are.

IVs are usually only independent when you have randomly assigned subjects to conditions.

The other thing that can happen is if your p-values are close to .05, different tests might be falling on one side of that cutoff or the other. They’re not really changing much, and even just rounding can be creating differences. So if that’s the case, don’t take the .05 cutoff too seriously.

' src=

December 9, 2010 at 3:41 pm

Thank you Karen! You really helped to clear these things up for me. I really appreciate it. Sorry again for all of the questions.

Thanks Again, Erika

October 23, 2010 at 3:09 pm

I apologize in advance but I have bunch of questions about unequal sample sizes and one-way ANOVAs in a particular case study.

I am conducting an experiment with very different numbers of sample sites. control: n=60; dose 1 n=114; dose 2 n=175. My main question is if the response to dose 1 and dose 2 are significantly different? Response was measured by difference in plant’s presence or absence before and after treatment. So if the the plant was present at the sample site before treatment and absent at the same site after treatment it was considered a 1 for response, if it was present before and it was present after it was considered a 0 for response and if the plant was not present before and was present after it was considered -1 (I already know from previous research that the two doses should be significantly different from the control but I would like to do an ANOVA test to compare the resposes in the control group and the two different doses)

Q1: Is there a test, like the levene test, for determining the equality of variances for unequal sample sizes?

Q2: Should I not use ANOVA because the sample sizes are too different?

Q3: Would it be better or worse to conduct a series of t-tests?

Q4: If I choose to use ANOVA should I use a Welch ANOVA followed by games howell pairwise comparison as suggested here in the below pdf because the sample sizes are different? http://frank.mtsu.edu/~dkfuller/notes302/anova.pdf

Q5: Should I not use ANOVA or a t-test because I pretty sure the data is not gaussian due to the fact that the data is practically boolean? And if so is there another test for comparing this kind of data?

Any help you could give me would be greatly appreciated. I feel pretty lost.

Thank you in advance, Erika

December 7, 2010 at 10:20 pm

Sorry it took me a while to respond. Hope this is still useful. You do have a lot of questions, but I’ll do my best.

1. Levene works with unequal samples sizes. Equal variance is even MORE important if sample sizes are unequal. 2. No. It’s fine to use ANOVA (assuming variances are equal) with unequal sample sizes. But you should NOT use ANOVA in this study because your response is categorical, not continuous. 3. Worse. Always worse. 4. Welch’s test could work in your design (if ANOVA were appropriate), but according to Keppel (1991), it’s “unsatisfactory” when you’re comparing more than 4 means. 5. Exactly. You could just run a Chi-square, or if you want to get really fancy, or you have covariates you want to include, a logistic regression.

May 11, 2010 at 10:34 pm

This may be a silly questions, but what if you are doing a 2x2x2 and your comparing males and females on their reaction times (2 tasks) and their anxiety (high or low) and there are more females in the study than males. Would this be a confound?

May 12, 2010 at 9:03 am

Hi Jessica,

It’s not a confound just if there are more females than males. It’s a confound only if, say, there are more females AND females are more likely to be anxious.

If your task and anxiety conditions are manipulated, so that you’re assigning people to them, then you have no problem. The example I gave could only occur if you also measured anxiety, not manipulated it.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Privacy Overview

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

15.6: Unequal Sample Sizes

  • Last updated
  • Save as PDF
  • Page ID 2178

  • Rice University

Learning Objectives

  • State why unequal \(n\) can be a problem
  • Define confounding
  • Compute weighted and unweighted means
  • Distinguish between Type I and Type III sums of squares
  • Describe why the cause of the unequal sample sizes makes a difference in the interpretation

The Problem of Confounding

Whether by design, accident, or necessity, the number of subjects in each of the conditions in an experiment may not be equal. For example, the sample sizes for the "Bias Against Associates of the Obese" case study are shown in Table \(\PageIndex{1}\). Although the sample sizes were approximately equal, the "Acquaintance Typical" condition had the most subjects. Since \(n\) is used to refer to the sample size of an individual group, designs with unequal sample sizes are sometimes referred to as designs with unequal \(n\).

We consider an absurd design to illustrate the main problem caused by unequal \(n\). Suppose an experimenter were interested in the effects of diet and exercise on cholesterol. The sample sizes are shown in Table \(\PageIndex{2}\).

What makes this example absurd is that there are no subjects in either the "Low-Fat No-Exercise" condition or the "High-Fat Moderate-Exercise" condition. The hypothetical data showing change in cholesterol are shown in Table \(\PageIndex{3}\).

The last column shows the mean change in cholesterol for the two Diet conditions, whereas the last row shows the mean change in cholesterol for the two Exercise conditions. The value of \(-15\) in the lower-right-most cell in the table is the mean of all subjects.

We see from the last column that those on the low-fat diet lowered their cholesterol an average of \(25\) units, whereas those on the high-fat diet lowered theirs by only an average of \(5\) units. However, there is no way of knowing whether the difference is due to diet or to exercise since every subject in the low-fat condition was in the moderate-exercise condition and every subject in the high-fat condition was in the no-exercise condition. Therefore, Diet and Exercise are completely confounded. The problem with unequal \(n\) is that it causes confounding.

Weighted and Unweighted Means

The difference between weighted and unweighted means is a difference critical for understanding how to deal with the confounding resulting from unequal \(n\).

Weighted and unweighted means will be explained using the data shown in Table \(\PageIndex{4}\). Here, Diet and Exercise are confounded because \(80\%\) of the subjects in the low-fat condition exercised as compared to \(20\%\) of those in the high-fat condition. However, there is not complete confounding as there was with the data in Table \(\PageIndex{3}\).

The weighted mean for "Low Fat" is computed as the mean of the "Low-Fat Moderate-Exercise" mean and the "Low-Fat No-Exercise" mean, weighted in accordance with sample size. To compute a weighted mean, you multiply each mean by its sample size and divide by \(N\), the total number of observations. Since there are four subjects in the "Low-Fat Moderate-Exercise" condition and one subject in the "Low-Fat No-Exercise" condition, the means are weighted by factors of \(4\) and \(1\) as shown below, where \(M_W\) is the weighted mean.

\[M_W=\frac{(4)(-27.5)+(1)(-20)}{5}=-26\]

The weighted mean for the low-fat condition is also the mean of all five scores in this condition. Thus if you ignore the factor "Exercise," you are implicitly computing weighted means.

The unweighted mean for the low-fat condition (\(M_U\)) is simply the mean of the two means.

\[M_U=\frac{-27.5-20}{2}=-23.75\]

One way to evaluate the main effect of Diet is to compare the weighted mean for the low-fat diet (\(-26\)) with the weighted mean for the high-fat diet (\(-4\)). This difference of \(-22\) is called "the effect of diet ignoring exercise" and is misleading since most of the low-fat subjects exercised and most of the high-fat subjects did not. However, the difference between the unweighted means of \(-15.625\) (\((-23.750)-(-8.125)\)) is not affected by this confounding and is therefore a better measure of the main effect. In short, weighted means ignore the effects of other variables (exercise in this example) and result in confounding; unweighted means control for the effect of other variables and therefore eliminate the confounding.

Statistical analysis programs use different terms for means that are computed controlling for other effects. SPSS calls them estimated marginal means, whereas SAS and SAS JMP call them least squares means.

Types of Sums of Squares

The section on Multi-Factor ANOVA stated that when there are unequal sample sizes, the sum of squares total is not equal to the sum of the sums of squares for all the other sources of variation. This is because the confounded sums of squares are not apportioned to any source of variation. For the data in Table \(\PageIndex{4}\), the sum of squares for Diet is \(390.625\), the sum of squares for Exercise is \(180.625\), and the sum of squares confounded between these two factors is \(819.375\) (the calculation of this value is beyond the scope of this introductory text). In the ANOVA Summary Table shown in Table \(\PageIndex{5}\), this large portion of the sums of squares is not apportioned to any source of variation and represents the "missing" sums of squares. That is, if you add up the sums of squares for Diet, Exercise, \(D \times E\), and Error, you get \(902.625\). If you add the confounded sum of squares of \(819.375\) to this value, you get the total sum of squares of \(1722.000\). When confounded sums of squares are not apportioned to any source of variation, the sums of squares are called Type III sums of squares. Type III sums of squares are, by far, the most common and if sums of squares are not otherwise labeled, it can safely be assumed that they are Type III .

When all confounded sums of squares are apportioned to sources of variation, the sums of squares are called Type I sums of squares. The order in which the confounded sums of squares are apportioned is determined by the order in which the effects are listed. The first effect gets any sums of squares confounded between it and any of the other effects. The second gets the sums of squares confounded between it and subsequent effects, but not confounded with the first effect, etc. The Type I sums of squares are shown in Table \(\PageIndex{6}\). As you can see, with Type I sums of squares, the sum of all sums of squares is the total sum of squares.

In Type II sums of squares, sums of squares confounded between main effects are not apportioned to any source of variation, whereas sums of squares confounded between main effects and interactions are apportioned to the main effects. In our example, there is no confounding between the \(D \times E\) interaction and either of the main effects. Therefore, the Type II sums of squares are equal to the Type III sums of squares.

Which Type of Sums of Squares to Use (optional)

Type I sums of squares allow the variance confounded between two main effects to be apportioned to one of the main effects. Unless there is a strong argument for how the confounded variance should be apportioned (which is rarely, if ever, the case), Type I sums of squares are not recommended.

There is not a consensus about whether Type II or Type III sums of squares is to be preferred. On the one hand, if there is no interaction, then Type II sums of squares will be more powerful for two reasons:

  • variance confounded between the main effect and interaction is properly assigned to the main effect and
  • weighting the means by sample sizes gives better estimates of the effects.

To take advantage of the greater power of Type II sums of squares, some have suggested that if the interaction is not significant, then Type II sums of squares should be used. Maxwell and Delaney (2003) caution that such an approach could result in a Type II error in the test of the interaction. That is, it could lead to the conclusion that there is no interaction in the population when there really is one. This, in turn, would increase the Type I error rate for the test of the main effect. As a result, their general recommendation is to use Type III sums of squares.

Maxwell and Delaney (2003) recognized that some researchers prefer Type II sums of squares when there are strong theoretical reasons to suspect a lack of interaction and the p value is much higher than the typical \(α\) level of \(0.05\). However, this argument for the use of Type II sums of squares is not entirely convincing. As Tukey (1991) and others have argued, it is doubtful that any effect, whether a main effect or an interaction, is exactly \(0\) in the population. Incidentally, Tukey argued that the role of significance testing is to determine whether a confident conclusion can be made about the direction of an effect, not simply to conclude that an effect is not exactly \(0\).

Finally, if one assumes that there is no interaction, then an ANOVA model with no interaction term should be used rather than Type II sums of squares in a model that includes an interaction term. (Models without interaction terms are not covered in this book).

There are situations in which Type II sums of squares are justified even if there is strong interaction. This is the case because the hypotheses tested by Type II and Type III sums of squares are different, and the choice of which to use should be guided by which hypothesis is of interest. Recall that Type II sums of squares weight cells based on their sample sizes whereas Type III sums of squares weight all cells the same. Consider Figure \(\PageIndex{1}\) which shows data from a hypothetical \(A(2) \times B(2)\)design. The sample sizes are shown numerically and are represented graphically by the areas of the endpoints.

types.jpg

First, let's consider the hypothesis for the main effect of \(B\) tested by the Type III sums of squares. Type III sums of squares weight the means equally and, for these data, the marginal means for \(b_1\) and \(b_2\) are equal:

For \(b_1:(b_1a_1 + b_1a_2)/2 = (7 + 9)/2 = 8\)

For \(b_2:(b_2a_1 + b_2a_2)/2 = (14+2)/2 = 8\)

Thus, there is no main effect of \(B\) when tested using Type III sums of squares. For Type II sums of squares, the means are weighted by sample size.

For \(b_1: (4 \times b_1a_1 + 8 \times b_1a_2)/12 = (4 \times 7 + 8 \times 9)/12 = 8.33\)

For \(b_2: (12 \times b_2a_1 + 8 \times b_2a_2)/20 = (12 \times 14 + 8 \times 2)/20 = 9.2\)

Since the weighted marginal mean for \(b_2\) is larger than the weighted marginal mean for \(b_1\), there is a main effect of \(B\) when tested using Type II sums of squares.

The Type II and Type III analysis are testing different hypotheses. First, let's consider the case in which the differences in sample sizes arise because in the sampling of intact groups, the sample cell sizes reflect the population cell sizes (at least approximately). In this case, it makes sense to weight some means more than others and conclude that there is a main effect of \(B\). This is the result obtained with Type II sums of squares. However, if the sample size differences arose from random assignment, and there just happened to be more observations in some cells than others, then one would want to estimate what the main effects would have been with equal sample sizes and, therefore, weight the means equally. With the means weighted equally, there is no main effect of \(B\), the result obtained with Type III sums of squares.

Unweighted Means Analysis

Type III sums of squares are tests of differences in unweighted means. However, there is an alternative method to testing the same hypotheses tested using Type III sums of squares. This method, unweighted means analysis, is computationally simpler than the standard method but is an approximate test rather than an exact test. It is, however, a very good approximation in all but extreme cases. Moreover, it is exactly the same as the traditional test for effects with one degree of freedom. The Analysis Lab uses unweighted means analysis and therefore may not match the results of other computer programs exactly when there is unequal n and the df are greater than one.

Causes of Unequal Sample Sizes

None of the methods for dealing with unequal sample sizes are valid if the experimental treatment is the source of the unequal sample sizes. Imagine an experiment seeking to determine whether publicly performing an embarrassing act would affect one's anxiety about public speaking. In this imaginary experiment, the experimental group is asked to reveal to a group of people the most embarrassing thing they have ever done. The control group is asked to describe what they had at their last meal. Twenty subjects are recruited for the experiment and randomly divided into two equal groups of \(10\), one for the experimental treatment and one for the control. Following their descriptions, subjects are given an attitude survey concerning public speaking. This seems like a valid experimental design. However, of the \(10\) subjects in the experimental group, four withdrew from the experiment because they did not wish to publicly describe an embarrassing situation. None of the subjects in the control group withdrew. Even if the data analysis were to show a significant effect, it would not be valid to conclude that the treatment had an effect because a likely alternative explanation cannot be ruled out; namely, subjects who were willing to describe an embarrassing situation differed from those who were not. Thus, the differential dropout rate destroyed the random assignment of subjects to conditions, a critical feature of the experimental design. No amount of statistical adjustment can compensate for this flaw.

  • Maxwell, S. E., & Delaney, H. D. (2003) Designing Experiments and Analyzing Data: A Model Comparison Perspective , Second Edition, Lawrence Erlbaum Associates, Mahwah, New Jersey.
  • Tukey, J. W. (1991) The philosophy of multiple comparisons, Statistical Science , 6 , 110-116.

How to Perform a t-test with Unequal Sample Sizes

One question students often have in statistics is:

Is it possible to perform a t-test when the sample sizes of each group are not equal?

The short answer:

Yes, you can perform a t-test when the sample sizes are not equal. Equal sample sizes is not one of the assumptions made in a t-test.

The real issues arise when the two samples do not have equal variances, which is one of the assumptions made in a t-test.

When this occurs, it’s recommended that you use Welch’s t-test instead, which does not make the assumption of equal variances.

The following examples demonstrate how to perform t-tests with unequal sample sizes when the variances are equal and when they’re not equal.

Example 1: Unequal Sample Sizes and Equal Variances

Suppose we administer two programs designed to help students score higher on some exam.

The results are as follows:

  • n (sample size): 500
  • x (sample mean): 80
  • s (sample standard deviation): 5
  • n (sample size): 20
  • x (sample mean): 85

The following code shows how to create a boxplot in R to visualize the distribution of exam scores for each program:

hypothesis testing unequal sample size

The mean exam score for Program 2 appears to be higher, but the variance of exam scores between the two programs is roughly equal. 

The following code shows how to perform an independent samples t-test along with a Welch’s t-test:

The independent samples t-test returns a p-value of .0009 and Welch’s t-test returns a p-value of .0029 .

Since the p-value of each test is less than .05, we would reject the null hypothesis in each test and conclude that there is a statistically significant difference in mean exam scores between the two programs.

Even though the sample sizes are unequal, the independent samples t-test and Welch’s t-test both return similar results since the two samples had equal variances.

Example 2: Unequal Sample Sizes and Unequal Variances

  • s (sample standard deviation): 25

hypothesis testing unequal sample size

The mean exam score for Program 2 appears to be higher, but the variance of exam scores for Program 1 is much higher than Program 2.

The independent samples t-test returns a p-value of .5496  and Welch’s t-test returns a p-value of .0361 .

The independent samples t-test is not able to detect a difference in mean exam scores, but the Welch’s t-test is able to detect a statistically significant difference.

Since the two samples had unequal variances, only Welch’s t-test was able to detect the statistically significant difference in mean exam scores since this test does not make the assumption of equal variances between samples.

Additional Resources

The following tutorials provide additional information about t-tests:

Introduction to the One Sample t-test Introduction to the Two Sample t-test Introduction to the Paired Samples t-test

How to Add & Subtract Years to Date in Google Sheets

How to reorder bars in a stacked bar chart in ggplot2, related posts, how to normalize data between -1 and 1, vba: how to check if string contains another..., how to interpret f-values in a two-way anova, how to create a vector of ones in..., how to find the mode of a histogram..., how to find quartiles in even and odd..., how to determine if a probability distribution is..., what is a symmetric histogram (definition & examples), how to calculate sxy in statistics (with example), how to calculate sxx in statistics (with example).

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

25.3 - calculating sample size.

Before we learn how to calculate the sample size that is necessary to achieve a hypothesis test with a certain power, it might behoove us to understand the effect that sample size has on power. Let's investigate by returning to our IQ example.

Example 25-3 Section  

Let \(X\) denote the IQ of a randomly selected adult American. Assume, a bit unrealistically again, that \(X\) is normally distributed with unknown mean \(\mu\) and (a strangely known) standard deviation of 16. This time, instead of taking a random sample of \(n=16\) students, let's increase the sample size to \(n=64\). And, while setting the probability of committing a Type I error to \(\alpha=0.05\), test the null hypothesis \(H_0:\mu=100\) against the alternative hypothesis that \(H_A:\mu>100\).

What is the power of the hypothesis test when \(\mu=108\), \(\mu=112\), and \(\mu=116\)?

Setting \(\alpha\), the probability of committing a Type I error, to 0.05, implies that we should reject the null hypothesis when the test statistic \(Z\ge 1.645\), or equivalently, when the observed sample mean is 103.29 or greater:

\( \bar{x} = \mu + z \left(\dfrac{\sigma}{\sqrt{n}} \right) = 100 +1.645\left(\dfrac{16}{\sqrt{64}} \right) = 103.29\)

Therefore, the power function \K(\mu)\), when \(\mu>100\) is the true value, is:

\( K(\mu) = P(\bar{X} \ge 103.29 | \mu) = P \left(Z \ge \dfrac{103.29 - \mu}{16 / \sqrt{64}} \right) = 1 - \Phi \left(\dfrac{103.29 - \mu}{2} \right)\)

Therefore, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=108\) is 0.9907, as calculated here:

\(K(108) = 1 - \Phi \left( \dfrac{103.29-108}{2} \right) = 1- \Phi(-2.355) = 0.9907 \)

And, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=112\) is greater than 0.9999, as calculated here:

\( K(112) = 1 - \Phi \left( \dfrac{103.29-112}{2} \right) = 1- \Phi(-4.355) = 0.9999\ldots \)

And, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=116\) is greater than 0.999999, as calculated here:

\( K(116) = 1 - \Phi \left( \dfrac{103.29-116}{2} \right) = 1- \Phi(-6.355) = 0.999999... \)

In summary, in the various examples throughout this lesson, we have calculated the power of testing \(H_0:\mu=100\) against \(H_A:\mu>100\) for two sample sizes ( \(n=16\) and \(n=64\)) and for three possible values of the mean ( \(\mu=108\), \(\mu=112\), and \(\mu=116\)). Here's a summary of our power calculations:

As you can see, our work suggests that for a given value of the mean \(\mu\) under the alternative hypothesis, the larger the sample size \(n\), the greater the power \(K(\mu)\) . Perhaps there is no better way to see this than graphically by plotting the two power functions simultaneously, one when \(n=16\) and the other when \(n=64\):

As this plot suggests, if we are interested in increasing our chance of rejecting the null hypothesis when the alternative hypothesis is true, we can do so by increasing our sample size \(n\). This benefit is perhaps even greatest for values of the mean that are close to the value of the mean assumed under the null hypothesis. Let's take a look at two examples that illustrate the kind of sample size calculation we can make to ensure our hypothesis test has sufficient power.

Example 25-4 Section  

corn field

Let \(X\) denote the crop yield of corn measured in the number of bushels per acre. Assume (unrealistically) that \(X\) is normally distributed with unknown mean \(\mu\) and standard deviation \(\sigma=6\). An agricultural researcher is working to increase the current average yield from 40 bushels per acre. Therefore, he is interested in testing, at the \(\alpha=0.05\) level, the null hypothesis \(H_0:\mu=40\) against the alternative hypothesis that \(H_A:\mu>40\). Find the sample size \(n\) that is necessary to achieve 0.90 power at the alternative \(\mu=45\).

As is always the case, we need to start by finding a threshold value \(c\), such that if the sample mean is larger than \(c\), we'll reject the null hypothesis:

That is, in order for our hypothesis test to be conducted at the \(\alpha=0.05\) level, the following statement must hold (using our typical \(Z\) transformation):

\(c = 40 + 1.645 \left( \dfrac{6}{\sqrt{n}} \right) \) (**)

But, that's not the only condition that \(c\) must meet, because \(c\) also needs to be defined to ensure that our power is 0.90 or, alternatively, that the probability of a Type II error is 0.10. That would happen if there was a 10% chance that our test statistic fell short of \(c\) when \(\mu=45\), as the following drawing illustrates in blue:

This illustration suggests that in order for our hypothesis test to have 0.90 power, the following statement must hold (using our usual \(Z\) transformation):

\(c = 45 - 1.28 \left( \dfrac{6}{\sqrt{n}} \right) \) (**)

Aha! We have two (asterisked (**)) equations and two unknowns! All we need to do is equate the equations, and solve for \(n\). Doing so, we get:

\(40+1.645\left(\frac{6}{\sqrt{n}}\right)=45-1.28\left(\frac{6}{\sqrt{n}}\right)\) \(\Rightarrow 5=(1.645+1.28)\left(\frac{6}{\sqrt{n}}\right), \qquad \Rightarrow 5=\frac{17.55}{\sqrt{n}}, \qquad n=(3.51)^2=12.3201\approx 13\)

Now that we know we will set \(n=13\), we can solve for our threshold value c :

\( c = 40 + 1.645 \left( \dfrac{6}{\sqrt{13}} \right)=42.737 \)

So, in summary, if the agricultural researcher collects data on \(n=13\) corn plots, and rejects his null hypothesis \(H_0:\mu=40\) if the average crop yield of the 13 plots is greater than 42.737 bushels per acre, he will have a 5% chance of committing a Type I error and a 10% chance of committing a Type II error if the population mean \(\mu\) were actually 45 bushels per acre.

Example 25-5 Section  

politician

Consider \(p\), the true proportion of voters who favor a particular political candidate. A pollster is interested in testing at the \(\alpha=0.01\) level, the null hypothesis \(H_0:9=0.5\) against the alternative hypothesis that \(H_A:p>0.5\). Find the sample size \(n\) that is necessary to achieve 0.80 power at the alternative \(p=0.55\).

In this case, because we are interested in performing a hypothesis test about a population proportion \(p\), we use the \(Z\)-statistic:

\(Z = \dfrac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)

Again, we start by finding a threshold value \(c\), such that if the observed sample proportion is larger than \(c\), we'll reject the null hypothesis:

That is, in order for our hypothesis test to be conducted at the \(\alpha=0.01\) level, the following statement must hold:

\(c = 0.5 + 2.326 \sqrt{ \dfrac{(0.5)(0.5)}{n}} \) (**)

But, again, that's not the only condition that c must meet, because \(c\) also needs to be defined to ensure that our power is 0.80 or, alternatively, that the probability of a Type II error is 0.20. That would happen if there was a 20% chance that our test statistic fell short of \(c\) when \(p=0.55\), as the following drawing illustrates in blue:

This illustration suggests that in order for our hypothesis test to have 0.80 power, the following statement must hold:

\(c = 0.55 - 0.842 \sqrt{ \dfrac{(0.55)(0.45)}{n}} \) (**)

Again, we have two (asterisked (**)) equations and two unknowns! All we need to do is equate the equations, and solve for \(n\). Doing so, we get:

\(0.5+2.326\sqrt{\dfrac{0.5(0.5)}{n}}=0.55-0.842\sqrt{\dfrac{0.55(0.45)}{n}} \\ 2.326\dfrac{\sqrt{0.25}}{\sqrt{n}}+0.842\dfrac{\sqrt{0.2475}}{\sqrt{n}}=0.55-0.5 \\ \dfrac{1}{\sqrt{n}}(1.5818897)=0.05 \qquad \Rightarrow n\approx \left(\dfrac{1.5818897}{0.05}\right)^2 = 1000.95 \approx 1001 \)

Now that we know we will set \(n=1001\), we can solve for our threshold value \(c\):

\(c = 0.5 + 2.326 \sqrt{\dfrac{(0.5)(0.5)}{1001}}= 0.5367 \)

So, in summary, if the pollster collects data on \(n=1001\) voters, and rejects his null hypothesis \(H_0:p=0.5\) if the proportion of sampled voters who favor the political candidate is greater than 0.5367, he will have a 1% chance of committing a Type I error and a 20% chance of committing a Type II error if the population proportion \(p\) were actually 0.55.

Incidentally, we can always check our work! Conducting the survey and subsequent hypothesis test as described above, the probability of committing a Type I error is:

\(\alpha= P(\hat{p} >0.5367 \text { if } p = 0.50) = P(Z > 2.3257) = 0.01 \)

and the probability of committing a Type II error is:

\(\beta = P(\hat{p} <0.5367 \text { if } p = 0.55) = P(Z < -0.846) = 0.199 \)

just as the pollster had desired.

We've illustrated several sample size calculations. Now, let's summarize the information that goes into a sample size calculation. In order to determine a sample size for a given hypothesis test, you need to specify:

The desired \(\alpha\) level, that is, your willingness to commit a Type I error.

The desired power or, equivalently, the desired \(\beta\) level, that is, your willingness to commit a Type II error.

A meaningful difference from the value of the parameter that is specified in the null hypothesis.

The standard deviation of the sample statistic or, at least, an estimate of the standard deviation (the "standard error") of the sample statistic.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Psychol Med
  • v.42(1); Jan-Feb 2020

Sample Size and its Importance in Research

Chittaranjan andrade.

Clinical Psychopharmacology Unit, Department of Clinical Psychopharmacology and Neurotoxicology, National Institute of Mental Health and Neurosciences, Bengaluru, Karnataka, India

The sample size for a study needs to be estimated at the time the study is proposed; too large a sample is unnecessary and unethical, and too small a sample is unscientific and also unethical. The necessary sample size can be calculated, using statistical software, based on certain assumptions. If no assumptions can be made, then an arbitrary sample size is set for a pilot study. This article discusses sample size and how it relates to matters such as ethics, statistical power, the primary and secondary hypotheses in a study, and findings from larger vs. smaller samples.

Studies are conducted on samples because it is usually impossible to study the entire population. Conclusions drawn from samples are intended to be generalized to the population, and sometimes to the future as well. The sample must therefore be representative of the population. This is best ensured by the use of proper methods of sampling. The sample must also be adequate in size – in fact, no more and no less.

SAMPLE SIZE AND ETHICS

A sample that is larger than necessary will be better representative of the population and will hence provide more accurate results. However, beyond a certain point, the increase in accuracy will be small and hence not worth the effort and expense involved in recruiting the extra patients. Furthermore, an overly large sample would inconvenience more patients than might be necessary for the study objectives; this is unethical. In contrast, a sample that is smaller than necessary would have insufficient statistical power to answer the primary research question, and a statistically nonsignificant result could merely be because of inadequate sample size (Type 2 or false negative error). Thus, a small sample could result in the patients in the study being inconvenienced with no benefit to future patients or to science. This is also unethical.

In this regard, inconvenience to patients refers to the time that they spend in clinical assessments and to the psychological and physical discomfort that they experience in assessments such as interviews, blood sampling, and other procedures.

ESTIMATING SAMPLE SIZE

So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80%, and some set the threshold for significance at 0.01 rather than 0.05. Both choices are uncommon because the necessary sample size becomes large, and the study becomes more expensive and more difficult to conduct. Many investigators increase the sample size by 10%, or by whatever proportion they can justify, to compensate for expected dropout, incomplete records, biological specimens that do not meet laboratory requirements for testing, and other study-related problems.

Sample size calculations require assumptions about expected means and standard deviations, or event risks, in different groups; or, upon expected effect sizes. For example, a study may be powered to detect an effect size of 0.5; or a response rate of 60% with drug vs. 40% with placebo.[ 1 ] When no guesstimates or expectations are possible, pilot studies are conducted on a sample that is arbitrary in size but what might be considered reasonable for the field.

The sample size may need to be larger in multicenter studies because of statistical noise (due to variations in patient characteristics, nonspecific treatment characteristics, rating practices, environments, etc. between study centers).[ 2 ] Sample size calculations can be performed manually or using statistical software; online calculators that provide free service can easily be identified by search engines. G*Power is an example of a free, downloadable program for sample size estimation. The manual and tutorial for G*Power can also be downloaded.

PRIMARY AND SECONDARY ANALYSES

The sample size is calculated for the primary hypothesis of the study. What is the difference between the primary hypothesis, primary outcome and primary outcome measure? As an example, the primary outcome may be a reduction in the severity of depression, the primary outcome measure may be the Montgomery-Asberg Depression Rating Scale (MADRS) and the primary hypothesis may be that reduction in MADRS scores is greater with the drug than with placebo. The primary hypothesis is tested in the primary analysis.

Studies almost always have many hypotheses; for example, that the study drug will outperform placebo on measures of depression, suicidality, anxiety, disability and quality of life. The sample size necessary for adequate statistical power to test each of these hypotheses will be different. Because a study can have only one sample size, it can be powered for only one outcome, the primary outcome. Therefore, the study would be either overpowered or underpowered for the other outcomes. These outcomes are therefore called secondary outcomes, and are associated with secondary hypotheses, and are tested in secondary analyses. Secondary analyses are generally considered exploratory because when many hypotheses in a study are each tested at a P < 0.05 level for significance, some may emerge statistically significant by chance (Type 1 or false positive errors).[ 3 ]

INTERPRETING RESULTS

Here is an interesting question. A test of the primary hypothesis yielded a P value of 0.07. Might we conclude that our sample was underpowered for the study and that, had our sample been larger, we would have identified a significant result? No! The reason is that larger samples will more accurately represent the population value, whereas smaller samples could be off the mark in either direction – towards or away from the population value. In this context, readers should also note that no matter how small the P value for an estimate is, the population value of that estimate remains the same.[ 4 ]

On a parting note, it is unlikely that population values will be null. That is, for example, that the response rate to the drug will be exactly the same as that to placebo, or that the correlation between height and age at onset of schizophrenia will be zero. If the sample size is large enough, even such small differences between groups, or trivial correlations, would be detected as being statistically significant. This does not mean that the findings are clinically significant.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

Chapter 2: Summarizing and Visualizing Data

Chapter 3: measure of central tendency, chapter 4: measures of variation, chapter 5: measures of relative standing, chapter 6: probability distributions, chapter 7: estimates, chapter 8: distributions, chapter 9: hypothesis testing, chapter 10: analysis of variance, chapter 11: correlation and regression, chapter 12: statistics in practice.

The JoVE video player is compatible with HTML5 and Adobe Flash. Older browsers that do not support HTML5 and the H.264 video codec will still use a Flash-based video player. We recommend downloading the newest version of Flash here, but we support all versions 10 and above.

hypothesis testing unequal sample size

Consider performing a one-way ANOVA test on a dataset with heights of students from three samples with unequal sample sizes.

The null hypothesis is that the mean heights of the three samples are equal, and the alternative hypothesis is that at least one of the mean heights is different.

Compute the F statistic using the ratio of the variance between samples and the variance within samples. Here, x̿ is the combined mean of all observations, ͞x i is the mean of the i th sample, n i is the size of the i th sample, k is the number of samples and s i 2 is the variance of the i th sample.

Observe that both variance estimates are weighted since they consider sample size to compute the F statistic.

From the P -value, we infer that at least one of the mean heights from the three samples is different. And hence, the null hypothesis is rejected.

Further, to determine which mean height is significantly different from the others, we may construct box plots, construct confidence intervals, or use multiple comparison tests.

10.4: One-Way ANOVA: Unequal Sample Sizes

One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used:

In the equation, n is the sample size,  ͞x is the sample mean, x̿   is the combined mean for all the observations, k  is the number of samples, and s 2 is the variance of the sample. It should be noted that the subscript ' i ' represents a specific sample in a dataset.

Observe that both the variance estimates, the variance between samples, and the variance within samples are weighted since they use the same size to calculate the F statistic. In other words, the different sample sizes in the dataset will affect the two variance estimates- the variance between samples and the variance within samples, ultimately affecting the value of the F statistic.

Get cutting-edge science videos from J o VE sent straight to your inbox every month.

mktb-description

We use cookies to enhance your experience on our website.

By continuing to use our website or clicking “Continue”, you are agreeing to accept our cookies.

WeChat QR Code - JoVE

hypothesis testing unequal sample size

Power and Sample Size Determination

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  
  • |   8  
  • |   9  
  • |   10  
  • |   11  

On This Page sidebar

Issues in Estimating Sample Size for Hypothesis Testing

Ensuring that a test has high power.

Learn More sidebar

All Modules

In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. For example, in each test of hypothesis, there are two errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true.   In the first step of any test of hypothesis, we select a level of significance, α , and α = P(Type I error) = P(Reject H 0 | H 0 is true). Because we purposely select a small value for α , we control the probability of committing a Type I error. The second type of error is called a Type II error and it is defined as the probability we do not reject H 0 when it is false. The probability of a Type II error is denoted β , and β =P(Type II error) = P(Do not Reject H 0 | H 0 is false). In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H 0 when it is false, i.e., power = 1- β = P(Reject H 0 | H 0 is false). Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error (i.e., small α ) and high power (i.e., small β, high power).  

Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.  

The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.  

Suppose we want to test the following hypotheses at aα=0.05:  H 0 : μ = 90 versus H 1 : μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not. This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. Recall from the Central Limit Theorem (see page 11 in the module on Probability ), that for large n (here n=100 is sufficiently large), the distribution of the sample means is approximately normal with a mean of

If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0 : μ = 90.  

Normal distribution of X when the mean of X is 90. A bell-shaped curve with a value of X-90 at the center.

Rejection Region for Test H 0 : μ = 90 versus H 1 : μ ≠ 90 at α =0.05

Standard normal distribution showing a mean of 90. The rejection areas are in the two tails at the extremes above and below the mean. If the alpha level is 0.05, then each tail accounts for an arean of 0.025.

The areas in the two tails of the curve represent the probability of a Type I Error, α= 0.05. This concept was discussed in the module on Hypothesis Testing .  

Now, suppose that the alternative hypothesis, H 1 , is true (i.e., μ ≠ 90) and that the true mean is actually 94. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.The values of the sample mean are shown along the horizontal axis.  

Two overlapping normal distributions, one depicting the null hypothesis with a mean of 90 and the other showing the alternative hypothesis with a mean of 94. A more complete explanation of the figure is provided in the text below the figure.

If the true mean is 94, then the alternative hypothesis is true. In our test, we selected α = 0.05 and reject H 0 if the observed sample mean exceeds 93.92 (focusing on the upper tail of the rejection region for now). The critical value (93.92) is indicated by the vertical line. The probability of a Type II error is denoted β, and β = P(Do not Reject H 0 | H 0 is false), i.e., the probability of not rejecting the null hypothesis if the null hypothesis were true. β is shown in the figure above as the area under the rightmost curve (H 1 ) to the left of the vertical line (where we do not reject H 0 ). Power is defined as 1- β = P(Reject H 0 | H 0 is false) and is shown in the figure as the area under the rightmost curve (H 1 ) to the right of the vertical line (where we reject H 0 ).  

Note that β and power are related to α, the variability of the outcome and the effect size. From the figure above we can see what happens to β and power if we increase α. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. The vertical line would shift to the left, increasing α, decreasing β and increasing power. While a better test is one with higher power, it is not advisable to increase α as a means to increase power. Nonetheless, there is a direct relationship between α and power (as α increases, so does power).

β and power are also related to the variability of the outcome and to the effect size. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). The figure below shows the same components for the situation where the mean under the alternative hypothesis is 98.

Overlapping bell-shaped distributions - one with a mean of 90 and the other with a mean of 98

Notice that there is much higher power when there is a larger difference between the mean under H 0 as compared to H 1 (i.e., 90 versus 98). A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is 90. In the previous figure for H 0 : μ = 90 and H 1 : μ = 94, if we observed a sample mean of 93, for example, it would not be as clear as to whether it came from a distribution whose mean is 90 or one whose mean is 94.

In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). The inputs for the sample size formulas include the desired power, the level of significance and the effect size. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate.  

The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false (i.e., a specified power). In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.

return to top | previous page | next page

Content ©2020. All Rights Reserved. Date last modified: March 13, 2020. Wayne W. LaMorte, MD, PhD, MPH

hypothesis testing unequal sample size

Two Sample T-Test Unequal Variance

Comparing two samples/populations/groups/means/values.

Two-sample T-Test with unequal variance can be applied when (1) the samples are normally distributed, (2) the standard deviation of both populations are unknown and assume to be unequal, and the (3) sample is sufficiently large (over 30).

To compare the height of two male populations from the United States and Sweden, a sample of 30 males from each country is randomly selected and the measured heights are provided in Table 3.

Table 6. Height data for US and Swedish male samples

hypothesis testing unequal sample size

As the population standard deviation is unknown, the data is assumed to be normally distributed and the sample size is large enough, the two-sample T-Test can be applied to analyze the data. The test statistics is calculated as in Equation 5.

hypothesis testing unequal sample size

[Any statistical software, including MS excel can perform the two-sample T-Tests. Therefore, the equation is only for reference only. Analysis output will be produced using software.]

MS Excel can be used for performing a two-sample T-Test.

Analysis using MS excel is provided in Figure 9.

hypothesis testing unequal sample size

Figure 9. Two Sample T-Test Equal Variance Analysis Results Using MS Excel

Statistical Interpretation of the Results

We reject the null hypothesis because the p -value (0.0127) is smaller than the level of significance (0.05). [ p -value is the observed probability of the null hypothesis to happen, which is calculated from the sample data using an appropriate method, two-sample T-Test for equal variance in this case]

Contextual Conclusion

Statistically, US male and Swedish male populations are significantly different with respect to the height. [rewrite the accepted hypothesis for an eighth grader without using the statistical jargon such as the p -value, level of significance, etc.]

The next question would be then who is taller or shorter. Both the sample and the population data show that the Swedish male population is taller than the US male population. However, the alternative hypothesis was written as “Not Equal.” Therefore, to test that the Swedish male population is taller than the US male population or the US male population is shorter than the Swedish male population. The hypothesis is written as below.

hypothesis testing unequal sample size

Now the alternative hypothesis become one-sided. As the one-sided probability is the half of the two-sided probability (p-value), we would still reject the null hypothesis. The new contextual conclusion would be “Statistically, the US male population is significantly shorter than the Swedish male population.” However, making this contextual conclusion for the original “not equal” alternative hypothesis would be wrong…………. A common mistake .

Test Your Knowledge

Paired t-test (matched pair/repeated measure).

IMAGES

  1. Hypothesis Testing 3

    hypothesis testing unequal sample size

  2. Hypothesis testing

    hypothesis testing unequal sample size

  3. How to Perform an ANOVA with Unequal Sample Sizes

    hypothesis testing unequal sample size

  4. Sample Size Calculation: Hypothesis Testing || Randomized control trial

    hypothesis testing unequal sample size

  5. PPT

    hypothesis testing unequal sample size

  6. Hypothesis testing -- two sample t test (unequal variance)

    hypothesis testing unequal sample size

VIDEO

  1. Nonparametric Methods: Nominal-Level Hypothesis

  2. FA II STATISTICS/ Chapter no 7 / Testing of hypothesis/ Z distribution / Example 7.8

  3. Large Sample Hypothesis Tests Sample Size

  4. Independent Samples

  5. Testing Hypothesis: One Sample Test in SPSS (Urdu)

  6. Two Sample Hypothesis T Test

COMMENTS

  1. How to Perform a t-test with Unequal Sample Sizes

    The following examples demonstrate how to perform t-tests with unequal sample sizes when the variances are equal and when they're not equal. ... The results are as follows: Program 1: n (sample size): 500; x (sample mean): 80; s ... = TRUE) Two Sample t-test data: program1 and program2 t = -3.3348, df = 518, p-value = 0.0009148 alternative ...

  2. hypothesis testing

    The classical two-sample unpaired t-test for example assumes variance homongenity and is robust against violations only if both groups are similarily sized (in order of magnitude). Otherwise higher variance in the smaller group will lead to Type I errors. Now with the t-test this is not much of a problem since commonly the Welch t-test is used ...

  3. When Unequal Sample Sizes Are and Are NOT a Problem in ANOVA

    The statistical power of a hypothesis test that compares groups is highest when groups have equal sample sizes. ... It seems that these test do not equate for unequal sample size, which I fell may be throwing off my data! DO you have any suggestions? I am comparing those distances of before and after for 5 teams. Reply. Karen says. May 25, 2012 ...

  4. Sample size, power and effect size revisited: simplified and practical

    In order to understand and interpret the sample size, power analysis, effect size, and P value, it is necessary to know how the hypothesis of the study was formed. It is best to evaluate a study for Type I and Type II errors ( Figure 1 ) through consideration of the study results in the context of its hypotheses ( 14 - 16 ).

  5. 15.6: Unequal Sample Sizes

    Figure 15.6.1 15.6. 1: An interaction plot with unequal sample sizes. First, let's consider the hypothesis for the main effect of B B tested by the Type III sums of squares. Type III sums of squares weight the means equally and, for these data, the marginal means for b1 b 1 and b2 b 2 are equal:

  6. How to Perform a t-test with Unequal Sample Sizes

    The following code shows how to perform an independent samples t-test along with a Welch's t-test: #perform independent samples t-test. t.test(program1, program2, var.equal=TRUE) Two Sample t-test. data: program1 and program2. t = -3.3348, df = 518, p-value = 0.0009148. alternative hypothesis: true difference in means is not equal to 0.

  7. PDF STAT 705 Chapters 23 and 24: Two factors, unequal sample sizes; multi

    Type III tests We treat model as regression model with (a 1) + (b 1) + (a 1)(b 1) = ab 1 predictors, but we only test dropping blocks of predictors from this full model V corresponding to A, B, or AB, using general nested linear hypotheses (\big model / little model"), as in regression. Recall n T = P a i=1 P b j=1 n ij. SAS gives Type III ...

  8. PDF Testing Closeness With Unequal Sized Samples

    question in the extremely practically relevant setting of unequal sample sizes. Informally, taking "to be a small constant, we show that provided pand qare supported on at most nelements, for any 2[0;1=3];the hypothesis test can be successfully performed (with high probability over the random samples) given samples of size m 1 = ( n2=3+) from p ...

  9. 25.3

    Let's take a look at two examples that illustrate the kind of sample size calculation we can make to ensure our hypothesis test has sufficient power. Example 25-4 Section Let \(X\) denote the crop yield of corn measured in the number of bushels per acre.

  10. Sample Size and its Importance in Research

    ESTIMATING SAMPLE SIZE. So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80 ...

  11. PDF Tutorial 4: Power and Sample Size for Two-sample t-test with Unequal

    required to ensure a pre-specified power for a hypothesis test depends on variability, level of significance, and the null vs. alternative difference. In order to understand our terms, we will review these key components of hypothesis testing before embarking on a numeric example of power and sample size estimation for the two independent-

  12. One-Way ANOVA: Unequal Sample Sizes

    10.4: One-Way ANOVA: Unequal Sample Sizes. One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used: In the equation, n is the sample size, ͞x is the sample mean ...

  13. PDF 2 Sample t-Test (unequal sample sizes and unequal variances)

    If the data are normally distributed (or close enough) we choose to test this hypothesis using a 2-tailed, 2 sample t-test, taking into account the inequality of variances and sample sizes. Below are the data: Sample 1 19.7146 22.8245 26.3348 25.4338 20.8310 19.3516 29.1662 21.5908 25.0997 18.0220 20.8439 28.8265 23.8161 27.0340 23.5834 18.6316

  14. Issues in Estimating Sample Size for Hypothesis Testing

    Suppose we want to test the following hypotheses at aα=0.05: H 0: μ = 90 versus H 1: μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the ...

  15. Comparison of variance between two samples with unequal sample size

    The secondary goal is to test for a difference in means. I do not yet have the data but I know for sample1, n=167 and for sample2, n=998. They can be assumed independent. My current plan for the data: 1) Perform a Shapiro-Wilk test to assess normality. 2) If the data is not normal, perform Levene's test of equal variance.

  16. Unequal Sample Size

    Unequal Sample Size. For unequal sample sizes, let m denote the smallest sample size among the J groups. From: Introduction to Robust Estimation and Hypothesis Testing (Third Edition), 2012 Related terms: Probability Theory; Bootstrapping

  17. hypothesis testing

    hypothesis-testing; t-test; wilcoxon-mann-whitney-test; Share. Cite. Improve this question. ... $\endgroup$ 7 $\begingroup$ Having sample sizes of 800 and 21,000 is far better than 800 and 800. Having unequal sample sizes isn't a bad thing. Having more samples is a good thing. $\endgroup$ - JimB. Jul 29, 2015 at 18:12 ... Unequal sample size ...

  18. Which Hypothesis testing method can I use for unequal sample size

    Welch's t-test is roboust to unequal sample sizes and can help you determine whether two groups are significantly different from each other. See the following link for more information: http ...

  19. hypothesis testing

    This can be interpreted as saying that if the pigs grew at the same rate and we adjusted for the unequal group size, we would observe a difference like 1212.5-2*657 = -103.1 kg (or even more negative) only 2.7% of the time.

  20. The Open Educator

    Two-sample T-Test with unequal variance can be applied when (1) the samples are normally distributed, (2) the standard deviation of both populations are unknown and assume to be unequal, and the (3) sample is sufficiently large (over 30). To compare the height of two male populations from the United States and Sweden, a sample of 30 males from ...

  21. Running a two-sample t.test with unequal sample size in R

    We do this in R with a Fisher's F-test, var.test(x, y). If your p > 0.05, then you can assume that the variances of both samples are homogenous. In this case, we run a classic Student's two-sample t-test by setting the parameter var.equal = TRUE. If the F-test returns a p < 0.05, then you can assume that the variances of the two groups are ...

  22. hypothesis testing

    The Welch test allows for unequal variances as suggested by by the different sample standard deviations by reducing the DF to about 32. Thus the test has somewhat lower power than a pooled t test, but for our fake data power is not an issue. Note: The R code below shows how the data above were samples (simulated) from two normal populations. If ...

  23. Testing Closeness With Unequal Sized Samples

    tremely practically relevant setting of unequal sample sizes. Informally, taking "to be a small constant, we show that provided pand qare supported on at most nelements, for any 2[0;1=3];the hypothesis test can be successfully performed (with high probability over the random samples) given samples of size m 1 = ( n2=3+) from p, and m 2 = ( n2=3 ...