User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.6 - confidence intervals & hypothesis testing.

Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis. Hypothesis testing requires that we have a hypothesized parameter. 

The simulation methods used to construct bootstrap distributions and randomization distributions are similar. One primary difference is a bootstrap distribution is centered on the observed sample statistic while a randomization distribution is centered on the value in the null hypothesis. 

In Lesson 4, we learned confidence intervals contain a range of reasonable estimates of the population parameter. All of the confidence intervals we constructed in this course were two-tailed. These two-tailed confidence intervals go hand-in-hand with the two-tailed hypothesis tests we learned in Lesson 5. The conclusion drawn from a two-tailed confidence interval is usually the same as the conclusion drawn from a two-tailed hypothesis test. In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always reject the null hypothesis.

Example: Mean Section  

This example uses the Body Temperature dataset built in to StatKey for constructing a  bootstrap confidence interval and conducting a randomization test . 

Let's start by constructing a 95% confidence interval using the percentile method in StatKey:

  

The 95% confidence interval for the mean body temperature in the population is [98.044, 98.474].

Now, what if we want to know if there is enough evidence that the mean body temperature is different from 98.6 degrees? We can conduct a hypothesis test. Because 98.6 is not contained within the 95% confidence interval, it is not a reasonable estimate of the population mean. We should expect to have a p value less than 0.05 and to reject the null hypothesis.

\(H_0: \mu=98.6\)

\(H_a: \mu \ne 98.6\)

\(p = 2*0.00080=0.00160\)

\(p \leq 0.05\), reject the null hypothesis

There is evidence that the population mean is different from 98.6 degrees. 

Selecting the Appropriate Procedure Section  

The decision of whether to use a confidence interval or a hypothesis test depends on the research question. If we want to estimate a population parameter, we use a confidence interval. If we are given a specific population parameter (i.e., hypothesized value), and want to determine the likelihood that a population with that parameter would produce a sample as different as our sample, we use a hypothesis test. Below are a few examples of selecting the appropriate procedure. 

Example: Cheese Consumption Section  

Research question: How much cheese (in pounds) does an average American adult consume annually? 

What is the appropriate inferential procedure? 

Cheese consumption, in pounds, is a quantitative variable. We have one group: American adults. We are not given a specific value to test, so the appropriate procedure here is a  confidence interval for a single mean .

Example: Age Section  

Research question:  Is the average age in the population of all STAT 200 students greater than 30 years?

There is one group: STAT 200 students. The variable of interest is age in years, which is quantitative. The research question includes a specific population parameter to test: 30 years. The appropriate procedure is a  hypothesis test for a single mean .

Try it! Section  

For each research question, identify the variables, the parameter of interest and decide on the the appropriate inferential procedure.

Research question:  How strong is the correlation between height (in inches) and weight (in pounds) in American teenagers?

There are two variables of interest: (1) height in inches and (2) weight in pounds. Both are quantitative variables. The parameter of interest is the correlation between these two variables.

We are not given a specific correlation to test. We are being asked to estimate the strength of the correlation. The appropriate procedure here is a  confidence interval for a correlation . 

Research question:  Are the majority of registered voters planning to vote in the next presidential election?

The parameter that is being tested here is a single proportion. We have one group: registered voters. "The majority" would be more than 50%, or p>0.50. This is a specific parameter that we are testing. The appropriate procedure here is a  hypothesis test for a single proportion .

Research question:  On average, are STAT 200 students younger than STAT 500 students?

We have two independent groups: STAT 200 students and STAT 500 students. We are comparing them in terms of average (i.e., mean) age.

If STAT 200 students are younger than STAT 500 students, that translates to \(\mu_{200}<\mu_{500}\) which is an alternative hypothesis. This could also be written as \(\mu_{200}-\mu_{500}<0\), where 0 is a specific population parameter that we are testing. 

The appropriate procedure here is a  hypothesis test for the difference in two means .

Research question:  On average, how much taller are adult male giraffes compared to adult female giraffes?

There are two groups: males and females. The response variable is height, which is quantitative. We are not given a specific parameter to test, instead we are asked to estimate "how much" taller males are than females. The appropriate procedure is a  confidence interval for the difference in two means .

Research question:  Are STAT 500 students more likely than STAT 200 students to be employed full-time?

There are two independent groups: STAT 500 students and STAT 200 students. The response variable is full-time employment status which is categorical with two levels: yes/no.

If STAT 500 students are more likely than STAT 200 students to be employed full-time, that translates to \(p_{500}>p_{200}\) which is an alternative hypothesis. This could also be written as \(p_{500}-p_{200}>0\), where 0 is a specific parameter that we are testing. The appropriate procedure is a  hypothesis test for the difference in two proportions.

Research question:  Is there is a relationship between outdoor temperature (in Fahrenheit) and coffee sales (in cups per day)?

There are two variables here: (1) temperature in Fahrenheit and (2) cups of coffee sold in a day. Both variables are quantitative. The parameter of interest is the correlation between these two variables.

If there is a relationship between the variables, that means that the correlation is different from zero. This is a specific parameter that we are testing. The appropriate procedure is a  hypothesis test for a correlation . 

Confidence Intervals

Hypothesis testing is the approach to statistical inference that we use when we have two competing theories that we are trying to choose between. A second approach to statistical inference is confidence intervals, which allow us to present a range of reasonable values for our unknown population parameter. The range of reasonable values allows us to understand the corresponding population better without requiring any ideas to be fully specified.

General Motivation and Framework

We have access to our sample, but we would really like to make a statement about the corresponding population. For example, we can calculate that the median price per night for a Chicago Airbnb was \$126 for a sample. What we really want to know, though, is what the median price per night for a Chicago Airbnb is for the entire population of Airbnbs, so that we can make an appropriate statement for the population.

How can we extend our knowledge from the sample to the population? We can use confidence intervals to help us generate a range of reasonable values for our unknown parameter. This will help us to make reasonable conclusions that should extend to the population appropriately.

To do so, we will combine our knowledge of sampling distributions with our specific sample value. This has many similar flavors to hypothesis testing but is approaching the problem through a different framework. Below, we'll walk through an example followed by the process to generate a confidence interval.

Confidence Interval Example

Like mentioned above, the median price per night for a Chicago Airbnb was $126 in our sample. Can we generate a sampling distribution for the possible values that the median price per night could take from repeated random samples?

We will use the resampling approach to generating a sampling distribution as described previously.

confidence interval null hypothesis example

Histogram of the sampling distribution for the median price of a Chicago Airbnb.

We've now generated our sampling distribution for sample median prices of Airbnbs in Chicago. Now, suppose that I want to create a range of reasonable values for the population median prices with 90% confidence (we'll define what 90% confidence means soon). To do so, I'll find the middle 90% of this distribution by calculating the 5th percentile and the 95th percentile.

For our simulated sampling distribution, the middle 90% are between \$120 and \$132 per night for a Chicago Airbnb.

At last, we'll make a jump from making statements about samples to making statements about populations. We could say that a range of reasonable values for the population median price per night of a Chicago Airbnb is between \$120 and \$132 per night.

Confidence Interval Steps

To generate a confidence interval, we follow the same set of steps. We do apply some steps differently depending on our specific parameter of interest.

To generate a confidence interval, we should:

  • Identify and define the parameter of interest
  • Determine the confidence level
  • Generate or use theory to specify the sampling distribution and check conditions
  • Calculate the middle region of your sampling distribution, according to your confidence level
  • Write a conclusion in the context of the problem.

Identify Parameter of Interest

We discussed identifying and defining the parameter of interest when we first described hypothesis testing. This is repeated for confidence intervals.

In this example, our population of interest is all Chicago Airbnbs. We likely would want to specify a time frame as well, and since we are using March 2023 data, we may specify that this is for all Chicago Airbnbs in March 2023.

Our parameter of interest (the summary measure) is the median. We may define the parameter of interest as $M$, the population median price per night for a Chicago Airbnb.

Determine the Confidence Level

The confidence level is analogous to the significance level. We'll provide a more exact definition and interpretation of the confidence level shortly. Confidence levels should be greater than 0% and less than 100%.

Confidence levels do not depend on the data and should be selected before observing the data. The confidence level is generally chosen based on the stakeholders and their requirements for the confidence in results. More confidence in the results are associated with higher confidence levels.

Common confidence levels include 90%, 95%, 98%, and 99%.

Determine the Sampling Distribution for the Sample Statistic

We again will use the sampling distribution of the sample statistic as the basis for our confidence interval calculation. To do so, we can follow the same process outlined for hypothesis testing. Recall, that we chose between a simulation-based resampling approach or a theory-based approach using the Central Limit Theorem to define the sampling distribution.

The biggest distinction between generating sampling distributions for confidence intervals compared to hypothesis testing is that we don't need to make any adjustments to our sampling distribution so that it is consistent with the null hypothesis. That is, recall that we wanted to adopt the skeptic's claim in hypothesis testing. When we were generating a sampling distribution, we would make any modifications necessary so that the sampling distribution fulfilled the condition of the null hypothesis. This distinction should be considered in two ways:

  • when generating the sampling distribution
  • when checking any necessary conditions

For example, if we were performing hypothesis testing with a simulation-based approach, we would need to first adjust the data so that the sample median was equal to the null value. However, without that condition for confidence intervals, we would use the data exactly as it is in the sample.

Similarly, some conditions for sampling distributions use information about the parameter of interest. For example, the theory-based approach with proportions requires that $n \times p$ and $n \times (1-p)$ are both at least 10. When we have a hypothesis, we should plug in the null value from the null hypothesis into these checks. With confidence intervals, if we don't have any requirements for the parameter, we can use our best estimate for $p$, which is often $\hat{p}$ when checking the conditions.

Again, the simulation-based approach requires the least number of assumptions. For our example, it is the only option for estimating the sampling distribution, since we haven't introduced theory that relates to the sampling distribution for a sample median.

Calculate the Confidence Interval

After we have determined the sampling distribution, we want to actually calculate the confidence interval, which is the range of reasonable values for our parameter of interest.

We want to find the central part of the sampling distribution that corresponds to our confidence level to generate the confidence interval, regardless of the approach for generating the sampling distribution. That is, if we want a 95% confidence interval, we will want to find the 2.5th percentile and the 97.5th percentile of the sampling distribution, so that the middle 95% is contained within those two values. In general, if we say that our confidence level is represented as CL%, then we want the (100-CL)/2 and (100+CL)/2 percentiles. We can find these percentiles both for a simulated sampling distribution or for a well-defined distribution, as long as we provide Python with the appropriate information.

This might seem counterintuitive, as we are using information about our sample to generate a guess about our population. To understand this, let's start by saying that this range would be a range of typical values for a sample statistic as calculated from our available data. Then, we're going to switch the order of the statement. This indicates that a sample statistic like the one we found would be reasonable if our parameter were anywhere in that range instead. Therefore, we'll say that the confidence interval that we calculated represents a range of reasonable values for the parameter.

Write a Conclusion in the Context of the Problem

Finally, we've generated our confidence interval and want to communicate our results to other stakeholders. What exactly does the confidence interval mean?

Informally, we might say something like: it is reasonable to claim that the population median price for a Chicago Airbnb is between \$120 and \$136 per night, with 90% confidence.

The formal interpretation is that we are 90% confident that the true population median price for a Chicago Airbnb falls in the range of \$120 and \$136 per night.

Confidence Interval Widths

Say that a stakeholder is not satisfied with a confidence interval. A common concern is that a confidence interval is too wide; that is, your stakeholder would like a narrower range of reasonable values. What can be changed to satisfy your stakeholder?

The two adjustable factors that affect the width of the confidence interval are the:

  • sample size
  • confidence level

Larger sample sizes result in narrower sampling distributions (recall this feature of the standard error from our sampling distribution module). This will also result in our confidence interval being narrower.

Larger confidence levels require a larger component of the sampling distribution to be included in the confidence interval. This will result in a wider confidence interval.

Therefore, if your stakeholder wants a narrower confidence interval, you could add more observations to your sample size or you could reduce your confidence level. It is also possible to estimate a desired sample size before gathering data that results in a confidence interval with limitations on the width of the confidence interval. We will skip over this calculation for our course, although you may encounter it in a future course.

Confidence Interval Misconceptions and Misinterpretations

We've discussed briefly what a confidence interval means. Equally important is what a confidence interval does not imply.

A confidence interval does not correspond to:

  • the probability that the parameter is in the confidence interval
  • a range of reasonable values for the sample data
  • a range of reasonable values for a sample statistic
  • a range of reasonable values for any future results from another sample

These last three misconceptions stem from misunderstanding that the confidence interval is about the parameter of interest and not about the sample or any of its corresponding characteristics.

For the first statement, consider that the population is already defined, and the corresponding parameter value for the population could then be calculated. It is a specific number, and it doesn't change. For example, it might be 120 or it could be 145. However, since the population is fixed, it is that exact number.

Once the confidence interval is calculated, then the confidence interval is also set and determined. It won't change. In this case, the parameter will either be contained in our confidence interval or it won't be, so the probability associated with the parameter being in the confidence interval is either 0 (the confidence interval isn't correct) or 1 (the confidence interval is correct).

Confidence Level Interpretation

We now understand how to calculate a confidence interval, what the confidence interval indicates, and what it doesn't indicate. However, we need to return to the second step where we set the confidence level for the interval. We know that this will have ramifications for the following steps of generating a confidence interval. But, what does it mean?

The confidence level means:

"If we gathered repeated random samples of the same size and calculated a CL% confidence interval for each, we would expect CL% of the resulting confidence intervals to contain the true parameter of interest."

Generally, this means that we expect CL% of our intervals to be correct. However, as we discussed above, we can't apply this reasoning to one specific interval after it's been calculated. This still does allow for variability and for different confidence intervals being generated from different samples.

Hypothesis Testing Decisions through Confidence Intervals

You may have noticed that many of the steps used for confidence intervals are shared with hypothesis testing. While there are distinctions between the two, we can also use confidence intervals to help us determine the result of a hypothesis test.

Suppose that a friend found it reported that the median price for all Chicago hotels is $160 per night. They suspect that Airbnbs are less expensive per night, and the population median price for Chicago Airbnbs is less expensive.

That is, the parameter of interest would be $M$ the population median price per night for all Chicago Airbnbs in March 2023. We can (and have) found the corresponding sample statistic, $m$ or the median price per night for the Chicago Airbnbs from our sample.

Because we don't have any data to analyze for Chicago hotels, we'll use this number as if it were true and treat this as a test for only one population. Our hypotheses would be:

$H_0: M = 160$

$H_a: M < 160$

What does the data say? If we've already generated a confidence interval, we don't need to repeat many of the steps for hypothesis testing. Instead, we can consider our calculated confidence interval as a range of reasonable values for our parameter. That is, it is reasonable that the population median price per night for all Chicago Airbnbs is between \$120 and \$136. In this case, the null value of 160 is not included in the range of reasonable values. Everything reasonable falls under the alternative hypothesis. We would want to reject the null hypothesis and adopt the alternative hypothesis as a more reasonable claim.

In this case, our confidence interval clearly supports our alternative hypothesis rather than our null hypothesis. However, in order to use confidence intervals to anticipate the decision for a hypothesis test, we need to ensure that we are using comparable confidence and significance levels:

  • for a two-sided alternative hypothesis, use a confidence level of $1-\alpha$
  • for a one-sided alternative hypothesis, use a confidence level of $1-2\times\alpha$

Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Hypothesis Testing

Hypothesis Testing

After completing this reading, you should be able to:

  • Construct an appropriate null hypothesis and alternative hypothesis and distinguish between the two.
  • Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and interpret the results of hypothesis tests with a specific level of confidence.
  • Differentiate between a one-sided and a two-sided test and identify when to use each test.
  • Explain the difference between Type I and Type II errors and how these relate to the size and power of a test.
  • Understand how a hypothesis test and a confidence interval are related.
  • Explain what the p-value of a hypothesis test measures.
  • Interpret the results of hypothesis tests with a specific level of confidence.
  • Identify the steps to test a hypothesis about the difference between two population means.
  • Explain the problem of multiple testing and how it can bias results.

Hypothesis testing is defined as a process of determining whether a hypothesis is in line with the sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The null hypothesis is an assumption of the population parameter. On the other hand,  the alternative hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. The critical values are determined by the distribution of the test statistic (when the null hypothesis is true) and the size of the test (which gives the size at which we reject the null hypothesis).

Components of the Hypothesis Testing

The elements of the test hypothesis include:

  • The null hypothesis.
  • The alternative hypothesis.
  • The test statistic.
  • The size of the hypothesis test and errors
  • The critical value.
  • The decision rule.

The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The null hypothesis is the statement concerning the population parameter values. It brings out the notion that “there is nothing about the data.”

The  null hypothesis , denoted as H 0 , represents the current state of knowledge about the population parameter that’s the subject of the test. In other words, it represents the “status quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. The inspectors will formulate a hypothesis like:

H 0 : Each 1 kg package has 0.15% cholesterol.

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H 0  include:

$$H_0:\mu={\mu}_0$$

$$H_0:\mu≤{\mu}_0$$

\(μ\) = true population mean and,

\(μ_0\)= the hypothesized population mean.

The Alternative Hypothesis

The  alternative hypothesis , denoted H 1 , is a contradiction of the null hypothesis. The null hypothesis determines the values of the population parameter at which the null hypothesis is rejected. Thus, rejecting the H 0  makes H 1  valid. We accept the alternative hypothesis when the “status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H 1 : Each 1 kg package does not have 0.15% cholesterol.

The typical statements of H1   include:

$$H_1:\mu \neq {\mu}_0$$

$$H_1:\mu > {\mu}_0$$

Note that we have stated the alternative hypothesis, which contradicted the above statement of the null hypothesis.

The Test Statistic

A test statistic is a standardized value computed from sample information when testing hypotheses. It compares the given data with what we would expect under the null hypothesis. Thus, it is a major determinant when deciding whether to reject H 0 , the null hypothesis.

We use the test statistic to gauge the degree of agreement between sample data and the null hypothesis. Analysts use the following formula when calculating the test statistic.

$$ \text{Test Statistic}= \frac{(\text{Sample Statistic–Hypothesized Value})}{(\text{Standard Error of the Sample Statistic})}$$

The test statistic is a random variable that changes from one sample to another. Test statistics assume a variety of distributions. We shall focus on normally distributed test statistics because it is used hypotheses concerning the means, regression coefficients, and other econometric models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis \(H_0:μ=μ_0\). Assume that the data used is iid, and asymptotic normally distributed as:

$$\sqrt{n} (\hat{\mu}-\mu) \sim N(0, {\sigma}^2)$$

Where \({\sigma}^2\) is the variance of the sequence of the iid random variable used. The asymptotic distribution leads to the test statistic:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}}\sim N(0,1)$$

Note this is consistent with our initial definition of the test statistic.

The following table  gives a brief outline of the various test statistics used regularly, based on the distribution that the data is assumed to follow:

$$\begin{array}{ll} \textbf{Hypothesis Test} & \textbf{Test Statistic}\\ \text{Z-test} & \text{z-statistic} \\ \text{Chi-Square Test} & \text{Chi-Square statistic}\\ \text{t-test} & \text{t-statistic} \\ \text{ANOVA} & \text{F-statistic}\\ \end{array}$$ We can subdivide the set of values that can be taken by the test statistic into two regions: One is called the non-rejection region, which is consistent with H 0  and the rejection region (critical region), which is inconsistent with H 0 . If the test statistic has a value found within the critical region, we reject H 0 .

Just like with any other statistic, the distribution of the test statistic must be specified entirely under H 0  when H 0  is true.

The Size of the Hypothesis Test and the Type I and Type II Errors

While using sample statistics to draw conclusions about the parameters of the population as a whole, there is always the possibility that the sample collected does not accurately represent the population. Consequently, statistical tests carried out using such sample data may yield incorrect results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two types of errors:

Type I Error

Type I error occurs when we reject a true null hypothesis. For example, a type I error would manifest in the form of rejecting H 0  = 0 when it is actually zero.

Type II Error

Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test provides insufficient evidence to reject the null hypothesis when it’s false.

The level of significance denoted by α represents the probability of making a type I error, i.e., rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken to be the probability of making a type II error within the bounds of statistical testing. The ideal but practically impossible statistical test would be one that  simultaneously   minimizes α and β. We use α to determine critical values that subdivide the distribution into the rejection and the non-rejection regions.

The Critical Value and the Decision Rule

The decision to reject or not to reject the null hypothesis is based on the distribution assumed by the test statistic. This means if the variable involved follows a normal distribution, we use the level of significance (α) of the test to come up with critical values that lie along with the standard normal distribution.

The decision rule is a result of combining the critical value (denoted by \(C_α\)), the alternative hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-side alternative, the decision is to reject the null hypothesis if \(|T|>C_α\). That is, reject the null hypothesis if the absolute value of the test statistic is greater than the critical value. When testing on the one-sided, decision rule, reject the null hypothesis if \(T<C_α\)  when using a one-sided lower alternative and if \(T>C_α\)  when using a one-sided upper alternative. When a null hypothesis is rejected at an α significance level, we say that the result is significant at α significance level.

Note that prior to decision-making, one must decide whether the test should be one-tailed or two-tailed. The following is a brief summary of the decision rules under different scenarios:

Left One-tailed Test

H 1 : parameter < X

Decision rule: Reject H 0  if the test statistic is less than the critical value. Otherwise,  do not reject  H 0.

Right One-tailed Test

H 1 : parameter > X

Decision rule: Reject H 0  if the test statistic is greater than the critical value. Otherwise,  do not reject  H 0.

Two-tailed Test

H 1 : parameter  ≠  X (not equal to X)

Decision rule: Reject H 0  if the test statistic is greater than the upper critical value or less than the lower critical value.

Two-tailed Test

 H 0 : μ < μ 0  vs. H 1 : μ > μ 0.

The second graph represents the rejection region when the alternative is a one-sided upper. The null hypothesis, in this case, is stated as:

H 0 : μ > μ 0  vs. H 1 : μ < μ 0.

Example: Hypothesis Test on the Mean

Consider the returns from a portfolio \(X=(x_1,x_2,\dots, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. We wish to determine whether the expected value of the return is different from 0 at a 5% significance level.

We start by stating the two-sided hypothesis test:

H 0 : μ =0 vs. H 1 : μ ≠ 0

The test statistic is:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}} \sim N(0,1)$$

In this case, we have,

\(\hat{μ}\)=0.075

\(\hat{\sigma}^2\)=0.17 2

$$T=\frac{0.075-0}{\sqrt{\frac{0.17^2}{40}}} \approx 2.79$$

At the significance level, \(α=5\%\),the critical value is \(±1.96\). Since this is a two-sided test, the rejection regions are ( \(-\infty,-1.96\) ) and (\(1.96, \infty \) ) as shown in the diagram below:

Rejection Regions - Two-Sided Test

The example above is an example of a Z-test (which is mostly emphasized in this chapter and immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-distribution if the random variables are iid and normally distributed and that the sample size is small (n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. That is:

$$s^2=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}}$$

Therefore the test statistic for \(H_0=μ_0\) is given by:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}} \sim t_{n-1}$$

The Type II Error and the Test Power

The power of a test is the direct opposite of the level of significance. While the level of relevance gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In other words, it gives the likelihood of rejecting H 0  when, indeed, it’s false. Denoting the probability of type II error by \(\beta\), the power test is given by:

$$ \text{Power of a Test}=1–\beta $$

The power test measures the likelihood that the false null hypothesis is rejected. It is influenced by the sample size, the length between the hypothesized parameter and the true value, and the size of the test.

Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size. Therefore, a 1-α confidence interval contains values that cannot be disregarded at a test size of α.

It is important to note that the confidence interval depends on the alternative hypothesis statement in the test. Let us start with the two-sided test alternatives.

$$ H_0:μ=0$$

$$H_1:μ≠0$$

Then the \(1-α\) confidence interval is given by:

$$\left[\hat{\mu} -C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]$$

\(C_α\) is the critical value at \(α\) test size.

Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio \(X=(x_1,x_2,…, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95% confidence interval for the portfolio return.

The \(1-\alpha\) confidence interval is given by:

$$\begin{align*}&\left[\hat{\mu}-C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]\\& =\left[0.0750-1.96 \times \frac{0.17}{\sqrt{40}}, 0.0750+1.96 \times \frac{0.17}{\sqrt{40}} \right]\\&=[0.02232,0.1277]\end{align*}$$

Thus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be rejected against the alternative.

One-Sided Alternative

For the one-sided alternative, the confidence interval is given by either:

$$\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

for the lower alternative

$$\left ( \hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}},\infty \right )$$

for the upper alternative.

Example: Calculating the One-Sided Alternative Confidence Interval

Assume that we were conducting the following one-sided test:

\(H_0:μ≤0\)

\(H_1:μ>0\)

The 95% confidence interval for the portfolio return is:

$$\begin{align*}&=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )\\&=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)\\&=(-\infty, 0.1192)\end{align*}$$

On the other hand, if the hypothesis test was:

\(H_0:μ>0\)

\(H_1:μ≤0\)

The 95% confidence interval would be:

$$=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

$$=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)=(0.1192, \infty)$$

Note that the critical value decreased from 1.96 to 1.645 due to a change in the direction of the change.

The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely compare the observed test statistic with some critical value. For example, we might “reject H 0  using a 5% test” or “reject H 0 at 1% significance level”. The problem with this ‘classical’ approach is that it does not give us details about the  strength of the evidence  against the null hypothesis.

Determination of the  p-value  gives statisticians a more informative approach to hypothesis testing. The p-value is the lowest level at which we can reject H 0 . This means that the strength of the evidence against H 0  increases as the  p-value becomes smaller. The test statistic depends on the alternative.

The p-Value for One-Tailed Test Alternative

For one-tailed tests, the  p-value  is given by the probability that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives the  p-value.

Denoting the test statistic by T, the p-value for \(H_1:μ>0\)  is given by:

$$P(Z>|T|)=1-P(Z≤|T|)=1- \Phi (|T|) $$

Conversely , for  \(H_1:μ≤0 \)  the p-value is given by:

$$ P(Z≤|T|)= \Phi (|T|)$$ 

Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the right tail is measured whether T is negative or positive.

The p-Value for Two-Tailed Test Alternative

  If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by determining the probability lying below the negative value of the test statistic. Then, we add this to the probability lying above the positive value of the test statistic. That is the p-value for the two-tailed hypothesis test is given by:

$$2\left[1-\Phi [|T|\right]$$

Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin 200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of significance.

H 0 : θ = 0.5

H 1 : θ < 0.5

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X  `binomial(200,0.5)  with mean 100(np=200*0.5), assuming H 0  is true.

$$\begin{align*}P\left [ z< \frac{85.5-100}{\sqrt{50}} \right]&=P(Z<-2.05)\\&=1–0.97982=0.02018 \end{align*}$$

Recall that for a binomial distribution, the variance is given by:

$$np(1-p)=200(0.5)(1-0.5)=50$$

(We have applied the Central Limit Theorem by taking the binomial distribution as approx. normal)

Since the probability is less than 0.05, H 0  is extremely unlikely, and we actually have strong evidence against H 0  that favors H 1 . Thus, clearly expressing this result, we could say:

“There is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude that the coin is biased against heads.”

Remember, failure to reject H 0  does not mean it’s true. It means there’s insufficient evidence to justify rejecting H 0,  given a certain level of significance.

Example 2:  p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H 0 : μ = μ 0  vs. H 1 : μ  ≠  μ 0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the  p-value

$$ \text{P-value}=2P(Z>2.2)=2[1–P(Z≤2.2)]  =1.39\%×2=2.78\%$$

(We have multiplied by two since this is a two-tailed test)

Example - Two-Sided Test

The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient evidence to reject H 0 . In fact, the evidence is so strong that we would also reject H 0  at significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H 0  since the  p-value  surpasses these values.

Hypothesis about the Difference between Two Population Means.

It’s common for analysts to be interested in establishing whether there exists a significant difference between the means of two different populations. For instance, they might want to know whether the average returns for two subsidiaries of a given company exhibit  significant  differences.

Now, consider a bivariate random variable:

$$W_i=[X_i,Y_i]$$

Assume that the components \(X_i\) and \(Y_i\)are both iid and are correlated. That is: \(\text{Corr} (X_i,Y_i )≠0\)

Now, suppose that we want to test the hypothesis that:

$$H_0:μ_X=μ_Y$$

$$H_1:μ_X≠μ_Y$$

In other words, we want to test whether the constituent random variables have equal means. Note that the hypothesis statement above can be written as:

$$H_0:μ_X-μ_Y=0$$

$$H_1:μ_X-μ_Y≠0$$

To execute this test, consider the variable:

$$Z_i=X_i-Y_i$$

Therefore, considering the above random variable, if the null hypothesis is correct then,

$$E(Z_i)=E(X_i)-E(Y_i)=μ_X-μ_Y=0$$

Intuitively, this can be considered as a standard hypothesis test of

H 0 : μ Z =0 vs. H 1 : μ Z  ≠ 0.

The tests statistic is given by:

$$T=\frac{\hat{\mu}_z}{\sqrt{\frac{\hat{\sigma}^2_z}{n}}} \sim N(0,1)$$

Note that the test statistic formula accounts for the correction between \(X_i \) and \(Y_i\). It is easy to see that:

$$V(Z_i)=V(X_i )+V(Y_i)-2COV(X_i, Y_i)$$

Which can be denoted as:

$$\hat{\sigma}^2_z =\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}$$

$$ \hat{\mu}_z ={\mu}_X-{\mu}_Y $$

And thus the test statistic formula can be written as:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}$$

This formula indicates that correlation plays a crucial role in determining the magnitude of the test statistic.

Another special case of the test statistic is when \(X_i\), and \(Y_i\) are iid and independent. The test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X}{n_X}+\frac{\hat{\sigma}^2_Y}{n_Y}}}$$

Where \(n_X\)  and \(n_Y\)  are the sample sizes of \(X_i\), and \(Y_i\) respectively.

Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means of the two portfolios at a 95% level. The first portfolio X consists of 30 government-issued bonds and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30 private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or otherwise.

The hypothesis statement is given by:

H 0 : μ X – μ Y =0 vs. H 1 : μ X – μ Y ≠ 0.

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value \(C_α=±1.96\). 

Recall that:

$$Cov(X, Y)=σ_{XY}=ρ_{XY} σ_X σ_Y$$

Where ρ_XY  is the correlation coefficient between X and Y.

Now the test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\rho}_{XY} {\sigma}_X {\sigma}_Y}{n}}}$$

$$=\frac{0.10-0.14}{\sqrt{\frac{0.02^2 +0.03^2-2\times 0.7 \times 0.02 \times 0.03}{30}}}=-10.215$$

The test statistic is far much less than -1.96. Therefore the null hypothesis is rejected at a 95% level.

The Problem of Multiple Testing

Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data set. The reuse of data results in spurious results and unreliable conclusions that do not hold up to scrutiny. The fundamental problem with multiple testing is that the test size (i.e., the probability that a true null is rejected) is only applicable for a single test. However, repeated testing creates test sizes that are much larger than the assumed size of alpha and therefore increases the probability of a Type I error.

Some control methods have been developed to combat multiple testing. These include Bonferroni correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

Practice Question An experiment was done to find out the number of hours that candidates spend preparing for the FRM part 1 exam. For a sample of 10 students , the average study time was found to be 312.7 hours, with a standard deviation of 7.2 hours. What is the 95% confidence interval for the mean study time of all candidates? A. [307.5, 317.9] B. [310, 317] C. [300, 317] D. [307.5, 312.2] The correct answer is A. To calculate the 95% confidence interval for the mean study time of all candidates, we can use the formula for the confidence interval when the population variance is unknown: \[\text{Confidence Interval} = \bar{X} \pm t_{1-\frac{\alpha}{2}} \times \frac{s}{\sqrt{n}}\] Where: \(\bar{X}\) is the sample mean \(t_{1-\frac{\alpha}{2}}\) is the t-score corresponding to the desired confidence level and degrees of freedom \(s\) is the sample standard deviation \(n\) is the sample size In this case: \(\bar{X} = 312.7\) hours (the average study time) \(s = 7.2\) hours (the standard deviation of study time) \(n = 10\) students (the sample size) To find the t-score (\(t_{1-\frac{\alpha}{2}}\)), we look at the t-table for the 95% confidence level (which corresponds to \(\alpha = 0.05\)) and 9 degrees of freedom (\(n – 1 = 10 – 1 = 9\)). The t-score is 2.262. Now, we can plug these values into the confidence interval formula: \[\text{Confidence Interval} = 312.7 \pm 2.262 \times \frac{7.2}{\sqrt{10}}\] Calculating the margin of error: \[\text{Margin of Error} = 2.262 \times \frac{7.2}{\sqrt{10}} \approx 5.2\] So the confidence interval is: \[\text{Confidence Interval} = 312.7 \pm 5.2 = [307.5, 317.9]\] Therefore, the 95% confidence interval for the mean study time of all candidates is [307.5, 317.9] hours.

Offered by AnalystPrep

confidence interval null hypothesis example

Approaches to Asset Allocation

Central clearing.

After completing this reading, you should be able to: Provide examples of the... Read More

GARP Code of Conduct

After completing this reading, you should be able to: Describe the responsibility of... Read More

Futures Markets

After completing this reading, you should be able to: Define and describe the... Read More

Common Univariate Random Variables

After completing this reading, you should be able to: Distinguish the key properties... Read More

Leave a Comment Cancel reply

You must be logged in to post a comment.

Logo for MacEwan Open Books

8.6 Relationship Between Confidence Intervals and Hypothesis Tests

Confidence intervals (CI) and hypothesis tests should give consistent results: we should not reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval contains the hypothesized value [latex]\mu_0[/latex]. Two-sided confidence intervals correspond to two-tailed tests, upper-tailed confidence intervals correspond to right-tailed tests, and lower-tailed confidence intervals correspond to left-tailed tests.

A [latex](1 - \alpha) \times 100\%[/latex] two-sided [latex]t[/latex] confidence interval is given in the form [latex](\bar{x} - t_{\alpha / 2} \frac{s}{\sqrt{n}}, \bar{x} + t_{\alpha / 2} \frac{s}{\sqrt{n}})[/latex]. A [latex](1 - \alpha) \times 100\%[/latex] upper-tailed t confidence interval is given by [latex](\bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}, \infty)[/latex] and the number [latex]\bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}[/latex] is called the lower bound of the interval. A [latex](1 - \alpha) \times 100\%[/latex] lower-tailed t confidence interval is given by [latex](- \infty, \bar{x} + t_{\alpha} \frac{s}{\sqrt{n}})[/latex] and the number [latex]\bar{x} + t_{\alpha} \frac{s}{\sqrt{n}}[/latex] is called the upper bound of the interval. We can also use confidence intervals to make conclusions about hypothesis tests: reject the null hypothesis [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval does not contain the hypothesized value [latex]\mu_0[/latex]. The relationship is summarized in the following table.

Table 8.3 : Relationship Between Confidence Interval and Hypothesis Test

confidence interval null hypothesis example

Here is the reason we should reject [latex]H_0[/latex] if [latex]\mu_0[/latex] is outside the corresponding confidence interval.

Take the right-tailed test for example, we should reject [latex]H_0[/latex] if the observed test statistic [latex]t_o[/latex] falls in the rejection region, that is if [latex]t_o \geq t_{\alpha}[/latex]. This implies [latex]t_o = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \geq t_{\alpha} \Longrightarrow \mu_0 \leq \bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}.[/latex] Given that the upper-tailed confidence interval for a right-tailed test is [latex](\bar{x} - t_{\alpha / 2} \frac{s}{\sqrt{n}}, \infty)[/latex], [latex]\mu_0 \leq \bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}[/latex] means the value of [latex]\mu_0[/latex] is outside the confidence interval. The same rationale applies to two-tailed and left-tailed tests. Therefore, we can reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if [latex]\mu_0[/latex] is outside the corresponding (1– [latex]\alpha[/latex] )×100% confidence interval.

Example: Relationship Between Confidence Intervals and Hypothesis Tests

The ankle-brachial index (ABI) compares the blood pressure of a patient’s arm to the blood pressure of the patient’s leg. The ABI can be an indicator of different diseases, including arterial diseases. A healthy (or normal) ABI is 0.9 or greater. Researchers obtained the ABI of 100 women with peripheral arterial disease and obtained a mean ABI of 0.64 with a standard deviation of 0.15.

  • Set up the hypotheses: [latex]H_0: \mu \geq 0.9[/latex] versus [latex]H_a: \mu < 0.9[/latex].
  • The significance level is [latex]\alpha = 0.05[/latex].
  • Compute the value of the test statistic: [latex]t_o = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{0.64 - 0.9}{0.15 / \sqrt{100}} = \frac{-0.26}{0.015} = -17.333[/latex] with [latex]df = n-1 = 100 -1 = 99[/latex] (not given in Table IV, use 95, the closest one smaller than 99).
  • Find the P-value. For a left-tailed test, the P-value is the area to the left of the observed test statistic [latex]t_o[/latex]. [latex]\mbox{P-value} = P(t \leq t_o) = P(t \leq -17.333) = P(t \geq 17.333) 2.629(t_{0.005})[/latex].
  • Decision: Since the P- value [latex]< 0.005 < 0.05(\alpha)[/latex], we should reject the null hypothesis [latex]H_0[/latex].
  • Conclusion: At the 5% significance level, the data provide sufficient evidence that, on average, women with peripheral arterial disease have an unhealthy ABI.

[latex]\left( - \infty, \bar{x} + t_{\alpha} \frac{s}{\sqrt{n}} \right)= \left( - \infty, 0.64 + 1.661 \times \frac{0.15}{\sqrt{100}} \right) = (- \infty , 0.665)[/latex].

  • Does the interval in part b) support the conclusion in part a)? In part a), we reject [latex]H_0[/latex] and claim that the mean ABI is below 0.9 for women with peripheral arterial disease. In part b), we are 95% confident that the mean ABI is less than 0.9 since the entire confidence interval is below 0.9. In other words, the hypothesized value 0.9 is outside the corresponding confidence interval, we should reject the null. Therefore, the results obtained in parts a) and b) are consistent.

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Hypothesis testing, p values, confidence intervals, and significance.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: March 13, 2023 .

  • Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

  • Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1]  When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3]  Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4]  When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5]  One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6]  Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7]  The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.  

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3]  In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12]  Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13]  A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14]  Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15]  confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14]  A larger width indicates a smaller sample size or a larger variability. [16]  A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15]  Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14]  In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13]  An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

  • Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14]  Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4]  Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. 

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Hypothesis Testing, P Values, Confidence Intervals, and Significance. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). [PeerJ. 2021] The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. PeerJ. 2021; 9:e12453. Epub 2021 Nov 24.
  • Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. [J Pharm Pract. 2010] Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. J Pharm Pract. 2010 Aug; 23(4):344-51. Epub 2010 Apr 13.
  • Interpreting "statistical hypothesis testing" results in clinical research. [J Ayurveda Integr Med. 2012] Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr; 3(2):65-9.
  • Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. [Dermatol Surg. 2005] Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Dermatol Surg. 2005 Apr; 31(4):462-6.
  • Review Is statistical significance testing useful in interpreting data? [Reprod Toxicol. 1993] Review Is statistical significance testing useful in interpreting data? Savitz DA. Reprod Toxicol. 1993; 7(2):95-100.

Recent Activity

  • Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearl... Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Icon Partners

  • Quality Improvement
  • Talk To Minitab

Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels

Topics: Hypothesis Testing , Data Analysis , Statistics

In this series of posts, I show how hypothesis tests and confidence intervals work by focusing on concepts and graphs rather than equations and numbers.  

Previously, I used graphs to show what statistical significance really means . In this post, I’ll explain both confidence intervals and confidence levels, and how they’re closely related to P values and significance levels.

How to Correctly Interpret Confidence Intervals and Confidence Levels

A confidence interval is a range of values that is likely to contain an unknown population parameter. If you draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. This percentage is the confidence level.

Most frequently, you’ll use confidence intervals to bound the mean or standard deviation, but you can also obtain them for regression coefficients, proportions, rates of occurrence (Poisson), and for the differences between populations.

Just as there is a common misconception of how to interpret P values , there’s a common misconception of how to interpret confidence intervals. In this case, the confidence level is not the probability that a specific confidence interval contains the population parameter.

The confidence level represents the theoretical ability of the analysis to produce accurate intervals if you are able to assess many intervals and you know the value of the population parameter. For a specific confidence interval from one study, the interval either contains the population value or it does not—there’s no room for probabilities other than 0 or 1. And you can't choose between these two possibilities because you don’t know the value of the population parameter.

"The parameter is an unknown constant and no probability statement concerning its value may be made."  —Jerzy Neyman, original developer of confidence intervals.

This will be easier to understand after we discuss the graph below . . .

With this in mind, how do you interpret confidence intervals?

Confidence intervals serve as good estimates of the population parameter because the procedure tends to produce intervals that contain the parameter. Confidence intervals are comprised of the point estimate (the most likely value) and a margin of error around that point estimate. The margin of error indicates the amount of uncertainty that surrounds the sample estimate of the population parameter.

In this vein, you can use confidence intervals to assess the precision of the sample estimate. For a specific variable, a narrower confidence interval [90 110] suggests a more precise estimate of the population parameter than a wider confidence interval [50 150].

Confidence Intervals and the Margin of Error

Let’s move on to see how confidence intervals account for that margin of error. To do this, we’ll use the same tools that we’ve been using to understand hypothesis tests. I’ll create a sampling distribution using probability distribution plots , the t-distribution , and the variability in our data. We'll base our confidence interval on the energy cost data set that we've been using.

When we looked at significance levels , the graphs displayed a sampling distribution centered on the null hypothesis value, and the outer 5% of the distribution was shaded. For confidence intervals, we need to shift the sampling distribution so that it is centered on the sample mean and shade the middle 95%.

Probability distribution plot that illustrates how a confidence interval works

The shaded area shows the range of sample means that you’d obtain 95% of the time using our sample mean as the point estimate of the population mean. This range [267 394] is our 95% confidence interval.

Using the graph, it’s easier to understand how a specific confidence interval represents the margin of error, or the amount of uncertainty, around the point estimate. The sample mean is the most likely value for the population mean given the information that we have. However, the graph shows it would not be unusual at all for other random samples drawn from the same population to obtain different sample means within the shaded area. These other likely sample means all suggest different values for the population mean. Hence, the interval represents the inherent uncertainty that comes with using sample data.

You can use these graphs to calculate probabilities for specific values. However, notice that you can’t place the population mean on the graph because that value is unknown. Consequently, you can’t calculate probabilities for the population mean, just as Neyman said!

Why P Values and Confidence Intervals Always Agree About Statistical Significance

You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree.

The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

  • If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.
  • If the confidence interval does not contain the null hypothesis value, the results are statistically significant.
  • If the P value is less than alpha, the confidence interval will not contain the null hypothesis value.

For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 and we draw the same conclusion.

To understand why the results always agree, let’s recall how both the significance level and confidence level work.

  • The significance level defines the distance the sample mean must be from the null hypothesis to be considered statistically significant.
  • The confidence level defines the distance for how close the confidence limits are to sample mean.

Both the significance level and the confidence level define a distance from a limit to a mean. Guess what? The distances in both cases are exactly the same!

The distance equals the critical t-value * standard error of the mean . For our energy cost example data, the distance works out to be $63.57.

Imagine this discussion between the null hypothesis mean and the sample mean:

Null hypothesis mean, hypothesis test representative : Hey buddy! I’ve found that you’re statistically significant because you’re more than $63.57 away from me!

Sample mean, confidence interval representative : Actually, I’m significant because you’re more than $63.57 away from me !

Very agreeable aren’t they? And, they always will agree as long as you compare the correct pairs of P values and confidence intervals. If you compare the incorrect pair, you can get conflicting results, as shown by common mistake #1 in this post .

Closing Thoughts

In statistical analyses, there tends to be a greater focus on P values and simply detecting a significant effect or difference. However, a statistically significant effect is not necessarily meaningful in the real world. For instance, the effect might be too small to be of any practical value.

It’s important to pay attention to the both the magnitude and the precision of the estimated effect. That’s why I'm rather fond of confidence intervals. They allow you to assess these important characteristics along with the statistical significance. You'd like to see a narrow confidence interval where the entire range represents an effect that is meaningful in the real world.

If you like this post, you might want to read the previous posts in this series that use the same graphical framework:

  • Part One: Why We Need to Use Hypothesis Tests
  • Part Two: Significance Levels (alpha) and P values

For more about confidence intervals, read my post where I compare them to tolerance intervals and prediction intervals .

If you'd like to see how I made the probability distribution plot, please read: How to Create a Graphical Version of the 1-sample t-Test .

minitab-on-twitter

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

11.8: Significance Testing and Confidence Intervals

  • Last updated
  • Save as PDF
  • Page ID 2425

  • Rice University

Learning Objectives

  • Explain why a confidence interval makes clear that one should not accept the null hypothesis

There is a close relationship between confidence intervals and significance tests. Specifically, if a statistic is significantly different from \(0\) at the \(0.05\) level, then the \(95\%\) confidence interval will not contain \(0\). All values in the confidence interval are plausible values for the parameter, whereas values outside the interval are rejected as plausible values for the parameter. In the Physicians' Reactions case study, the \(95\%\) confidence interval for the difference between means extends from \(2.00\) to \(11.26\). Therefore, any value lower than \(2.00\) or higher than \(11.26\) is rejected as a plausible value for the population difference between means. Since zero is lower than \(2.00\), it is rejected as a plausible value and a test of the null hypothesis that there is no difference between means is significant. It turns out that the \(p\) value is \(0.0057\). There is a similar relationship between the \(99\%\) confidence interval and significance at the \(0.01\) level.

Whenever an effect is significant, all values in the confidence interval will be on the same side of zero (either all positive or all negative). Therefore, a significant finding allows the researcher to specify the direction of the effect. There are many situations in which it is very unlikely two conditions will have exactly the same population means. For example, it is practically impossible that aspirin and acetaminophen provide exactly the same degree of pain relief. Therefore, even before an experiment comparing their effectiveness is conducted, the researcher knows that the null hypothesis of exactly no difference is false. However, the researcher does not know which drug offers more relief. If a test of the difference is significant, then the direction of the difference is established because the values in the confidence interval are either all positive or all negative.

If the \(95\%\) confidence interval contains zero (more precisely, the parameter value specified in the null hypothesis), then the effect will not be significant at the \(0.05\) level. Looking at non-significant effects in terms of confidence intervals makes clear why the null hypothesis should not be accepted when it is not rejected: Every value in the confidence interval is a plausible value of the parameter. Since zero is in the interval, it cannot be rejected. However, there is an infinite number of other values in the interval (assuming continuous measurement), and none of them can be rejected either.

9.1 Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 , the — null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

H a —, the alternative hypothesis: a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are reject H 0 if the sample information favors the alternative hypothesis or do not reject H 0 or decline to reject H 0 if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example 9.1

H 0 : No more than 30 percent of the registered voters in Santa Clara County voted in the primary election. p ≤ 30 H a : More than 30 percent of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25 percent. State the null and alternative hypotheses.

Example 9.2

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are the following: H 0 : μ = 2.0 H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 66
  • H a : μ __ 66

Example 9.3

We want to test if college students take fewer than five years to graduate from college, on the average. The null and alternative hypotheses are the following: H 0 : μ ≥ 5 H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 45
  • H a : μ __ 45

Example 9.4

An article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third of the students pass. The same article stated that 6.6 percent of U.S. students take advanced placement exams and 4.4 percent pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6 percent. State the null and alternative hypotheses. H 0 : p ≤ 0.066 H a : p > 0.066

On a state driver’s test, about 40 percent pass the test on the first try. We want to test if more than 40 percent pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p __ 0.40
  • H a : p __ 0.40

Collaborative Exercise

Bring to class a newspaper, some news magazines, and some internet articles. In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute Texas Education Agency (TEA). The original material is available at: https://www.texasgateway.org/book/tea-statistics . Changes were made to the original material, including updates to art, structure, and other content updates.

Access for free at https://openstax.org/books/statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Statistics
  • Publication date: Mar 27, 2020
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/statistics/pages/9-1-null-and-alternative-hypotheses

© Jan 23, 2024 Texas Education Agency (TEA). The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Introduction to Statistical Methodology, Second Edition

Chapter 7 two-sample hypothesis tests and confidence intervals.

There are two broad classification types for research, observational studies and designed experiments. These two types of research differ in the way that the researcher interacts with the subjects being observed. In an observational study, the researcher doesn’t force a subject into some behavior or treatment, but merely observes the subject (making measurements but not changing behaviors). In contrast, in an experiment, the researcher imposes different treatments onto the subjects and the pairing between the subject and treatment group happens at random.

Example: For many years hormone (Estrogen and Progestin) replacement therapy’s primary use for post-menopausal woman was to reduce the uncomfortable side-effects of menopause but it was thought to also reduced the rate of breast cancer in post-menopausal women. This belief was the result of many observational studies where women who chose to take hormone replacement therapy also had reduced rates of breast cancer. The lurking variable that the observational studies missed was that hormone therapy is relatively expensive and was taken by predominately women of a high socio- economic status. Those women tended to be more health conscious, lived in areas with less pollution, and were generally at a lower risk for developing breast cancer. Even when researchers realized that socio-economic status was confounded with the therapy, they couldn’t be sure which was the cause of the reduced breast cancer rates. Two variables are said to be confounded if the design of a given experiment or study cannot distinguish the effect of one variable from the other. To correctly test this, nearly 17,000 women underwent an experiment in which each women was randomly assigned to take either the treatment (E+P) or a placebo. The Women’s Health Initiative (WHI) Estrogen plus Progestin Study (E+P) was stopped on July 7, 2002 (after an average 5.6 years of follow-up) because of increased risks of cardiovascular disease and breast cancer in women taking active study pills, compared with those on placebo (inactive pills). The study showed that the overall risks exceeded the benefits, with women taking E+P at higher risk for heart disease, blood clots, stroke, and breast cancer, but at lower risk for fracture and colon cancer. Lurking variables such as income levels and education are correlated to overall health behaviors and with an increased use of hormone replacement therapy. By randomly assigning each woman to a treatment, the unidentified lurking variables were evenly spread across treatments and the dangers of hormone replacement therapy were revealed.

In the previous paragraph, we introduced the idea of a lurking variable where a lurking variable is a variable the researcher hasn’t considered but affects the response variable. In observational studies a researcher will try to measure all the variables that might affect the response but will undoubtedly miss something.

There is a fundamental difference between imposing treatments onto subjects versus taking a random sample from a population and observing relationships between variables. In general, designed experiments allow us to determine cause-and-effect relationships while observational studies can only determine if variables are correlated. This difference in how the data is generated will result in different methods for generating a sampling distribution for a statistic of interest. In this chapter we will focus on experimental designs, though the same analyses are appropriate for observational studies.

7.1 Difference in means between two groups

Often researchers will obtain a group of subjects and divide them into two groups, provide different treatments to each, and observe some response. The goal is to see if the two groups have different mean values, as this is the most common difference to be interested in.

The first thing to consider is that the group of subjects in our sample should be representative of a population of interest. Because we cannot impose an experiment on an entire population, we often are forced to examine a small sample and we hope that the sample statistics (the sample mean \(\bar{x}\) , and sample standard deviation \(s\) ) are good estimates of the population parameters (the population mean \(\mu\) , and population standard deviation \(\sigma\) ). First recognize that these are a sample and we generally think of them to be representative of some population.

Example: Finger Tapping and Caffeine

The effects of caffeine on the body have been well studied. In one experiment, a group of male college students were trained in a particular tapping movement and to tap at a rapid rate. They were randomly divided into caffeine and non-caffeine groups and given approximately two cups of coffee (with either 200 mg of caffeine or none). After a 2-hour period, the students tapping rate was measured.

The population that we are trying to learn about is male college-aged students and we the most likely question of interest is if the mean tap rate of the caffeinated group is different than the non-caffeinated group. Notice that we want to take this sample of 20 students to make inference on the population of male college-aged students. The hypotheses we are interested in are \[\begin{aligned} H_{0}: \mu_{nc} &= \mu_{c} \\ H_{a}: \mu_{nc} &\ne \mu_{c} \end{aligned}\]

where \(\mu_{c}\) is the mean tap rate of the caffeinated group and \(\mu_{nc}\) is the mean tap rate of the non-caffeinated group. We could equivalently express these hypotheses via \[\begin{aligned} H_{0}: \mu_{nc}-\mu_{c} &= 0 \\ H_{a}: \mu_{nc}-\mu_{c} &\ne 0 \end{aligned}\]

Or we could let \(\delta=\mu_{nc}-\mu_{c}\) and write the hypotheses as \[\begin{aligned} H_{0}:\,\delta &= 0 \\ H_{a}:\,\delta &\ne 0 \end{aligned}\]

The data are available in many different formats at http://www.lock5stat.com/datapage.html

The dataset contains two variables. Taps are the response of interest. Group is a factor (or categorical variable) that has 2 levels. These are the different groupings of Caffeine and NoCaffeine. The first thing we should do is, as always, graph the data.

confidence interval null hypothesis example

From this view, it looks like the caffeine group has a higher tapping rate. It will be helpful to summarize the difference between these two groups with a single statistic by calculating the mean for each group and then calculate the difference between the group means.

We can find then find the difference in the sample means.

Notationally, lets call this statistic \(d=\bar{x}_{nc}-\bar{x}_{c}=-3.5\) . We are interested in testing if this observed difference might be due to just random chance and we just happened to assigned more of the fast tappers to the caffeine group. How could we test the null hypothesis that the mean of the caffeinated group is different than the non-caffeinated?

7.1.1 Inference via resampling

The key idea is “How could the data have turned out if the null hypothesis is true?” If the null hypothesis is true, then the caffeinated/non-caffeinated group treatment had no effect on the tap rate and it was just random chance that the caffeinated group got a larger percentage of fast tappers. That is to say the group variable has no relationship to tap rate. I could have just as easily assigned the fast tappers to the non-caffeinated group purely by random chance. So our simulation technique is to shuffle the group labels and then calculate a difference between the group means !

Below we demonstrate what it would look like to shuffle the groups. This is the core concept behind the permutation methods, and how we can work to make an inference via resampling.

We can then calculate the mean difference but this time using the randomly generated groups, and now the non-caffeinated group just happens to have a slightly higher mean tap rate just by the random sorting into two groups.

We could repeat this shuffling several times and see the possible values we might have seen if the null hypothesis is correct and the treatment group doesn’t matter at all.

Of course, five times isn’t sufficient to understand the sampling distribution of the mean difference under the null hypothesis, we should do more.

We can then take the results of our 10000 permutations and view a histogram of the resulting difference in the shuffled group means ( \(d^*\) ).

confidence interval null hypothesis example

We are then interested in how often from our permutations did we observe something more extreme than the mean difference from the original groupings. Because this is a two-tailed test, we will look for how many observations are either below -3.5 or above +3.5. The original difference in the means are marked as vertical red lines in the graph above.

We have almost no cases where the randomly assigned groups produced a difference as extreme as the actual observed difference of \(d=-3.5\) . We can calculate the percentage of the sampling distribution of the difference in means that is farther from zero

We see that only 58/10,000 simulations of data produced assuming \(H_{0}\) is true produced a \(d^{*}\) value more extreme than our observed difference in sample means. This is exactly the definition we have given to a p-value; thus, we can reject the null hypothesis \(H_{0}:\mu_{nc}-\mu_{c}=0\) in favor of the alternative \(H_{a}:\mu_{nc}-\mu_{c}\ne 0\) at an \(\alpha=0.05\) or any other reasonable \(\alpha\) level.

7.1.1.1 Using coin

To make the code less cumbersome, we can incorporate the use of the coin package. This package will allow us to perform a variety of permutation tests without having to produce code such as that shown above. We will only need to ensure that our data is prepared properly. However, for those who are interested more in the R coding that can be done to produce permutation tests, please see Appendix B : Alternative Permutation Test Code.

The data in CaffeineTaps has the data separated as Taps and Group , which is exactly the form we need it in. We can run the permutation using coin simply by using the oneway_test() command and asking it to approximate the p-value. It will then run the permutation test for us. The same number of reshuffles as above (10000) is used.

We observe excellent agreement to the simulation run above, but with much less involvement on how to handle the code.

7.1.1.2 Different Alternative Hypothesis

Everything we know about the biological effects of ingesting caffeine suggests that we should have expected the caffeinated group to tap faster. We might want to set up our experiment so only faster tapping represents “extreme” data compared to the null hypothesis. In this case we want an alternative of \(H_{a}:\,\mu_{nc}-\mu_{c}<0\) We can state our null and alternative hypothesis as \[\begin{aligned} H_{0}:\,\mu_{nc}-\mu_{c} &\ge 0 \\ H_{a}:\,\mu_{nc}-\mu_{c} &< 0 \end{aligned}\]

The creation of the sampling distribution of the mean difference \(d^*\) is identical to our previous technique because if our observed difference \(d\) is so negative that it is incompatible with the hypothesis that \(\mu_{nc}-\mu_{c}=0\) then it must also be incompatible with any positive difference. We can perform the permutation test and generate the distribution of estimated differences in the same manner as above. The only difference in the analysis is at the end when we calculate the p-value and don’t consider the positive tail. That is, the p-value is the percent of simulations where \(d^*<d\) .

We can perform a left-tailed test using coin , but need to be sure we call ‘NoCaffeine’ the first group. We can do this with relevel() .

From both methods we see that the p-value is approximately cut in half by ignoring the upper tail, which makes sense considering the observed symmetry in the sampling distribution of \(d^*\) .

In general, we prefer to use a two-sided test because if the two-sided test leads us to reject the null hypothesis then so would the appropriate one-sided hypothesis (except in the case where the alternative was chosen before the data was collected and the observed data was in the other tail). Second, by using a two-sample test, it prevents us from from “tricking” ourselves when we don’t know the which group should have a higher mean going into the experiment, but after seeing the data, thinking we should have known and using the less stringent test. Some statisticians go so far as to say that using a 1-sided test is outright fraudulent. Generally, we’ll concentrate on two-sided tests as they are the most widely acceptable.

7.1.1.3 Inference via Bootstrap Confidence Interval

Just as we could use bootstrapping to evaluate a confidence interval for one-sample, we can do the same for two-samples. We need only update the function we are give the boot function.

This function works slightly different than the Chapter 3 version. We must now ensure that we give it a data.frame where the first column are the observations and the second column the factored group labels. This code will then calculate the difference in the means while bootstrapping the elements observed. We can run the bootstrap in a nearly identical fashion to Chapter 3. Notice my data is no longer a vector of values, but the data.frame we have been working with.

We can visualize the results identical to the earlier chapters, but now recognizing this sampling distribution represents the difference in the means of the NoCaffeine and Caffeine groups.

confidence interval null hypothesis example

Thus, we can state that with 95% confidence the difference between the mean NoCaffeine taps and mean Caffeine taps is between -5.4 and -1.5 taps. Notice that the null hypothesis value, \(\delta=0\) , is not a value supported by the data because 0 is not in the 95% confidence interval. A subtle point in the above bootstrap code does not re-sampled each group separately. Because the experimental protocol was to have 10 in each group, we might want to use bootstrap code that accounts for the correct design. For now, we might end up with 12 caffeinated and 8 decaffeinated subjects, which is data that our experimental design couldn’t have generated. This should have minimal consequence and our bootstraps can still be conducted relatively easy.

7.1.2 Inference via asymptotic results (unequal variance assumption)

Previously we’ve seen that the Central Limit Theorem gives us a way to estimate the distribution of the sample mean. So it should be reasonable to assume that for our two groups (1=NoCaffeine, 2=Caffeine), \[\bar{X}_{1}\stackrel{\cdot}{\sim}N\left(\mu_{1},\, \frac{\sigma_{1}^{2}}{n_1}\right)\;\;\;\textrm{and}\;\;\;\bar{X}_{2}\stackrel{\cdot}{\sim}N\left(\mu_{2},\; \frac{\sigma_{2}^{2}}{n_2}\right)\]

It turns out that because \(\bar{X}_{C}\) and \(\bar{X}_{NC}\) both have approximately normal distributions, then the difference between them also does. This shouldn’t be too surprising after looking at the permutation and bootstrap distributions of the \(d^*\) values.

So our hypothesis tests and confidence interval routine will follow a similar pattern as our one-sample tests, but we now need to figure out the correct standardization formula for the difference in means. The only difficulty will be figuring out what the appropriate standard deviation term \(\hat{\sigma}_{D}\) should be.

Recall that if two random variables, A and B, are independent then \[Var\left(A-B\right)=Var(A)+Var(B)\] and therefore \[\begin{aligned} Var\left(D\right) &= Var\left(\bar{X}_{1}-\bar{X}_{2}\right) \\ &= Var\left(\bar{X}_{1}\right)+Var\left(\bar{X}_{2}\right) \\ &= \frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}} \end{aligned}\] and finally we have \[StdErr\left(D\right)=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}\] and therefore my standardized value for the difference will be \[\begin{aligned} t_{\Delta} &= \frac{\textrm{estimate}\,\,\,-\,\,\,\textrm{null hypothesized value}}{StdErr\left(\,\,\textrm{estimate}\,\,\right)} \\ \end{aligned}\]

The test statistic under unequal variance conditions is given by

\[\begin{aligned} t_{\Delta} &= \frac{\left(\bar{x}_{1}-\bar{x}_{2}\right)}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}} \\ \end{aligned}\]

For the data evaluated here, we thus have

\[\begin{aligned} t_{\Delta} &= \frac{\left(-3.5\right)-0}{\sqrt{\frac{2.39^{2}}{10}+\frac{2.21^{2}}{10}}} \\ &= -3.39 \end{aligned}\]

This is somewhat painful, but reasonable. The last question is what t-distribution should we compare this to? Previously we’ve used \(df=n-1\) but now we have two samples. So our degrees of freedom ought to be somewhere between \(\min\left(n_{1},n_{2}\right)-2=8\) and \(\left(n_{1}+n_{2}\right)-1=19\) .

There is no correct answer, but the best approximation to what it should be is called Satterthwaite’s Approximation. We will give this degree of freedom a special character, \(\Delta\) , to keep it clear when we are using it. \[\Delta=\frac{\left(V_{1}+V_{2}\right)^{2}}{\frac{V_{1}^{2}}{n_{1}-1}+\frac{V_{2}^{2}}{n_{2}-1}}\] where \[V_{1}=\frac{s_{1}^{2}}{n_{1}}\;\;\textrm{and }\;\;V_{2}=\frac{s_{2}^{2}}{n_{2}}\]

So for our example we have

\[V_{1}=\frac{2.39^{2}}{10}=0.5712\;\;\;\textrm{and}\;\;\;V_{2}=\frac{2.21^{2}}{10}=0.4884\] and \[\Delta=\frac{\left(0.5712+0.4884\right)^{2}}{\frac{\left(0.5712\right)^{2}}{9}+\frac{\left(0.4884\right)^{2}}{9}}=17.89\]

So now we can compute our two-tailed p-value as

\[\textrm{p-value}=2*P\left(T_{17.89}<-3.39\right)\]

7.1.2.1 Confidence Interval

Similar to the theory discussed earlier, we can calculate the asymptotic confidence interval for the difference in the means. Recall that in general the confidence interval is given by

\[\begin{aligned} \textrm{Est}\;\; &\pm\; t_{\Delta}^{1-\alpha/2}\;\textrm{StdErr}\left(\;\textrm{Est}\;\right) \\ \end{aligned}\]

For the difference of two means under unequal variance conditions, this can be written

\[\begin{aligned} \left(\bar{x}_{1}-\bar{x}_{2}\right) &\pm t_{\Delta}^{1-\alpha/2}\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}} \\ \end{aligned}\]

Working this through for the data set evaluated here, we find obtain

\[\begin{aligned} -3.5 &\pm 2.10\sqrt{\frac{2.39^{2}}{10}+\frac{2.21^{2}}{10}} \\ -3.5 &\pm 2.16 \end{aligned}\]

Where the critical t-score was found at \(\Delta\) degrees of freedom

Giving a 95% confidence interval for the difference in mean taps for NoCaffeine and Caffeine groups as

\[\left(-5.66,\;-1.34\right)\]

It is probably fair to say that this is an ugly calculation to do by hand. Fortunately it isn’t too hard to make R do these calculations for you. The function t.test() will accept two arguments, a vector of values from the first group and a vector from the second group. We can also give it a formula, which is good to start understanding. Here we use Response ~ Predictors , which will be important for understanding linear models. We want to test if the response Taps differs between the two Group levels.

7.1.3 Inference via asymptotic results (equal variance assumption)

In the CaffeineTaps example, the standard deviations of each group are quite similar. Instead of thinking of the data as \[\bar{X}_{1}\stackrel{\cdot}{\sim}N\left(\mu_{1},\,\frac{\sigma_{1}^{2}}{n_1}\right)\;\;\;\textrm{and}\;\;\;\bar{X}_{2}\stackrel{\cdot}{\sim}N\left(\mu_{2},\;\frac{\sigma_{2}^{2}}{n_2}\right)\] we could consider the model where we assume that the variance term is the same for each sample. \[\bar{X}_{1}\stackrel{\cdot}{\sim}N\left(\mu_{1},\,\frac{\sigma^{2}}{n_1}\right)\;\;\;\textrm{and}\;\;\;\bar{X}_{2}\stackrel{\cdot}{\sim}N\left(\mu_{2},\; \frac{\sigma^{2}}{n_2}\right)\]

First, we can estimate \(\mu_{1}\) and \(\mu_{2}\) with the appropriate sample means \(\bar{x}_{1}\) and \(\bar{x}_{2}\) . Next we need to calculate an estimate of \(\sigma\) using all of the data. First recall the formula for the sample variance for one group was \[s^{2}=\frac{1}{n-1}\left[\sum_{j=1}^{n}\left(x_{j}-\bar{x}\right)^{2}\right]\]

In the case with two samples, we want a similar formula but it should take into account data from both sample groups. Define the notation \(x_{1j}\) to be the \(j\) th observation of group 1, and \(x_{2j}\) to be the \(j\) th observation of group 2 and in general \(x_{ij}\) as the \(j\) th observation from group \(i\) . We want to subtract each observation from the its appropriate sample mean and then, because we had to estimate two means, we need to subtract two degrees of freedom from the denominator. \[\begin{aligned} s_{pooled}^{2} &= \frac{1}{n_{1}+n_{2}-2}\left[\sum_{j=1}^{n_{1}}\left(x_{1j}-\bar{x}_{1}\right)^{2}+\sum_{j=1}^{n_{2}}\left(x_{2j}-\bar{x}_{2}\right)^{2}\right] \\ &= \frac{1}{n_{1}+n_{2}-2}\left[\sum_{j=1}^{n_{1}}e_{1j}^{2}+\sum_{j=1}^{n_{2}}e_{2j}^{2}\right]\\ &= \frac{1}{n_{1}+n_{2}-2}\left[\sum_{i=1}^{2}\sum_{j=1}^{n_{i}}e_{ij}^{2}\right] \end{aligned}\]

where \(\bar{x}_{1}\) and \(\bar{x}_{2}\) are the sample means and \(e_{ij}=x_{ij}-\bar{x}_{i}\) is the residual error of the \(i,j\) observation. A computationally convenient formula for this same quantity is \[s_{pooled}^{2}=\frac{1}{n_{1}+n_{2}-2}\left[\left(n_{1}-1\right)s_{1}^{2}+\left(n_{2}-1\right)s_{2}^{2}\right]\]

Finally we notice that this pooled estimate of the variance term \(\sigma^{2}\) has \(n_{1}+n_{2}-2\) degrees of freedom. One benefit of the pooled procedure is that we don’t have to mess with the Satterthwaite’s approximate degrees of freedom.

Recall our test statistic in the unequal variance case was \[t_{\Delta} =\frac{\left(\bar{x}_{1}-\bar{x}_{2}\right)-0}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}\] Now in the equal variance case, we will use the pooled estimate of the variance term \(s_{pooled}^{2}\) instead of \(s_{1}^{2}\) and \(s_{2}^{2}\) , and we have known \(df = (n_1 + n_2 - 2)\) .

\[\begin{aligned} t_{n_{1}+n_{2}-2} &= \frac{\left(\bar{x}_{1}-\bar{x}_{2}\right)-0}{\sqrt{\frac{s_{pool}^{2}}{n_{1}}+\frac{s_{pool}^{2}}{n_{2}}}} \\ &= \frac{\left(\bar{x}_{1}-\bar{x}_{2}\right)-0}{s_{pool}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}} \end{aligned}\]

where we note that \[StdErr\left(\bar{X}_{1}-\bar{X}_{2}\right)=s_{pooled}\sqrt{\left(1/n_{1}\right)+\left(1/n_{2}\right)}\]

7.1.3.1 Caffeine Example

We can now rework the analysis of the Caffeine data under equal variance assumptions, allowing us to pool our estimate of the variance.

Recall our hypothesis for the CaffeineTaps data \[\begin{aligned} H_{0}: &\mu_{nc}-\mu_{c} = 0 \\ H_{a}: &\mu_{nc}-\mu_{c} \ne 0 \end{aligned}\]

First we have to calculate the summary statistics for each group.

We can then use the descriptive statistics to determine the pooled variance estimate \(\sigma_{pooled}\) ).

Next we can calculate \[t_{18}=\frac{\left(244.8-248.3\right)-0}{2.31\sqrt{\frac{1}{10}+\frac{1}{10}}}=-3.39\]

Finally we estimate our p-value

The change in the assumption of the variance makes little difference for the Caffeine data set, and we can still conclude that there is a difference in the mean taps between the Caffeine and NoCaffeine groups.

7.1.3.2 Confidence Interval

The associated \(95\%\) confidence interval when working under the equal variance assumption is \[\left(\bar{x}_{1}-\bar{x}_{2}\right)\pm t_{n_{1}+n_{2}-2}^{1-\alpha/2}\;\left(s_{pool}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}\right)\]

We now find the critical t-score at the known degree of freedom

Then calculate the confidence interval

\[\begin{aligned} -3.5 &\pm 2.10\left(\,2.31\sqrt{\frac{1}{10}+\frac{1}{10}}\right) \\ -3.5 &\pm 2.17 \end{aligned}\] \[\left(-5.67,\;-1.33\right)\]

This p-value and \(95\%\) confidence interval are quite similar to the values we got in the case where we assumed unequal variances.

As usual, these calculations are pretty annoying to do by hand and we wish to instead do them using R. Again the function t.test() will do the annoying calculations for us. We must only state that we want to do the test under equal variance or var.equal=TRUE .

Maybe we would like to evaluate a higher confidence level.

7.1.4 Additional Example

Example: Does drinking beer increase your attractiveness to mosquitoes?

In places in the country substantial mosquito populations, the question of whether drinking beer causes the drinker to be more attractive to the mosquitoes than drinking something else has plagued campers. To answer such a question, researchers conducted a study to determine if drinking beer attracts more mosquitoes than drinking water. Of \(n=43\) subjects, \(n_{b}=25\) drank a liter beer and \(n_{w}=18\) drank a liter of water and mosquitoes were caught in traps as they approached the different subjects. The critical part of this study is that the treatment (beer or water) was randomly assigned to each subject.

For this study, we want to test \[H_{0}:\:\delta=0\;\;\;\;\;\;\textrm{vs}\;\;\;\;\;\;H_{a}:\,\delta<0\] where we define \(\delta=\mu_{w}-\mu_{b}\) and \(\mu_{b}\) is the mean number of mosquitoes attracted to a beer drinker and \(\mu_{w}\) is the mean number attracted to a water drinker. As usual we begin our analysis by plotting the data.

confidence interval null hypothesis example

For this experiment and the summary statistic that captures the difference we are trying to understand is \(d=\bar{x}_{w}-\bar{x}_{b}\) where \(\bar{x}_{w}\) is the sample mean number of mosquitoes attracted by the water group and \(\bar{x}_{b}\) is the sample mean number of mosquitoes attracted by the beer group. Because of the order we chose for the subtraction, a negative value for d is supportive of the alternative hypothesis that mosquitoes are more attracted to beer drinkers.

Here we see that our statistic of interest is \[\begin{aligned} d &= \bar{x}_{w}-\bar{x}_{b} \\ &= 19.22-23.6 \\ &= -4.37\bar{7} \end{aligned}\]

We can use our numerical methods to evaluate statistical significance. First we perform the hypothesis test by creating the sampling distribution of \(d^*\) assuming \(H_0\) is true by repeatedly shuffling the group labels and calculating differences. We use coin to simplify the work.

From 10000 permutations, we obtain a p-value estimate of \(0.0004\) .

The associated confidence interval (lets do a \(90\%\) confidence level), is created via bootstrapping. The diff.mean.func was defined earlier in the chapter. Our data is in the form we need it, so we can run the bootstrap with the same setup as earlier.

We can visualize the distribution of the difference in means.

confidence interval null hypothesis example

We can then extract the quantile-based 90% confidence interval.

The calculated p-value is extremely small and the associated two-sided 90% confidence interval does not contain 0, so we can conclude at 10% significance that the choice of drink does cause a change in attractiveness to mosquitoes.

If we wanted to perform the same analysis using asymptotic methods we could do the calculations by hand, or just use R.

Because we releveled the groups to make Water first, this calculation matches everything demonstrated above.

7.2 Difference in means between two groups: Paired Data

If the context of study is such that we can logically pair an observation from the first population to a particular observation in the second, then we can perform what is called a Paired Test. In a paired test, we will take each set of paired observations, calculate the difference, and then perform a 1-sample regular hypothesis test on the differences.

For example, in the package Lock5Data there is a dataset that examines the age in which men and women get married. The data was obtained by taking a random sample from publicly available marriage licenses in St. Lawrence County, NY.

Unfortunately the format of this dataset is not particularly convenient for making graphs. Instead I want to turn this data into a “long” dataset where I have one row per person, not one row per marriage. We will also want to retain the groupings, so another column is created with this information ( Marriage ).

We can then visualize the ages for each type of Spouse .

confidence interval null hypothesis example

Looking at this view of the data, it doesn’t appear that the husbands tend to be older than the wives. A t-test to see if the average age of husbands is greater than the average age of wives gives an insignificant difference. We will want to test if

\[\begin{aligned} H_{0}: &\mu_{h}-\mu_{w} = 0 \\ H_{a}: &\mu_{h}-\mu_{w} > 0 \end{aligned}\]

But critically, we are ignoring that while the average ages might not be different, for a given marriage, the husband tends to be older than the wife. Instead of looking at the difference in the means (i.e \(d=\bar{h}-\bar{w}\) ) we should actually be looking at the mean of the differences \(\bar{d}=\frac{1}{n}\sum d_{i}\) where \(d_{i}=h_{i}-w_{i}\) .

confidence interval null hypothesis example

Given this set of differences, we’d like to know if this data is compatible with the null hypothesis that husbands and wives tend to be the same age versus the alternative that husbands tend to be older. (We could chose the two-sided test as well). \[\begin{aligned} H_{0}:\;\delta &= 0 \\ H_{A}:\;\delta &> 0 \end{aligned}\]

Because we have reduced our problem to a 1-sample test, we can perform the asymptotic t-test easily enough in R. Notice now we are testing against the null hypothesis that \(\delta = 0\) .

The result is highly statistically significant, and we see the mean difference in ages for the husband to be 2.8 years older.

To perform the same analysis using re-sampling methods, we need to be careful to do the re-sampling correctly. Notice that if we use coin how we set it up before, that we get something similar to when we were working under the assumption that the two groups were independent. The coin package requires objects be factors (the data.frame has it contained as a character or chr ).

The issue is that the permutations were done without taking into account that the Spouses are paired. The permutation test must be updated such that the paired nature of the marriages is taken into account. We can introduce this pairing using a conditional statement, where we want to ensure that we group the Spouse variable given the marriage they are in Marriage .

After 10000 permutations, no mean difference in age was ever observed as extreme as the original and can only state that p-value \(< 1e-4\) . This is in agreement with the t.test performed above on the difference in ages for each marriage.

Finally, we can also perform bootstrap analysis. This now only requires that we use our bootstrap method from Chapter 3, as we only need to bootstrap the differences. We have not introduced mean.function so I must provide it now. I then run the bootstrap on the difference in ages for each marriage.

I omit the visualization, but give the resulting 95% confidence interval.

We observe a similar p-value and confidence interval as we did using the asymptotic test as expected. We now have a variety of tests and conditions, and can perform the analysis under asymptotic assumptions or through numerical strategies.

7.2.1 Additional Example

Example: Traffic Flow

Engineers in Dresden, Germany were looking at ways to improve traffic flow by enabling traffic lights to communicate information about traffic flow with nearby traffic lights and modify their timing sequence appropriately. The engineers wished to compare new flexible timing system with the standard fixed timing sequence by evaluating the delay at a randomly selected \(n=24\) intersections in Dresden. The data show results of one experiment where they simulated buses moving along a street and recorded the delay time for both systems. Because each simulation is extremely intensive, they only simulated \(n=24\) intersections instead of simulating the whole city.

We change the shape of the data to make it easier to work with.

As usual, we’ll first examine the data with a graph.

confidence interval null hypothesis example

All of the differences were positive, so it is almost ridiculous to do a hypothesis test that there is no decrease in delays with the flexible timing system. We continue through the analysis. We begin with the asymptotic results, using the paired differences.

As expected, there is significant evidence that mean difference between Standard and Flexible. We can also run the permutation test under paired conditions. We are interested in the response Delay and how it is influenced by Seq the sequence time. We then also ensure that we properly pair the data, where are groupings are now the variable Light .

We observe after 10000 iterations that again, no mean difference was ever as extreme as the original data set. Thus, we have p-value \(< 1e^{-4}\) . We finish with the bootstrap, performed on the differences.

The confidence interval suggests that these data support that the mean difference between the flexible timing sequence versus the standard fixed timing sequence in Dresden is in the interval \(\left(55.5,\,67.5\right)\) seconds.

7.3 Exercises

In the 2011 article “Methane contamination of drinking water accompanying gas-well drilling and hydraulic fracturing” in the Proceedings of the National Academy of Sciences, \(n_{1}=21\) sites in proximity to a fracking well had a mean methane level of \(\bar{x}_{1}=19.2\) mg \(CH_{4} L^{-1}\) with a sample standard deviation \(s_{1}=30.3\) . The \(n_{2}=13\) sites in the same region with no fracking wells within 1 kilometer had mean methane levels of \(\bar{x}_{2}=1.1\) mg \(CH_{4} L^{-1}\) and standard deviation \(s_{2}=6.3\) . Perform a one-sided, two-sample t-test with unpooled variance and an \(\alpha=0.05\) level to investigate if the presence of fracking wells increases the methane level in drinking-water wells in this region. Notice that because I don’t give you the data, you can only analyze the data using the asymptotic method and plugging in the give quantities into the formulas presented.

  • State an appropriate null and alternative hypothesis.
  • Calculate an appropriate test statistic (making sure to denote the appropriate degrees of freedom, if necessary).
  • Calculate an appropriate p-value.
  • At an significance level of \(\alpha=0.05\) , do you reject or fail to reject the null hypothesis?
  • Restate your conclusion in terms of the problem.

All persons running for public office must report the amount of money raised and spent during their campaign. Political scientists contend that it is more difficult for female candidates to raise money. Suppose that we randomly sample \(30\) male and \(30\) female candidates for state legislature and observe the male candidates raised, on average, \(\bar{y}=\$350,000\) with a standard deviation of \(s_{y}=\$61,900\) and the females raised on average \(\bar{x}=\$245,000\) with a standard deviation of \(s_{x}=\$52,100\) . Perform a one-sided, two-sample t-test with pooled variance to test if female candidates generally raise less in their campaigns that male candidates. Notice that because I don’t give you the data, you can only analyze the data using the asymptotic method and plugging in the give quantities into the formulas presented.

  • State an appropriate null and alternative hypothesis. (Be sure to use correct notation!)

In the Lock5Data package, the dataset Smiles gives data “…from a study examining the effect of a smile on the leniency of disciplinary action for wrongdoers. Participants in the experiment took on the role of members of a college disciplinary panel judging students accused of cheating. For each suspect, along with a description of the offense, a picture was provided with either a smile or neutral facial expression. Note, that for each individual only one picture was submitted. A leniency score was calculated based on the disciplinary decisions made by the participants.”

  • Graph the leniency score for the smiling and non-smiling groups. Comment on if you can visually detect any difference in leniency score.
  • Calculate the mean and standard deviation of the leniencies for each group. Does it seem reasonable that the standard deviation of each group is the same?
  • Do a two-sided two-sample t-test using pooled variance using the asymptotic method. Report the test statistic, p-value, and a \(95\%\) CI.
  • Do a two-side two-sample t-test using re-sampling methods. Report the p-value and a \(95\%\) CI.
  • What do you conclude at an \(\alpha=0.05\) level? Do you feel we should have used a more stringent \(\alpha\) level?

In the Lock5Data package, the dataset StorySpoilers is data from an experiment where the researchers are testing if a “spoiler” at the beginning of a short story negatively affects the enjoyment of the story. A set of \(n=12\) stories were selected and a spoiler introduction was created. Each version of each story was read by at least \(30\) people and rated. Reported are the average ratings for the spoiler and non-spoiler versions of each story. The following code creates the “long” version of the data.

  • Based on the description, a 1-sided test is appropriate. Explain why.
  • Graph the ratings for the original stories and the modified spoiler version. Comment on if you detect any difference in ratings between the two.
  • Graph the difference in ratings for each story. Comment on if the distribution of the differences seems to suggest that a spoiler lowers the rating.
  • Do a paired one-sided t-test using the asymptotic method. Also calculate a \(95\%\) confidence interval.
  • Do a paired one-sided t-test using the permutation method. Also calculate a \(95\%\) confidence interval using the bootstrap.
  • Based on your results in parts (d) and (e), what do you conclude?

In the Lock5Data package, the dataset Wetsuits describes an experiment with the goal of quantifying the effect of wearing a wetsuit on the speed of swimming. (It is often debated among triathletes whether or not to wear a wetsuit when it is optional.) A set of \(n=12\) swimmers and triathletes did a 1500 m swim in both the wetsuit and again in regular swimwear. The order in which they swam (wetsuit first or non-wetsuit first) was randomized for each participant. Reported is the maximum velocity during each swim.

  • Why did the researcher randomize which suit was worn first?
  • Plot the velocities for the wetsuit and non-wetsuit for each participant. Comment on if you detect any difference in the means of these two distributions.
  • Ignore the pairing and do a two-sided two-sample t-test using the asymptotic method. What would you conclude doing the t-test this way?
  • Plot the difference in velocity for each swimmer. Comment on if the observed difference in velocity seems to indicate that which should be preferred (wetsuit or non-wetsuit).
  • Do a paired two-sided t-test using the asymptotic method. Also calculate the 95% confidence interval. What do you conclude?
  • Do a paired two-sided t-test using the permutation method. Also calculate the 95% confidence interval using the bootstrap method. What do you conclude?

Rejecting the Null Hypothesis Using Confidence Intervals

Rejecting the Null Hypothesis Using Confidence Intervals

After a discussion on the two primary methods of statistical inference, viz. hypothesis tests and confidence intervals, it is shown that these two methods are actually equivalent.

In an introductory statistics class, there are three main topics that are taught: descriptive statistics and data visualizations, probability and sampling distributions, and statistical inference. Within statistical inference, there are two key methods of statistical inference that are taught, viz. confidence intervals and hypothesis testing . While these two methods are always taught when learning data science and related fields, it is rare that the relationship between these two methods is properly elucidated.

In this article, we’ll begin by defining and describing each method of statistical inference in turn and along the way, state what statistical inference is, and perhaps more importantly, what it isn’t. Then we’ll describe the relationship between the two. While it is typically the case that confidence intervals are taught before hypothesis testing when learning statistics, we’ll begin with the latter since it will allow us to define statistical significance.

Hypothesis Tests

The purpose of a hypothesis test is to answer whether random chance might be responsible for an observed effect. Hypothesis tests use sample statistics to test a hypothesis about population parameters. The null hypothesis, H 0 , is a statement that represents the assumed status quo regarding a variable or variables and it is always about a population characteristic. Some of the ways the null hypothesis is typically glossed are: the population variable is equal to a particular value or there is no difference between the population variables . For example:

  • H 0 : μ = 61 in (The mean height of the population of American men is 69 inches)
  • H 0 : p 1 -p 2 = 0 (The difference in the population proportions of women who prefer football over baseball and the population proportion of men who prefer football over baseball is 0.)

Note that the null hypothesis always has the equal sign.

The alternative hypothesis, denoted either H 1 or H a , is the statement that is opposed to the null hypothesis (e.g., the population variable is not equal to a particular value  or there is a difference between the population variables ):

  • H 1 : μ > 61 im (The mean height of the population of American men is greater than 69 inches.)
  • H 1 : p 1 -p 2 ≠ 0 (The difference in the population proportions of women who prefer football over baseball and the population proportion of men who prefer football over baseball is not 0.)

The alternative hypothesis is typically the claim that the researcher hopes to show and it always contains the strict inequality symbols (‘<’ left-sided or left-tailed, ‘≠’ two-sided or two-tailed, and ‘>’ right-sided or right-tailed).

When carrying out a test of H 0 vs. H 1 , the null hypothesis H 0 will be rejected in favor of the alternative hypothesis only if the sample provides convincing evidence that H 0 is false. As such, a statistical hypothesis test is only capable of demonstrating strong support for the alternative hypothesis by rejecting the null hypothesis.

When the null hypothesis is not rejected, it does not mean that there is strong support for the null hypothesis (since it was assumed to be true); rather, only that there is not convincing evidence against the null hypothesis. As such, we never use the phrase “accept the null hypothesis.”

In the classical method of performing hypothesis testing, one would have to find what is called the test statistic and use a table to find the corresponding probability. Happily, due to the advancement of technology, one can use Python (as is done in the Flatiron’s Data Science Bootcamp ) and get the required value directly using a Python library like stats models . This is the p-value , which is short for the probability value.

The p-value is a measure of inconsistency between the hypothesized value for a population characteristic and the observed sample. The p -value is the probability, under the assumption the null hypothesis is true, of obtaining a test statistic value that is a measure of inconsistency between the null hypothesis and the data. If the p -value is less than or equal to the probability of the Type I error, then we can reject the null hypothesis and we have sufficient evidence to support the alternative hypothesis.

Typically the probability of a Type I error ɑ, more commonly known as the level of significance , is set to be 0.05, but it is often prudent to have it set to values less than that such as 0.01 or 0.001. Thus, if p -value ≤ ɑ, then we reject the null hypothesis and we interpret this as saying there is a statistically significant difference between the sample and the population. So if the p -value=0.03 ≤ 0.05 = ɑ, then we would reject the null hypothesis and so have statistical significance, whereas if p -value=0.08 ≥ 0.05 = ɑ, then we would fail to reject the null hypothesis and there would not be statistical significance.

Confidence Intervals

The other primary form of statistical inference are confidence intervals. While hypothesis tests are concerned with testing a claim, the purpose of a confidence interval is to estimate an unknown population characteristic. A confidence interval is an interval of plausible values for a population characteristic. They are constructed so that we have a chosen level of confidence that the actual value of the population characteristic will be between the upper and lower endpoints of the open interval.

The structure of an individual confidence interval is the sample estimate of the variable of interest margin of error. The margin of error is the product of a multiplier value and the standard error, s.e., which is based on the standard deviation and the sample size. The multiplier is where the probability, of level of confidence, is introduced into the formula.

The confidence level is the success rate of the method used to construct a confidence interval. A confidence interval estimating the proportion of American men who state they are an avid fan of the NFL could be (0.40, 0.60) with a 95% level of confidence. The level of confidence is not the probability that that population characteristic is in the confidence interval, but rather refers to the method that is used to construct the confidence interval.

For example, a 95% confidence interval would be interpreted as if one constructed 100 confidence intervals, then 95 of them would contain the true population characteristic. 

Errors and Power

A Type I error, or a false positive, is the error of finding a difference that is not there, so it is the probability of incorrectly rejecting a true null hypothesis is ɑ, where ɑ is the level of significance. It follows that the probability of correctly failing to reject a true null hypothesis is the complement of it, viz. 1 – ɑ. For a particular hypothesis test, if ɑ = 0.05, then its complement would be 0.95 or 95%.

While we are not going to expand on these ideas, we note the following two related probabilities. A Type II error, or false negative, is the probability of failing to reject a false null hypothesis where the probability of a type II error is β and the power is the probability of correctly rejecting a false null hypothesis where power = 1 – β. In common statistical practice, one typically only speaks of the level of significance and the power.

The following table summarizes these ideas , where the column headers refer to what is actually the case, but is unknown. (If the truth or falsity of the null value was truly known, we wouldn’t have to do statistics.)

confidence interval null hypothesis example

Hypothesis Tests and Confidence Intervals

Since hypothesis tests and confidence intervals are both methods of statistical inference, then it is reasonable to wonder if they are equivalent in some way. The answer is yes, which means that we can perform hypothesis testing using confidence intervals.

Returning to the example where we have an estimate of the proportion of American men that are avid fans of the NFL, we had (0.40, 0.60) at a 95% confidence level. As a hypothesis test, we could have the alternative hypothesis as H 1 ≠ 0.51. Since the null value of 0.51 lies within the confidence interval, then we would fail to reject the null hypothesis at ɑ = 0.05.

On the other hand, if H 1 ≠ 0.61, then since 0.61 is not in the confidence interval we can reject the null hypothesis at ɑ = 0.05. Note that the confidence level of 95% and the level of significance at ɑ = 0.05 = 5%  are complements, which is the “H o is True” column in the above table.

In general, one can reject the null hypothesis given a null value and a confidence interval for a two-sided test if the null value is not in the confidence interval where the confidence level and level of significance are complements. For one-sided tests, one can still perform a hypothesis test with the confidence level and null value. Not only is there an added layer of complexity for this equivalence, it is the best practice to perform two-sided hypothesis tests since one is not prejudicing the direction of the alternative.

In this discussion of hypothesis testing and confidence intervals, we not only understand when these two methods of statistical inference can be equivalent, but now have a deeper understanding of statistical significance itself and therefore, statistical inference.

Learn More About Data Science at Flatiron

The curriculum in our Data Science Bootcamp incorporates the latest technologies, including artificial intelligence (AI) tools. Download the syllabus to see what you can learn, or book a 10-minute call with Admissions to learn about full-time and part-time attendance opportunities.

Disclaimer: The information in this blog is current as of February 28, 2024. Current policies, offerings, procedures, and programs may differ.

confidence interval null hypothesis example

About Brendan Patrick Purdy

Brendan is the senior curriculum developer for data science at the Flatiron School. He holds degrees in mathematics, data science, and philosophy, and enjoys modeling neural networks with the Python library TensorFlow.

Related Posts

confidence interval null hypothesis example

Learn to Code Python: Free Lesson for Beginners

confidence interval null hypothesis example

Courtney Pimentel: Sales Associate to IT

confidence interval null hypothesis example

A Guide to Cloud Identity and Access Management (Cloud IAM)

Related resources.

confidence interval null hypothesis example

Behind JavaScript, HTML/CSS, and SQL, Python is the fourth most popular language with 44.1% of developers. Check out this article on how you can learn this popular programming language for free.

confidence interval null hypothesis example

"The most surprising thing about my time at Flatiron was discovering how much I could learn and grow in such a short amount of time."

confidence interval null hypothesis example

Discover how Cloud IAM addresses the security challenges of cloud environments, ensuring granular access control for enhanced protection and efficiency.

Privacy Overview

Statology

Statistics Made Easy

How to Write a Null Hypothesis (5 Examples)

A hypothesis test uses sample data to determine whether or not some claim about a population parameter is true.

Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms:

H 0 (Null Hypothesis): Population parameter =,  ≤, ≥ some value

H A  (Alternative Hypothesis): Population parameter <, >, ≠ some value

Note that the null hypothesis always contains the equal sign .

We interpret the hypotheses as follows:

Null hypothesis: The sample data provides no evidence to support some claim being made by an individual.

Alternative hypothesis: The sample data  does provide sufficient evidence to support the claim being made by an individual.

For example, suppose it’s assumed that the average height of a certain species of plant is 20 inches tall. However, one botanist claims the true average height is greater than 20 inches.

To test this claim, she may go out and collect a random sample of plants. She can then use this sample data to perform a hypothesis test using the following two hypotheses:

H 0 : μ ≤ 20 (the true mean height of plants is equal to or even less than 20 inches)

H A : μ > 20 (the true mean height of plants is greater than 20 inches)

If the sample data gathered by the botanist shows that the mean height of this species of plants is significantly greater than 20 inches, she can reject the null hypothesis and conclude that the mean height is greater than 20 inches.

Read through the following examples to gain a better understanding of how to write a null hypothesis in different situations.

Example 1: Weight of Turtles

A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles.

Here is how to write the null and alternative hypotheses for this scenario:

H 0 : μ = 300 (the true mean weight is equal to 300 pounds)

H A : μ ≠ 300 (the true mean weight is not equal to 300 pounds)

Example 2: Height of Males

It’s assumed that the mean height of males in a certain city is 68 inches. However, an independent researcher believes the true mean height is greater than 68 inches. To test this, he goes out and collects the height of 50 males in the city.

H 0 : μ ≤ 68 (the true mean height is equal to or even less than 68 inches)

H A : μ > 68 (the true mean height is greater than 68 inches)

Example 3: Graduation Rates

A university states that 80% of all students graduate on time. However, an independent researcher believes that less than 80% of all students graduate on time. To test this, she collects data on the proportion of students who graduated on time last year at the university.

H 0 : p ≥ 0.80 (the true proportion of students who graduate on time is 80% or higher)

H A : μ < 0.80 (the true proportion of students who graduate on time is less than 80%)

Example 4: Burger Weights

A food researcher wants to test whether or not the true mean weight of a burger at a certain restaurant is 7 ounces. To test this, he goes out and measures the weight of a random sample of 20 burgers from this restaurant.

H 0 : μ = 7 (the true mean weight is equal to 7 ounces)

H A : μ ≠ 7 (the true mean weight is not equal to 7 ounces)

Example 5: Citizen Support

A politician claims that less than 30% of citizens in a certain town support a certain law. To test this, he goes out and surveys 200 citizens on whether or not they support the law.

H 0 : p ≥ .30 (the true proportion of citizens who support the law is greater than or equal to 30%)

H A : μ < 0.30 (the true proportion of citizens who support the law is less than 30%)

Additional Resources

Introduction to Hypothesis Testing Introduction to Confidence Intervals An Explanation of P-Values and Statistical Significance

confidence interval null hypothesis example

Hey there. My name is Zach Bobbitt. I have a Master of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Using a confidence interval to decide whether to reject the null hypothesis

Suppose that you do a hypothesis test. Remember that the decision to reject the null hypothesis (H 0 ) or fail to reject it can be based on the p-value and your chosen significance level (also called α). If the p-value is less than or equal to α, you reject H 0 ; if it is greater than α, you fail to reject H 0 .

  • If the reference value specified in H 0 lies outside the interval (that is, is less than the lower bound or greater than the upper bound), you can reject H 0 .
  • If the reference value specified in H 0 lies within the interval (that is, is not less than the lower bound or greater than the upper bound), you fail to reject H 0 .
  • Minitab.com
  • License Portal
  • Cookie Settings

You are now leaving support.minitab.com.

Click Continue to proceed to:

IMAGES

  1. t test null hypothesis example

    confidence interval null hypothesis example

  2. 15 Null Hypothesis Examples (2024)

    confidence interval null hypothesis example

  3. [Math] One tailed confidence interval $1

    confidence interval null hypothesis example

  4. Hypothesis Testing

    confidence interval null hypothesis example

  5. PPT

    confidence interval null hypothesis example

  6. The Relationship Between Confidence Intervals and Hypothesis Tests

    confidence interval null hypothesis example

VIDEO

  1. Statitics Science 6 /Hypotheses ( Interval Estimation

  2. 1.3 Confidence intervals and two-sided tests

  3. Testing of Hypothesis

  4. Testing of Hypothesis

  5. Testing of Hypothesis

  6. Testing of Hypothesis

COMMENTS

  1. 6.6

    In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always ...

  2. Hypothesis Testing and Confidence Intervals

    The relationship between the confidence level and the significance level for a hypothesis test is as follows: Confidence level = 1 - Significance level (alpha) For example, if your significance level is 0.05, the equivalent confidence level is 95%. Both of the following conditions represent statistically significant results: The P-value in a ...

  3. Understanding Confidence Intervals

    To calculate the 95% confidence interval, we can simply plug the values into the formula. For the USA: So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98. For GB: So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.

  4. The Relationship Between Hypothesis Testing and Confidence Intervals

    An example of a typical hypothesis test (two-tailed) where "p" is some parameter. First, we state our two kinds of hypothesis:. Null hypothesis (H0): The "status quo" or "known/accepted fact".States that there is no statistical significance between two variables and is usually what we are looking to disprove.

  5. Confidence Intervals

    Below, we'll walk through an example followed by the process to generate a confidence interval. Confidence Interval Example. Like mentioned above, the median price per night for a Chicago Airbnb was $126 in our sample. ... We would want to reject the null hypothesis and adopt the alternative hypothesis as a more reasonable claim. In this case ...

  6. 8.3: Confidence Intervals

    Step 4: Make the Decision Finally, we can compare our confidence interval to our null hypothesis value. The null value of 38 is higher than our lower bound of 37.76 and lower than our upper bound of 41.94. Thus, the confidence interval brackets our null hypothesis value, and we fail to reject the null hypothesis: Fail to Reject \(H_0\).

  7. Hypothesis Testing and Confidence Intervals

    The correct answer is A. To calculate the 95% confidence interval for the mean study time of all candidates, we can use the formula for the confidence interval when the population variance is unknown: Confidence Interval = ¯X ±t1−α 2 × s √n Confidence Interval = X ¯ ± t 1 − α 2 × s n. Where: ¯X X ¯ is the sample mean.

  8. 8.6 Relationship Between Confidence Intervals and Hypothesis Tests

    In part b), we are 95% confident that the mean ABI is less than 0.9 since the entire confidence interval is below 0.9. In other words, the hypothesized value 0.9 is outside the corresponding confidence interval, we should reject the null. Therefore, the results obtained in parts a) and b) are consistent.

  9. The Ultimate Guide to Hypothesis Testing and Confidence Intervals in

    Step 1: Set up the null hypothesis: Two tails: H0: Ᾱ = μ ... one sample mean confidence interval. where Ᾱ is the sample's mean, and t can be found in the t table above based on the confidence level and degree of freedom. For example, for a sample with 10 observations, the t value for the 95% confidence interval is 2.262. ...

  10. PDF Lecture 7

    Recap: Hypothesis testing for a population mean. Set the hypotheses. H0 : = null value HA : < or > or 6= null value. Check assumptions and conditions. Independence: random sample/assignment, 10% condition when sampling without replacement Normality: nearly normal population or n 30, no extreme skew.

  11. Confidence Intervals: Interpreting, Finding & Formulas

    A confidence interval (CI) is a range of values that is likely to contain the value of an unknown population parameter. These intervals represent a plausible domain for the parameter given the characteristics of your sample data. Confidence intervals are derived from sample statistics and are calculated using a specified confidence level.

  12. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  13. Understanding Hypothesis Tests: Confidence Intervals and ...

    If the P value is less than alpha, the confidence interval will not contain the null hypothesis value. For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 ...

  14. Null Hypothesis: Definition, Rejecting & Examples

    When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant. Statisticians often denote the null hypothesis as H 0 or H A.. Null Hypothesis H 0: No effect exists in the population.; Alternative Hypothesis H A: The effect exists in the population.; In every study or experiment, researchers assess an effect or relationship.

  15. 11.8: Significance Testing and Confidence Intervals

    If the \(95\%\) confidence interval contains zero (more precisely, the parameter value specified in the null hypothesis), then the effect will not be significant at the \(0.05\) level. Looking at non-significant effects in terms of confidence intervals makes clear why the null hypothesis should not be accepted when it is not rejected: Every ...

  16. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0, the —null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

  17. Null & Alternative Hypotheses

    The null and alternative hypotheses offer competing answers to your research question. When the research question asks "Does the independent variable affect the dependent variable?": The null hypothesis ( H0) answers "No, there's no effect in the population.". The alternative hypothesis ( Ha) answers "Yes, there is an effect in the ...

  18. PDF Confidence Intervals and Hypothesis Tests: Two Samples

    Chapter 9. Confidence Intervals and Hypothesis Tests: Two Samples. Inferences Based on Two Samples. In the following sections, our goal is to compare two population parameters to each other. We want to know the relationship between the parameters (if they are equal or if one is larger than the other). For example, I may want to compare the ...

  19. Chapter 7 Two-Sample Hypothesis Tests and Confidence Intervals

    Thus, we can state that with 95% confidence the difference between the mean NoCaffeine taps and mean Caffeine taps is between -5.4 and -1.5 taps. Notice that the null hypothesis value, \(\delta=0\), is not a value supported by the data because 0 is not in the 95% confidence interval. A subtle point in the above bootstrap code does not re ...

  20. Rejecting the Null Hypothesis Using Confidence Intervals

    On the other hand, if H 1 ≠ 0.61, then since 0.61 is not in the confidence interval we can reject the null hypothesis at ɑ = 0.05. Note that the confidence level of 95% and the level of significance at ɑ = 0.05 = 5% are complements, which is the "H o is True" column in the above table.

  21. How to Write a Null Hypothesis (5 Examples)

    Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms: H0 (Null Hypothesis): Population parameter =, ≤, ≥ some value. HA (Alternative Hypothesis): Population parameter <, >, ≠ some value. Note that the null hypothesis always contains the equal sign.

  22. Using a confidence interval to decide whether to reject the null hypothesis

    Remember that the decision to reject the null hypothesis (H 0) or fail to reject it can be based on the p-value and your chosen significance level (also called α). If the p-value is less than or equal to α, you reject H 0; if it is greater than α, you fail to reject H 0. Your decision can also be based on the confidence interval (or bound ...

  23. Student's t Table (Free Download)

    A non-directional hypothesis states that a population parameter (such as a mean or regression coefficient) is not equal to a certain value (such as 0). Two-tailed tests are appropriate for most studies. If you're calculating a confidence interval, choose two-tailed. One-tailed tests are used when the alternative hypothesis is directional.