Tutorial Playlist

Statistics tutorial, everything you need to know about the probability density function in statistics, the best guide to understand central limit theorem, an in-depth guide to measures of central tendency : mean, median and mode, the ultimate guide to understand conditional probability.

A Comprehensive Look at Percentile in Statistics

The Best Guide to Understand Bayes Theorem

Everything you need to know about the normal distribution, an in-depth explanation of cumulative distribution function, a complete guide to chi-square test, a complete guide on hypothesis testing in statistics, understanding the fundamentals of arithmetic and geometric progression, the definitive guide to understand spearman’s rank correlation, a comprehensive guide to understand mean squared error, all you need to know about the empirical rule in statistics, the complete guide to skewness and kurtosis, a holistic look at bernoulli distribution.

All You Need to Know About Bias in Statistics

A Complete Guide to Get a Grasp of Time Series Analysis

The Key Differences Between Z-Test Vs. T-Test

The Complete Guide to Understand Pearson's Correlation

A complete guide on the types of statistical studies, everything you need to know about poisson distribution, your best guide to understand correlation vs. regression, the most comprehensive guide for beginners on what is correlation, what is hypothesis testing in statistics types and examples.

Lesson 10 of 24 By Avijeet Biswal

A Complete Guide on Hypothesis Testing in Statistics

Table of Contents

In today’s data-driven world , decisions are based on data all the time. Hypothesis plays a crucial role in that process, whether it may be making business decisions, in the health sector, academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk drawing the wrong conclusions and making bad decisions. In this tutorial, you will look at Hypothesis Testing in Statistics.

What Is Hypothesis Testing in Statistics?

Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables.

Let's discuss few examples of statistical hypothesis from real-life - 

  • A teacher assumes that 60% of his college's students come from lower-middle-class families.
  • A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.

Now that you know about hypothesis testing, look at the two types of hypothesis testing in statistics.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

  • Here, x̅ is the sample mean,
  • μ0 is the population mean,
  • σ is the standard deviation,
  • n is the sample size.

How Hypothesis Testing Works?

An analyst performs hypothesis testing on a statistical sample to present evidence of the plausibility of the null hypothesis. Measurements and analyses are conducted on a random sample of the population to test a theory. Analysts use a random population sample to test two hypotheses: the null and alternative hypotheses.

The null hypothesis is typically an equality hypothesis between population parameters; for example, a null hypothesis may claim that the population means return equals zero. The alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means the return is not equal to zero). As a result, they are mutually exclusive, and only one can be correct. One of the two possibilities, however, will always be correct.

Your Dream Career is Just Around The Corner!

Your Dream Career is Just Around The Corner!

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.

Let's understand this with an example.

A sanitizer manufacturer claims that its product kills 95 percent of germs on average. 

To put this company's claim to the test, create a null and alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

Another straightforward example to understand this concept is determining whether or not a coin is fair and balanced. The null hypothesis states that the probability of a show of heads is equal to the likelihood of a show of tails. In contrast, the alternate theory states that the probability of a show of heads and tails would be very different.

Become a Data Scientist with Hands-on Training!

Become a Data Scientist with Hands-on Training!

Hypothesis Testing Calculation With Examples

Let's consider a hypothesis test for the average height of women in the United States. Suppose our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and determine that their average height is 5'5". The standard deviation of population is 2.

To calculate the z-score, we would use the following formula:

z = ( x̅ – μ0 ) / (σ /√n)

z = (5'5" - 5'4") / (2" / √100)

z = 0.5 / (0.045)

 We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there is evidence to suggest that the average height of women in the US is greater than 5'4".

Steps of Hypothesis Testing

Step 1: specify your null and alternate hypotheses.

It is critical to rephrase your original research hypothesis (the prediction that you wish to study) as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The null hypothesis predicts no link between the variables of interest.

Step 2: Gather Data

For a statistical test to be legitimate, sampling and data collection must be done in a way that is meant to test your hypothesis. You cannot draw statistical conclusions about the population you are interested in if your data is not representative.

Step 3: Conduct a Statistical Test

Other statistical tests are available, but they all compare within-group variance (how to spread out the data inside a category) against between-group variance (how different the categories are from one another). If the between-group variation is big enough that there is little or no overlap between groups, your statistical test will display a low p-value to represent this. This suggests that the disparities between these groups are unlikely to have occurred by accident. Alternatively, if there is a large within-group variance and a low between-group variance, your statistical test will show a high p-value. Any difference you find across groups is most likely attributable to chance. The variety of variables and the level of measurement of your obtained data will influence your statistical test selection.

Step 4: Determine Rejection Of Your Null Hypothesis

Your statistical test results must determine whether your null hypothesis should be rejected or not. In most circumstances, you will base your judgment on the p-value provided by the statistical test. In most circumstances, your preset level of significance for rejecting the null hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would be seen if the null hypothesis were true. In other circumstances, researchers use a lower level of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null hypothesis.

Step 5: Present Your Results 

The findings of hypothesis testing will be discussed in the results and discussion portions of your research paper, dissertation, or thesis. You should include a concise overview of the data and a summary of the findings of your statistical test in the results section. You can talk about whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a must for your statistics assignments.

Types of Hypothesis Testing

To determine whether a discovery or relationship is statistically significant, hypothesis testing uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only when the population standard deviation is known and the sample size is 30 data points or more, can a z-test be applied.

A statistical test called a t-test is employed to compare the means of two groups. To determine whether two groups differ or if a procedure or treatment affects the population of interest, it is frequently used in hypothesis testing.

Chi-Square 

You utilize a Chi-square test for hypothesis testing concerning whether your data is as predicted. To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the differences between categorical variables from a random sample. The test's fundamental premise is that the observed values in your data should be compared to the predicted values that would be present if the null hypothesis were true.

Hypothesis Testing and Confidence Intervals

Both confidence intervals and hypothesis tests are inferential techniques that depend on approximating the sample distribution. Data from a sample is used to estimate a population parameter using confidence intervals. Data from a sample is used in hypothesis testing to examine a given hypothesis. We must have a postulated parameter to conduct hypothesis testing.

Bootstrap distributions and randomization distributions are created using comparable simulation techniques. The observed sample statistic is the focal point of a bootstrap distribution, whereas the null hypothesis value is the focal point of a randomization distribution.

A variety of feasible population parameter estimates are included in confidence ranges. In this lesson, we created just two-tailed confidence intervals. There is a direct connection between these two-tail confidence intervals and these two-tail hypothesis tests. The results of a two-tailed hypothesis test and two-tailed confidence intervals typically provide the same results. In other words, a hypothesis test at the 0.05 level will virtually always fail to reject the null hypothesis if the 95% confidence interval contains the predicted value. A hypothesis test at the 0.05 level will nearly certainly reject the null hypothesis if the 95% confidence interval does not include the hypothesized parameter.

Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical hypothesis into two types.

Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.

Composite Hypothesis: A composite hypothesis specifies a range of values.

A company is claiming that their average sales for this quarter are 1000 units. This is an example of a simple hypothesis.

Suppose the company claims that the sales are in the range of 900 to 1000 units. Then this is a case of a composite hypothesis.

One-Tailed and Two-Tailed Hypothesis Testing

The One-Tailed test, also called a directional test, considers a critical region of data that would result in the null hypothesis being rejected if the test sample falls into it, inevitably meaning the acceptance of the alternate hypothesis.

In a one-tailed test, the critical distribution area is one-sided, meaning the test sample is either greater or lesser than a specific value.

In two tails, the test sample is checked to be greater or less than a range of values in a Two-Tailed test, implying that the critical distribution area is two-sided.

If the sample falls within this range, the alternate hypothesis will be accepted, and the null hypothesis will be rejected.

Become a Data Scientist With Real-World Experience

Become a Data Scientist With Real-World Experience

Right Tailed Hypothesis Testing

If the larger than (>) sign appears in your hypothesis statement, you are using a right-tailed test, also known as an upper test. Or, to put it another way, the disparity is to the right. For instance, you can contrast the battery life before and after a change in production. Your hypothesis statements can be the following if you want to know if the battery life is longer than the original (let's say 90 hours):

  • The null hypothesis is (H0 <= 90) or less change.
  • A possibility is that battery life has risen (H1) > 90.

The crucial point in this situation is that the alternate hypothesis (H1), not the null hypothesis, decides whether you get a right-tailed test.

Left Tailed Hypothesis Testing

Alternative hypotheses that assert the true value of a parameter is lower than the null hypothesis are tested with a left-tailed test; they are indicated by the asterisk "<".

Suppose H0: mean = 50 and H1: mean not equal to 50

According to the H1, the mean can be greater than or less than 50. This is an example of a Two-tailed test.

In a similar manner, if H0: mean >=50, then H1: mean <50

Here the mean is less than 50. It is called a One-tailed test.

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis despite being true.

Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it is false, unlike a Type-I error.

Suppose a teacher evaluates the examination paper to decide whether a student passes or fails.

H0: Student has passed

H1: Student has failed

Type I error will be the teacher failing the student [rejects H0] although the student scored the passing marks [H0 was true]. 

Type II error will be the case where the teacher passes the student [do not reject H0] although the student did not score the passing marks [H1 is true].

Level of Significance

The alpha value is a criterion for determining whether a test statistic is statistically significant. In a statistical test, Alpha represents an acceptable probability of a Type I error. Because alpha is a probability, it can be anywhere between 0 and 1. In practice, the most commonly used alpha values are 0.01, 0.05, and 0.1, which represent a 1%, 5%, and 10% chance of a Type I error, respectively (i.e. rejecting the null hypothesis when it is in fact correct).

Future-Proof Your AI/ML Career: Top Dos and Don'ts

Future-Proof Your AI/ML Career: Top Dos and Don'ts

A p-value is a metric that expresses the likelihood that an observed difference could have occurred by chance. As the p-value decreases the statistical significance of the observed difference increases. If the p-value is too low, you reject the null hypothesis.

Here you have taken an example in which you are trying to test whether the new advertising campaign has increased the product's sales. The p-value is the likelihood that the null hypothesis, which states that there is no change in the sales due to the new advertising campaign, is true. If the p-value is .30, then there is a 30% chance that there is no increase or decrease in the product's sales.  If the p-value is 0.03, then there is a 3% probability that there is no increase or decrease in the sales value due to the new advertising campaign. As you can see, the lower the p-value, the chances of the alternate hypothesis being true increases, which means that the new advertising campaign causes an increase or decrease in sales.

Why is Hypothesis Testing Important in Research Methodology?

Hypothesis testing is crucial in research methodology for several reasons:

  • Provides evidence-based conclusions: It allows researchers to make objective conclusions based on empirical data, providing evidence to support or refute their research hypotheses.
  • Supports decision-making: It helps make informed decisions, such as accepting or rejecting a new treatment, implementing policy changes, or adopting new practices.
  • Adds rigor and validity: It adds scientific rigor to research using statistical methods to analyze data, ensuring that conclusions are based on sound statistical evidence.
  • Contributes to the advancement of knowledge: By testing hypotheses, researchers contribute to the growth of knowledge in their respective fields by confirming existing theories or discovering new patterns and relationships.

Limitations of Hypothesis Testing

Hypothesis testing has some limitations that researchers should be aware of:

  • It cannot prove or establish the truth: Hypothesis testing provides evidence to support or reject a hypothesis, but it cannot confirm the absolute truth of the research question.
  • Results are sample-specific: Hypothesis testing is based on analyzing a sample from a population, and the conclusions drawn are specific to that particular sample.
  • Possible errors: During hypothesis testing, there is a chance of committing type I error (rejecting a true null hypothesis) or type II error (failing to reject a false null hypothesis).
  • Assumptions and requirements: Different tests have specific assumptions and requirements that must be met to accurately interpret results.

After reading this tutorial, you would have a much better understanding of hypothesis testing, one of the most important concepts in the field of Data Science . The majority of hypotheses are based on speculation about observed behavior, natural phenomena, or established theories.

If you are interested in statistics of data science and skills needed for such a career, you ought to explore Simplilearn’s Post Graduate Program in Data Science.

If you have any questions regarding this ‘Hypothesis Testing In Statistics’ tutorial, do share them in the comment section. Our subject matter expert will respond to your queries. Happy learning!

1. What is hypothesis testing in statistics with example?

Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a population. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then collecting data to assess the evidence. An example: testing if a new drug improves patient recovery (Ha) compared to the standard treatment (H0) based on collected patient data.

2. What is hypothesis testing and its types?

Hypothesis testing is a statistical method used to make inferences about a population based on sample data. It involves formulating two hypotheses: the null hypothesis (H0), which represents the default assumption, and the alternative hypothesis (Ha), which contradicts H0. The goal is to assess the evidence and determine whether there is enough statistical significance to reject the null hypothesis in favor of the alternative hypothesis.

Types of hypothesis testing:

  • One-sample test: Used to compare a sample to a known value or a hypothesized value.
  • Two-sample test: Compares two independent samples to assess if there is a significant difference between their means or distributions.
  • Paired-sample test: Compares two related samples, such as pre-test and post-test data, to evaluate changes within the same subjects over time or under different conditions.
  • Chi-square test: Used to analyze categorical data and determine if there is a significant association between variables.
  • ANOVA (Analysis of Variance): Compares means across multiple groups to check if there is a significant difference between them.

3. What are the steps of hypothesis testing?

The steps of hypothesis testing are as follows:

  • Formulate the hypotheses: State the null hypothesis (H0) and the alternative hypothesis (Ha) based on the research question.
  • Set the significance level: Determine the acceptable level of error (alpha) for making a decision.
  • Collect and analyze data: Gather and process the sample data.
  • Compute test statistic: Calculate the appropriate statistical test to assess the evidence.
  • Make a decision: Compare the test statistic with critical values or p-values and determine whether to reject H0 in favor of Ha or not.
  • Draw conclusions: Interpret the results and communicate the findings in the context of the research question.

4. What are the 2 types of hypothesis testing?

  • One-tailed (or one-sided) test: Tests for the significance of an effect in only one direction, either positive or negative.
  • Two-tailed (or two-sided) test: Tests for the significance of an effect in both directions, allowing for the possibility of a positive or negative effect.

The choice between one-tailed and two-tailed tests depends on the specific research question and the directionality of the expected effect.

5. What are the 3 major types of hypothesis?

The three major types of hypotheses are:

  • Null Hypothesis (H0): Represents the default assumption, stating that there is no significant effect or relationship in the data.
  • Alternative Hypothesis (Ha): Contradicts the null hypothesis and proposes a specific effect or relationship that researchers want to investigate.
  • Nondirectional Hypothesis: An alternative hypothesis that doesn't specify the direction of the effect, leaving it open for both positive and negative possibilities.

Find our Data Analyst Online Bootcamp in top cities:

About the author.

Avijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

Recommended Resources

The Key Differences Between Z-Test Vs. T-Test

Free eBook: Top Programming Languages For A Data Scientist

Normality Test in Minitab: Minitab with Statistics

Normality Test in Minitab: Minitab with Statistics

A Comprehensive Look at Percentile in Statistics

Machine Learning Career Guide: A Playbook to Becoming a Machine Learning Engineer

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

9.4 Full Hypothesis Test Examples

Tests on means, example 9.8.

Jeffrey, as an eight-year old, established a mean time of 16.43 seconds for swimming the 25-yard freestyle, with a standard deviation of 0.8 seconds . His dad, Frank, thought that Jeffrey could swim the 25-yard freestyle faster using goggles. Frank bought Jeffrey a new pair of expensive goggles and timed Jeffrey for 15 25-yard freestyle swims . For the 15 swims, Jeffrey's mean time was 16 seconds. Frank thought that the goggles helped Jeffrey to swim faster than the 16.43 seconds. Conduct a hypothesis test using a preset α = 0.05. Assume that the swim times for the 25-yard freestyle are normal.

Set up the Hypothesis Test:

Since the problem is about a mean, this is a test of a single population mean .

H 0 : μ = 16.43   H a : μ < 16.43

For Jeffrey to swim faster, his time will be less than 16.43 seconds. The "<" tells you this is left-tailed.

Determine the distribution needed:

Random variable: X ¯ X ¯ = the mean time to swim the 25-yard freestyle.

Distribution for the test: X ¯ X ¯ is normal (population standard deviation is known: σ = 0.8)

X ¯ ~ N ( μ , σ X n ) X ¯ ~ N ( μ , σ X n ) Therefore, X ¯ ~ N ( 16.43 , 0.8 15 ) X ¯ ~ N ( 16.43 , 0.8 15 )

μ = 16.43 comes from H 0 and not the data. σ = 0.8, and n = 15.

Calculate the p -value using the normal distribution for a mean:

p -value = P ( x ¯ x ¯ < 16) = 0.0187 where the sample mean in the problem is given as 16.

p -value = 0.0187 (This is called the actual level of significance .) The p -value is the area to the left of the sample mean is given as 16.

μ = 16.43 comes from H 0 . Our assumption is μ = 16.43.

Interpretation of the p -value: If H 0 is true , there is a 0.0187 probability (1.87%)that Jeffrey's mean time to swim the 25-yard freestyle is 16 seconds or less. Because a 1.87% chance is small, the mean time of 16 seconds or less is unlikely to have happened randomly. It is a rare event.

Compare α and the p -value:

α = 0.05 p -value = 0.0187 α > p -value

Make a decision: Since α > α > p -value, reject H 0 .

This indicates that you reject the null hypothesis that the mean time to swim the 25-yard freestyle is at least 16.43 seconds.

Conclusion: At the 5% significance level, there is sufficient evidence that Jeffrey's mean time to swim the 25-yard freestyle is less than 16.43 seconds. Thus, based on the sample data, we conclude that Jeffrey swims faster using the new goggles.

The Type I and Type II errors for this problem are as follows: The Type I error is to conclude that Jeffrey swims the 25-yard freestyle, on average, in less than 16.43 seconds when, in fact, he actually swims the 25-yard freestyle, on average, in at least 16.43 seconds. (Reject the null hypothesis when the null hypothesis is true.)

The Type II error is that there is not evidence to conclude that Jeffrey swims the 25-yard freestyle, on average, in less than 16.43 seconds when, in fact, he actually does swim the 25-yard free-style, on average, in less than 16.43 seconds. (Do not reject the null hypothesis when the null hypothesis is false.)

The mean throwing distance of a football for Marco, a high school quarterback, is 40 yards, with a standard deviation of two yards. The team coach tells Marco to adjust his grip to get more distance. The coach records the distances for 20 throws. For the 20 throws, Marco’s mean distance was 45 yards. The coach thought the different grip helped Marco throw farther than 40 yards. Conduct a hypothesis test using a preset α = 0.05. Assume the throw distances for footballs are normal.

First, determine what type of test this is, set up the hypothesis test, find the p -value, sketch the graph, and state your conclusion.

Example 9.9

Jasmine has just begun her new job on the sales force of a very competitive company. In a sample of 16 sales calls it was found that she closed the contract for an average value of 108 dollars with a standard deviation of 12 dollars. Test at 5% significance that the population mean is at least 100 dollars against the alternative that it is less than 100 dollars. Company policy requires that new members of the sales force must exceed an average of $100 per contract during the trial employment period. Can we conclude that Jasmine has met this requirement at the significance level of 95%?

  • H 0 : µ ≤ 100 H a : µ > 100 The null and alternative hypothesis are for the parameter µ because the number of dollars of the contracts is a continuous random variable. Also, this is a one-tailed test because the company has only an interested if the number of dollars per contact is below a particular number not "too high" a number. This can be thought of as making a claim that the requirement is being met and thus the claim is in the alternative hypothesis.
  • Test statistic: t c = x ¯ − µ 0 s n = 108 − 100 ( 12 16 ) = 2.67 t c = x ¯ − µ 0 s n = 108 − 100 ( 12 16 ) = 2.67
  • Critical value: t a = 1.753 t a = 1.753 with n-1 degrees of freedom= 15

The test statistic is a Student's t because the sample size is below 30; therefore, we cannot use the normal distribution. Comparing the calculated value of the test statistic and the critical value of t t ( t a ) ( t a ) at a 5% significance level, we see that the calculated value is in the tail of the distribution. Thus, we conclude that 108 dollars per contract is significantly larger than the hypothesized value of 100 and thus we cannot accept the null hypothesis. There is evidence that supports Jasmine's performance meets company standards.

It is believed that a stock price for a particular company will grow at a rate of $5 per week with a standard deviation of $1. An investor believes the stock won’t grow as quickly. The changes in stock price is recorded for ten weeks and are as follows: $4, $3, $2, $3, $1, $7, $2, $1, $1, $2. Perform a hypothesis test using a 5% level of significance. State the null and alternative hypotheses, state your conclusion, and identify the Type I errors.

Example 9.10

A manufacturer of salad dressings uses machines to dispense liquid ingredients into bottles that move along a filling line. The machine that dispenses salad dressings is working properly when 8 ounces are dispensed. Suppose that the average amount dispensed in a particular sample of 35 bottles is 7.91 ounces with a variance of 0.03 ounces squared, s 2 s 2 . Is there evidence that the machine should be stopped and production wait for repairs? The lost production from a shutdown is potentially so great that management feels that the level of significance in the analysis should be 99%.

Again we will follow the steps in our analysis of this problem.

STEP 1 : Set the Null and Alternative Hypothesis. The random variable is the quantity of fluid placed in the bottles. This is a continuous random variable and the parameter we are interested in is the mean. Our hypothesis therefore is about the mean. In this case we are concerned that the machine is not filling properly. From what we are told it does not matter if the machine is over-filling or under-filling, both seem to be an equally bad error. This tells us that this is a two-tailed test: if the machine is malfunctioning it will be shutdown regardless if it is from over-filling or under-filling. The null and alternative hypotheses are thus:

STEP 2 : Decide the level of significance and draw the graph showing the critical value.

This problem has already set the level of significance at 99%. The decision seems an appropriate one and shows the thought process when setting the significance level. Management wants to be very certain, as certain as probability will allow, that they are not shutting down a machine that is not in need of repair. To draw the distribution and the critical value, we need to know which distribution to use. Because this is a continuous random variable and we are interested in the mean, and the sample size is greater than 30, the appropriate distribution is the normal distribution and the relevant critical value is 2.575 from the normal table or the t-table at 0.005 column and infinite degrees of freedom. We draw the graph and mark these points.

STEP 3 : Calculate sample parameters and the test statistic. The sample parameters are provided, the sample mean is 7.91 and the sample variance is .03 and the sample size is 35. We need to note that the sample variance was provided not the sample standard deviation, which is what we need for the formula. Remembering that the standard deviation is simply the square root of the variance, we therefore know the sample standard deviation, s, is 0.173. With this information we calculate the test statistic as -3.07, and mark it on the graph.

STEP 4 : Compare test statistic and the critical values Now we compare the test statistic and the critical value by placing the test statistic on the graph. We see that the test statistic is in the tail, decidedly greater than the critical value of 2.575. We note that even the very small difference between the hypothesized value and the sample value is still a large number of standard deviations. The sample mean is only 0.08 ounces different from the required level of 8 ounces, but it is 3 plus standard deviations away and thus we cannot accept the null hypothesis.

STEP 5 : Reach a Conclusion

Three standard deviations of a test statistic will guarantee that the test will fail. The probability that anything is within three standard deviations is almost zero. Actually it is 0.0026 on the normal distribution, which is certainly almost zero in a practical sense. Our formal conclusion would be “ At a 99% level of significance we cannot accept the hypothesis that the sample mean came from a distribution with a mean of 8 ounces” Or less formally, and getting to the point, “At a 99% level of significance we conclude that the machine is under filling the bottles and is in need of repair”.

Try It 9.10

A company records the mean time of employees working in a day. The mean comes out to be 475 minutes, with a standard deviation of 45 minutes. A manager recorded times of 20 employees. The times of working were (frequencies are in parentheses) 460(3); 465(2); 470(3); 475(1); 480(6); 485(3); 490(2).

Conduct a hypothesis test using a 2.5% level of significance to determine if the mean time is more than 475 .

Hypothesis Test for Proportions

Just as there were confidence intervals for proportions, or more formally, the population parameter p of the binomial distribution, there is the ability to test hypotheses concerning p .

The population parameter for the binomial is p . The estimated value (point estimate) for p is p′ where p′ = x/n , x is the number of successes in the sample and n is the sample size.

When you perform a hypothesis test of a population proportion p , you take a simple random sample from the population. The conditions for a binomial distribution must be met, which are: there are a certain number n of independent trials meaning random sampling, the outcomes of any trial are binary, success or failure, and each trial has the same probability of a success p . The shape of the binomial distribution needs to be similar to the shape of the normal distribution. To ensure this, the quantities np′ and nq′ must both be greater than five ( np′ > 5 and nq′ > 5). In this case the binomial distribution of a sample (estimated) proportion can be approximated by the normal distribution with μ = np μ = np and σ = npq σ = npq . Remember that q = 1 – p q = 1 – p . There is no distribution that can correct for this small sample bias and thus if these conditions are not met we simply cannot test the hypothesis with the data available at that time. We met this condition when we first were estimating confidence intervals for p .

Again, we begin with the standardizing formula modified because this is the distribution of a binomial.

Substituting p 0 p 0 , the hypothesized value of p , we have:

This is the test statistic for testing hypothesized values of p , where the null and alternative hypotheses take one of the following forms:

The decision rule stated above applies here also: if the calculated value of Z c shows that the sample proportion is "too many" standard deviations from the hypothesized proportion, the null hypothesis cannot be accepted. The decision as to what is "too many" is pre-determined by the analyst depending on the level of significance required in the test.

Example 9.11

The mortgage department of a large bank is interested in the nature of loans of first-time borrowers. This information will be used to tailor their marketing strategy. They believe that 50% of first-time borrowers take out smaller loans than other borrowers. They perform a hypothesis test to determine if the percentage is the same or different from 50% . They sample 100 first-time borrowers and find 53 of these loans are smaller that the other borrowers. For the hypothesis test, they choose a 5% level of significance.

STEP 1 : Set the null and alternative hypothesis.

H 0 : p = 0.50   H a : p ≠ 0.50

The words "is the same or different from" tell you this is a two-tailed test. The Type I and Type II errors are as follows: The Type I error is to conclude that the proportion of borrowers is different from 50% when, in fact, the proportion is actually 50%. (Reject the null hypothesis when the null hypothesis is true). The Type II error is there is not enough evidence to conclude that the proportion of first time borrowers differs from 50% when, in fact, the proportion does differ from 50%. (You fail to reject the null hypothesis when the null hypothesis is false.)

STEP 2 : Decide the level of significance and draw the graph showing the critical value

The level of significance has been set by the problem at the 5% level. Because this is two-tailed test one-half of the alpha value will be in the upper tail and one-half in the lower tail as shown on the graph. The critical value for the normal distribution at the 95% level of confidence is 1.96. This can easily be found on the student’s t-table at the very bottom at infinite degrees of freedom remembering that at infinity the t-distribution is the normal distribution. Of course the value can also be found on the normal table but you have go looking for one-half of 95 (0.475) inside the body of the table and then read out to the sides and top for the number of standard deviations.

STEP 3 : Calculate the sample parameters and critical value of the test statistic.

The test statistic is a normal distribution, Z, for testing proportions and is:

For this case, the sample of 100 found 53 of these loans were smaller than those of other borrowers. The sample proportion, p′ = 53/100= 0.53 The test question, therefore, is : “Is 0.53 significantly different from .50?” Putting these values into the formula for the test statistic we find that 0.53 is only 0.60 standard deviations away from .50. This is barely off of the mean of the standard normal distribution of zero. There is virtually no difference from the sample proportion and the hypothesized proportion in terms of standard deviations.

STEP 4 : Compare the test statistic and the critical value.

The calculated value is well within the critical values of ± 1.96 standard deviations and thus we cannot reject the null hypothesis. To reject the null hypothesis we need significant evident of difference between the hypothesized value and the sample value. In this case the sample value is very nearly the same as the hypothesized value measured in terms of standard deviations.

STEP 5 : Reach a conclusion

The formal conclusion would be “At a 5% level of significance we cannot reject the null hypothesis that 50% of first-time borrowers take out smaller loans than other borrowers.” Notice the length to which the conclusion goes to include all of the conditions that are attached to the conclusion. Statisticians, for all the criticism they receive, are careful to be very specific even when this seems trivial. Statisticians cannot say more than they know, and the data constrain the conclusion to be within the metes and bounds of the data.

Try It 9.11

A teacher believes that 85% of students in the class will want to go on a field trip to the local zoo. The teacher performs a hypothesis test to determine if the percentage is the same or different from 85%. The teacher samples 50 students and 39 reply that they would want to go to the zoo. For the hypothesis test, use a 1% level of significance.

Example 9.12

Suppose a consumer group suspects that the proportion of households that have three or more cell phones is 30%. A cell phone company has reason to believe that the proportion is not 30%. Before they start a big advertising campaign, they conduct a hypothesis test. Their marketing people survey 150 households with the result that 43 of the households have three or more cell phones.

Here is an abbreviate version of the system to solve hypothesis tests applied to a test on a proportions.

Try It 9.12

Marketers believe that 92% of adults in the United States own a cell phone. A cell phone manufacturer believes that number is actually lower. 200 American adults are surveyed, of which, 174 report having cell phones. Use a 5% level of significance. State the null and alternative hypothesis, find the p -value, state your conclusion, and identify the Type I and Type II errors.

Example 9.13

The National Institute of Standards and Technology provides exact data on conductivity properties of materials. Following are conductivity measurements for 11 randomly selected pieces of a particular type of glass.

1.11; 1.07; 1.11; 1.07; 1.12; 1.08; .98; .98; 1.02; .95; .95 Is there convincing evidence that the average conductivity of this type of glass is greater than one? Use a significance level of 0.05.

Let’s follow a four-step process to answer this statistical question.

  • H 0 : μ ≤ 1
  • H a : μ > 1
  • Plan : We are testing a sample mean without a known population standard deviation with less than 30 observations. Therefore, we need to use a Student's-t distribution. Assume the underlying population is normal.
  • Do the calculations and draw the graph .
  • State the Conclusions : We cannot accept the null hypothesis. It is reasonable to state that the data supports the claim that the average conductivity level is greater than one.

Try It 9.13

The boiling point of a specific liquid is measured for 15 samples, and the boiling points are obtained as follows:

205; 206; 206; 202; 199; 194; 197; 198; 198; 201; 201; 202; 207; 211; 205

Is there convincing evidence that the average boiling point is greater than 200? Use a significance level of 0.1. Assume the population is normal.

Example 9.14

In a study of 420,019 cell phone users, 172 of the subjects developed brain cancer. Test the claim that cell phone users developed brain cancer at a greater rate than that for non-cell phone users (the rate of brain cancer for non-cell phone users is 0.0340%). Since this is a critical issue, use a 0.005 significance level. Explain why the significance level should be so low in terms of a Type I error.

  • H 0 : p ≤ 0.00034
  • H a : p > 0.00034

If we commit a Type I error, we are essentially accepting a false claim. Since the claim describes cancer-causing environments, we want to minimize the chances of incorrectly identifying causes of cancer.

  • We will be testing a sample proportion with x = 172 and n = 420,019. The sample is sufficiently large because we have np' = 420,019(0.00034) = 142.8, nq' = 420,019(0.99966) = 419,876.2, two independent outcomes, and a fixed probability of success p' = 0.00034. Thus we will be able to generalize our results to the population.

Try It 9.14

In a study of 390,000 moisturizer users, 138 of the subjects developed skin diseases. Test the claim that moisturizer users developed skin diseases at a greater rate than that for non-moisturizer users (the rate of skin diseases for non-moisturizer users is 0.041%). Since this is a critical issue, use a 0.005 significance level. Explain why the significance level should be so low in terms of a Type I error.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-business-statistics-2e/pages/1-introduction
  • Authors: Alexander Holmes, Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Business Statistics 2e
  • Publication date: Dec 13, 2023
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-business-statistics-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-business-statistics-2e/pages/9-4-full-hypothesis-test-examples

© Dec 6, 2023 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Hypothesis Testing

Hypothesis testing is a tool for making statistical inferences about the population data. It is an analysis tool that tests assumptions and determines how likely something is within a given standard of accuracy. Hypothesis testing provides a way to verify whether the results of an experiment are valid.

A null hypothesis and an alternative hypothesis are set up before performing the hypothesis testing. This helps to arrive at a conclusion regarding the sample obtained from the population. In this article, we will learn more about hypothesis testing, its types, steps to perform the testing, and associated examples.

What is Hypothesis Testing in Statistics?

Hypothesis testing uses sample data from the population to draw useful conclusions regarding the population probability distribution . It tests an assumption made about the data using different types of hypothesis testing methodologies. The hypothesis testing results in either rejecting or not rejecting the null hypothesis.

Hypothesis Testing Definition

Hypothesis testing can be defined as a statistical tool that is used to identify if the results of an experiment are meaningful or not. It involves setting up a null hypothesis and an alternative hypothesis. These two hypotheses will always be mutually exclusive. This means that if the null hypothesis is true then the alternative hypothesis is false and vice versa. An example of hypothesis testing is setting up a test to check if a new medicine works on a disease in a more efficient manner.

Null Hypothesis

The null hypothesis is a concise mathematical statement that is used to indicate that there is no difference between two possibilities. In other words, there is no difference between certain characteristics of data. This hypothesis assumes that the outcomes of an experiment are based on chance alone. It is denoted as \(H_{0}\). Hypothesis testing is used to conclude if the null hypothesis can be rejected or not. Suppose an experiment is conducted to check if girls are shorter than boys at the age of 5. The null hypothesis will say that they are the same height.

Alternative Hypothesis

The alternative hypothesis is an alternative to the null hypothesis. It is used to show that the observations of an experiment are due to some real effect. It indicates that there is a statistical significance between two possible outcomes and can be denoted as \(H_{1}\) or \(H_{a}\). For the above-mentioned example, the alternative hypothesis would be that girls are shorter than boys at the age of 5.

Hypothesis Testing P Value

In hypothesis testing, the p value is used to indicate whether the results obtained after conducting a test are statistically significant or not. It also indicates the probability of making an error in rejecting or not rejecting the null hypothesis.This value is always a number between 0 and 1. The p value is compared to an alpha level, \(\alpha\) or significance level. The alpha level can be defined as the acceptable risk of incorrectly rejecting the null hypothesis. The alpha level is usually chosen between 1% to 5%.

Hypothesis Testing Critical region

All sets of values that lead to rejecting the null hypothesis lie in the critical region. Furthermore, the value that separates the critical region from the non-critical region is known as the critical value.

Hypothesis Testing Formula

Depending upon the type of data available and the size, different types of hypothesis testing are used to determine whether the null hypothesis can be rejected or not. The hypothesis testing formula for some important test statistics are given below:

  • z = \(\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\). \(\overline{x}\) is the sample mean, \(\mu\) is the population mean, \(\sigma\) is the population standard deviation and n is the size of the sample.
  • t = \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}\). s is the sample standard deviation.
  • \(\chi ^{2} = \sum \frac{(O_{i}-E_{i})^{2}}{E_{i}}\). \(O_{i}\) is the observed value and \(E_{i}\) is the expected value.

We will learn more about these test statistics in the upcoming section.

Types of Hypothesis Testing

Selecting the correct test for performing hypothesis testing can be confusing. These tests are used to determine a test statistic on the basis of which the null hypothesis can either be rejected or not rejected. Some of the important tests used for hypothesis testing are given below.

Hypothesis Testing Z Test

A z test is a way of hypothesis testing that is used for a large sample size (n ≥ 30). It is used to determine whether there is a difference between the population mean and the sample mean when the population standard deviation is known. It can also be used to compare the mean of two samples. It is used to compute the z test statistic. The formulas are given as follows:

  • One sample: z = \(\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\).
  • Two samples: z = \(\frac{(\overline{x_{1}}-\overline{x_{2}})-(\mu_{1}-\mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}}\).

Hypothesis Testing t Test

The t test is another method of hypothesis testing that is used for a small sample size (n < 30). It is also used to compare the sample mean and population mean. However, the population standard deviation is not known. Instead, the sample standard deviation is known. The mean of two samples can also be compared using the t test.

  • One sample: t = \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}\).
  • Two samples: t = \(\frac{(\overline{x_{1}}-\overline{x_{2}})-(\mu_{1}-\mu_{2})}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}\).

Hypothesis Testing Chi Square

The Chi square test is a hypothesis testing method that is used to check whether the variables in a population are independent or not. It is used when the test statistic is chi-squared distributed.

One Tailed Hypothesis Testing

One tailed hypothesis testing is done when the rejection region is only in one direction. It can also be known as directional hypothesis testing because the effects can be tested in one direction only. This type of testing is further classified into the right tailed test and left tailed test.

Right Tailed Hypothesis Testing

The right tail test is also known as the upper tail test. This test is used to check whether the population parameter is greater than some value. The null and alternative hypotheses for this test are given as follows:

\(H_{0}\): The population parameter is ≤ some value

\(H_{1}\): The population parameter is > some value.

If the test statistic has a greater value than the critical value then the null hypothesis is rejected

Right Tail Hypothesis Testing

Left Tailed Hypothesis Testing

The left tail test is also known as the lower tail test. It is used to check whether the population parameter is less than some value. The hypotheses for this hypothesis testing can be written as follows:

\(H_{0}\): The population parameter is ≥ some value

\(H_{1}\): The population parameter is < some value.

The null hypothesis is rejected if the test statistic has a value lesser than the critical value.

Left Tail Hypothesis Testing

Two Tailed Hypothesis Testing

In this hypothesis testing method, the critical region lies on both sides of the sampling distribution. It is also known as a non - directional hypothesis testing method. The two-tailed test is used when it needs to be determined if the population parameter is assumed to be different than some value. The hypotheses can be set up as follows:

\(H_{0}\): the population parameter = some value

\(H_{1}\): the population parameter ≠ some value

The null hypothesis is rejected if the test statistic has a value that is not equal to the critical value.

Two Tail Hypothesis Testing

Hypothesis Testing Steps

Hypothesis testing can be easily performed in five simple steps. The most important step is to correctly set up the hypotheses and identify the right method for hypothesis testing. The basic steps to perform hypothesis testing are as follows:

  • Step 1: Set up the null hypothesis by correctly identifying whether it is the left-tailed, right-tailed, or two-tailed hypothesis testing.
  • Step 2: Set up the alternative hypothesis.
  • Step 3: Choose the correct significance level, \(\alpha\), and find the critical value.
  • Step 4: Calculate the correct test statistic (z, t or \(\chi\)) and p-value.
  • Step 5: Compare the test statistic with the critical value or compare the p-value with \(\alpha\) to arrive at a conclusion. In other words, decide if the null hypothesis is to be rejected or not.

Hypothesis Testing Example

The best way to solve a problem on hypothesis testing is by applying the 5 steps mentioned in the previous section. Suppose a researcher claims that the mean average weight of men is greater than 100kgs with a standard deviation of 15kgs. 30 men are chosen with an average weight of 112.5 Kgs. Using hypothesis testing, check if there is enough evidence to support the researcher's claim. The confidence interval is given as 95%.

Step 1: This is an example of a right-tailed test. Set up the null hypothesis as \(H_{0}\): \(\mu\) = 100.

Step 2: The alternative hypothesis is given by \(H_{1}\): \(\mu\) > 100.

Step 3: As this is a one-tailed test, \(\alpha\) = 100% - 95% = 5%. This can be used to determine the critical value.

1 - \(\alpha\) = 1 - 0.05 = 0.95

0.95 gives the required area under the curve. Now using a normal distribution table, the area 0.95 is at z = 1.645. A similar process can be followed for a t-test. The only additional requirement is to calculate the degrees of freedom given by n - 1.

Step 4: Calculate the z test statistic. This is because the sample size is 30. Furthermore, the sample and population means are known along with the standard deviation.

z = \(\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\).

\(\mu\) = 100, \(\overline{x}\) = 112.5, n = 30, \(\sigma\) = 15

z = \(\frac{112.5-100}{\frac{15}{\sqrt{30}}}\) = 4.56

Step 5: Conclusion. As 4.56 > 1.645 thus, the null hypothesis can be rejected.

Hypothesis Testing and Confidence Intervals

Confidence intervals form an important part of hypothesis testing. This is because the alpha level can be determined from a given confidence interval. Suppose a confidence interval is given as 95%. Subtract the confidence interval from 100%. This gives 100 - 95 = 5% or 0.05. This is the alpha value of a one-tailed hypothesis testing. To obtain the alpha value for a two-tailed hypothesis testing, divide this value by 2. This gives 0.05 / 2 = 0.025.

Related Articles:

  • Probability and Statistics
  • Data Handling

Important Notes on Hypothesis Testing

  • Hypothesis testing is a technique that is used to verify whether the results of an experiment are statistically significant.
  • It involves the setting up of a null hypothesis and an alternate hypothesis.
  • There are three types of tests that can be conducted under hypothesis testing - z test, t test, and chi square test.
  • Hypothesis testing can be classified as right tail, left tail, and two tail tests.

Examples on Hypothesis Testing

  • Example 1: The average weight of a dumbbell in a gym is 90lbs. However, a physical trainer believes that the average weight might be higher. A random sample of 5 dumbbells with an average weight of 110lbs and a standard deviation of 18lbs. Using hypothesis testing check if the physical trainer's claim can be supported for a 95% confidence level. Solution: As the sample size is lesser than 30, the t-test is used. \(H_{0}\): \(\mu\) = 90, \(H_{1}\): \(\mu\) > 90 \(\overline{x}\) = 110, \(\mu\) = 90, n = 5, s = 18. \(\alpha\) = 0.05 Using the t-distribution table, the critical value is 2.132 t = \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}\) t = 2.484 As 2.484 > 2.132, the null hypothesis is rejected. Answer: The average weight of the dumbbells may be greater than 90lbs
  • Example 2: The average score on a test is 80 with a standard deviation of 10. With a new teaching curriculum introduced it is believed that this score will change. On random testing, the score of 38 students, the mean was found to be 88. With a 0.05 significance level, is there any evidence to support this claim? Solution: This is an example of two-tail hypothesis testing. The z test will be used. \(H_{0}\): \(\mu\) = 80, \(H_{1}\): \(\mu\) ≠ 80 \(\overline{x}\) = 88, \(\mu\) = 80, n = 36, \(\sigma\) = 10. \(\alpha\) = 0.05 / 2 = 0.025 The critical value using the normal distribution table is 1.96 z = \(\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\) z = \(\frac{88-80}{\frac{10}{\sqrt{36}}}\) = 4.8 As 4.8 > 1.96, the null hypothesis is rejected. Answer: There is a difference in the scores after the new curriculum was introduced.
  • Example 3: The average score of a class is 90. However, a teacher believes that the average score might be lower. The scores of 6 students were randomly measured. The mean was 82 with a standard deviation of 18. With a 0.05 significance level use hypothesis testing to check if this claim is true. Solution: The t test will be used. \(H_{0}\): \(\mu\) = 90, \(H_{1}\): \(\mu\) < 90 \(\overline{x}\) = 110, \(\mu\) = 90, n = 6, s = 18 The critical value from the t table is -2.015 t = \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}\) t = \(\frac{82-90}{\frac{18}{\sqrt{6}}}\) t = -1.088 As -1.088 > -2.015, we fail to reject the null hypothesis. Answer: There is not enough evidence to support the claim.

go to slide go to slide go to slide

hypothesis statistics example

Book a Free Trial Class

FAQs on Hypothesis Testing

What is hypothesis testing.

Hypothesis testing in statistics is a tool that is used to make inferences about the population data. It is also used to check if the results of an experiment are valid.

What is the z Test in Hypothesis Testing?

The z test in hypothesis testing is used to find the z test statistic for normally distributed data . The z test is used when the standard deviation of the population is known and the sample size is greater than or equal to 30.

What is the t Test in Hypothesis Testing?

The t test in hypothesis testing is used when the data follows a student t distribution . It is used when the sample size is less than 30 and standard deviation of the population is not known.

What is the formula for z test in Hypothesis Testing?

The formula for a one sample z test in hypothesis testing is z = \(\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\) and for two samples is z = \(\frac{(\overline{x_{1}}-\overline{x_{2}})-(\mu_{1}-\mu_{2})}{\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}}\).

What is the p Value in Hypothesis Testing?

The p value helps to determine if the test results are statistically significant or not. In hypothesis testing, the null hypothesis can either be rejected or not rejected based on the comparison between the p value and the alpha level.

What is One Tail Hypothesis Testing?

When the rejection region is only on one side of the distribution curve then it is known as one tail hypothesis testing. The right tail test and the left tail test are two types of directional hypothesis testing.

What is the Alpha Level in Two Tail Hypothesis Testing?

To get the alpha level in a two tail hypothesis testing divide \(\alpha\) by 2. This is done as there are two rejection regions in the curve.

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2023 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

How to Write a Great Hypothesis

Hypothesis Format, Examples, and Tips

Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

hypothesis statistics example

Amy Morin, LCSW, is a psychotherapist and international bestselling author. Her books, including "13 Things Mentally Strong People Don't Do," have been translated into more than 40 languages. Her TEDx talk,  "The Secret of Becoming Mentally Strong," is one of the most viewed talks of all time.

hypothesis statistics example

Verywell / Alex Dos Diaz

  • The Scientific Method

Hypothesis Format

Falsifiability of a hypothesis, operational definitions, types of hypotheses, hypotheses examples.

  • Collecting Data

Frequently Asked Questions

A hypothesis is a tentative statement about the relationship between two or more  variables. It is a specific, testable prediction about what you expect to happen in a study.

One hypothesis example would be a study designed to look at the relationship between sleep deprivation and test performance might have a hypothesis that states: "This study is designed to assess the hypothesis that sleep-deprived people will perform worse on a test than individuals who are not sleep-deprived."

This article explores how a hypothesis is used in psychology research, how to write a good hypothesis, and the different types of hypotheses you might use.

The Hypothesis in the Scientific Method

In the scientific method , whether it involves research in psychology, biology, or some other area, a hypothesis represents what the researchers think will happen in an experiment. The scientific method involves the following steps:

  • Forming a question
  • Performing background research
  • Creating a hypothesis
  • Designing an experiment
  • Collecting data
  • Analyzing the results
  • Drawing conclusions
  • Communicating the results

The hypothesis is a prediction, but it involves more than a guess. Most of the time, the hypothesis begins with a question which is then explored through background research. It is only at this point that researchers begin to develop a testable hypothesis. Unless you are creating an exploratory study, your hypothesis should always explain what you  expect  to happen.

In a study exploring the effects of a particular drug, the hypothesis might be that researchers expect the drug to have some type of effect on the symptoms of a specific illness. In psychology, the hypothesis might focus on how a certain aspect of the environment might influence a particular behavior.

Remember, a hypothesis does not have to be correct. While the hypothesis predicts what the researchers expect to see, the goal of the research is to determine whether this guess is right or wrong. When conducting an experiment, researchers might explore a number of factors to determine which ones might contribute to the ultimate outcome.

In many cases, researchers may find that the results of an experiment  do not  support the original hypothesis. When writing up these results, the researchers might suggest other options that should be explored in future studies.

In many cases, researchers might draw a hypothesis from a specific theory or build on previous research. For example, prior research has shown that stress can impact the immune system. So a researcher might hypothesize: "People with high-stress levels will be more likely to contract a common cold after being exposed to the virus than people who have low-stress levels."

In other instances, researchers might look at commonly held beliefs or folk wisdom. "Birds of a feather flock together" is one example of folk wisdom that a psychologist might try to investigate. The researcher might pose a specific hypothesis that "People tend to select romantic partners who are similar to them in interests and educational level."

Elements of a Good Hypothesis

So how do you write a good hypothesis? When trying to come up with a hypothesis for your research or experiments, ask yourself the following questions:

  • Is your hypothesis based on your research on a topic?
  • Can your hypothesis be tested?
  • Does your hypothesis include independent and dependent variables?

Before you come up with a specific hypothesis, spend some time doing background research. Once you have completed a literature review, start thinking about potential questions you still have. Pay attention to the discussion section in the  journal articles you read . Many authors will suggest questions that still need to be explored.

To form a hypothesis, you should take these steps:

  • Collect as many observations about a topic or problem as you can.
  • Evaluate these observations and look for possible causes of the problem.
  • Create a list of possible explanations that you might want to explore.
  • After you have developed some possible hypotheses, think of ways that you could confirm or disprove each hypothesis through experimentation. This is known as falsifiability.

In the scientific method ,  falsifiability is an important part of any valid hypothesis.   In order to test a claim scientifically, it must be possible that the claim could be proven false.

Students sometimes confuse the idea of falsifiability with the idea that it means that something is false, which is not the case. What falsifiability means is that  if  something was false, then it is possible to demonstrate that it is false.

One of the hallmarks of pseudoscience is that it makes claims that cannot be refuted or proven false.

A variable is a factor or element that can be changed and manipulated in ways that are observable and measurable. However, the researcher must also define how the variable will be manipulated and measured in the study.

For example, a researcher might operationally define the variable " test anxiety " as the results of a self-report measure of anxiety experienced during an exam. A "study habits" variable might be defined by the amount of studying that actually occurs as measured by time.

These precise descriptions are important because many things can be measured in a number of different ways. One of the basic principles of any type of scientific research is that the results must be replicable.   By clearly detailing the specifics of how the variables were measured and manipulated, other researchers can better understand the results and repeat the study if needed.

Some variables are more difficult than others to define. How would you operationally define a variable such as aggression ? For obvious ethical reasons, researchers cannot create a situation in which a person behaves aggressively toward others.

In order to measure this variable, the researcher must devise a measurement that assesses aggressive behavior without harming other people. In this situation, the researcher might utilize a simulated task to measure aggressiveness.

Hypothesis Checklist

  • Does your hypothesis focus on something that you can actually test?
  • Does your hypothesis include both an independent and dependent variable?
  • Can you manipulate the variables?
  • Can your hypothesis be tested without violating ethical standards?

The hypothesis you use will depend on what you are investigating and hoping to find. Some of the main types of hypotheses that you might use include:

  • Simple hypothesis : This type of hypothesis suggests that there is a relationship between one independent variable and one dependent variable.
  • Complex hypothesis : This type of hypothesis suggests a relationship between three or more variables, such as two independent variables and a dependent variable.
  • Null hypothesis : This hypothesis suggests no relationship exists between two or more variables.
  • Alternative hypothesis : This hypothesis states the opposite of the null hypothesis.
  • Statistical hypothesis : This hypothesis uses statistical analysis to evaluate a representative sample of the population and then generalizes the findings to the larger group.
  • Logical hypothesis : This hypothesis assumes a relationship between variables without collecting data or evidence.

A hypothesis often follows a basic format of "If {this happens} then {this will happen}." One way to structure your hypothesis is to describe what will happen to the  dependent variable  if you change the  independent variable .

The basic format might be: "If {these changes are made to a certain independent variable}, then we will observe {a change in a specific dependent variable}."

A few examples of simple hypotheses:

  • "Students who eat breakfast will perform better on a math exam than students who do not eat breakfast."
  • Complex hypothesis: "Students who experience test anxiety before an English exam will get lower scores than students who do not experience test anxiety."​
  • "Motorists who talk on the phone while driving will be more likely to make errors on a driving course than those who do not talk on the phone."

Examples of a complex hypothesis include:

  • "People with high-sugar diets and sedentary activity levels are more likely to develop depression."
  • "Younger people who are regularly exposed to green, outdoor areas have better subjective well-being than older adults who have limited exposure to green spaces."

Examples of a null hypothesis include:

  • "Children who receive a new reading intervention will have scores different than students who do not receive the intervention."
  • "There will be no difference in scores on a memory recall task between children and adults."

Examples of an alternative hypothesis:

  • "Children who receive a new reading intervention will perform better than students who did not receive the intervention."
  • "Adults will perform better on a memory task than children." 

Collecting Data on Your Hypothesis

Once a researcher has formed a testable hypothesis, the next step is to select a research design and start collecting data. The research method depends largely on exactly what they are studying. There are two basic types of research methods: descriptive research and experimental research.

Descriptive Research Methods

Descriptive research such as  case studies ,  naturalistic observations , and surveys are often used when it would be impossible or difficult to  conduct an experiment . These methods are best used to describe different aspects of a behavior or psychological phenomenon.

Once a researcher has collected data using descriptive methods, a correlational study can then be used to look at how the variables are related. This type of research method might be used to investigate a hypothesis that is difficult to test experimentally.

Experimental Research Methods

Experimental methods  are used to demonstrate causal relationships between variables. In an experiment, the researcher systematically manipulates a variable of interest (known as the independent variable) and measures the effect on another variable (known as the dependent variable).

Unlike correlational studies, which can only be used to determine if there is a relationship between two variables, experimental methods can be used to determine the actual nature of the relationship—whether changes in one variable actually  cause  another to change.

A Word From Verywell

The hypothesis is a critical part of any scientific exploration. It represents what researchers expect to find in a study or experiment. In situations where the hypothesis is unsupported by the research, the research still has value. Such research helps us better understand how different aspects of the natural world relate to one another. It also helps us develop new hypotheses that can then be tested in the future.

Some examples of how to write a hypothesis include:

  • "Staying up late will lead to worse test performance the next day."
  • "People who consume one apple each day will visit the doctor fewer times each year."
  • "Breaking study sessions up into three 20-minute sessions will lead to better test results than a single 60-minute study session."

The four parts of a hypothesis are:

  • The research question
  • The independent variable (IV)
  • The dependent variable (DV)
  • The proposed relationship between the IV and DV

Castillo M. The scientific method: a need for something better? . AJNR Am J Neuroradiol. 2013;34(9):1669-71. doi:10.3174/ajnr.A3401

Nevid J. Psychology: Concepts and Applications. Wadworth, 2013.

By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

Statology

Statistics Made Easy

4 Examples of Hypothesis Testing in Real Life

In statistics, hypothesis tests are used to test whether or not some hypothesis about a population parameter is true.

To perform a hypothesis test in the real world, researchers will obtain a random sample from the population and perform a hypothesis test on the sample data, using a null and alternative hypothesis:

  • Null Hypothesis (H 0 ): The sample data occurs purely from chance.
  • Alternative Hypothesis (H A ): The sample data is influenced by some non-random cause.

If the p-value of the hypothesis test is less than some significance level (e.g. α = .05), then we can reject the null hypothesis and conclude that we have sufficient evidence to say that the alternative hypothesis is true.

The following examples provide several situations where hypothesis tests are used in the real world.

Example 1: Biology

Hypothesis tests are often used in biology to determine whether some new treatment, fertilizer, pesticide, chemical, etc. causes increased growth, stamina, immunity, etc. in plants or animals.

For example, suppose a biologist believes that a certain fertilizer will cause plants to grow more during a one-month period than they normally do, which is currently 20 inches. To test this, she applies the fertilizer to each of the plants in her laboratory for one month.

She then performs a hypothesis test using the following hypotheses:

  • H 0 : μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
  • H A : μ > 20 inches (the fertilizer will cause mean plant growth to increase)

If the p-value of the test is less than some significance level (e.g. α = .05), then she can reject the null hypothesis and conclude that the fertilizer leads to increased plant growth.

Example 2: Clinical Trials

Hypothesis tests are often used in clinical trials to determine whether some new treatment, drug, procedure, etc. causes improved outcomes in patients.

For example, suppose a doctor believes that a new drug is able to reduce blood pressure in obese patients. To test this, he may measure the blood pressure of 40 patients before and after using the new drug for one month.

He then performs a hypothesis test using the following hypotheses:

  • H 0 : μ after = μ before (the mean blood pressure is the same before and after using the drug)
  • H A : μ after < μ before (the mean blood pressure is less after using the drug)

If the p-value of the test is less than some significance level (e.g. α = .05), then he can reject the null hypothesis and conclude that the new drug leads to reduced blood pressure.

Example 3: Advertising Spend

Hypothesis tests are often used in business to determine whether or not some new advertising campaign, marketing technique, etc. causes increased sales.

For example, suppose a company believes that spending more money on digital advertising leads to increased sales. To test this, the company may increase money spent on digital advertising during a two-month period and collect data to see if overall sales have increased.

They may perform a hypothesis test using the following hypotheses:

  • H 0 : μ after = μ before (the mean sales is the same before and after spending more on advertising)
  • H A : μ after > μ before (the mean sales increased after spending more on advertising)

If the p-value of the test is less than some significance level (e.g. α = .05), then the company can reject the null hypothesis and conclude that increased digital advertising leads to increased sales.

Example 4: Manufacturing

Hypothesis tests are also used often in manufacturing plants to determine if some new process, technique, method, etc. causes a change in the number of defective products produced.

For example, suppose a certain manufacturing plant wants to test whether or not some new method changes the number of defective widgets produced per month, which is currently 250. To test this, they may measure the mean number of defective widgets produced before and after using the new method for one month.

They can then perform a hypothesis test using the following hypotheses:

  • H 0 : μ after = μ before (the mean number of defective widgets is the same before and after using the new method)
  • H A : μ after ≠ μ before (the mean number of defective widgets produced is different before and after using the new method)

If the p-value of the test is less than some significance level (e.g. α = .05), then the plant can reject the null hypothesis and conclude that the new method leads to a change in the number of defective widgets produced per month.

Additional Resources

Introduction to Hypothesis Testing Introduction to the One Sample t-test Introduction to the Two Sample t-test Introduction to the Paired Samples t-test

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

hypothesis statistics example

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

S.3 hypothesis testing.

In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail.

The general idea of hypothesis testing involves:

  • Making an initial assumption.
  • Collecting evidence (data).
  • Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.

Every hypothesis test — regardless of the population parameter involved — requires the above three steps.

Example S.3.1

Is normal body temperature really 98.6 degrees f section  .

Consider the population of many, many adults. A researcher hypothesized that the average adult body temperature is lower than the often-advertised 98.6 degrees F. That is, the researcher wants an answer to the question: "Is the average adult body temperature 98.6 degrees? Or is it lower?" To answer his research question, the researcher starts by assuming that the average adult body temperature was 98.6 degrees F.

Then, the researcher went out and tried to find evidence that refutes his initial assumption. In doing so, he selects a random sample of 130 adults. The average body temperature of the 130 sampled adults is 98.25 degrees.

Then, the researcher uses the data he collected to make a decision about his initial assumption. It is either likely or unlikely that the researcher would collect the evidence he did given his initial assumption that the average adult body temperature is 98.6 degrees:

  • If it is likely , then the researcher does not reject his initial assumption that the average adult body temperature is 98.6 degrees. There is not enough evidence to do otherwise.
  • either the researcher's initial assumption is correct and he experienced a very unusual event;
  • or the researcher's initial assumption is incorrect.

In statistics, we generally don't make claims that require us to believe that a very unusual event happened. That is, in the practice of statistics, if the evidence (data) we collected is unlikely in light of the initial assumption, then we reject our initial assumption.

Example S.3.2

Criminal trial analogy section  .

One place where you can consistently see the general idea of hypothesis testing in action is in criminal trials held in the United States. Our criminal justice system assumes "the defendant is innocent until proven guilty." That is, our initial assumption is that the defendant is innocent.

In the practice of statistics, we make our initial assumption when we state our two competing hypotheses -- the null hypothesis ( H 0 ) and the alternative hypothesis ( H A ). Here, our hypotheses are:

  • H 0 : Defendant is not guilty (innocent)
  • H A : Defendant is guilty

In statistics, we always assume the null hypothesis is true . That is, the null hypothesis is always our initial assumption.

The prosecution team then collects evidence — such as finger prints, blood spots, hair samples, carpet fibers, shoe prints, ransom notes, and handwriting samples — with the hopes of finding "sufficient evidence" to make the assumption of innocence refutable.

In statistics, the data are the evidence.

The jury then makes a decision based on the available evidence:

  • If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects the null hypothesis and deems the defendant guilty. We behave as if the defendant is guilty.
  • If there is insufficient evidence, then the jury does not reject the null hypothesis . We behave as if the defendant is innocent.

In statistics, we always make one of two decisions. We either "reject the null hypothesis" or we "fail to reject the null hypothesis."

Errors in Hypothesis Testing Section  

Did you notice the use of the phrase "behave as if" in the previous discussion? We "behave as if" the defendant is guilty; we do not "prove" that the defendant is guilty. And, we "behave as if" the defendant is innocent; we do not "prove" that the defendant is innocent.

This is a very important distinction! We make our decision based on evidence not on 100% guaranteed proof. Again:

  • If we reject the null hypothesis, we do not prove that the alternative hypothesis is true.
  • If we do not reject the null hypothesis, we do not prove that the null hypothesis is true.

We merely state that there is enough evidence to behave one way or the other. This is always true in statistics! Because of this, whatever the decision, there is always a chance that we made an error .

Let's review the two types of errors that can be made in criminal trials:

Table S.3.2 shows how this corresponds to the two types of errors in hypothesis testing.

Note that, in statistics, we call the two types of errors by two different  names -- one is called a "Type I error," and the other is called  a "Type II error." Here are the formal definitions of the two types of errors:

There is always a chance of making one of these errors. But, a good scientific study will minimize the chance of doing so!

Making the Decision Section  

Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. If it is likely , we do not reject the null hypothesis. If it is unlikely , then we reject the null hypothesis in favor of the alternative hypothesis. Effectively, then, making the decision reduces to determining "likely" or "unlikely."

In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption:

  • We could take the " critical value approach " (favored in many of the older textbooks).
  • Or, we could take the " P -value approach " (what is used most often in research, journal articles, and statistical software).

In the next two sections, we review the procedures behind each of these two approaches. To make our review concrete, let's imagine that μ is the average grade point average of all American students who major in mathematics. We first review the critical value approach for conducting each of the following three hypothesis tests about the population mean $\mu$:

In Practice

  • We would want to conduct the first hypothesis test if we were interested in concluding that the average grade point average of the group is more than 3.
  • We would want to conduct the second hypothesis test if we were interested in concluding that the average grade point average of the group is less than 3.
  • And, we would want to conduct the third hypothesis test if we were only interested in concluding that the average grade point average of the group differs from 3 (without caring whether it is more or less than 3).

Upon completing the review of the critical value approach, we review the P -value approach for conducting each of the above three hypothesis tests about the population mean \(\mu\). The procedures that we review here for both approaches easily extend to hypothesis tests about any other population parameter.

PW Skills | Blog

What is Hypothesis Testing in Statistics? Types and Examples

By Varun Saharawat | January 6, 2024

Hypothesis testing in statistics involves testing an assumption about a population parameter using sample data. Learners can download Hypothesis Testing PDF to get instant access to all information!

Hypothesis Testing

What exactly is hypothesis testing, and how does it work in statistics? Can I find practical examples and understand the different types from this blog?

Hypothesis Testing : Ever wonder how researchers determine if a new medicine actually works or if a new marketing campaign effectively drives sales? They use hypothesis testing! It is at the core of how scientific studies, business experiments and surveys determine if their results are statistically significant or just due to chance.

Hypothesis testing allows us to make evidence-based decisions by quantifying uncertainty and providing a structured process to make data-driven conclusions rather than guessing. In this post, we will discuss hypothesis testing types, examples, and processes!

Table of Contents

Hypothesis Testing

Hypothesis testing is a statistical method used to evaluate the validity of a hypothesis using sample data. It involves assessing whether observed data provide enough evidence to reject a specific hypothesis about a population parameter. 

Hypothesis Testing in Data Science

Hypothesis testing in data science is a statistical method used to evaluate two mutually exclusive population statements based on sample data. The primary goal is to determine which statement is more supported by the observed data.

Hypothesis testing assists in supporting the certainty of findings in research and data science projects. This statistical inference aids in making decisions about population parameters using sample data. For those who are looking to deepen their knowledge in data science and expand their skillset, we highly recommend checking out Master Generative AI: Data Science Course by Physics Wallah .

Also Read: What is Encapsulation Explain in Details

What is the Hypothesis Testing Procedure in Data Science?

The hypothesis testing procedure in data science involves a structured approach to evaluating hypotheses using statistical methods. Here’s a step-by-step breakdown of the typical procedure:

1) State the Hypotheses:

  • Null Hypothesis (H0): This is the default assumption or a statement of no effect or difference. It represents what you aim to test against.
  • Alternative Hypothesis (Ha): This is the opposite of the null hypothesis and represents what you want to prove.

2) Choose a Significance Level (α):

  • Decide on a threshold (commonly 0.05) beyond which you will reject the null hypothesis. This is your significance level.

3) Select the Appropriate Test:

  • Depending on your data type (e.g., continuous, categorical) and the nature of your research question, choose the appropriate statistical test (e.g., t-test, chi-square test, ANOVA, etc.).

4) Collect Data:

  • Gather data from your sample or population, ensuring that it’s representative and sufficiently large (or as per your experimental design).

5)Compute the Test Statistic:

  • Using your data and the chosen statistical test, compute the test statistic that summarizes the evidence against the null hypothesis.

6) Determine the Critical Value or P-value:

  • Based on your significance level and the test statistic’s distribution, determine the critical value from a statistical table or compute the p-value.

7) Make a Decision:

  • If the p-value is less than α: Reject the null hypothesis.
  • If the p-value is greater than or equal to α: Fail to reject the null hypothesis.

8) Draw Conclusions:

  • Based on your decision, draw conclusions about your research question or hypothesis. Remember, failing to reject the null hypothesis doesn’t prove it true; it merely suggests that you don’t have sufficient evidence to reject it.

9) Report Findings:

  • Document your findings, including the test statistic, p-value, conclusion, and any other relevant details. Ensure clarity so that others can understand and potentially replicate your analysis.

Also Read: Binary Search Algorithm

How Hypothesis Testing Works?

Hypothesis testing is a fundamental concept in statistics that aids analysts in making informed decisions based on sample data about a larger population. The process involves setting up two contrasting hypotheses, the null hypothesis and the alternative hypothesis, and then using statistical methods to determine which hypothesis provides a more plausible explanation for the observed data.

The Core Principles:

  • The Null Hypothesis (H0): This serves as the default assumption or status quo. Typically, it posits that there is no effect or no difference, often represented by an equality statement regarding population parameters. For instance, it might state that a new drug’s effect is no different from a placebo.
  • The Alternative Hypothesis (H1 or Ha): This is the counter assumption or what researchers aim to prove. It’s the opposite of the null hypothesis, indicating that there is an effect, a change, or a difference in the population parameters. Using the drug example, the alternative hypothesis would suggest that the new drug has a different effect than the placebo.

Testing the Hypotheses:

Once these hypotheses are established, analysts gather data from a sample and conduct statistical tests. The objective is to determine whether the observed results are statistically significant enough to reject the null hypothesis in favor of the alternative.

Examples to Clarify the Concept:

  • Null Hypothesis (H0): The sanitizer’s average efficacy is 95%.
  • By conducting tests, if evidence suggests that the sanitizer’s efficacy is significantly less than 95%, we reject the null hypothesis.
  • Null Hypothesis (H0): The coin is fair, meaning the probability of heads and tails is equal.
  • Through experimental trials, if results consistently show a skewed outcome, indicating a significantly different probability for heads and tails, the null hypothesis might be rejected.

What are the 3 types of Hypothesis Test?

Hypothesis testing is a cornerstone in statistical analysis, providing a framework to evaluate the validity of assumptions or claims made about a population based on sample data. Within this framework, several specific tests are utilized based on the nature of the data and the question at hand. Here’s a closer look at the three fundamental types of hypothesis tests:

The z-test is a statistical method primarily employed when comparing means from two datasets, particularly when the population standard deviation is known. Its main objective is to ascertain if the means are statistically equivalent. 

A crucial prerequisite for the z-test is that the sample size should be relatively large, typically 30 data points or more. This test aids researchers and analysts in determining the significance of a relationship or discovery, especially in scenarios where the data’s characteristics align with the assumptions of the z-test.

The t-test is a versatile statistical tool used extensively in research and various fields to compare means between two groups. It’s particularly valuable when the population standard deviation is unknown or when dealing with smaller sample sizes. 

By evaluating the means of two groups, the t-test helps ascertain if a particular treatment, intervention, or variable significantly impacts the population under study. Its flexibility and robustness make it a go-to method in scenarios ranging from medical research to business analytics.

3. Chi-Square Test:

The Chi-Square test stands distinct from the previous tests, primarily focusing on categorical data rather than means. This statistical test is instrumental when analyzing categorical variables to determine if observed data aligns with expected outcomes as posited by the null hypothesis. 

By assessing the differences between observed and expected frequencies within categorical data, the Chi-Square test offers insights into whether discrepancies are statistically significant. Whether used in social sciences to evaluate survey responses or in quality control to assess product defects, the Chi-Square test remains pivotal for hypothesis testing in diverse scenarios.

Also Read: Python vs Java: Which is Best for Machine learning algorithm

Hypothesis Testing in Statistics

Hypothesis testing is a fundamental concept in statistics used to make decisions or inferences about a population based on a sample of data. The process involves setting up two competing hypotheses, the null hypothesis H 0​ and the alternative hypothesis H 1​. 

Through various statistical tests, such as the t-test, z-test, or Chi-square test, analysts evaluate sample data to determine whether there’s enough evidence to reject the null hypothesis in favor of the alternative. The aim is to draw conclusions about population parameters or to test theories, claims, or hypotheses.

Hypothesis Testing in Research

In research, hypothesis testing serves as a structured approach to validate or refute theories or claims. Researchers formulate a clear hypothesis based on existing literature or preliminary observations. They then collect data through experiments, surveys, or observational studies. 

Using statistical methods, researchers analyze this data to determine if there’s sufficient evidence to reject the null hypothesis. By doing so, they can draw meaningful conclusions, make predictions, or recommend actions based on empirical evidence rather than mere speculation.

Hypothesis Testing in R

R, a powerful programming language and environment for statistical computing and graphics, offers a wide array of functions and packages specifically designed for hypothesis testing. Here’s how hypothesis testing is conducted in R:

  • Data Collection : Before conducting any test, you need to gather your data and ensure it’s appropriately structured in R.
  • Choose the Right Test : Depending on your research question and data type, select the appropriate hypothesis test. For instance, use the t.test() function for a t-test or chisq.test() for a Chi-square test.
  • Set Hypotheses : Define your null and alternative hypotheses. Using R’s syntax, you can specify these hypotheses and run the corresponding test.
  • Execute the Test : Utilize built-in functions in R to perform the hypothesis test on your data. For instance, if you want to compare two means, you can use the t.test() function, providing the necessary arguments like the data vectors and type of t-test (one-sample, two-sample, paired, etc.).
  • Interpret Results : Once the test is executed, R will provide output, including test statistics, p-values, and confidence intervals. Based on these results and a predetermined significance level (often 0.05), you can decide whether to reject the null hypothesis.
  • Visualization : R’s graphical capabilities allow users to visualize data distributions, confidence intervals, or test statistics, aiding in the interpretation and presentation of results.

Hypothesis testing is an integral part of statistics and research, offering a systematic approach to validate hypotheses. Leveraging R’s capabilities, researchers and analysts can efficiently conduct and interpret various hypothesis tests, ensuring robust and reliable conclusions from their data.

Do Data Scientists do Hypothesis Testing?

Yes, data scientists frequently engage in hypothesis testing as part of their analytical toolkit. Hypothesis testing is a foundational statistical technique used to make data-driven decisions, validate assumptions, and draw conclusions from data. Here’s how data scientists utilize hypothesis testing:

  • Validating Assumptions : Before diving into complex analyses or building predictive models, data scientists often need to verify certain assumptions about the data. Hypothesis testing provides a structured approach to test these assumptions, ensuring that subsequent analyses or models are valid.
  • Feature Selection : In machine learning and predictive modeling, data scientists use hypothesis tests to determine which features (or variables) are most relevant or significant in predicting a particular outcome. By testing hypotheses related to feature importance or correlation, they can streamline the modeling process and enhance prediction accuracy.
  • A/B Testing : A/B testing is a common technique in marketing, product development, and user experience design. Data scientists employ hypothesis testing to compare two versions (A and B) of a product, feature, or marketing strategy to determine which performs better in terms of a specified metric (e.g., conversion rate, user engagement).
  • Research and Exploration : In exploratory data analysis (EDA) or when investigating specific research questions, data scientists formulate hypotheses to test certain relationships or patterns within the data. By conducting hypothesis tests, they can validate these relationships, uncover insights, and drive data-driven decision-making.
  • Model Evaluation : After building machine learning or statistical models, data scientists use hypothesis testing to evaluate the model’s performance, assess its predictive power, or compare different models. For instance, hypothesis tests like the t-test or F-test can help determine if a new model significantly outperforms an existing one based on certain metrics.
  • Business Decision-making : Beyond technical analyses, data scientists employ hypothesis testing to support business decisions. Whether it’s evaluating the effectiveness of a marketing campaign, assessing customer preferences, or optimizing operational processes, hypothesis testing provides a rigorous framework to validate assumptions and guide strategic initiatives.

Hypothesis Testing Examples and Solutions

Let’s delve into some common examples of hypothesis testing and provide solutions or interpretations for each scenario.

Example: Testing the Mean

Scenario : A coffee shop owner believes that the average waiting time for customers during peak hours is 5 minutes. To test this, the owner takes a random sample of 30 customer waiting times and wants to determine if the average waiting time is indeed 5 minutes.

Hypotheses :

  • H 0​ (Null Hypothesis): 5 μ =5 minutes (The average waiting time is 5 minutes)
  • H 1​ (Alternative Hypothesis): 5 μ =5 minutes (The average waiting time is not 5 minutes)

Solution : Using a t-test (assuming population variance is unknown), calculate the t-statistic based on the sample mean, sample standard deviation, and sample size. Then, determine the p-value and compare it with a significance level (e.g., 0.05) to decide whether to reject the null hypothesis.

Example: A/B Testing in Marketing

Scenario : An e-commerce company wants to determine if changing the color of a “Buy Now” button from blue to green increases the conversion rate.

  • H 0​: Changing the button color does not affect the conversion rate.
  • H 1​: Changing the button color affects the conversion rate.

Solution : Split website visitors into two groups: one sees the blue button (control group), and the other sees the green button (test group). Track the conversion rates for both groups over a specified period. Then, use a chi-square test or z-test (for large sample sizes) to determine if there’s a statistically significant difference in conversion rates between the two groups.

Hypothesis Testing Formula

The formula for hypothesis testing typically depends on the type of test (e.g., z-test, t-test, chi-square test) and the nature of the data (e.g., mean, proportion, variance). Below are the basic formulas for some common hypothesis tests:

Z-Test for Population Mean :

Z=(σ/n​)(xˉ−μ0​)​

  • ˉ x ˉ = Sample mean
  • 0 μ 0​ = Population mean under the null hypothesis
  • σ = Population standard deviation
  • n = Sample size

T-Test for Population Mean :

t= (s/ n ​ ) ( x ˉ −μ 0 ​ ) ​ 

s = Sample standard deviation 

Chi-Square Test for Goodness of Fit :

χ2=∑Ei​(Oi​−Ei​)2​

  • Oi ​ = Observed frequency
  • Ei ​ = Expected frequency

Also Read: Full Form of OOPS

Hypothesis Testing Calculator

While you can perform hypothesis testing manually using the above formulas and statistical tables, many online tools and software packages simplify this process. Here’s how you might use a calculator or software:

  • Z-Test and T-Test Calculators : These tools typically require you to input sample statistics (like sample mean, population mean, standard deviation, and sample size). Once you input these values, the calculator will provide you with the test statistic (Z or t) and a p-value.
  • Chi-Square Calculator : For chi-square tests, you’d input observed and expected frequencies for different categories or groups. The calculator then computes the chi-square statistic and provides a p-value.
  • Software Packages (e.g., R, Python with libraries like scipy, or statistical software like SPSS) : These platforms offer more comprehensive tools for hypothesis testing. You can run various tests, get detailed outputs, and even perform advanced analyses, including regression models, ANOVA, and more.

When using any calculator or software, always ensure you understand the underlying assumptions of the test, interpret the results correctly, and consider the broader context of your research or analysis.

Hypothesis Testing FAQs

What are the key components of a hypothesis test.

The key components include: Null Hypothesis (H0): A statement of no effect or no difference. Alternative Hypothesis (H1 or Ha): A statement that contradicts the null hypothesis. Test Statistic: A value computed from the sample data to test the null hypothesis. Significance Level (α): The threshold for rejecting the null hypothesis. P-value: The probability of observing the given data, assuming the null hypothesis is true.

What is the significance level in hypothesis testing?

The significance level (often denoted as α) is the probability threshold used to determine whether to reject the null hypothesis. Commonly used values for α include 0.05, 0.01, and 0.10, representing a 5%, 1%, or 10% chance of rejecting the null hypothesis when it's actually true.

How do I choose between a one-tailed and two-tailed test?

The choice between one-tailed and two-tailed tests depends on your research question and hypothesis. Use a one-tailed test when you're specifically interested in one direction of an effect (e.g., greater than or less than). Use a two-tailed test when you want to determine if there's a significant difference in either direction.

What is a p-value, and how is it interpreted?

The p-value is a probability value that helps determine the strength of evidence against the null hypothesis. A low p-value (typically ≤ 0.05) suggests that the observed data is inconsistent with the null hypothesis, leading to its rejection. Conversely, a high p-value suggests that the data is consistent with the null hypothesis, leading to no rejection.

Can hypothesis testing prove a hypothesis true?

No, hypothesis testing cannot prove a hypothesis true. Instead, it helps assess the likelihood of observing a given set of data under the assumption that the null hypothesis is true. Based on this assessment, you either reject or fail to reject the null hypothesis.

Step-by-Step Guide to Buy Courses on PW Skills

hypothesis statistics example

Why Choose PW Skills to Learn Coding Course?

hypothesis statistics example

Top 10 Career Options after BCA

Top10 Career Options after BCA

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Unit 12: Significance tests (hypothesis testing)

About this unit, the idea of significance tests.

  • Simple hypothesis testing (Opens a modal)
  • Idea behind hypothesis testing (Opens a modal)
  • Examples of null and alternative hypotheses (Opens a modal)
  • P-values and significance tests (Opens a modal)
  • Comparing P-values to different significance levels (Opens a modal)
  • Estimating a P-value from a simulation (Opens a modal)
  • Using P-values to make conclusions (Opens a modal)
  • Simple hypothesis testing Get 3 of 4 questions to level up!
  • Writing null and alternative hypotheses Get 3 of 4 questions to level up!
  • Estimating P-values from simulations Get 3 of 4 questions to level up!

Error probabilities and power

  • Introduction to Type I and Type II errors (Opens a modal)
  • Type 1 errors (Opens a modal)
  • Examples identifying Type I and Type II errors (Opens a modal)
  • Introduction to power in significance tests (Opens a modal)
  • Examples thinking about power in significance tests (Opens a modal)
  • Consequences of errors and significance (Opens a modal)
  • Type I vs Type II error Get 3 of 4 questions to level up!
  • Error probabilities and power Get 3 of 4 questions to level up!

Tests about a population proportion

  • Constructing hypotheses for a significance test about a proportion (Opens a modal)
  • Conditions for a z test about a proportion (Opens a modal)
  • Reference: Conditions for inference on a proportion (Opens a modal)
  • Calculating a z statistic in a test about a proportion (Opens a modal)
  • Calculating a P-value given a z statistic (Opens a modal)
  • Making conclusions in a test about a proportion (Opens a modal)
  • Writing hypotheses for a test about a proportion Get 3 of 4 questions to level up!
  • Conditions for a z test about a proportion Get 3 of 4 questions to level up!
  • Calculating the test statistic in a z test for a proportion Get 3 of 4 questions to level up!
  • Calculating the P-value in a z test for a proportion Get 3 of 4 questions to level up!
  • Making conclusions in a z test for a proportion Get 3 of 4 questions to level up!

Tests about a population mean

  • Writing hypotheses for a significance test about a mean (Opens a modal)
  • Conditions for a t test about a mean (Opens a modal)
  • Reference: Conditions for inference on a mean (Opens a modal)
  • When to use z or t statistics in significance tests (Opens a modal)
  • Example calculating t statistic for a test about a mean (Opens a modal)
  • Using TI calculator for P-value from t statistic (Opens a modal)
  • Using a table to estimate P-value from t statistic (Opens a modal)
  • Comparing P-value from t statistic to significance level (Opens a modal)
  • Free response example: Significance test for a mean (Opens a modal)
  • Writing hypotheses for a test about a mean Get 3 of 4 questions to level up!
  • Conditions for a t test about a mean Get 3 of 4 questions to level up!
  • Calculating the test statistic in a t test for a mean Get 3 of 4 questions to level up!
  • Calculating the P-value in a t test for a mean Get 3 of 4 questions to level up!
  • Making conclusions in a t test for a mean Get 3 of 4 questions to level up!

More significance testing videos

  • Hypothesis testing and p-values (Opens a modal)
  • One-tailed and two-tailed tests (Opens a modal)
  • Z-statistics vs. T-statistics (Opens a modal)
  • Small sample hypothesis test (Opens a modal)
  • Large sample proportion hypothesis testing (Opens a modal)

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.E: Hypothesis Testing with One Sample (Exercises)

  • Last updated
  • Save as PDF
  • Page ID 1146

These are homework exercises to accompany the Textmap created for "Introductory Statistics" by OpenStax.

9.1: Introduction

9.2: null and alternative hypotheses.

Some of the following statements refer to the null hypothesis, some to the alternate hypothesis.

State the null hypothesis, \(H_{0}\), and the alternative hypothesis. \(H_{a}\), in terms of the appropriate parameter \((\mu \text{or} p)\).

  • The mean number of years Americans work before retiring is 34.
  • At most 60% of Americans vote in presidential elections.
  • The mean starting salary for San Jose State University graduates is at least $100,000 per year.
  • Twenty-nine percent of high school seniors get drunk each month.
  • Fewer than 5% of adults ride the bus to work in Los Angeles.
  • The mean number of cars a person owns in her lifetime is not more than ten.
  • About half of Americans prefer to live away from cities, given the choice.
  • Europeans have a mean paid vacation each year of six weeks.
  • The chance of developing breast cancer is under 11% for women.
  • Private universities' mean tuition cost is more than $20,000 per year.
  • \(H_{0}: \mu = 34; H_{a}: \mu \neq 34\)
  • \(H_{0}: p \leq 0.60; H_{a}: p > 0.60\)
  • \(H_{0}: \mu \geq 100,000; H_{a}: \mu < 100,000\)
  • \(H_{0}: p = 0.29; H_{a}: p \neq 0.29\)
  • \(H_{0}: p = 0.05; H_{a}: p < 0.05\)
  • \(H_{0}: \mu \leq 10; H_{a}: \mu > 10\)
  • \(H_{0}: p = 0.50; H_{a}: p \neq 0.50\)
  • \(H_{0}: \mu = 6; H_{a}: \mu \neq 6\)
  • \(H_{0}: p ≥ 0.11; H_{a}: p < 0.11\)
  • \(H_{0}: \mu \leq 20,000; H_{a}: \mu > 20,000\)

Over the past few decades, public health officials have examined the link between weight concerns and teen girls' smoking. Researchers surveyed a group of 273 randomly selected teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls were surveyed again. Sixty-three said they smoked to stay thin. Is there good evidence that more than thirty percent of the teen girls smoke to stay thin? The alternative hypothesis is:

  • \(p < 0.30\)
  • \(p \leq 0.30\)
  • \(p \geq 0.30\)
  • \(p > 0.30\)

A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students attended the opening night midnight showing of the latest Harry Potter movie. She surveys 84 of her students and finds that 11 attended the midnight showing. An appropriate alternative hypothesis is:

  • \(p = 0.20\)
  • \(p > 0.20\)
  • \(p < 0.20\)
  • \(p \leq 0.20\)

Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test. The null and alternative hypotheses are:

  • \(H_{0}: \bar{x} = 4.5, H_{a}: \bar{x} > 4.5\)
  • \(H_{0}: \mu \geq 4.5, H_{a}: \mu < 4.5\)
  • \(H_{0}: \mu = 4.75, H_{a}: \mu > 4.75\)
  • \(H_{0}: \mu = 4.5, H_{a}: \mu > 4.5\)

9.3: Outcomes and the Type I and Type II Errors

State the Type I and Type II errors in complete sentences given the following statements.

  • The mean number of cars a person owns in his or her lifetime is not more than ten.
  • Private universities mean tuition cost is more than $20,000 per year.
  • Type I error: We conclude that the mean is not 34 years, when it really is 34 years. Type II error: We conclude that the mean is 34 years, when in fact it really is not 34 years.
  • Type I error: We conclude that more than 60% of Americans vote in presidential elections, when the actual percentage is at most 60%.Type II error: We conclude that at most 60% of Americans vote in presidential elections when, in fact, more than 60% do.
  • Type I error: We conclude that the mean starting salary is less than $100,000, when it really is at least $100,000. Type II error: We conclude that the mean starting salary is at least $100,000 when, in fact, it is less than $100,000.
  • Type I error: We conclude that the proportion of high school seniors who get drunk each month is not 29%, when it really is 29%. Type II error: We conclude that the proportion of high school seniors who get drunk each month is 29% when, in fact, it is not 29%.
  • Type I error: We conclude that fewer than 5% of adults ride the bus to work in Los Angeles, when the percentage that do is really 5% or more. Type II error: We conclude that 5% or more adults ride the bus to work in Los Angeles when, in fact, fewer that 5% do.
  • Type I error: We conclude that the mean number of cars a person owns in his or her lifetime is more than 10, when in reality it is not more than 10. Type II error: We conclude that the mean number of cars a person owns in his or her lifetime is not more than 10 when, in fact, it is more than 10.
  • Type I error: We conclude that the proportion of Americans who prefer to live away from cities is not about half, though the actual proportion is about half. Type II error: We conclude that the proportion of Americans who prefer to live away from cities is half when, in fact, it is not half.
  • Type I error: We conclude that the duration of paid vacations each year for Europeans is not six weeks, when in fact it is six weeks. Type II error: We conclude that the duration of paid vacations each year for Europeans is six weeks when, in fact, it is not.
  • Type I error: We conclude that the proportion is less than 11%, when it is really at least 11%. Type II error: We conclude that the proportion of women who develop breast cancer is at least 11%, when in fact it is less than 11%.
  • Type I error: We conclude that the average tuition cost at private universities is more than $20,000, though in reality it is at most $20,000. Type II error: We conclude that the average tuition cost at private universities is at most $20,000 when, in fact, it is more than $20,000.

For statements a-j in Exercise 9.109 , answer the following in complete sentences.

  • State a consequence of committing a Type I error.
  • State a consequence of committing a Type II error.

When a new drug is created, the pharmaceutical company must subject it to testing before receiving the necessary permission from the Food and Drug Administration (FDA) to market the drug. Suppose the null hypothesis is “the drug is unsafe.” What is the Type II Error?

  • To conclude the drug is safe when in, fact, it is unsafe.
  • Not to conclude the drug is safe when, in fact, it is safe.
  • To conclude the drug is safe when, in fact, it is safe.
  • Not to conclude the drug is unsafe when, in fact, it is unsafe.

A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students attended the opening midnight showing of the latest Harry Potter movie. She surveys 84 of her students and finds that 11 of them attended the midnight showing. The Type I error is to conclude that the percent of EVC students who attended is ________.

  • at least 20%, when in fact, it is less than 20%.
  • 20%, when in fact, it is 20%.
  • less than 20%, when in fact, it is at least 20%.
  • less than 20%, when in fact, it is less than 20%.

It is believed that Lake Tahoe Community College (LTCC) Intermediate Algebra students get less than seven hours of sleep per night, on average. A survey of 22 LTCC Intermediate Algebra students generated a mean of 7.24 hours with a standard deviation of 1.93 hours. At a level of significance of 5%, do LTCC Intermediate Algebra students get less than seven hours of sleep per night, on average?

The Type II error is not to reject that the mean number of hours of sleep LTCC students get per night is at least seven when, in fact, the mean number of hours

  • is more than seven hours.
  • is at most seven hours.
  • is at least seven hours.
  • is less than seven hours.

Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test, the Type I error is:

  • to conclude that the current mean hours per week is higher than 4.5, when in fact, it is higher
  • to conclude that the current mean hours per week is higher than 4.5, when in fact, it is the same
  • to conclude that the mean hours per week currently is 4.5, when in fact, it is higher
  • to conclude that the mean hours per week currently is no higher than 4.5, when in fact, it is not higher

9.4: Distribution Needed for Hypothesis Testing

It is believed that Lake Tahoe Community College (LTCC) Intermediate Algebra students get less than seven hours of sleep per night, on average. A survey of 22 LTCC Intermediate Algebra students generated a mean of 7.24 hours with a standard deviation of 1.93 hours. At a level of significance of 5%, do LTCC Intermediate Algebra students get less than seven hours of sleep per night, on average? The distribution to be used for this test is \(\bar{X} \sim\) ________________

  • \(N\left(7.24, \frac{1.93}{\sqrt{22}}\right)\)
  • \(N\left(7.24, 1.93\right)\)

9.5: Rare Events, the Sample, Decision and Conclusion

The National Institute of Mental Health published an article stating that in any one-year period, approximately 9.5 percent of American adults suffer from depression or a depressive illness. Suppose that in a survey of 100 people in a certain town, seven of them suffered from depression or a depressive illness. Conduct a hypothesis test to determine if the true proportion of people in that town suffering from depression or a depressive illness is lower than the percent in the general adult American population.

  • Is this a test of one mean or proportion?
  • State the null and alternative hypotheses. \(H_{0}\) : ____________________ \(H_{a}\) : ____________________
  • Is this a right-tailed, left-tailed, or two-tailed test?
  • What symbol represents the random variable for this test?
  • In words, define the random variable for this test.
  • \(x =\) ________________
  • \(n =\) ________________
  • \(p′ =\) _____________
  • Calculate \(\sigma_{x} =\) __________. Show the formula set-up.
  • State the distribution to use for the hypothesis test.
  • Find the \(p\text{-value}\).
  • Reason for the decision:
  • Conclusion (write out in a complete sentence):

9.6: Additional Information and Full Hypothesis Test Examples

For each of the word problems, use a solution sheet to do the hypothesis test. The solution sheet is found in [link] . Please feel free to make copies of the solution sheets. For the online version of the book, it is suggested that you copy the .doc or the .pdf files.

If you are using a Student's \(t\) - distribution for one of the following homework problems, you may assume that the underlying population is normally distributed. (In general, you must first prove that assumption, however.)

A particular brand of tires claims that its deluxe tire averages at least 50,000 miles before it needs to be replaced. From past studies of this tire, the standard deviation is known to be 8,000. A survey of owners of that tire design is conducted. From the 28 tires surveyed, the mean lifespan was 46,500 miles with a standard deviation of 9,800 miles. Using \(\alpha = 0.05\), is the data highly inconsistent with the claim?

  • \(H_{0}: \mu \geq 50,000\)
  • \(H_{a}: \mu < 50,000\)
  • Let \(\bar{X} =\) the average lifespan of a brand of tires.
  • normal distribution
  • \(z = -2.315\)
  • \(p\text{-value} = 0.0103\)
  • Check student’s solution.
  • alpha: 0.05
  • Decision: Reject the null hypothesis.
  • Reason for decision: The \(p\text{-value}\) is less than 0.05.
  • Conclusion: There is sufficient evidence to conclude that the mean lifespan of the tires is less than 50,000 miles.
  • \((43,537, 49,463)\)

From generation to generation, the mean age when smokers first start to smoke varies. However, the standard deviation of that age remains constant of around 2.1 years. A survey of 40 smokers of this generation was done to see if the mean starting age is at least 19. The sample mean was 18.1 with a sample standard deviation of 1.3. Do the data support the claim at the 5% level?

The cost of a daily newspaper varies from city to city. However, the variation among prices remains steady with a standard deviation of 20¢. A study was done to test the claim that the mean cost of a daily newspaper is $1.00. Twelve costs yield a mean cost of 95¢ with a standard deviation of 18¢. Do the data support the claim at the 1% level?

  • \(H_{0}: \mu = $1.00\)
  • \(H_{a}: \mu \neq $1.00\)
  • Let \(\bar{X} =\) the average cost of a daily newspaper.
  • \(z = –0.866\)
  • \(p\text{-value} = 0.3865\)
  • \(\alpha: 0.01\)
  • Decision: Do not reject the null hypothesis.
  • Reason for decision: The \(p\text{-value}\) is greater than 0.01.
  • Conclusion: There is sufficient evidence to support the claim that the mean cost of daily papers is $1. The mean cost could be $1.
  • \(($0.84, $1.06)\)

An article in the San Jose Mercury News stated that students in the California state university system take 4.5 years, on average, to finish their undergraduate degrees. Suppose you believe that the mean time is longer. You conduct a survey of 49 students and obtain a sample mean of 5.1 with a sample standard deviation of 1.2. Do the data support your claim at the 1% level?

The mean number of sick days an employee takes per year is believed to be about ten. Members of a personnel department do not believe this figure. They randomly survey eight employees. The number of sick days they took for the past year are as follows: 12; 4; 15; 3; 11; 8; 6; 8. Let \(x =\) the number of sick days they took for the past year. Should the personnel team believe that the mean number is ten?

  • \(H_{0}: \mu = 10\)
  • \(H_{a}: \mu \neq 10\)
  • Let \(\bar{X}\) the mean number of sick days an employee takes per year.
  • Student’s t -distribution
  • \(t = –1.12\)
  • \(p\text{-value} = 0.300\)
  • \(\alpha: 0.05\)
  • Reason for decision: The \(p\text{-value}\) is greater than 0.05.
  • Conclusion: At the 5% significance level, there is insufficient evidence to conclude that the mean number of sick days is not ten.
  • \((4.9443, 11.806)\)

In 1955, Life Magazine reported that the 25 year-old mother of three worked, on average, an 80 hour week. Recently, many groups have been studying whether or not the women's movement has, in fact, resulted in an increase in the average work week for women (combining employment and at-home work). Suppose a study was done to determine if the mean work week has increased. 81 women were surveyed with the following results. The sample mean was 83; the sample standard deviation was ten. Does it appear that the mean work week has increased for women at the 5% level?

Your statistics instructor claims that 60 percent of the students who take her Elementary Statistics class go through life feeling more enriched. For some reason that she can't quite figure out, most people don't believe her. You decide to check this out on your own. You randomly survey 64 of her past Elementary Statistics students and find that 34 feel more enriched as a result of her class. Now, what do you think?

  • \(H_{0}: p \geq 0.6\)
  • \(H_{a}: p < 0.6\)
  • Let \(P′ =\) the proportion of students who feel more enriched as a result of taking Elementary Statistics.
  • normal for a single proportion
  • \(p\text{-value} = 0.1308\)
  • Conclusion: There is insufficient evidence to conclude that less than 60 percent of her students feel more enriched.

The “plus-4s” confidence interval is \((0.411, 0.648)\)

A Nissan Motor Corporation advertisement read, “The average man’s I.Q. is 107. The average brown trout’s I.Q. is 4. So why can’t man catch brown trout?” Suppose you believe that the brown trout’s mean I.Q. is greater than four. You catch 12 brown trout. A fish psychologist determines the I.Q.s as follows: 5; 4; 7; 3; 6; 4; 5; 3; 6; 3; 8; 5. Conduct a hypothesis test of your belief.

Refer to Exercise 9.119 . Conduct a hypothesis test to see if your decision and conclusion would change if your belief were that the brown trout’s mean I.Q. is not four.

  • \(H_{0}: \mu = 4\)
  • \(H_{a}: \mu \neq 4\)
  • Let \(\bar{X}\) the average I.Q. of a set of brown trout.
  • two-tailed Student's t-test
  • \(t = 1.95\)
  • \(p\text{-value} = 0.076\)
  • Reason for decision: The \(p\text{-value}\) is greater than 0.05
  • Conclusion: There is insufficient evidence to conclude that the average IQ of brown trout is not four.
  • \((3.8865,5.9468)\)

According to an article in Newsweek , the natural ratio of girls to boys is 100:105. In China, the birth ratio is 100: 114 (46.7% girls). Suppose you don’t believe the reported figures of the percent of girls born in China. You conduct a study. In this study, you count the number of girls and boys born in 150 randomly chosen recent births. There are 60 girls and 90 boys born of the 150. Based on your study, do you believe that the percent of girls born in China is 46.7?

A poll done for Newsweek found that 13% of Americans have seen or sensed the presence of an angel. A contingent doubts that the percent is really that high. It conducts its own survey. Out of 76 Americans surveyed, only two had seen or sensed the presence of an angel. As a result of the contingent’s survey, would you agree with the Newsweek poll? In complete sentences, also give three reasons why the two polls might give different results.

  • \(H_{a}: p < 0.13\)
  • Let \(P′ =\) the proportion of Americans who have seen or sensed angels
  • –2.688
  • \(p\text{-value} = 0.0036\)
  • Reason for decision: The \(p\text{-value}\)e is less than 0.05.
  • Conclusion: There is sufficient evidence to conclude that the percentage of Americans who have seen or sensed an angel is less than 13%.

The“plus-4s” confidence interval is (0.0022, 0.0978)

The mean work week for engineers in a start-up company is believed to be about 60 hours. A newly hired engineer hopes that it’s shorter. She asks ten engineering friends in start-ups for the lengths of their mean work weeks. Based on the results that follow, should she count on the mean work week to be shorter than 60 hours?

Data (length of mean work week): 70; 45; 55; 60; 65; 55; 55; 60; 50; 55.

Use the “Lap time” data for Lap 4 (see [link] ) to test the claim that Terri finishes Lap 4, on average, in less than 129 seconds. Use all twenty races given.

  • \(H_{0}: \mu \geq 129\)
  • \(H_{a}: \mu < 129\)
  • Let \(\bar{X} =\) the average time in seconds that Terri finishes Lap 4.
  • Student's t -distribution
  • \(t = 1.209\)
  • Conclusion: There is insufficient evidence to conclude that Terri’s mean lap time is less than 129 seconds.
  • \((128.63, 130.37)\)

Use the “Initial Public Offering” data (see [link] ) to test the claim that the mean offer price was $18 per share. Do not use all the data. Use your random number generator to randomly survey 15 prices.

The following questions were written by past students. They are excellent problems!

"Asian Family Reunion," by Chau Nguyen

Every two years it comes around.

We all get together from different towns.

In my honest opinion,

It's not a typical family reunion.

Not forty, or fifty, or sixty,

But how about seventy companions!

The kids would play, scream, and shout

One minute they're happy, another they'll pout.

The teenagers would look, stare, and compare

From how they look to what they wear.

The men would chat about their business

That they make more, but never less.

Money is always their subject

And there's always talk of more new projects.

The women get tired from all of the chats

They head to the kitchen to set out the mats.

Some would sit and some would stand

Eating and talking with plates in their hands.

Then come the games and the songs

And suddenly, everyone gets along!

With all that laughter, it's sad to say

That it always ends in the same old way.

They hug and kiss and say "good-bye"

And then they all begin to cry!

I say that 60 percent shed their tears

But my mom counted 35 people this year.

She said that boys and men will always have their pride,

So we won't ever see them cry.

I myself don't think she's correct,

So could you please try this problem to see if you object?

  • \(H_{0}: p = 0.60\)
  • \(H_{a}: p < 0.60\)
  • Let \(P′ =\) the proportion of family members who shed tears at a reunion.
  • –1.71
  • Reason for decision: \(p\text{-value} < \alpha\)
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the proportion of family members who shed tears at a reunion is less than 0.60. However, the test is weak because the \(p\text{-value}\) and alpha are quite close, so other tests should be done.
  • We are 95% confident that between 38.29% and 61.71% of family members will shed tears at a family reunion. \((0.3829, 0.6171)\). The“plus-4s” confidence interval (see chapter 8) is \((0.3861, 0.6139)\)

Note that here the “large-sample” \(1 - \text{PropZTest}\) provides the approximate \(p\text{-value}\) of 0.0438. Whenever a \(p\text{-value}\) based on a normal approximation is close to the level of significance, the exact \(p\text{-value}\) based on binomial probabilities should be calculated whenever possible. This is beyond the scope of this course.

"The Problem with Angels," by Cyndy Dowling

Although this problem is wholly mine,

The catalyst came from the magazine, Time.

On the magazine cover I did find

The realm of angels tickling my mind.

Inside, 69% I found to be

In angels, Americans do believe.

Then, it was time to rise to the task,

Ninety-five high school and college students I did ask.

Viewing all as one group,

Random sampling to get the scoop.

So, I asked each to be true,

"Do you believe in angels?" Tell me, do!

Hypothesizing at the start,

Totally believing in my heart

That the proportion who said yes

Would be equal on this test.

Lo and behold, seventy-three did arrive,

Out of the sample of ninety-five.

Now your job has just begun,

Solve this problem and have some fun.

"Blowing Bubbles," by Sondra Prull

Studying stats just made me tense,

I had to find some sane defense.

Some light and lifting simple play

To float my math anxiety away.

Blowing bubbles lifts me high

Takes my troubles to the sky.

POIK! They're gone, with all my stress

Bubble therapy is the best.

The label said each time I blew

The average number of bubbles would be at least 22.

I blew and blew and this I found

From 64 blows, they all are round!

But the number of bubbles in 64 blows

Varied widely, this I know.

20 per blow became the mean

They deviated by 6, and not 16.

From counting bubbles, I sure did relax

But now I give to you your task.

Was 22 a reasonable guess?

Find the answer and pass this test!

  • \(H_{0}: \mu \geq 22\)
  • \(H_{a}: \mu < 22\)
  • Let \(\bar{X} =\) the mean number of bubbles per blow.
  • –2.667
  • \(p\text{-value} = 0.00486\)
  • Conclusion: There is sufficient evidence to conclude that the mean number of bubbles per blow is less than 22.
  • \((18.501, 21.499)\)

"Dalmatian Darnation," by Kathy Sparling

A greedy dog breeder named Spreckles

Bred puppies with numerous freckles

The Dalmatians he sought

Possessed spot upon spot

The more spots, he thought, the more shekels.

His competitors did not agree

That freckles would increase the fee.

They said, “Spots are quite nice

But they don't affect price;

One should breed for improved pedigree.”

The breeders decided to prove

This strategy was a wrong move.

Breeding only for spots

Would wreak havoc, they thought.

His theory they want to disprove.

They proposed a contest to Spreckles

Comparing dog prices to freckles.

In records they looked up

One hundred one pups:

Dalmatians that fetched the most shekels.

They asked Mr. Spreckles to name

An average spot count he'd claim

To bring in big bucks.

Said Spreckles, “Well, shucks,

It's for one hundred one that I aim.”

Said an amateur statistician

Who wanted to help with this mission.

“Twenty-one for the sample

Standard deviation's ample:

They examined one hundred and one

Dalmatians that fetched a good sum.

They counted each spot,

Mark, freckle and dot

And tallied up every one.

Instead of one hundred one spots

They averaged ninety six dots

Can they muzzle Spreckles’

Obsession with freckles

Based on all the dog data they've got?

"Macaroni and Cheese, please!!" by Nedda Misherghi and Rachelle Hall

As a poor starving student I don't have much money to spend for even the bare necessities. So my favorite and main staple food is macaroni and cheese. It's high in taste and low in cost and nutritional value.

One day, as I sat down to determine the meaning of life, I got a serious craving for this, oh, so important, food of my life. So I went down the street to Greatway to get a box of macaroni and cheese, but it was SO expensive! $2.02 !!! Can you believe it? It made me stop and think. The world is changing fast. I had thought that the mean cost of a box (the normal size, not some super-gigantic-family-value-pack) was at most $1, but now I wasn't so sure. However, I was determined to find out. I went to 53 of the closest grocery stores and surveyed the prices of macaroni and cheese. Here are the data I wrote in my notebook:

Price per box of Mac and Cheese:

  • 5 stores @ $2.02
  • 15 stores @ $0.25
  • 3 stores @ $1.29
  • 6 stores @ $0.35
  • 4 stores @ $2.27
  • 7 stores @ $1.50
  • 5 stores @ $1.89
  • 8 stores @ 0.75.

I could see that the cost varied but I had to sit down to figure out whether or not I was right. If it does turn out that this mouth-watering dish is at most $1, then I'll throw a big cheesy party in our next statistics lab, with enough macaroni and cheese for just me. (After all, as a poor starving student I can't be expected to feed our class of animals!)

  • \(H_{0}: \mu \leq 1\)
  • \(H_{a}: \mu > 1\)
  • Let \(\bar{X} =\) the mean cost in dollars of macaroni and cheese in a certain town.
  • Student's \(t\)-distribution
  • \(t = 0.340\)
  • \(p\text{-value} = 0.36756\)
  • Conclusion: The mean cost could be $1, or less. At the 5% significance level, there is insufficient evidence to conclude that the mean price of a box of macaroni and cheese is more than $1.
  • \((0.8291, 1.241)\)

"William Shakespeare: The Tragedy of Hamlet, Prince of Denmark," by Jacqueline Ghodsi

THE CHARACTERS (in order of appearance):

  • HAMLET, Prince of Denmark and student of Statistics
  • POLONIUS, Hamlet’s tutor
  • HOROTIO, friend to Hamlet and fellow student

Scene: The great library of the castle, in which Hamlet does his lessons

(The day is fair, but the face of Hamlet is clouded. He paces the large room. His tutor, Polonius, is reprimanding Hamlet regarding the latter’s recent experience. Horatio is seated at the large table at right stage.)

POLONIUS: My Lord, how cans’t thou admit that thou hast seen a ghost! It is but a figment of your imagination!

HAMLET: I beg to differ; I know of a certainty that five-and-seventy in one hundred of us, condemned to the whips and scorns of time as we are, have gazed upon a spirit of health, or goblin damn’d, be their intents wicked or charitable.

POLONIUS If thou doest insist upon thy wretched vision then let me invest your time; be true to thy work and speak to me through the reason of the null and alternate hypotheses. (He turns to Horatio.) Did not Hamlet himself say, “What piece of work is man, how noble in reason, how infinite in faculties? Then let not this foolishness persist. Go, Horatio, make a survey of three-and-sixty and discover what the true proportion be. For my part, I will never succumb to this fantasy, but deem man to be devoid of all reason should thy proposal of at least five-and-seventy in one hundred hold true.

HORATIO (to Hamlet): What should we do, my Lord?

HAMLET: Go to thy purpose, Horatio.

HORATIO: To what end, my Lord?

HAMLET: That you must teach me. But let me conjure you by the rights of our fellowship, by the consonance of our youth, but the obligation of our ever-preserved love, be even and direct with me, whether I am right or no.

(Horatio exits, followed by Polonius, leaving Hamlet to ponder alone.)

(The next day, Hamlet awaits anxiously the presence of his friend, Horatio. Polonius enters and places some books upon the table just a moment before Horatio enters.)

POLONIUS: So, Horatio, what is it thou didst reveal through thy deliberations?

HORATIO: In a random survey, for which purpose thou thyself sent me forth, I did discover that one-and-forty believe fervently that the spirits of the dead walk with us. Before my God, I might not this believe, without the sensible and true avouch of mine own eyes.

POLONIUS: Give thine own thoughts no tongue, Horatio. (Polonius turns to Hamlet.) But look to’t I charge you, my Lord. Come Horatio, let us go together, for this is not our test. (Horatio and Polonius leave together.)

HAMLET: To reject, or not reject, that is the question: whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous statistics, or to take arms against a sea of data, and, by opposing, end them. (Hamlet resignedly attends to his task.)

(Curtain falls)

"Untitled," by Stephen Chen

I've often wondered how software is released and sold to the public. Ironically, I work for a company that sells products with known problems. Unfortunately, most of the problems are difficult to create, which makes them difficult to fix. I usually use the test program X, which tests the product, to try to create a specific problem. When the test program is run to make an error occur, the likelihood of generating an error is 1%.

So, armed with this knowledge, I wrote a new test program Y that will generate the same error that test program X creates, but more often. To find out if my test program is better than the original, so that I can convince the management that I'm right, I ran my test program to find out how often I can generate the same error. When I ran my test program 50 times, I generated the error twice. While this may not seem much better, I think that I can convince the management to use my test program instead of the original test program. Am I right?

  • \(H_{0}: p = 0.01\)
  • \(H_{a}: p > 0.01\)
  • Let \(P′ =\) the proportion of errors generated
  • Normal for a single proportion
  • Decision: Reject the null hypothesis
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the proportion of errors generated is more than 0.01.

The“plus-4s” confidence interval is \((0.004, 0.144)\).

"Japanese Girls’ Names"

by Kumi Furuichi

It used to be very typical for Japanese girls’ names to end with “ko.” (The trend might have started around my grandmothers’ generation and its peak might have been around my mother’s generation.) “Ko” means “child” in Chinese characters. Parents would name their daughters with “ko” attaching to other Chinese characters which have meanings that they want their daughters to become, such as Sachiko—happy child, Yoshiko—a good child, Yasuko—a healthy child, and so on.

However, I noticed recently that only two out of nine of my Japanese girlfriends at this school have names which end with “ko.” More and more, parents seem to have become creative, modernized, and, sometimes, westernized in naming their children.

I have a feeling that, while 70 percent or more of my mother’s generation would have names with “ko” at the end, the proportion has dropped among my peers. I wrote down all my Japanese friends’, ex-classmates’, co-workers, and acquaintances’ names that I could remember. Following are the names. (Some are repeats.) Test to see if the proportion has dropped for this generation.

Ai, Akemi, Akiko, Ayumi, Chiaki, Chie, Eiko, Eri, Eriko, Fumiko, Harumi, Hitomi, Hiroko, Hiroko, Hidemi, Hisako, Hinako, Izumi, Izumi, Junko, Junko, Kana, Kanako, Kanayo, Kayo, Kayoko, Kazumi, Keiko, Keiko, Kei, Kumi, Kumiko, Kyoko, Kyoko, Madoka, Maho, Mai, Maiko, Maki, Miki, Miki, Mikiko, Mina, Minako, Miyako, Momoko, Nana, Naoko, Naoko, Naoko, Noriko, Rieko, Rika, Rika, Rumiko, Rei, Reiko, Reiko, Sachiko, Sachiko, Sachiyo, Saki, Sayaka, Sayoko, Sayuri, Seiko, Shiho, Shizuka, Sumiko, Takako, Takako, Tomoe, Tomoe, Tomoko, Touko, Yasuko, Yasuko, Yasuyo, Yoko, Yoko, Yoko, Yoshiko, Yoshiko, Yoshiko, Yuka, Yuki, Yuki, Yukiko, Yuko, Yuko.

"Phillip’s Wish," by Suzanne Osorio

My nephew likes to play

Chasing the girls makes his day.

He asked his mother

If it is okay

To get his ear pierced.

She said, “No way!”

To poke a hole through your ear,

Is not what I want for you, dear.

He argued his point quite well,

Says even my macho pal, Mel,

Has gotten this done.

It’s all just for fun.

C’mon please, mom, please, what the hell.

Again Phillip complained to his mother,

Saying half his friends (including their brothers)

Are piercing their ears

And they have no fears

He wants to be like the others.

She said, “I think it’s much less.

We must do a hypothesis test.

And if you are right,

I won’t put up a fight.

But, if not, then my case will rest.”

We proceeded to call fifty guys

To see whose prediction would fly.

Nineteen of the fifty

Said piercing was nifty

And earrings they’d occasionally buy.

Then there’s the other thirty-one,

Who said they’d never have this done.

So now this poem’s finished.

Will his hopes be diminished,

Or will my nephew have his fun?

  • \(H_{0}: p = 0.50\)
  • \(H_{a}: p < 0.50\)
  • Let \(P′ =\) the proportion of friends that has a pierced ear.
  • –1.70
  • \(p\text{-value} = 0.0448\)
  • Reason for decision: The \(p\text{-value}\) is less than 0.05. (However, they are very close.)
  • Conclusion: There is sufficient evidence to support the claim that less than 50% of his friends have pierced ears.
  • Confidence Interval: \((0.245, 0.515)\): The “plus-4s” confidence interval is \((0.259, 0.519)\).

"The Craven," by Mark Salangsang

Once upon a morning dreary

In stats class I was weak and weary.

Pondering over last night’s homework

Whose answers were now on the board

This I did and nothing more.

While I nodded nearly napping

Suddenly, there came a tapping.

As someone gently rapping,

Rapping my head as I snore.

Quoth the teacher, “Sleep no more.”

“In every class you fall asleep,”

The teacher said, his voice was deep.

“So a tally I’ve begun to keep

Of every class you nap and snore.

The percentage being forty-four.”

“My dear teacher I must confess,

While sleeping is what I do best.

The percentage, I think, must be less,

A percentage less than forty-four.”

This I said and nothing more.

“We’ll see,” he said and walked away,

And fifty classes from that day

He counted till the month of May

The classes in which I napped and snored.

The number he found was twenty-four.

At a significance level of 0.05,

Please tell me am I still alive?

Or did my grade just take a dive

Plunging down beneath the floor?

Upon thee I hereby implore.

Toastmasters International cites a report by Gallop Poll that 40% of Americans fear public speaking. A student believes that less than 40% of students at her school fear public speaking. She randomly surveys 361 schoolmates and finds that 135 report they fear public speaking. Conduct a hypothesis test to determine if the percent at her school is less than 40%.

  • \(H_{0}: p = 0.40\)
  • \(H_{a}: p < 0.40\)
  • Let \(P′ =\) the proportion of schoolmates who fear public speaking.
  • –1.01
  • \(p\text{-value} = 0.1563\)
  • Conclusion: There is insufficient evidence to support the claim that less than 40% of students at the school fear public speaking.
  • Confidence Interval: \((0.3241, 0.4240)\): The “plus-4s” confidence interval is \((0.3257, 0.4250)\).

Sixty-eight percent of online courses taught at community colleges nationwide were taught by full-time faculty. To test if 68% also represents California’s percent for full-time faculty teaching the online classes, Long Beach City College (LBCC) in California, was randomly selected for comparison. In the same year, 34 of the 44 online courses LBCC offered were taught by full-time faculty. Conduct a hypothesis test to determine if 68% represents California. NOTE: For more accurate results, use more California community colleges and this past year's data.

According to an article in Bloomberg Businessweek , New York City's most recent adult smoking rate is 14%. Suppose that a survey is conducted to determine this year’s rate. Nine out of 70 randomly chosen N.Y. City residents reply that they smoke. Conduct a hypothesis test to determine if the rate is still 14% or if it has decreased.

  • \(H_{0}: p = 0.14\)
  • \(H_{a}: p < 0.14\)
  • Let \(P′ =\) the proportion of NYC residents that smoke.
  • –0.2756
  • \(p\text{-value} = 0.3914\)
  • At the 5% significance level, there is insufficient evidence to conclude that the proportion of NYC residents who smoke is less than 0.14.
  • Confidence Interval: \((0.0502, 0.2070)\): The “plus-4s” confidence interval (see chapter 8) is \((0.0676, 0.2297)\).

The mean age of De Anza College students in a previous term was 26.6 years old. An instructor thinks the mean age for online students is older than 26.6. She randomly surveys 56 online students and finds that the sample mean is 29.4 with a standard deviation of 2.1. Conduct a hypothesis test.

Registered nurses earned an average annual salary of $69,110. For that same year, a survey was conducted of 41 California registered nurses to determine if the annual salary is higher than $69,110 for California nurses. The sample average was $71,121 with a sample standard deviation of $7,489. Conduct a hypothesis test.

  • \(H_{0}: \mu = 69,110\)
  • \(H_{0}: \mu > 69,110\)
  • Let \(\bar{X} =\) the mean salary in dollars for California registered nurses.
  • \(t = 1.719\)
  • \(p\text{-value}: 0.0466\)
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the mean salary of California registered nurses exceeds $69,110.
  • \(($68,757, $73,485)\)

La Leche League International reports that the mean age of weaning a child from breastfeeding is age four to five worldwide. In America, most nursing mothers wean their children much earlier. Suppose a random survey is conducted of 21 U.S. mothers who recently weaned their children. The mean weaning age was nine months (3/4 year) with a standard deviation of 4 months. Conduct a hypothesis test to determine if the mean weaning age in the U.S. is less than four years old.

Over the past few decades, public health officials have examined the link between weight concerns and teen girls' smoking. Researchers surveyed a group of 273 randomly selected teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls were surveyed again. Sixty-three said they smoked to stay thin. Is there good evidence that more than thirty percent of the teen girls smoke to stay thin?

After conducting the test, your decision and conclusion are

  • Reject \(H_{0}\): There is sufficient evidence to conclude that more than 30% of teen girls smoke to stay thin.
  • Do not reject \(H_{0}\): There is not sufficient evidence to conclude that less than 30% of teen girls smoke to stay thin.
  • Do not reject \(H_{0}\): There is not sufficient evidence to conclude that more than 30% of teen girls smoke to stay thin.
  • Reject \(H_{0}\): There is sufficient evidence to conclude that less than 30% of teen girls smoke to stay thin.

A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students attended the opening night midnight showing of the latest Harry Potter movie. She surveys 84 of her students and finds that 11 of them attended the midnight showing.

At a 1% level of significance, an appropriate conclusion is:

  • There is insufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is less than 20%.
  • There is sufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is more than 20%.
  • There is sufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is less than 20%.
  • There is insufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is at least 20%.

Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test.

At a significance level of \(a = 0.05\), what is the correct conclusion?

  • There is enough evidence to conclude that the mean number of hours is more than 4.75
  • There is enough evidence to conclude that the mean number of hours is more than 4.5
  • There is not enough evidence to conclude that the mean number of hours is more than 4.5
  • There is not enough evidence to conclude that the mean number of hours is more than 4.75

Instructions: For the following ten exercises,

Hypothesis testing: For the following ten exercises, answer each question.

State the null and alternate hypothesis.

State the \(p\text{-value}\).

State \(\alpha\).

What is your decision?

Write a conclusion.

Answer any other questions asked in the problem.

According to the Center for Disease Control website, in 2011 at least 18% of high school students have smoked a cigarette. An Introduction to Statistics class in Davies County, KY conducted a hypothesis test at the local high school (a medium sized–approximately 1,200 students–small city demographic) to determine if the local high school’s percentage was lower. One hundred fifty students were chosen at random and surveyed. Of the 150 students surveyed, 82 have smoked. Use a significance level of 0.05 and using appropriate statistical evidence, conduct a hypothesis test and state the conclusions.

A recent survey in the N.Y. Times Almanac indicated that 48.8% of families own stock. A broker wanted to determine if this survey could be valid. He surveyed a random sample of 250 families and found that 142 owned some type of stock. At the 0.05 significance level, can the survey be considered to be accurate?

  • \(H_{0}: p = 0.488\) \(H_{a}: p \neq 0.488\)
  • \(p\text{-value} = 0.0114\)
  • \(\alpha = 0.05\)
  • Reject the null hypothesis.
  • At the 5% level of significance, there is enough evidence to conclude that 48.8% of families own stocks.
  • The survey does not appear to be accurate.

Driver error can be listed as the cause of approximately 54% of all fatal auto accidents, according to the American Automobile Association. Thirty randomly selected fatal accidents are examined, and it is determined that 14 were caused by driver error. Using \(\alpha = 0.05\), is the AAA proportion accurate?

The US Department of Energy reported that 51.7% of homes were heated by natural gas. A random sample of 221 homes in Kentucky found that 115 were heated by natural gas. Does the evidence support the claim for Kentucky at the \(\alpha = 0.05\) level in Kentucky? Are the results applicable across the country? Why?

  • \(H_{0}: p = 0.517\) \(H_{0}: p \neq 0.517\)
  • \(p\text{-value} = 0.9203\).
  • \(\alpha = 0.05\).
  • Do not reject the null hypothesis.
  • At the 5% significance level, there is not enough evidence to conclude that the proportion of homes in Kentucky that are heated by natural gas is 0.517.
  • However, we cannot generalize this result to the entire nation. First, the sample’s population is only the state of Kentucky. Second, it is reasonable to assume that homes in the extreme north and south will have extreme high usage and low usage, respectively. We would need to expand our sample base to include these possibilities if we wanted to generalize this claim to the entire nation.

For Americans using library services, the American Library Association claims that at most 67% of patrons borrow books. The library director in Owensboro, Kentucky feels this is not true, so she asked a local college statistic class to conduct a survey. The class randomly selected 100 patrons and found that 82 borrowed books. Did the class demonstrate that the percentage was higher in Owensboro, KY? Use \(\alpha = 0.01\) level of significance. What is the possible proportion of patrons that do borrow books from the Owensboro Library?

The Weather Underground reported that the mean amount of summer rainfall for the northeastern US is at least 11.52 inches. Ten cities in the northeast are randomly selected and the mean rainfall amount is calculated to be 7.42 inches with a standard deviation of 1.3 inches. At the \(\alpha = 0.05 level\), can it be concluded that the mean rainfall was below the reported average? What if \(\alpha = 0.01\)? Assume the amount of summer rainfall follows a normal distribution.

  • \(H_{0}: \mu \geq 11.52\) \(H_{a}: \mu < 11.52\)
  • \(p\text{-value} = 0.000002\) which is almost 0.
  • At the 5% significance level, there is enough evidence to conclude that the mean amount of summer rain in the northeaster US is less than 11.52 inches, on average.
  • We would make the same conclusion if alpha was 1% because the \(p\text{-value}\) is almost 0.

A survey in the N.Y. Times Almanac finds the mean commute time (one way) is 25.4 minutes for the 15 largest US cities. The Austin, TX chamber of commerce feels that Austin’s commute time is less and wants to publicize this fact. The mean for 25 randomly selected commuters is 22.1 minutes with a standard deviation of 5.3 minutes. At the \(\alpha = 0.10\) level, is the Austin, TX commute significantly less than the mean commute time for the 15 largest US cities?

A report by the Gallup Poll found that a woman visits her doctor, on average, at most 5.8 times each year. A random sample of 20 women results in these yearly visit totals

3; 2; 1; 3; 7; 2; 9; 4; 6; 6; 8; 0; 5; 6; 4; 2; 1; 3; 4; 1

At the \(\alpha = 0.05\) level can it be concluded that the sample mean is higher than 5.8 visits per year?

  • \(H_{0}: \mu \leq 5.8\) \(H_{a}: \mu > 5.8\)
  • \(p\text{-value} = 0.9987\)
  • At the 5% level of significance, there is not enough evidence to conclude that a woman visits her doctor, on average, more than 5.8 times a year.

According to the N.Y. Times Almanac the mean family size in the U.S. is 3.18. A sample of a college math class resulted in the following family sizes:

5; 4; 5; 4; 4; 3; 6; 4; 3; 3; 5; 5; 6; 3; 3; 2; 7; 4; 5; 2; 2; 2; 3; 2

At \(\alpha = 0.05\) level, is the class’ mean family size greater than the national average? Does the Almanac result remain valid? Why?

The student academic group on a college campus claims that freshman students study at least 2.5 hours per day, on average. One Introduction to Statistics class was skeptical. The class took a random sample of 30 freshman students and found a mean study time of 137 minutes with a standard deviation of 45 minutes. At α = 0.01 level, is the student academic group’s claim correct?

  • \(H_{0}: \mu \geq 150\) \(H_{0}: \mu < 150\)
  • \(p\text{-value} = 0.0622\)
  • \(\alpha = 0.01\)
  • At the 1% significance level, there is not enough evidence to conclude that freshmen students study less than 2.5 hours per day, on average.
  • The student academic group’s claim appears to be correct.

9.7: Hypothesis Testing of a Single Mean and Single Proportion

Confidence distributions and hypothesis testing

  • Regular Article
  • Open access
  • Published: 29 March 2024

Cite this article

You have full access to this open access article

  • Eugenio Melilli   ORCID: orcid.org/0000-0003-2542-5286 1 &
  • Piero Veronese   ORCID: orcid.org/0000-0002-4416-2269 1  

68 Accesses

Explore all metrics

The traditional frequentist approach to hypothesis testing has recently come under extensive debate, raising several critical concerns. Additionally, practical applications often blend the decision-theoretical framework pioneered by Neyman and Pearson with the inductive inferential process relied on the p -value, as advocated by Fisher. The combination of the two methods has led to interpreting the p -value as both an observed error rate and a measure of empirical evidence for the hypothesis. Unfortunately, both interpretations pose difficulties. In this context, we propose that resorting to confidence distributions can offer a valuable solution to address many of these critical issues. Rather than suggesting an automatic procedure, we present a natural approach to tackle the problem within a broader inferential context. Through the use of confidence distributions, we show the possibility of defining two statistical measures of evidence that align with different types of hypotheses under examination. These measures, unlike the p -value, exhibit coherence, simplicity of interpretation, and ease of computation, as exemplified by various illustrative examples spanning diverse fields. Furthermore, we provide theoretical results that establish connections between our proposal, other measures of evidence given in the literature, and standard testing concepts such as size, optimality, and the p -value.

Similar content being viewed by others

Blending bayesian and frequentist methods according to the precision of prior information with applications to hypothesis testing.

David R. Bickel

hypothesis statistics example

Introducing and analyzing the Bayesian power function as an alternative to the power function for a test

Julián de la Horra

hypothesis statistics example

The Support Interval

Eric-Jan Wagenmakers, Quentin F. Gronau, … Alexander Etz

Avoid common mistakes on your manuscript.

1 Introduction

In applied research, the standard frequentist approach to hypothesis testing is commonly regarded as a straightforward, coherent, and automatic method for assessing the validity of a conjecture represented by one of two hypotheses, denoted as \({{{\mathcal {H}}}_{0}}\) and \({{{\mathcal {H}}}_{1}}\) . The probabilities \(\alpha \) and \(\beta \) of committing type I and type II errors (reject \({{{\mathcal {H}}}_{0}}\) , when it is true and accept \({{{\mathcal {H}}}_{0}}\) when it is false, respectively) are controlled through a carefully designed experiment. After having fixed \(\alpha \) (usually at 0.05), the p -value is used to quantify the measure of evidence against the null hypothesis. If the p -value is less than \(\alpha \) , the conclusion is deemed significant , suggesting that it is unlikely that the null hypothesis holds. Regrettably, this methodology is not as secure as it may seem, as evidenced by a large literature, see the ASA’s Statement on p -values (Wasserstein and Lazar 2016 ) and The American Statistician (2019, vol. 73, sup1) for a discussion of various principles, misconceptions, and recommendations regarding the utilization of p -values. The standard frequentist approach is, in fact, a blend of two different views on hypothesis testing presented by Neyman-Pearson and Fisher. The first authors approach hypothesis testing within a decision-theoretic framework, viewing it as a behavioral theory. In contrast, Fisher’s perspective considers testing as a component of an inductive inferential process that does not necessarily require an alternative hypothesis or concepts from decision theory such as loss, risk or admissibility, see Hubbard and Bayarri ( 2003 ). As emphasized by Goodman ( 1993 ) “the combination of the two methods has led to a reinterpretation of the p -value simultaneously as an ‘observed error rate’ and as a ‘measure of evidence’. Both of these interpretations are problematic...”.

It is out of our scope to review the extensive debate on hypothesis testing. Here, we briefly touch upon a few general points, without delving into the Bayesian approach.

i) The long-standing caution expressed by Berger and Sellke ( 1987 ) and Berger and Delampady ( 1987 ) that a p -value of 0.05 provides only weak evidence against the null hypothesis has been further substantiated by recent investigations into experiment reproducibility, see e.g., Open Science Collaboration OSC ( 2015 ) and Johnson et al. ( 2017 ). In light of this, 72 statisticians have stated “For fields where the threshold for defining statistical significance for new discoveries is \(p<0.05\) , we propose a change to \(p<0.005\) ”, see Benjamin et al. ( 2018 ).

ii) The ongoing debate regarding the selection of a one-sided or two-sided test leaves the standard practice of doubling the p-value , when moving from the first to the second type of test, without consistent support, see e.g., Freedman ( 2008 ).

iii) There has been a longstanding argument in favor of integrating hypothesis testing with estimation, see e.g. Yates ( 1951 , pp. 32–33) or more recently, Greenland et al. ( 2016 ) who emphasize that “... statistical tests should never constitute the sole input to inferences or decisions about associations or effects ... in most scientific settings, the arbitrary classification of results into significant and non-significant is unnecessary for and often damaging to valid interpretation of data”.

iv) Finally, the p -value is incoherent when it is regarded as a statistical measure of the evidence provided by the data in support of a hypothesis \({{{\mathcal {H}}}_{0}}\) . As shown by Schervish ( 1996 ), it is possible that the p -value for testing the hypothesis \({{{\mathcal {H}}}_{0}}\) is greater than that for testing \({{{\mathcal {H}}}_{0}}^{\prime } \supset {{{\mathcal {H}}}_{0}}\) for the same observed data.

While theoretical insights into hypothesis testing are valuable for elucidating various aspects, we believe they cannot be compelled to serve as a unique, definitive practical guide for real-world applications. For example, uniformly most powerful (UMP) tests for discrete models not only rarely exist, but nobody uses them because they are randomized. On the other hand, how can a test of size 0.05 be considered really different from one of size 0.047 or 0.053? Moreover, for one-sided hypotheses, why should the first type error always be much more severe than the second type one? Alternatively, why should the test for \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) always be considered equivalent to the test for \({{{\mathcal {H}}}_{0}}: \theta = \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) ? Furthermore, the decision to test \({{{\mathcal {H}}}_{0}}: \theta =\theta _0\) rather than \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _0-\epsilon , \theta _0+\epsilon ]\) , for a suitable positive \(\epsilon \) , should be driven by the specific requirements of the application and not solely by the existence of a good or simple test. In summary, we concur with Fisher ( 1973 ) that “no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas”.

Considering all these crucial aspects, we believe it is essential to seek an applied hypothesis testing approach that encourages researchers to engage more deeply with the specific problem, avoids relying on standardized procedures, and is consistently integrated into a broader framework of inference. One potential solution can be found resorting to the “confidence distribution” (CD) approach. The modern CD theory was introduced by Schweder and Hjort ( 2002 ) and Singh et al. ( 2005 ) and relies on the idea of constructing a data-depending distribution for the parameter of interest to be used for inferential purposes. A CD should not be confused with a Bayesian posterior distribution. It is not derived through the Bayes theorem, and it does not require any prior distributions. Similar to the conventional practice in point or interval estimation, where one seeks a point or interval estimator, the objective of this theory is to discover a distribution estimator . Thanks to a clarification of this concept and a formalized definition of the CD within a purely frequentist setting, a wide literature on the topic has been developed encompassing both theoretical developments and practical applications, see e.g. for a general overview Schweder and Hjort ( 2016 ), Singh et al. ( 2007 ), and Xie and Singh ( 2013 ). We also remark that when inference is required for a real parameter, it is possible to establish a relationship between CDs and fiducial distributions, originally introduced by Fisher ( 1930 ). For a modern and general presentation of the fiducial inference see Hannig ( 2009 ) and Hannig et al. ( 2016 ), while for a connection with the CDs see Schweder and Hjort ( 2016 ) and Veronese and Melilli ( 2015 , 2018a ). Some results about the connection between CDs and hypothesis testing are presented in Singh et al. ( 2007 , Sec. 3.3) and Xie & Singh ( 2013 , Sec. 4.3), but the focus is only on the formal relationships between the support that a CD can provide for a hypothesis and the p -value.

In this paper we discuss in details the application of CDs in hypothesis testing. We show how CDs can offer valuable solutions to address the aforementioned difficulties and how a test can naturally be viewed as a part of a more extensive inferential process. Once a CD has been specified, everything can be developed straightforwardly, without any particular technical difficulties. The core of our approach centers on the notion of support provided by the data to a hypothesis through a CD. We introduce two distinct but related types of support, the choice of which depends on the hypothesis under consideration. They are always coherent, easy to interpret and to compute, even in case of interval hypotheses, contrary to what happens for the p -value. The flexibility, simplicity, and effectiveness of our proposal are illustrated by several examples from various fields and a simulation study. We have postponed the presentation of theoretical results, comparisons with other proposals found in the literature, as well as the connections with standard hypothesis testing concepts such as size, significance level, optimality, and p -values to the end of the paper to enhance its readability.

The paper is structured as follows: In Sect. 2 , we provide a review of the CD’s definition and the primary methods for its construction, with a particular focus on distinctive aspects that arise when dealing with discrete models (Sect. 2.1 ). Section 3 explores the application of the CD in hypothesis testing and introduces the two notions of support. In Sect. 4 , we discuss several examples to illustrate the benefits of utilizing the CD in various scenarios, offering comparisons with traditional p -values. Theoretical results about tests based on the CD and comparisons with other measures of support or plausibility for hypotheses are presented in Sect. 5 . Finally, in Sect. 6 , we summarize the paper’s findings and provide concluding remarks. For convenience, a table of CDs for some common statistical models can be found in Appendix A, while all the proofs of the propositions are presented in Appendix B.

2 Confidence distributions

The modern definition of confidence distribution for a real parameter \(\theta \) of interest, see Schweder & Hjort ( 2002 ; 2016 , sec. 3.2) and Singh et al. ( 2005 ; 2007 ) can be formulated as follows:

Definition 1

Let \(\{P_{\theta ,\varvec{\lambda }},\theta \in \Theta \subseteq \mathbb {R}, \varvec{\lambda }\in \varvec{\Lambda }\}\) be a parametric model for data \(\textbf{X}\in {\mathcal {X}}\) ; here \(\theta \) is the parameter of interest and \(\varvec{\lambda }\) is a nuisance parameter. A function H of \(\textbf{X}\) and \(\theta \) is called a confidence distribution for \(\theta \) if: i) for each value \(\textbf{x}\) of \(\textbf{X}\) , \(H(\textbf{x},\cdot )=H_{\textbf{x}}(\cdot )\) is a continuous distribution function on \(\Theta \) ; ii) \(H(\textbf{X},\theta )\) , seen as a function of the random element \(\textbf{X}\) , has the uniform distribution on (0, 1), whatever the true parameter value \((\theta , \varvec{\lambda })\) . The function H is an asymptotic confidence distribution if the continuity requirement in i) is removed and ii) is replaced by: ii) \(^{\prime }\) \(H(\textbf{X},\theta )\) converges in law to the uniform distribution on (0, 1) for the sample size going to infinity, whatever the true parameter value \((\theta , \varvec{\lambda })\) .

The CD theory is placed in a purely frequentist context and the uniformity of the distribution ensures the correct coverage of the confidence intervals. The CD should be regarded as a distribution estimator of a parameter \(\theta \) and its mean, median or mode can serve as point estimates of \(\theta \) , see Xie and Singh ( 2013 ) for a detailed discussion. In essence, the CD can be employed in a manner similar to a Bayesian posterior distribution, but its interpretation differs and does not necessitate any prior distribution. Closely related to the CD is the confidence curve (CC) which, given an observation \(\textbf{x}\) , is defined as \( CC_{\textbf{x}}(\theta )=|1-2H_{\textbf{x}}(\theta )|\) ; see Schweder and Hjort ( 2002 ). This function provides the boundary points of equal-tailed confidence intervals for any level \(1-\alpha \) , with \(0<\alpha <1\) , and offers an immediate visualization of their length.

Various procedures can be adopted to obtain exact or asymptotic CDs starting, for example, from pivotal functions, likelihood functions and bootstrap distributions, as detailed in Singh et al. ( 2007 ), Xie and Singh ( 2013 ), Schweder and Hjort ( 2016 ). A CD (or an asymptotic CD) can also be derived directly from a real statistic T , provided that its exact or asymptotic distribution function \(F_{\theta }(t)\) is a continuously monotonic function in \(\theta \) and its limits are 0 and 1 as \(\theta \) approaches its boundaries. For example, if \(F_{\theta }(t)\) is nonincreasing, we can define

Furthermore, if \(H_t(\theta )\) is differentiable in \(\theta \) , we can obtain the CD-density \(h_t(\theta )=-({\partial }/{\partial \theta }) F_{\theta }(t)\) , which coincides with the fiducial density suggested by Fisher. In particular, when the statistical model belongs to the real regular natural exponential family (NEF) with natural parameter \(\theta \) and sufficient statistic T , there always exists an “optimal” CD for \(\theta \) which is given by ( 1 ), see Veronese and Melilli ( 2015 ).

The CDs based on a real statistic play an important role in hypothesis testing. In this setting remarkable results are obtained when the model has monotone likelihood ratio (MLR). We recall that if \(\textbf{X}\) is a random vector distributed according to the family \(\{p_\theta , \theta \in \Theta \subseteq \mathbb {R}\}\) , this family is said to have MLR in the real statistic \(T(\textbf{X})\) if, for any \(\theta _1 <\theta _2\) , the ratio \(p_{\theta _2}(\textbf{x})/p_{\theta _1}(\textbf{x})\) is a nondecreasing function of \(T(\textbf{x})\) for values of \(\textbf{x}\) that induce at least one of \(p_{\theta _1}\) and \(p_{\theta _2}\) to be positive. Furthermore, for such families, it holds that \(F_{\theta _2}(t) \le F_{\theta _1}(t)\) for each t , see Shao ( 2003 , Sec. 6.1.2). Families with MLR not only allow the construction of Uniformly Most Powerful (UMP) tests in various scenarios but also identify the statistic T , which can be employed in constructing the CD for \(\theta \) . Indeed, because \(F_\theta (t)\) is nonincreasing in \(\theta \) for each t , \(H_t(\theta )\) can be defined as in ( 1 ) provided the conditions of continuity and limits of \(F_{\theta }(t)\) are met. Of course, if the MLR is nonincreasing in T a similar result holds and the CD for \(\theta \) is \(H_t(\theta )=F_\theta (t)\) .

An interesting characteristic of the CD that validates its suitability for use in a testing problem is its consistency , meaning that it increasingly concentrates around the “true” value of \(\theta \) as the sample size grows, leading to the correct decision.

Definition 2

The sequence of CDs \(H(\textbf{X}_n, \cdot )\) is consistent at some \(\theta _0 \in \Theta \) if, for every neighborhood U of \(\theta _0\) , \(\int _U dH(\textbf{X}_n, \theta ) \rightarrow 1\) , as \(n\rightarrow \infty \) , in probability under \(\theta _0\) .

The following proposition provides some useful asymptotic properties of a CD for independent identically distributed (i.i.d.) random variables.

Proposition 1

Let \(X_1,X_2,\ldots \) be a sequence of i.i.d. random variables from a distribution function \(F_{\theta }\) , parameterized by a real parameter \(\theta \) , and let \(H_{\textbf{x}_n}\) be the CD for \(\theta \) based on \(\textbf{x}_n=(x_1, \ldots , x_n)\) . If \(\theta _0\) denotes the true value of \(\theta \) , then \(H(\textbf{X}_n, \cdot )\) is consistent at \(\theta _0\) if one of the following conditions holds:

\(F_{\theta }\) belongs to a NEF;

\(F_{\theta }\) is a continuous distribution function and standard regularity assumptions hold;

its expected value and variance converge for \(n\rightarrow \infty \) to \(\theta _0\) , and 0, respectively, in probability under \(\theta _0\) .

Finally, if i) or ii) holds the CD is asymptotically normal.

Table 8 in Appendix A provides a list of CDs for various standard models. Here, we present two basic examples, while numerous others will be covered in Sect. 4 within an inferential and testing framework.

( Normal model ) Let \(\textbf{X}=(X_1,\ldots ,X_n)\) be an i.i.d. sample from a normal distribution N \((\mu ,\sigma ^2)\) , with \(\sigma ^2\) known. A standard pivotal function is \(Q({\bar{X}}, \mu )=\sqrt{n}({\bar{X}}-\mu )/ \sigma \) , where \(\bar{X}=\sum X_i/n\) . Since \(Q({\bar{X}}, \mu )\) is decreasing in \(\mu \) and has the standard normal distribution \(\Phi \) , the CD for \(\mu \) is \(H_{\bar{x}}(\mu )=1-\Phi (\sqrt{n}({\bar{x}}-\mu )/ \sigma )=\Phi (\sqrt{n}(\mu -{\bar{x}})/ \sigma )\) , that is a N \(({\bar{x}},\sigma /\sqrt{n})\) . When the variance is unknown we can use the pivotal function \(Q({\bar{X}}, \mu )=\sqrt{n}({\bar{X}}-\mu )/S\) , where \(S^2=\sum (X_i-\bar{X})^2/(n-1)\) , and the CD for \(\mu \) is \(H_{{\bar{x}},s}(\mu )=1-F^{T_{n-1}}(\sqrt{n}({\bar{x}}-\mu )/ \sigma )=F^{T_{n-1}}(\sqrt{n}(\mu -{\bar{x}})/ \sigma )\) , where \(F^{T_{n-1}}\) is the t-distribution function with \(n-1\) degrees of freedom.

( Uniform model ) Let \(\textbf{X}=(X_1,\ldots ,X_n)\) be an i.i.d. sample from the uniform distribution on \((0,\theta )\) , \(\theta >0\) . Consider the (sufficient) statistic \(T=\max (X_1, \ldots ,X_n)\) whose distribution function is \(F_\theta (t)=(t/\theta )^n\) , for \(0<t<\theta \) . Because \(F_\theta (t)\) is decreasing in \(\theta \) and the limit conditions are satisfied for \(\theta >t\) , the CD for \(\theta \) is \(H_t(\theta )=1-(t/\theta )^n\) , i.e. a Pareto distribution \(\text {Pa}(n, t)\) with parameters n (shape) and t (scale). Since the uniform distribution is not regular, the consistency of the CD follows from condition iii) of Proposition 1 . This is because \(E^{H_{t}}(\theta )=nt/(n-1)\) and \(Var^{H_{t}}(\theta )=nt^2/((n-2)(n-1)^2)\) , so that, for \(n\rightarrow \infty \) , \(E^{H_{t}}(\theta ) \rightarrow \theta _0\) (from the strong consistency of the estimator T of \(\theta \) , see e.g. Shao 2003 , p.134) and \(Var^{H_{t}}(\theta )\rightarrow 0\) trivially.

2.1 Peculiarities of confidence distributions for discrete models

When the model is discrete, clearly we can only derive asymptotic CDs. However, a crucial question arises regarding uniqueness. Since \(F_{\theta }(t)=\text{ Pr}_\theta \{T \le t\}\) does not coincide with Pr \(_\theta \{T<t\}\) for any value t within the support \({\mathcal {T}}\) of T , it is possible to define two distinct “extreme” CDs. If \(F_\theta (t)\) is non increasing in \(\theta \) , we refer to the right CD as \(H_{t}^r(\theta )=1-\text{ Pr}_\theta \{T\le t\}\) and to the left CD as \(H_{t}^\ell (\theta )=1-\text{ Pr}_\theta \{T<t\}\) . Note that \(H_{t}^r(\theta ) < H_{t}^\ell (\theta )\) , for every \(t \in {{\mathcal {T}}}\) and \(\theta \in \Theta \) , so that the center (i.e. the mean or the median) of \(H_{t}^r(\theta )\) is greater than that of \(H_{t}^\ell (\theta )\) . If \(F_\theta (t)\) is increasing in \(\theta \) , we define \( H_{t}^\ell (\theta )=F_\theta (t)\) and \(H^r_t(\theta )=\text{ Pr}_\theta \{T<t\}\) and one again \(H_{t}^r(\theta ) < H_{t}^\ell (\theta )\) . Veronese & Melilli ( 2018b , sec. 3.2) suggest overcoming this nonuniqueness by averaging the CD-densities \(h_t^r\) and \(h_t^\ell \) using the geometric mean \(h_t^g(\theta )\propto \sqrt{h_t^r(\theta )h_t^\ell (\theta )}\) . This typically results in a simpler CD compared to the one obtained through the arithmetic mean, with smaller confidence intervals. Note that the (asymptotic) CD defined in ( 1 ) for discrete models corresponds to the right CD, and it is more appropriately referred to as \(H_t^r(\theta )\) hereafter. Clearly, \(H_{t}^\ell (\theta )\) can be obtained from \(H_{t}^r(\theta )\) by replacing t with its preceding value in the support \({\mathcal {T}}\) . For discrete models, the table in Appendix A reports \(H_{t}^r(\theta )\) , \(H_{t}^\ell (\theta )\) and \(H_t^g(\theta )\) . Compared to \(H^{\ell }_t\) and \(H^r_t\) , \(H^g_t\) offers the advantage of closely approximating a uniform distribution when viewed as a function of the random variable T .

Proposition 2

Given a discrete statistic T with distribution indexed by a real parameter \(\theta \in \Theta \) and support \({{\mathcal {T}}}\) independent of \(\theta \) , assume that, for each \(\theta \in \Theta \) and \(t\in {\mathcal {T}}\) , \(H^r_t(\theta )< H^g_t(\theta ) < H^{\ell }_t(\theta )\) . Then, denoting by \(G^j\) the distribution function of \(H^j_T\) , with \(j=\ell ,g,r\) , we have \(G^\ell (u) \le u \le G^r(u)\) . Furthermore,

Notice that the assumption in Proposition 2 is always satisfied when the model belongs to a NEF, see Veronese and Melilli ( 2018a ).

The possibility of constructing different CDs using the same discrete statistic T plays an important role in connection with standard p -values, as we will see in Sect. 5 .

(Binomial model) Let \(\textbf{X}=(X_1,\ldots , X_n)\) be an i.i.d. sample from a binomial distributions Bi(1,  p ) with success probability p . Then \(T=\sum _{i=1}^n X_i\) is distributed as a Bi( n ,  p ) and by ( 1 ), recalling the well-known relationship between the binomial and beta distributions, it follows that the right CD for p is a Be( \(t+1,n-t\) ) for \(t=0,1,\ldots , n-1\) . Furthermore, the left CD is a Be( \(t,n-t+1\) ) and it easily follows that \(H_t^g(p)\) is a Be( \(t+1/2,n-t+1/2\) ). Figure 1 shows the corresponding three CD-densities along with their respective CCs, emphasizing the central position of \(h_t^g(p)\) and its confidence intervals in comparison to \(h_t^\ell (p)\) and \(h^r_t(p)\) .

figure 1

(Binomial model) CD-densities (left plot) and CCs (right plot) corresponding to \(H_t^g(p)\) (solid lines), \(H_t^{\ell }(p)\) (dashed lines) and \(H_t^r(p)\) (dotted lines) for the parameter p with n = 15 and \(t=5\) . In the CC plot, the horizontal dotted line is at level 0.95

3 Confidence distributions in testing problems

As mentioned in Sect. 1 , we believe that introducing a CD can serve as a valuable and unifying approach, compelling individuals to think more deeply about the specific problem they aim to address rather than resorting to automatic rules. In fact, the availability of a whole distribution for the parameter of interest equips statisticians and practitioners with a versatile tool for handling a wide range of inference tasks, such as point and interval estimation, hypothesis testing, and more, without the need for ad hoc procedures. Here, we will address the issue in the simplest manner, referring to Sect. 5 for connections with related ideas in the literature and additional technical details.

Given a set \(A \subseteq \Theta \subseteq \mathbb {R}\) , it seems natural to measure the “support” that the data \(\textbf{x}\) provide to A through the CD \(H_{\textbf{x}}\) , as \(CD(A)=H_{\textbf{x}}(A)= \int _{A} dH_{\textbf{x}}(\theta )\) . Notice that, with a slight abuse of notation widely used in literature (see e.g., Singh et al. 2007 , who call \(H_{\textbf{x}}(A)\) strong-support ), we use \(H_{\textbf{x}}(\theta )\) to indicate the distribution function on \(\Theta \subseteq \mathbb {R}\) evaluated at \(\theta \) and \(H_{\textbf{x}}(A)\) to denote the mass that \(H_{\textbf{x}}\) induces on a (measurable) subset \(A\subseteq \Theta \) . It immediately follows that to compare the plausibility of k different hypotheses \({{\mathcal {H}}}_{i}: \theta \in \Theta _i\) , \(i=1,\ldots ,k\) , with \(\Theta _i \subseteq \Theta \) not being a singleton, it is enough to compute each \(H_{\textbf{x}}(\Theta _i)\) . We will call \(H_{\textbf{x}}(\Theta _i)\) the CD-support provided by \(H_{\textbf{x}}\) to the set \(\Theta _i\) . In particular, consider the usual case in which we have two hypotheses \({{{\mathcal {H}}}_{0}}: \theta \in \Theta _0\) and \({{{\mathcal {H}}}_{1}}: \theta \in \Theta _1\) , with \(\Theta _0 \cap \Theta _1= \emptyset \) , \(\Theta _0 \cup \Theta _1 = \Theta \) and assume that \({{{\mathcal {H}}}_{0}}\) is not a precise hypothesis (i.e. is not of type \(\theta =\theta _0\) ). As in the Bayesian approach one can compute the posterior odds, here we can evaluate the confidence odds \(CO_{0,1}\) of \({{{\mathcal {H}}}_{0}}\) against \({{{\mathcal {H}}}_{1}}\)

If \(CO_{0,1}\) is greater than one, the data support \({{{\mathcal {H}}}_{0}}\) more than \({{{\mathcal {H}}}_{1}}\) and this support clearly increases with \(CO_{0,1}\) . Sometimes this type of information can be sufficient to have an idea of the reasonableness of the hypotheses, but if we need to take a decision, we can include the confidence odds in a full decision setting. Thus, writing the decision space as \({{\mathcal {D}}}=\{0,1\}\) , where i indicates accepting \({{{\mathcal {H}}}}_i\) , for \(i=0,1\) , a penalization for the two possible errors must be specified. A simple loss function is

where \(\delta \) denotes the decision taken and \(a_i >0\) , \(i=0,1\) . The optimal decision is the one that minimizes the (expected) confidence loss

Therefore, we will choose \({{{\mathcal {H}}}_{0}}\) if \(a_0 H_{\textbf{x}}(\Theta _0) > a_1 H_{\textbf{x}}(\Theta _1)\) , that is if \(CO_{0,1}>a_1/a_0\) or equivalently if \(H_{\textbf{x}}(\Theta _0)>a_1/(a_0+a_1)=\gamma \) . Clearly, if there is no reason to penalize differently the two errors by setting an appropriate value for the ratio \(a_1/a_0\) , we assume \(a_0=a_1\) so that \(\gamma =0.5\) . This implies that the chosen hypothesis will be the one receiving the highest level of the CD-support. Therefore, we state the following

Definition 3

Given the two (non precise) hypotheses \({{\mathcal {H}}}_i: \theta \in \Theta _i\) , \(i=0,1\) , the CD-support of \({{\mathcal {H}}}_i\) is defined as \(H_{\textbf{x}}(\Theta _i)\) . The hypothesis \({{{\mathcal {H}}}_{0}}\) is rejected according to the CD-test if the CD-support is less than a fixed threshold \(\gamma \) depending on the loss function ( 3 ) or, equivalently, if the confidence odds \(CO_{0,1}\) are less than \(a_1/a_0=\gamma /(1-\gamma )\) .

Unfortunately, the previous notion of CD-support fails for a precise hypothesis \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) , since in this case \(H_{\textbf{x}}(\{\theta _0\})\) trivially equals zero. Notice that the problem cannot be solved by transforming \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) into the seemingly more reasonable \({{{\mathcal {H}}}_{0}}^{\prime }:\theta \in [\theta _0-\epsilon , \theta _0+\epsilon ]\) because, apart from the arbitrariness of \(\epsilon \) , the CD-support for very narrow range intervals would typically remain negligible. We thus introduce an alternative way to assess the plausibility of a precise hypothesis or, more generally, of a “small” interval hypothesis.

Consider first \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) and assume, as usual, that \(H_{\textbf{x}}(\theta )\) is a CD for \(\theta \) , based on the data \(\textbf{x}\) . Looking at the confidence curve \(CC_{\textbf{x}}(\theta )=|1-2H_{\textbf{x}}(\theta )|\) in Fig. 2 , it is reasonable to assume that the closer \(\theta _0\) is to the median \(\theta _m\) of the CD, the greater the consistency of the value of \(\theta _0\) with respect to \(\textbf{x}\) . Conversely, the complement to 1 of the CC represents the unconsidered confidence relating to both tails of the distribution. We can thus define a measure of plausibility for \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) as \((1-CC_{\textbf{x}}(\theta ))/2\) and this measure will be referred to as the CD*-support given by \(\textbf{x}\) to the hypothesis. It is immediate to see that

In other words, if \(\theta _0 < \theta _m\) \([\theta _0 > \theta _m]\) the CD*-support is \(H_{\textbf{x}}(\theta _0)\) \([1-H_{\textbf{x}}(\theta _0)]\) and corresponds to the CD-support of all \(\theta \) ’s that are less plausible than \(\theta _0\) among those located on the left [right] side of the CC . Clearly, if \(\theta _0 = \theta _m\) the CD*-support equals 1/2, its maximum value. Notice that in this case no alternative hypothesis is considered and that the CD*-support provides a measure of plausibility for \(\theta _0\) by examining “the direction of the observed departure from the null hypothesis”. This quotation is derived from Gibbons and Pratt ( 1975 ) and was originally stated to support their preference for reporting a one-tailed p -value over a two-tailed one. Here we are in a similar context and we refer to their paper for a detailed discussion of this recommendation.

figure 2

The CD*-supports of the points \(\theta _0\) , \(\theta _1\) , \(\theta _m\) and \(\theta _2\) correspond to half of the solid vertical lines and are given by \(H_{\textbf{x}}(\theta _0)\) , \(H_{\textbf{x}}(\theta _1)\) , \(H_{\textbf{x}}(\theta _m)=1/2\) e \(1-H_{\textbf{x}}(\theta _2)\) , respectively

An alternative way to intuitively justify formula ( 4 ) is as follows. Since \(H_{\textbf{x}}(\{\theta _0\})=0\) , we can look at the set K of values of \(\theta \) which are in some sense “more consistent” with the observed data \(\textbf{x}\) than \(\theta _0\) , and define the plausibility of \({{{\mathcal {H}}}_{0}}\) as \(1-H_{\textbf{x}}(K)\) . This procedure was followed in a Bayesian framework by Pereira et al. ( 1999 ) and Pereira et al. ( 2008 ) who, in order to identify K , relay on the posterior distribution of \(\theta \) and focus on its mode. We refer to these papers for a more detailed discussion of this idea. Here we emphasize only that the evidence \(1-H_{\textbf{x}}(K)\) supporting \({{{\mathcal {H}}}_{0}}\) cannot be considered as evidence against a possible alternative hypothesis. In our context, the set K can be identified as the set \(\{\theta \in \Theta : \theta < \theta _0\}\) if \(H_{\textbf{x}}(\theta _0)>1-H_{\textbf{x}}(\theta _0)\) or as \(\{\theta \in \Theta : \theta >\theta _0\}\) if \(H_{\textbf{x}}(\theta _0)\le 1-H_{\textbf{x}}(\theta _0)\) . It follows immediately that \(1-H_{\textbf{x}}(K)=\min \{H_{\textbf{x}}(\theta _0), 1-H_{\textbf{x}}(\theta _0)\}\) , which coincides with the CD*-support given in ( 4 ).

We can readily extend the previous definition of CD*-support to interval hypotheses \({{{\mathcal {H}}}_{0}}:\theta \in [\theta _1, \theta _2]\) . This extension becomes particularly pertinent when dealing with small intervals, where the CD-support may prove ineffective. In such cases, the set K of \(\theta \) values that are “more consistent” with the data \(\textbf{x}\) than those falling within the interval \([\theta _1, \theta _2]\) should clearly exclude this interval. Instead, it should include one of the two tails, namely, either \({\theta \in \Theta : \theta < \theta _1}\) or \({\theta \in \Theta : \theta > \theta _2}\) , depending on which one receives a greater mass from the CD. Then

so that the CD*-support of the interval \([\theta _1,\theta _2]\) is \(\text{ CD* }([\theta _1,\theta _2])=1-H_{\textbf{x}}(K)=\min \{H_{\textbf{x}}(\theta _2), 1-H_{\textbf{x}}(\theta _1)\}\) , which reduces to ( 4 ) in the case of a degenerate interval (i.e., when \(\theta _1=\theta _2=\theta _0\) ). Therefore, we can establish the following

Definition 4

Given the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , with \(\theta _1 \le \theta _2 \) , the CD*-support of \({{{\mathcal {H}}}_{0}}\) is defined as \(\min \{H_{\textbf{x}}(\theta _2), 1-H_{\textbf{x}}(\theta _1)\}\) . If \(H_{\textbf{x}}(\theta _2) <1-H_{\textbf{x}}(\theta _1)\) it is more reasonable to consider values of \(\theta \) greater than those specified by \({{{\mathcal {H}}}_{0}}\) , and conversely, the opposite holds true in the reverse situation. Furthermore, the hypothesis \({{{\mathcal {H}}}_{0}}\) is rejected according to the CD*-test if its CD*-support is less than a fixed threshold \(\gamma ^*\) .

The definition of CD*-support has been established for bounded interval (or precise) hypothesis. However, it can be readily extended to one-sided intervals such as \((-\infty , \theta _0]\) or \([\theta _0, +\infty )\) , but in these cases, it is evident that the CD*- and the CD-support are equivalent. For a general interval hypothesis we observe that \(H_{\textbf{x}}([\theta _1, \theta _2])\le \min \{H_{\textbf{x}}(\theta _2), 1-H_{\textbf{x}}(\theta _1)\}\) . Consequently, the CD-support can never exceed the CD*-support, even though they exhibit significant similarity when \(\theta _1\) or \(\theta _2\) resides in the extreme region of one tail of the CD or when the CD is highly concentrated (see examples 4 , 6 and 7 ).

It is crucial to emphasize that both CD-support and CD*-support are coherent measures of the evidence provided by the data for a hypothesis. This coherence arises from the fact that if \({{{\mathcal {H}}}_{0}}\subset {{{\mathcal {H}}}_{0}}^{\prime }\) , both the supports for \({{{\mathcal {H}}}_{0}}^{\prime }\) cannot be less than those for \({{{\mathcal {H}}}_{0}}\) . This is in stark contrast to the behavior of p -values, as demonstrated in Schervish ( 1996 ), Peskun ( 2020 ), and illustrated in Examples 4 and 7 .

Finally, as seen in Sect. 2.1 , various options for CDs are available for discrete models. Unless a specific problem suggests otherwise (see Sect. 5.1 ), we recommend using the geometric mean \(H_t^g\) as it offers a more impartial treatment of \({{{\mathcal {H}}}_{0}}\) and e \({{{\mathcal {H}}}_{1}}\) , as shown in Proposition 2 .

In this section, we illustrate the behavior, effectiveness, and simplicity of CD- and CD*-supports in an inferential context through several examples. We examine various contexts to assess the flexibility and consistency of our approach and compare it with the standard one. It is worth noting that the computation of the p -value for interval hypotheses is challenging and does not have a closed form.

( Normal model ) As seen in Example 1 , the CD for the mean \(\mu \) of a normal model is N \(({\bar{x}},\sigma /\sqrt{n})\) , for \(\sigma \) known. For simplicity, we assume this case; otherwise, the CD would be a t-distribution. Figure 3 shows the CD-density and the corresponding CC for \({\bar{x}}=2.7\) with three different values of \(\sigma /\sqrt{n}\) : \(1/\sqrt{50}=0.141\) , \(1/\sqrt{25}=0.2\) and \(1/\sqrt{10}=0.316\) .

The observed \({\bar{x}}\) specifies the center of both the CD and the CC, and values of \(\mu \) that are far from it receive less support the smaller the dispersion \(\sigma /\sqrt{n}\) of the CD. Alternatively, values of \(\mu \) within the CC, i.e., within the confidence interval of a specific level, are more reasonable than values outside it. These values become more plausible as the level of the interval decreases. Table 1 clarifies these points by providing the CD-support, confidence odds, CD*-support, and the p -value of the UMPU test for different interval hypotheses and different values of \(\sigma /\sqrt{n}\) .

figure 3

(Normal model) CD-densities (left plot) and CCs (right plot) for \(\mu \) with \({\bar{x}}=2.7\) and three values of \(\sigma /\sqrt{n}\) : \(1/\sqrt{50}\) (solid line), \(1/\sqrt{25}\) (dashed line) and \(1/\sqrt{10}\) (dotted line). In the CC plot the dotted horizontal line is at level 0.95

It can be observed that when the interval is sufficiently large, e.g., [2.0, 2.5], the CD- and the CD*-supports are similar. However, for smaller intervals, as in the other three cases, the difference between the CD- and the CD*-support increases with the variance of the CD, \(\sigma /\sqrt{n}\) , regardless of whether the interval contains the observation \({\bar{x}}\) or not. These aspects are general depending on the form of the CD. Therefore, a comparison between these two measures can be useful to clarify whether an interval is smaller or not, according to the problem under analysis. Regarding the p -value of the UMPU test (see Schervish 1996 , equation 2), it is similar to the CD*-support when the interval is large (first case). However, the difference increases with the growth of the variance in the other cases. Furthermore, enlarging the interval from [2.4, 2.6] to [2.3, 2.6], not reported in Table 1 , while the CD*-supports remain unchanged, results in p -values reducing to 0.241, 0.331, and 0.479 for the three considered variances. This once again highlights the incoherence of the p -value as a measure of the plausibility of a hypothesis.

Now, consider a precise hypothesis, for instance, \({{{\mathcal {H}}}_{0}}:\mu =2.35\) . For the three values used for \(\sigma /\sqrt{n}\) , the CD*-supports are 0.007, 0.040, and 0.134, respectively. From Fig. 3 , it is evident that the point \(\mu =2.35\) lies to the left of the median of the CD. Consequently, the data suggest values of \(\mu \) larger than 2.35. Furthermore, looking at the CC, it becomes apparent that 2.35 is not encompassed within the confidence interval of level 0.95 when \(\sigma /\sqrt{n}=1/\sqrt{50}\) , contrary to what occurs in the other two cases. Due to the symmetry of the normal model, the UMPU test coincides with the equal tailed test, so that the p -value is equal to 2 times the CD*-support (see Remark 4 in Sect. 5.2 ). Furthermore, the size of the CD*-test is \(2\gamma ^*\) , where \(\gamma ^*\) is the threshold fixed to decide whether to reject the hypothesis or not (see Proposition 5 . Thus, if a test of level 0.05 is desired, it is sufficient to fix \(\gamma ^*=0.025\) , and both the CD*-support and the p -value lead to the same decision, namely, rejecting \({{{\mathcal {H}}}_{0}}\) only for the case \(\sigma /\sqrt{n}=0.141\) .

To assess the effectiveness of the CD*-support, we conduct a brief simulation study. For different values of \(\mu \) , we generate 100000 values of \({\bar{x}}\) from a normal distribution with mean \(\mu \) and various standard deviation \(\sigma /\sqrt{n}\) . We obtain the corresponding CDs with the CD*-supports and compute also the p -values. In Table 2 , we consider \({{{\mathcal {H}}}_{0}}: \mu \in [2.0, 2.5]\) and the performance of the CD*-support can be evaluated looking for example at the proportions of values in the intervals [0, 0.4), [0.4, 0.6) and [0.6, 1]. Values of the CD*-support in the first interval suggest a low plausibility of \({{{\mathcal {H}}}_{0}}\) in the light of the data, while values in the third one suggest a high plausibility. We highlight the proportions of incorrect evaluations in boldface. The last column of the table reports the proportion of errors resulting from the use of the standard procedure based on the p -value for a threshold of 0.05. Note how the proportion of errors related to the CD*-support is generally quite low with a maximum value of 0.301, contrary to what happens for the automatic procedure based on the p -value, which reaches a proportion of error of 0.845. Notice that the maximum error due to the CD*-support is obtained when \({{{\mathcal {H}}}_{0}}\) is true, while that due to the p -value is obtained in the opposite, as expected.

We consider now the two hypotheses \({{{\mathcal {H}}}_{0}}:\mu =2.35\) and \({{{\mathcal {H}}}_{0}}: \mu \in [2.75,2.85]\) . Notice that the interval in the second hypothesis should be regarded as small, because it can be checked that the CD- and CD*-supports consistently differ, as can be seen for example in Table 1 for the case \({\bar{x}}=2.7\) . Thus, this hypothesis can be considered not too different from a precise one. Because for a precise hypothesis the CD*-support cannot be larger than 0.5, to evaluate the performance of the CD*-support we can consider the three intervals [0, 0.2), [0.2, 0.3) and [0.3, 0.5].

Table 3 reports the results of the simulation including again the proportion of errors resulting from the use of the p -value with threshold 0.05. For the precise hypothesis \({{{\mathcal {H}}}_{0}}: \mu =2.35\) , the proportion of values of the CD*-support less than 0.2 when \(\mu =2.35\) is, whatever the standard deviation, approximately equal to 0.4. This depends on the fact that for a precise hypothesis, the CD*-support has a uniform distribution on the interval [0, 0.5], see Proposition 5 . This aspect must be taken into careful consideration when setting a threshold for a CD*-test. On the other hand, the proportion of values of the CD*-support in the interval [0.3, 0.5], which wrongly support \({{{\mathcal {H}}}_{0}}\) when it is false, goes from 0.159 to 0.333 for \(\mu =2.55\) and from 0.010 to 0.193 for \(\mu =2.75\) , which are surely better than those obtained from the standard procedure based on the p -value. Take now the hypothesis \({{{\mathcal {H}}}_{0}}: \mu \in [2.75,2.85]\) . Since it can be considered not too different from a precise hypothesis, we consider the proportion of values of the CD*-support in the intervals [0, 0.2), [0.2, 0.3) and [0.3, 1]. Notice that, for simplicity, we assume 1 as the upper bound of the third interval, even though for small intervals, the values of the CD*-support can not be much larger than 0.5. In our simulation it does not exceed 0.635. For the different values of \(\mu \) considered the behavior of the CD*-support and p -value is not too different from the previous case of a precise hypothesis even if the proportion of errors when \({{{\mathcal {H}}}_{0}}\) is true decreases for both while it increases when \({{{\mathcal {H}}}_{0}}\) is false.

Binomial model Suppose we are interested in assessing the chances of candidate A winning the next ballot for a certain administrative position. The latest election poll based on a sample of size \(n=20\) , yielded \(t=9\) votes in favor of A . What can we infer? Clearly, we have a binomial model where the parameter p denotes the probability of having a vote in favor of A . The standard estimate of p is \(\hat{p}=9/20=0.45\) , which might suggest that A will lose the ballot. However, the usual (Wald) confidence interval of level 0.95 based on the normal approximation, i.e. \(\hat{p} \pm 1.96 \sqrt{\hat{p}(1-\hat{p})/n}\) , is (0.232, 0.668). Given its considerable width, this interval suggests that the previous estimate is unreliable. We could perform a statistical test with a significance level \(\alpha \) , but what is \({{{\mathcal {H}}}_{0}}\) , and what value of \(\alpha \) should we consider? If \({{{\mathcal {H}}}_{0}}: p \ge 0.5\) , implying \({{{\mathcal {H}}}_{1}}: p <0.5\) , the p -value is 0.327. This suggests not rejecting \({{{\mathcal {H}}}_{0}}\) for any usual value \(\alpha \) . However, if we choose \({{{\mathcal {H}}}_{0}}^\prime : p \le 0.5\) the p -value is 0.673, and in this case, we would not reject \({{{\mathcal {H}}}_{0}}^\prime \) . These results provide conflicting indications. As seen in Example 3 , the CD for p , \(H_t^g(p)\) , is Be(9.5,11.5) and Fig. 4 shows its CD-density along with the corresponding CC, represented by solid lines. The dotted horizontal line at 0.95 in the CC plot highlights the (non asymptotic) equal-tailed confidence interval (0.251, 0.662), which is shorter than the Wald interval. Note that our interval can be easily obtained by computing the quantiles of order 0.025 and 0.975 of the beta distribution.

figure 4

(Binomial model) CD-densities (left plot) and CCs (right plot) corresponding to \(H_t^g(p)\) , for the parameter p , with \(\hat{p}=t/n=0.45\) : \(n=20\) , \(t=9\) (solid lines) and \(n=60\) , \(t=27\) (dashed lines). In the CC plot the horizontal dotted line is at level 0.95

The CD-support provided by the data for the two hypotheses \({{{\mathcal {H}}}_{0}}:p \ge 0.5\) and \({{{\mathcal {H}}}_{1}}:p < 0.5\) (the choice of what is called \(H_0\) being irrelevant), is \(1-H_t^g(0.5)=0.328\) and \(H_t^g(0.5)=0.672\) respectively. Therefore, the confidence odds are \(CO_{0,1}=0.328/0.672=0.488\) , suggesting that the empirical evidence in favor of the victory of A is half of that of its defeat. Now, consider a sample of size \(n=60\) with \(t=27\) , so that again \(\hat{p}=0.45\) . While a standard analysis leads to the same conclusions (the p -values for \({{{\mathcal {H}}}_{0}}\) and \({{{\mathcal {H}}}_{0}}^{\prime }\) are 0.219 and 0.781, respectively), the use of the CD clarifies the differences between the two cases. The corresponding CD-density and CC are also reported in Fig. 4 (dashed lines) and, as expected, they are more concentrated around \(\hat{p}\) . Thus, the accuracy of the estimates of p is greater for the larger n and the length of the confidence intervals is smaller. Furthermore, for \(n=60\) , \(CO_{0,1}=0.281\) reducing the chance that A wins to about 1 to 4.

As a second application on the binomial model, we follow Johnson and Rossell ( 2010 ) and consider a stylized phase II trial of a new drug designed to improve the overall response rate from 20% to 40% for a specific population of patients with a common disease. The hypotheses are \({{{\mathcal {H}}}_{0}}:p \le 0.2\) versus \({{{\mathcal {H}}}_{1}}: p>0.2\) . It is assumed that patients are accrued and the trial continues until one of the two events occurs: (a) data clearly support one of the two hypotheses (indicated by a CD-support greater than 0.9) or (b) 50 patients have entered the trial. Trials that are not stopped before the 51st patient accrues are assumed to be inconclusive.

Based on a simulation of 1000 trials, Table 4 reports the proportions of trials that conclude in favor of each hypothesis, along with the average number of patients observed before each trial is stopped, for \(\theta =0.1\) (the central value of \({{{\mathcal {H}}}_{0}}\) ) and for \(\theta =0.4\) . A comparison with the results reported by Johnson and Rossell ( 2010 ) reveals that our approach is clearly superior with respect to Bayesian inferences performed with standard priors and comparable to that obtained under their non-local prior carefully specified. Although there is a slight reduction in the proportion of trials stopped for \({{\mathcal {H}}}_0\) (0.814 compared to 0.91), the average number of involved patients is lower (12.7 compared to 17.7), and the power is higher (0.941 against 0.812).

(Exponential distribution) Suppose an investigator aims to compare the performance of a new item, measured in terms of average lifetime, with that of the one currently in use, which is 0.375. To model the item lifetime, it is common to use the exponential distribution with rate parameter \(\lambda \) , so that the mean is \(1/\lambda \) . The typical testing problem is defined by \({{\mathcal {H}}}_0: \lambda =1/0.375=2.667\) versus \({{\mathcal {H}}}_1: \lambda \ne 2.667\) . In many cases, it would be more realistic and interesting to consider hypotheses of the form \({{\mathcal {H}}}_0: \lambda \in [\lambda _1,\lambda _2]\) versus \({{\mathcal {H}}}_1: \lambda \notin [\lambda _1,\lambda _2]\) , and if \({{{\mathcal {H}}}_{0}}\) is rejected, it becomes valuable to know whether the new item is better or worse than the old one. Note that, although an UMPU test exists for this problem, calculating its p -value is not simple and cannot be expressed in a closed form. Here we consider two different null hypotheses: \({{\mathcal {H}}}_0: \lambda \in [2, 4]\) and \({{\mathcal {H}}}_0: \lambda \in [2.63, 2.70]\) , corresponding to a tolerance in the difference between the mean lifetimes of the new and old items equal to 0.125 and 0.005, respectively. Given a sample of n new items with mean \({\bar{x}}\) , it follows from Table 8 in Appendix A that the CD for \(\lambda \) is Ga( n ,  t ), where \(t=n\bar{x}\) . Assuming \(n=10\) , we consider two values of t , namely, 1.5 and 4.5. The corresponding CD-densities are illustrated in Fig. 5 showing how the observed value t significantly influences the shape of the distribution, altering both its center and its dispersion, in contrast to the normal model. Specifically, for \(t=1.5\) , the potential estimates of \(\lambda \) , represented by the mean and median of the CD, are 6.67 and 6.45, respectively. For \(t=4.5\) , these values change to 2.22 and 2.15.

Table 5 provides the CD- and the CD*-supports corresponding to the two null hypotheses considered, along with the p -values of the UMPU test. Figure 5 and Table 5 together make it evident that, for \(t=1.5\) , the supports of both interval null hypotheses are very low and leading to their rejection, unless the problem requires a loss function that strongly penalizes a wrong rejection. Furthermore, it is immediately apparent that the data suggest higher values of \(\lambda \) , indicating a lower average lifetime of the new item. Note that the standard criterion “ p -value \(< 0.05\) ” would imply not rejecting \({{{\mathcal {H}}}_{0}}: \lambda \in [2,4]\) . For \(t=4.5\) , when \({{{\mathcal {H}}}_{0}}: \lambda \in [2,4]\) , the median 2.15 of the CD falls within the interval [2, 4]. Consequently, both the CD- and the CD*-supports are greater than 0.5, leading to the acceptance of \({{{\mathcal {H}}}_{0}}\) , as also suggested by the p -value. When \({{{\mathcal {H}}}_{0}}: \lambda \in [2.63, 2.70]\) , the CD-support becomes meaningless, whereas the CD*-support is not negligible (0.256) and should be carefully evaluated in accordance with the problem under analysis. This contrasts with the indication provided by the p -value (0.555).

For the point null hypothesis \(\lambda =2.67\) , the analysis is similar to that for the interval [2.63, 2.70]. Note that, in this case, in addition to the UMPU test, it is also possible to consider the simpler and most frequently used equal-tailed test. The corresponding p -value is 0.016 for \(t=1.5\) and 0.484 for \(t=4.5\) ; these values are exactly two times the CD*-support, see Remark 4 .

figure 5

(Exponential model) CD-densities for the rate parameter \(\lambda \) , with \(n=10\) and \(t=1.5\) (dashed line) and \(t=4.5\) (solid line)

( Uniform model ) As seen in Example 2 , the CD for the parameter \(\theta \) of the uniform distribution \(\text {U}(0, \theta )\) is a Pareto distribution \(\text {Pa}(n, t)\) , where t is the sample maximum. Figure 6 shows the CD-density for \(n=10\) and \(t=2.1\) .

Consider now \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1, \theta _2]\) versus \({{{\mathcal {H}}}_{1}}: \theta \notin [\theta _1, \theta _2]\) . As usual, we can identify the interval \([\theta _1, \theta _2]\) on the plot of the CD-density and immediately recognize when the CD-test trivially rejects \({{{\mathcal {H}}}_{0}}\) (the interval lies on the left of t , i.e. \(\theta _2<t\) ), when the value of \(\theta _1\) is irrelevant and only the CD-support of \([t,\theta _2]\) determines the decision ( \(\theta _1<t<\theta _2\) ), or when the whole CD-support of \([\theta _1,\theta _2]\) must be considered ( \(t<\theta _1<\theta _2\) ). These facts are not as intuitive when the p -value is used. Indeed, for this problem, there exists the UMP test of level \(\alpha \) (see Eftekharian and Taheri 2015 ) and it is possible to write the p -value as

(we are not aware of previous mention of it). Table 6 reports the p -value of the UMP test, as well as the CD and CD*-supports, for the two hypotheses \({{{\mathcal {H}}}_{0}}: \theta \in [1.5, 2.2]\) and \({{{\mathcal {H}}}_{0}}^\prime : \theta \in [2.0, 2.2]\) for a sample of size \(n=10\) and various values of t .

It can be observed that, when t belongs to the interval \([\theta _1, \theta _2]\) , the CD- and CD*-supports do not depend on \(\theta _1\) , as previously remarked, while the p -value does. This reinforces the incoherence of the p -value shown by Schervish ( 1996 ). For instance, when \(t=2.19\) , the p -value for \({{{\mathcal {H}}}_{0}}\) is 0.046, while that for \({{{\mathcal {H}}}_{0}}^{\prime }\) (included in \({{{\mathcal {H}}}_{0}}\) ) is larger, namely 0.072. Thus, assuming \(\alpha =0.05\) , the UMP test leads to the rejection of \({{{\mathcal {H}}}_{0}}\) but it results in the acceptance of the smaller hypothesis \({{{\mathcal {H}}}_{0}}^{\prime }\) .

figure 6

(Uniform model) CD-density for \(\theta \) with \(n=10\) and \(t=2.1\)

( Sharpe ratio ) The Sharpe ratio is one of the most widely used measures of performance of stocks and funds. It is defined as the average excess return relative to the volatility, i.e. \(SR=\theta =(\mu _R-R_f)/\sigma _R\) , where \(\mu _R\) and \(\sigma _R\) are the mean and standard deviation of a return R and \(R_f\) is a risk-free rate. Under the typical assumption of constant risk-free rate, the excess returns \(X_1, X_2, \ldots , X_n\) of the fund over a period of length n are considered, leading to \(\theta =\mu /\sigma \) , where \(\mu \) and \(\sigma \) are the mean and standard deviation of each \(X_i\) . If the sample is not too small, the distribution and the dependence of the \(X_i\) ’s are not so crucial, and the inference on \(\theta \) is similar to that obtained under the basic assumption of i.i.d. normal random variables, as discussed in Opdyke ( 2007 ). Following this article, we consider the weekly returns of the mutual fund Fidelity Blue Chip Growth from 12/24/03 to 12/20/06 (these data are available for example on Yahoo! Finance, https://finance.yahoo.com/quote/FBGRX ) and assume that the excess returns are i.i.d. normal with a risk-free rate equal to 0.00052. Two different samples are analyzed: the first one includes all \(n_1=159\) observations from the entire period, while the second one is limited to the \(n_2=26\) weeks corresponding to the fourth quarter of 2005 and the first quarter of 2006. The sample mean, the standard deviation, and the corresponding sample Sharpe ratio for the first sample are \(\bar{x}_1=0.00011\) , \(s_1=0.01354\) , \(t_1=\bar{x}_1/s_1=0.00842\) . For the second sample, the values are \(\bar{x}_2=0.00280\) , \(s_2=0.01048\) , \(t_2=\bar{x}_2/s_2=0.26744\) .

We can derive the CD for \(\theta \) starting from the sampling distribution of the statistic \(W=\sqrt{n}T=\sqrt{n}\bar{X}/S\) , which has a noncentral t-distribution with \(n-1\) degrees of freedom and noncentrality parameter \(\tau =\sqrt{n}\mu /\sigma =\sqrt{n}\theta \) . This family has MLR (see Lehmann and Romano 2005 , p. 224) and the distribution function \(F^W_\tau \) of W is continuous in \(\tau \) with \(\lim _{\tau \rightarrow +\infty } F^W_\tau (w)=0\) and \(\lim _{\tau \rightarrow -\infty } F^W_\tau (w)=1\) , for each w in \(\mathbb {R}\) . Thus, from ( 1 ), the CD for \(\tau \) is \(H^\tau _w(\tau )=1-F^W_\tau (w)\) . Recalling that \(\theta =\tau /\sqrt{n}\) , the CD for \(\theta \) can be obtained using a trivial transformation which leads to \(H^\theta _w(\theta )=H^\tau _{w}(\sqrt{n}\theta )=1-F_{\sqrt{n}\theta }^W(w)\) , where \(w=\sqrt{n}t\) . In Figure 7 , the CD-densities for \(\theta \) relative to the two samples are plotted: they are symmetric and centered on the estimate t of \(\theta \) , and the dispersion is smaller for the one with the larger n .

Now, let us consider the typical hypotheses for the Sharpe ratio \({{\mathcal {H}}}_0: \theta \le 0\) versus \({{\mathcal {H}}}_1: \theta >0\) . From Table 7 , which reports the CD-supports and the corresponding odds for the two samples, and from Fig. 7 , it appears that the first sample clearly favors neither hypothesis, while \({{{\mathcal {H}}}_{1}}\) is strongly supported by the second one. Here, the p -value coincides with the CD-support (see Proposition 3 ), but choosing the the usual values 0.05 or 0.01 to decide whether to reject \({{{\mathcal {H}}}_{0}}\) or not may lead to markedly different conclusions.

When the assumption of i.i.d. normal returns does not hold, it is possible to show (Opdyke 2007 ) that the asymptotic distribution of T is normal with mean and variance \(\theta \) and \(\sigma ^2_T=(1+\theta ^2(\gamma _4-1)/4-\theta \gamma _3)/n\) , where \(\gamma _3\) and \(\gamma _4\) are the skewness and kurtosis of the \(X_i\) ’s. Thus, the CD for \(\theta \) can be derived from the asymptotic distribution of T and is N( \(t,\hat{\sigma }^2_T)\) , where \(\hat{\sigma }^2_T\) is obtained by estimating the population moments using the sample counterparts. The last column of Table 7 shows that the asymptotic CD-supports for \({{{\mathcal {H}}}_{0}}\) are not too different from the previous ones.

figure 7

(Sharpe ratio) CD-densities for \(\theta =\mu /\sigma \) with \(n_1=159, t_1=0.008\) (solid line) and \(n_2\) =26, \(t_2=0.267\) (dashed line)

( Ratio of Poisson rates ) The comparison of Poisson rates \(\mu _1\) and \(\mu _2\) is important in various contexts, as illustrated for example by Lehmann & Romano ( 2005 , sec. 4.5), who also derive the UMPU test for the ratio \(\phi =\mu _1/\mu _2\) . Given two i.i.d. samples of sizes \(n_1\) and \(n_2\) from independent Poisson distributions, we can summarize the data with the two sufficient sample sums \(S_1\) and \(S_2\) , where \(S_i \sim \) Po( \(n_i\mu _i\) ), \(i=1,2\) . Reparameterizing the joint density of \((S_1, S_2)\) with \(\phi =\mu _1/\mu _2\) and \(\lambda =n_1\mu _1+n_2\mu _2\) , it is simple to verify that the conditional distribution of \(S_1\) given \(S_1+S_2=s_1+s_2\) is Bi( \(s_1+s_2, w\phi /(1+w\phi )\) ), with \(w=n_1/n_2\) , while the marginal distribution of \(S_1+S_2\) depends only on \(\lambda \) . Thus, for making inference on \(\phi \) , it is reasonable to use the CD for \(\phi \) obtained from the previous conditional distribution. Referring to the table in Appendix A, the CD \(H^g_{s_1,s_2}\) for \(w\phi /(1+w\phi )\) is Be \((s_1+1/2, s_2+1/2)\) , enabling us to determine the CD-density for \(\phi \) through the change of variable rule:

We compare our results with those derived by the standard conditional test implemented through the function poisson.test in R. We use the “eba1977” data set available in the package ISwR, ( https://CRAN.R-project.org/package=ISwR ), which contains counts of incident lung cancer cases and population size in four neighboring Danish cities by age group. Specifically, we compare the \(s_1=11\) lung cancer cases in a population of \(n_1=800\) people aged 55–59 living in Fredericia with the \(s_2=21\) cases observed in the other cities, which have a total of \(n_2=3011\) residents. For the hypothesis \({{{\mathcal {H}}}_{0}}: \phi =1\) versus \({{{\mathcal {H}}}_{1}}: \phi \ne 1\) , the R-output provides a p -value of 0.080 and a 0.95 confidence interval of (0.858, 4.277). If a significance level \(\alpha =0.05\) is chosen, \({{{\mathcal {H}}}_{0}}\) is not rejected, leading to the conclusion that there should be no reason for the inhabitants of Fredericia to worry.

Looking at the three CD-densities for \(\phi \) in Fig. 8 , it is evident that values of \(\phi \) greater than 1 are more supported than values less than 1. Thus, one should test the hypothesis \({{{\mathcal {H}}}_{0}}: \phi \le 1\) versus \({{{\mathcal {H}}}_{1}}: \phi >1\) . Using ( 5 ), it follows that the CD-support of \({{{\mathcal {H}}}_{0}}\) is \(H^g_{s_1,s_2}(1)=0.037\) , and the confidence odds are \(CO_{0,1}=0.037/(1-0.037)=0.038\) . To avoid rejecting \({{{\mathcal {H}}}_{0}}\) , a very asymmetric loss function should be deemed suitable. Finally, we observe that the confidence interval computed in R, is the Clopper-Pearson one, which has exact coverage but, as generally recognized, is too wide. In our context, this corresponds to taking the lower bound of the interval using the CC generated by \(H^\ell _{s_1, s_2}\) and the upper bound using that generated by \(H^r_{s_1, s_2}\) (see Veronese and Melilli 2015 ). It includes the interval generated by \(H_{s_1, s_2}^g\) , namely (0.931, 4.026), as shown in the right plot of Fig. 8 .

figure 8

(Poisson-rates) CD-densities (left plot) and CCs (right plot) corresponding to \(H^g_{s_1,s_2}(\phi )\) (solid lines), \(H^\ell _{s_1,s_2}(\phi )\) (dashed lines) and \(H^r_{s_1,s_2}(\phi )\) (dotted lines) for the parameter \(\phi \) . In the CC plot the vertical lines identify the Clopper-Pearson confidence interval (dashed and dotted lines) and that based on \(H^g_{s_1,s_2}(\phi )\) (solid lines). The dotted horizontal line is at level 0.95

5 Properties of CD-support and CD*-support

5.1 one-sided hypotheses.

The CD-support of a set is the mass assigned to it by the CD, making it a fundamental component in all inferential problems based on CDs. Nevertheless, its direct utilization in hypothesis testing is rare, with the exception of Xie and Singh ( 2013 ). It can also be viewed as a specific instance of evidential support , a notion introduced by Bickel ( 2022 ) within a broader category of models known as evidential models , which encompass both posterior distributions and confidence distributions as specific cases.

Let us now consider a classical testing problem. Let \(\textbf{X}\) be an i.i.d. sample with a distribution depending on a real parameter \(\theta \) and let \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) , where \(\theta _0\) is a fixed value (the case \({{{\mathcal {H}}}_{0}}^\prime : \theta \ge \theta _0\) versus \({{{\mathcal {H}}}_{1}}^\prime : \theta <\theta _0\) is perfectly specular and will not be analyzed). In order to compare our test with the standard one, we assume that the model has MLR in \(T=T(\textbf{X})\) . Suppose first that the distribution function \(F_\theta (t)\) of T is continuous and that the CD for \(\theta \) is \(H_t(\theta )=1- F_{\theta }(t)\) . From Sect. 3 , the CD-support for \({{{\mathcal {H}}}_{0}}\) (which coincides with the CD*-support) is \(H_t(\theta _0)\) . In this case, the UMP test exists, as established by the Karlin-Rubin theorem, and rejects \({{{\mathcal {H}}}_{0}}\) if \(t > t_\alpha \) , where \(t_\alpha \) depends on the chosen significance level \(\alpha \) , or alternatively, if the p -value \(\text{ Pr}_{\theta _0}(T\ge t)\) is less than \(\alpha \) . Since \(\text{ Pr}_{\theta _0}(T\ge t)=1-F_{\theta _0}(t)=H_t(\theta _0)\) , the p -value coincides with the CD-support. Thus, to define a CD-test with size \(\alpha \) , it is enough to fix its rejection region as \(\{t: H_t(\theta _0)<\alpha \}\) , and both tests lead to the same conclusion.

When the statistic T is discrete, we have seen that various choices of CDs are possible. Assuming that \(H^r_t(\theta )< H^g_t(\theta ) < H^{\ell }_t(\theta )\) , as occurs for models belonging to a real NEF, it follows immediately that \(H^{r}_t\) provides stronger support for \({{\mathcal {H}}}_0: \theta \le \theta _0\) than \(H^g_t\) does, while \(H^{\ell }_t\) provides stronger support for \({{\mathcal {H}}}_0^\prime : \theta \ge \theta _0\) than \(H^g_t\) does. In other words, \(H_t^{\ell }\) is more conservative than \(H^g_t\) for testing \({{{\mathcal {H}}}_{0}}\) and the same happens to \(H^r_t\) for \({{{\mathcal {H}}}_{0}}^{\prime }\) . Therefore, selecting the appropriate CD can lead to the standard testing result. For example, in the case of \({{{\mathcal {H}}}_{0}}:\theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta > \theta _0\) , the p -value is \(\text{ Pr}_{\theta _0}(T\ge t)=1-\text{ Pr}_{\theta _0}(T<t)=H^{\ell }_t(\theta _0)\) , and the rejection region of the standard test and that of the CD-test based on \(H_t^{\ell }\) coincide if the threshold is the same. However, as both tests are non-randomized, their size is typically strictly less than the fixed threshold.

The following proposition summarizes the previous considerations.

Proposition 3

Consider a model indexed by a real parameter \(\theta \) with MLR in the statistic T and the one-sided hypotheses \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) versus \({{{\mathcal {H}}}_{1}}: \theta >\theta _0\) , or \({{{\mathcal {H}}}_{0}}^\prime : \theta \ge \theta _0\) versus \({{{\mathcal {H}}}_{1}}^\prime : \theta <\theta _0\) . If T is continuous, then the CD-support and the p -value associated with the UMP test are equal. Thus, if a common threshold \(\alpha \) is set for both rejection regions, the two tests have size \(\alpha \) . If T is discrete, the CD-support coincides with the usual p -value if \(H^\ell _t [H^r_t]\) is chosen when \({{{\mathcal {H}}}_{0}}: \theta \le \theta _0\) \([{{{\mathcal {H}}}_{0}}^\prime : \theta \ge \theta _0]\) . For a fixed threshold \(\alpha \) , the two tests have a size not greater than \(\alpha \) .

The CD-tests with threshold \(\alpha \) mentioned in the previous proposition have significance level \(\alpha \) and are, therefore, valid , that is \(\sup _{\theta \in \Theta _0} Pr_\theta (H(T)\le \alpha ) \le \alpha \) (see Martin and Liu 2013 ). This is no longer true if, for a discrete T , we choose \(H^g_t\) . However, Proposition 2 implies that its average size is closer to \(\alpha \) compared to those of the tests obtained using \(H^\ell _t\) \([H^r_t]\) , making \(H^g_t\) more appropriate when the problem does not strongly suggest that the null hypothesis should be considered true “until proven otherwise”.

5.2 Precise and interval hypotheses

The notion of CD*-support surely demands more attention than that of CD-support. Recalling that the CD*-support only accounts for one direction of deviation from the precise or interval hypothesis, we will first briefly explore its connections with similar notions.

While the CD-support is an additive measure, meaning that for any set \(A \subseteq \Theta \) and its complement \(A^c\) , we always have \(\text{ CD }(A) +\text{ CD }(A^c)=1\) , the CD*-support is only a sub-additive measure, that is \(\text{ CD* }(A) +\text{ CD* }(A^c)\le 1\) , as can be easily checked. This suggests that the CD*-support can be related to a belief function. In essence, a belief function \(\text{ bel}_\textbf{x}(A)\) measures the evidence in \(\textbf{x}\) that supports A . However, due to its sub-additivity, it alone cannot provide sufficient information; it must be coupled with the plausibility function, defined as \(\text {pl}_\textbf{x}(A) = 1 - \text {bel}_\textbf{x}(A^c)\) . We refer to Martin and Liu ( 2013 ) for a detailed treatment of these notions within the general framework of Inferential Models , which admits a CD as a very specific case. We only mention here that they show that when \(A=\{\theta _0\}\) (i.e. a singleton), \(\text{ bel}_\textbf{x}(\{\theta _0\})=0\) , but \(\text{ bel}_\textbf{x}(\{\theta _0\}^c)\) can be different from 1. In particular, for the normal model N \((\theta ,1)\) , they found that, under some assumptions, \(\text{ bel}_\textbf{x}(\{\theta _0\}^c) =|2\Phi (x-\theta _0)-1|\) . Recalling the definition of the CC and the CD provided in Example 1 , it follows that the plausibility of \(\theta _0\) is \(\text {pl}_\textbf{x}(\{\theta _0\})=1-\text{ bel}_\textbf{x}(\{\theta _0\}^c)=1-|2\Phi (x-\theta _0)-1|= 1-CC_\textbf{x}(\theta _0)\) , and using ( 4 ), we can conclude that the CD*-support of \(\theta _0\) corresponds to half their plausibility.

The CD*-support for a precise hypothesis \({{{\mathcal {H}}}_{0}}: \theta =\theta _0\) is related to the notion of evidence, as defined in a Bayesian context by Pereira et al. ( 2008 ). Evidence is the posterior probability of the set \(\{\theta \in \Theta : p(\theta |\textbf{x})<p(\theta _0|\textbf{x})\}\) , where \(p(\theta |\textbf{x})\) is the posterior density of \(\theta \) . In particular, when a unimodal and symmetric CD is used as a posterior distribution, it is easy to check that the CD*-support coincides with half of the evidence.

The CD*-support is also related to the notion of weak-support defined by Singh et al. ( 2007 ) as \(\sup _{\theta \in [\theta _1,\theta _2]} 2 \min \{H_{\textbf{x}}(\theta ), 1-H_{\textbf{x}}(\theta )\}\) , but important differences exist. If data give little support to \({{{\mathcal {H}}}_{0}}\) , our definition highlights better whether values of \(\theta \) on the right or on the left of \({{{\mathcal {H}}}_{0}}\) are more reasonable. Moreover, if \({{{\mathcal {H}}}_{0}}\) is highly supported, that is \(\theta _m \in [\theta _1,\theta _2]\) , the weak-support is always equal to one, while the CD*-support assumes values in the interval [0.5, 1], allowing to better discriminate between different cases. Only if \({{{\mathcal {H}}}_{0}}\) is a precise hypothesis the two definitions agree, leaving out the multiplicative constant of two.

There exists a strong connection between the CD*-support and the e-value introduced by Peskun ( 2020 ). Under certain regularity assumptions, the e -value can be expressed in terms of a CD and coincides with the CD*-support, so that the properties and results originally established by Peskun for the e -value also apply to the CD*-support. More precisely, let us first consider the case of an observation x generated by the normal model \(\text {N}(\mu ,1)\) . Peskun shows that for the hypothesis \({{{\mathcal {H}}}_{0}}: \mu \in [\mu _1,\mu _2]\) , the e -value is equal to \(\min \{\Phi (x-\mu _1), \Phi (\mu _2-x)\}\) . Since, as shown in Example 1 , \(H_x(\mu )=1-\Phi (x-\mu )=\Phi (\mu -x)\) , it immediately follows that \(\min \{H_x(\mu _2),1-H_x(\mu _1)\}= \min \{\Phi (\mu _2-x), \Phi (x-\mu _1)\}\) , so that the e -value and the CD*-support coincide. For a more general case, we present the following result.

Proposition 4

Let \(\textbf{X}\) be a random vector distributed according to the family of densities \(\{p_\theta , \theta \in \Theta \subseteq \mathbb {R}\}\) with a MLR in the real continuous statistic \(T=T(\textbf{X})\) , with distribution function \(F_\theta (t)\) . If \(F_\theta (t)\) is continuous in \(\theta \) with limits 0 and 1 for \(\theta \) tending to \(\sup (\Theta )\) and \(\inf (\Theta )\) , respectively, then the CD*-support and the e -value for the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , \(\theta _1 \le \theta _2\) , are equivalent.

We emphasize, however, that the advantage of the CD*-support over the e -value relies on the fact that knowledge of the entire CD allows us to naturally encompass the testing problem into a more comprehensive and coherent inferential framework, in which the e -value is only one of the aspects to be taken into consideration.

Suppose now that a test of significance for \({{\mathcal {H}}}_0: \theta \in [\theta _1,\theta _2]\) , with \(\theta _1 \le \theta _2\) , is desired and that the CD for \(\theta \) is \(H_t(\theta )\) . Recall that the CD-support for \({{{\mathcal {H}}}_{0}}\) is \(H_t([\theta _1,\theta _2]) = \int _{\theta _1}^{\theta _2} dH_{t}(\theta ) = H_t(\theta _2)-H_t(\theta _1)\) , and that when \(\theta _1=\theta _2=\theta _0\) , or the interval \([\theta _1,\theta _2]\) is “small”, it becomes ineffective, and the CD*-support must be employed. The following proposition establishes some results about the CD- and the CD*-tests.

Proposition 5

Given a statistical model parameterized by the real parameter \(\theta \) with MLR in the continuous statistic T , consider the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) with \( \theta _1 \le \theta _2\) . Then,

both the CD- and the CD*-tests reject \({{{\mathcal {H}}}_{0}}\) for all values of T that are smaller or larger than suitable values;

if a threshold \(\gamma \) is fixed for the CD-test, its size is not less than \(\gamma \) ;

for a precise hypothesis, i.e., \(\theta _1=\theta _2\) , the CD*-support, seen as function of the random variable T , has the uniform distribution on (0, 0.5);

if a threshold \(\gamma ^*\) is fixed for the CD*-test, its size falls within the interval \([\gamma ^*, \min (2\gamma ^*,1)]\) and equals \(\min (2\gamma ^*,1)\) when \(\theta _1=\theta _2\) , (i.e. when \({{{\mathcal {H}}}_{0}}\) is a precise hypothesis);

the CD-support is never greater than the CD*-support, and if a common threshold is fixed for both tests, the size of the CD-test is not smaller than that of the CD*-test.

Point i) highlights that the rejection regions generated by the CD- and CD*-tests are two-sided, resembling standard tests for hypotheses of this kind. However, even when \(\gamma = \gamma ^*\) , the rejection regions differ, with the CD-test being more conservative for \({{{\mathcal {H}}}_{0}}\) . This becomes crucial for small intervals, where the CD-test tends to reject the null hypothesis almost invariably.

Under the assumption of Proposition 5 , the p -value corresponding to the commonly used equal tailed test for a precise hypothesis \({{{\mathcal {H}}}_{0}}:\theta =\theta _0\) is \(2\min \{F_{\theta _0}(t), 1-F_{\theta _0}(t)\}\) , so that it coincides with 2 times the CD*-support.

For interval hypotheses, a UMPU test essentially exists only for models within a NEF, and an interesting relationship can be established with the CD-test.

Proposition 6

Given the CD based on the sufficient statistic of a continuous real NEF with natural parameter \(\theta \) , consider the hypothesis \({{\mathcal {H}}}_0: \theta \in [\theta _1,\theta _2]\) versus \({{\mathcal {H}}}_1: \theta \notin [\theta _1,\theta _2]\) , with \(\theta _1 < \theta _2\) . If the CD-test has size \(\alpha _{CD}\) , it is the UMPU test among all \(\alpha _{CD}\) -level tests.

For interval hypotheses, unlike one-sided hypotheses, when the statistic T is discrete, there is no clear reason to prefer either \(H_t^{\ell }\) or \(H_t^r\) . Neither test is more conservative, as their respective rejection regions are shifted by just one point in the support of T . Thus, \(H^g_t\) can be considered again a reasonable compromise, due to its greater proximity to the uniform distribution. Moreover, while the results stated for continuous statistics may not hold exactly for discrete statistics, they remain approximately valid for not too small sample sizes, thanks to the asymptotic normality of CDs, as stated in Proposition 1 .

6 Conclusions

In this article, we propose the use of confidence distributions to address a hypothesis testing problem concerning a real parameter of interest. Specifically, we introduce the CD- and CD*-supports, which are suitable for evaluating one-sided or large interval null hypotheses and precise or small interval null hypotheses, respectively. This approach does not necessarily require identifying the first and second type errors or fixing a significance level a priori. We do not propose an automatic procedure; instead, we suggest a careful and more general inferential analysis of the problem based on CDs. CD- and CD*-supports are two simple coherent measures of evidence for a hypothesis with a clear meaning and interpretation. None of these features are owned by the p -value, which is more complex and generally does not exist in closed form for interval hypothesis.

It is well known that the significance level \(\alpha \) of a test, which is crucial to take a decision, should be adjusted according to the sample size, but this is almost never done in practice. In our approach, the support provided by the CD to a hypothesis trivially depends on the sample size through the dispersion of the CD. For example, if \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , you can easily observe the effect of sample size on the CD-support of \({{{\mathcal {H}}}_{0}}\) by examining the interval \([\theta _1, \theta _2]\) on the CD-density plot. The CD-support can be non-negligible also when the length \(\Delta =\theta _2-\theta _1\) is small for a CD that is sufficiently concentrated on the interval. The relationship between \(\Delta \) and the dispersion of the CD highlights again the importance of a thoughtful choice of the threshold used for decision-making and the unreasonableness of using standard values. Note that the CD- and CD*-tests are similar in many standard situations, as shown in the examples presented.

Finally, we have investigated some theoretical aspects of the CD- and CD*-tests which are crucial in standard approach. While for one-sided hypotheses, an agreement with standard tests can be established, there are some distinctions to be made for two-sided hypotheses. If a threshold \(\gamma \) is fixed for a CD- or CD*-test, then its size exceeds \(\gamma \) reaching \(2\gamma \) for a CD*-test relative to a precise hypothesis. This is because the CD*-support only considers the appropriate tail suggested by the data and it does not adhere to the typical procedure of doubling the one-sided p -value, a procedure that can be criticized, as seen in Sect. 1 . Of course, if one is convinced of the need to double the p -value, in our context, it is sufficient to double the CD*-support. In the case of a precise hypothesis \({{{\mathcal {H}}}_{0}}: \theta = \theta _0\) , this leads to a valid test because \(Pr_{\theta _0}\left( 2\min \{H_{\textbf{x}}(\theta _0),1-H_{\textbf{x}}(\theta _0)\}\le \alpha \right) \le \alpha \) , as can be deduced by considering the relationship of the CD*-support with the e -value and the results in Peskun ( 2020 , Sec. 2).

Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C et al (2018) Redefine statistical significance. Nat. Hum Behav 2:6–10

Article   Google Scholar  

Berger JO, Delampady M (1987) Testing precise hypotheses. Statist Sci 2:317–335

Google Scholar  

Berger JO, Sellke T (1987) Testing a point null hypothesis: the irreconcilability of p-values and evidence. J Amer Statist Assoc 82:112–122

MathSciNet   Google Scholar  

Bickel DR (2022) Confidence distributions and empirical Bayes posterior distributions unified as distributions of evidential support. Comm Statist Theory Methods 51:3142–3163

Article   MathSciNet   Google Scholar  

Eftekharian A, Taheri SM (2015) On the GLR and UMP tests in the family with support dependent on the parameter. Stat Optim Inf Comput 3:221–228

Fisher RA (1930) Inverse probability. Proceedings of the Cambridge Philosophical Society 26:528–535

Fisher RA (1973) Statistical methods and scientific inference. Hafner Press, New York

Freedman LS (2008) An analysis of the controversy over classical one-sided tests. Clinical Trials 5:635–640

Gibbons JD, Pratt JW (1975) p-values: interpretation and methodology. Amer Statist 29:20–25

Goodman SN (1993) p-values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 137:485–496

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG (2016) Statistical tests, p-values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31:337–350

Hannig J (2009) On generalized fiducial inference. Statist Sinica 19:491–544

Hannig J, Iyer HK, Lai RCS, Lee TCM (2016) Generalized fiducial inference: a review and new results. J Amer Statist Assoc 44:476–483

Hubbard R, Bayarri MJ (2003) Confusion over measures of evidence (p’s) versus errors ( \(\alpha \) ’s) in Classical Statistical Testing. Amer Statist 57:171–178

Johnson VE, Rossell D (2010) On the use of non-local prior densities in Bayesian hypothesis tests. J R Stat Soc Ser B 72:143–170

Johnson VE, Payne RD, Wang T, Asher A, Mandal S (2017) On the reproducibility of psychological science. J Amer Statist Assoc 112:1–10

Lehmann EL, Romano JP (2005) Testing Statistical Hypotheses, 3rd edn. Springer, New York

Martin R, Liu C (2013) Inferential models: a framework for prior-free posterior probabilistic inference. J Amer Statist Assoc 108:301–313

Opdyke JD (2007) Comparing sharpe ratios: so where are the p -values? J Asset Manag 8:308–336

OSC (2015). Estimating the reproducibility of psychological science. Science 349:aac4716

Pereira CADB, Stern JM (1999) Evidence and credibility: full Bayesian significance test for precise hypotheses. Entropy 1:99–110

Pereira CADB, Stern JM, Wechsler S (2008) Can a significance test be genuinely Bayesian? Bayesian Anal 3:79–100

Peskun PH (2020) Two-tailed p-values and coherent measures of evidence. Amer Statist 74:80–86

Schervish MJ (1996) p values: What they are and what they are not. Amer Statist 50:203–206

Schweder T, Hjort NL (2002) Confidence and likelihood. Scand J Stat 29:309–332

Schweder T, Hjort NL (2016) Confidence, likelihood and probability. Cambridge University Press, London

Book   Google Scholar  

Shao J (2003) Mathematical statistics. Springer-Verlag, New York

Singh K, Xie M, Strawderman M (2005) Combining information through confidence distributions. Ann Statist 33:159–183

Singh K, Xie M, Strawderman WE (2007). Confidence distribution (CD) – Distribution estimator of a parameter. In Complex datasets and inverse problems: tomography, networks and beyond (pp. 132–150). Institute of Mathematical Statistics

Veronese P, Melilli E (2015) Fiducial and confidence distributions for real exponential families. Scand J Stat 42:471–484

Veronese P, Melilli E (2018) Fiducial, confidence and objective Bayesian posterior distributions for a multidimensional parameter. J Stat Plan Inference 195:153–173

Veronese P, Melilli E (2018) Some asymptotic results for fiducial and confidence distributions. Statist Probab Lett 134:98–105

Wasserstein RL, Lazar NA (2016) The ASA statement on p-values: context, process, and purpose. Amer Statist 70:129–133

Xie M, Singh K (2013) Confidence distribution, the frequentist distribution estimator of a parameter: a review. Int Stat Rev 81:3–39

Yates F (1951) The influence of statistical methods for research workers on the development of the science of statistics. J Amer Statist Assoc 46:19–34

Download references

Acknowledgements

Partial financial support was received from Bocconi University. The authors would like to thank the referees for their valuable comments, suggestions and references, which led to a significantly improved version of the manuscript

Open access funding provided by Università Commerciale Luigi Bocconi within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

Bocconi University, Department of Decision Sciences, Milano, Italy

Eugenio Melilli & Piero Veronese

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Eugenio Melilli .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A. Table of confidence distributions

Appendix b. proof of propositions, proof of proposition 1.

The asymptotic normality and the consistency of the CD in i) and ii) follow from Veronese & Melilli ( 2015 , Thm. 3) for models belonging to a NEF and from Veronese & Melilli ( 2018b , Thm. 1) for continuous arbitrary models. Part iii) of the proposition follows directly using the Chebyshev’s inequality. \(\diamond \)

Proof of Proposition 2

Denote by \(F_{\theta }(t)\) the distribution function of T , assume that its support \({{\mathcal {T}}}=\{t_1,t_2,\ldots ,t_k\}\) is finite for simplicity and let \(p_j=p_j(\theta )=\text{ Pr}_\theta (T=t_j)\) , \(j=1,2,\ldots ,k\) for a fixed \(\theta \) . Consider the case \(H_t^r(\theta )=1-F_{\theta }(t)\) (if \(H_t^r(\theta )=F_{\theta }(t)\) the proof is similar) so that, for each \(j=2,\ldots ,k\) , \(H_{t_j}^\ell (\theta )=H_{t_{j-1}}^r(\theta )\) and \(H_{t_1}^\ell (\theta )=1\) . The supports of the random variables \(H^r_T(\theta )\) , \(H^\ell _T(\theta )\) and \(H^g_T(\theta )\) are, respectively,

where ( 6 ) holds because \(H^r_{t_j}(\theta )< H^g_{t_j}(\theta ) < H^{\ell }_{t_j}(\theta )\) . The probabilities corresponding to the points included in the three supports are of course the same, that is \(p_k,p_{k-1},\ldots ,p_1\) , in this order, so that \(G^\ell (u) \le u \le G^r(u)\) .

Let \(d(Q,R)=\int |Q(x)-R(x)|dx\) be the distance between the two arbitrary distribution functions Q and R . Denoting \(G^u\) as the uniform distribution function on (0, 1), we have

where the last inequality follows from ( 6 ). Thus, the distance from uniformity of \(H_T^g(\theta )\) is less than that of \(H_T^\ell (\theta )\) and of \(H_T^r(\theta )\) and ( 2 ) is proven. \(\diamond \)

Proof of Proposition 4

Given the statistic T and the hypothesis \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) , the e -value, see Peskun 2020 , equation 12), is \(\min \bigg \{\max _{\theta \in [\theta _1,\theta _2]} F_\theta (t), \max _{\theta \in [\theta _1,\theta _2]} (1-F_\theta (t))\bigg \}\) . Under the assumptions of the proposition, it follows that \(F_t(\theta )\) is monotonically nonincreasing in \(\theta \) for each t (see Section 2 ). As a result, the e -value simplifies to:

where the last expression coincides with the CD*-support of \({{{\mathcal {H}}}_{0}}\) . Note that the same result holds if the MLR is nondecreasing in T ensuring that \(F_t(\theta )\) is monotonically nondecreasing. \(\diamond \)

Proof of Proposition 5

Point i). Consider first the CD-test and let \(g(t)=H_t([\theta _1,\theta _2])=H_t(\theta _2)-H_t(\theta _1)=F_{\theta _1}(t)-F_{\theta _2}(t)\) , which is a nonnegative, continuous function with \(\lim _{t\rightarrow \pm \infty }g(t)=0\) and with derivative \(g^\prime (t)=f_{\theta _1}(t)- f_{\theta _2}(t)\) . Let \(t_0 \in \mathbb {R}\) be a point such that g is nondecreasing for \(t<t_0\) and strictly decreasing for \(t \in (t_0,t_1)\) , for a suitable \(t_1>t_0\) ; the existence of \(t_0\) is guaranteed by the properties of g . It follows that \(g^\prime (t) \ge 0\) for \(t<t_0\) and \(g^\prime (t)<0\) in \((t_0,t_1)\) . We show that \(t_0\) is the unique point at which the function \(g^\prime \) changes sign. Indeed, if \(t_2\) were a point greater than \(t_1\) such that \(g^\prime (t)>0\) for t in a suitable interval \((t_2,t_3)\) , with \(t_3> t_2\) , we would have, in this interval, \(f_{\theta _1}(t)>f_{\theta _2}(t)\) . Since \(f_{\theta _1}(t)<f_{\theta _2}(t)\) for \(t \in (t_0,t_1)\) , this implies \(f_{\theta _2}(t)/f_{\theta _1}(t)>1\) for \(t \in (t_0,t_1)\) and \(f_{\theta _2}(t)/f_{\theta _1}(t)<1\) for \(t \in (t_2,t_3)\) , which contradicts the assumption of the (nondecreasing) MLR in T . Thus, g ( t ) is nondecreasing for \(t<t_0\) and nonincreasing for \(t>t_0\) , and the set \(\{t: H_t([\theta _1,\theta _2])< \gamma \}\) coincides with \( \{t: t<t^\prime \) or \(t>t^{\prime \prime }\}\) for suitable \(t^\prime \) and \(t^{\prime \prime }\) .

Consider now the CD*-test. The corresponding support is \(\min \{H_t(\theta _2), 1-H_t(\theta _1)\}= \min \{1-F_{\theta _2}(t), F_{\theta _1}(t)\}\) , which is a continuous function of t and approaches zero as \(t \rightarrow \pm \infty \) . Moreover, it equals \(F_{\theta _1}(t)\) for \(t\le t^*=\inf \{t: F_{\theta _1}(t)=1-F_{\theta _2}(t)\}\) and \(1-F_{\theta _2}(t)\) for \(t\ge t^*\) . Thus, the function is nondecreasing for \(t \le t^*\) and nonincreasing for \(t \ge t^*\) , and the result is proven.

Point ii). Suppose having observed \(t^\prime = F_{\theta _1}^{-1}(\gamma )\) , then the CD-support for \({{{\mathcal {H}}}_{0}}\) is

so that \(t^\prime \) belongs to the rejection region defined by the threshold \(\gamma \) . Due to the structure of this region specified in point i), all \(t\le t^{\prime }\) belong to it. Now,

because \(F_{\theta }(t) \le F_{\theta _1}(t)\) for each t and \(\theta \in [\theta _1,\theta _2]\) . It follows that the size of the CD-test with threshold \(\gamma \) is not smaller than \(\gamma \) .

Point iii). The result follows from the equality of the CD*-support with the e -value, as stated in Proposition 4 , and the uniformity of the e -value as proven in Peskun ( 2020 , Sec. 2).

Point iv). The size of the CD*-test with threshold \(\gamma ^*\) is the supremum on \([\theta _1,\theta _2]\) of the following probability

under the assumption that \(F_{\theta _1}^{-1}(\gamma ^*) <F_{\theta _2}^{-1}(1-\gamma ^*)\) , otherwise the probability is one. Because \(F_{\theta _2}(t) \le F_{\theta }(t) \le F_{\theta _1}(t)\) for each t and \(\theta \in [\theta _1,\theta _2]\) , it follows that \(F_{\theta }(F_{\theta _1}^{-1}(\gamma ^*)) \le F_{\theta _1}(F_{\theta _1}^{-1}(\gamma ^*))=\gamma ^*\) , and \(F_{\theta }(F_{\theta _2}^{-1}(1-\gamma ^*)) \ge F_{\theta _2}(F_{\theta _2}^{-1}(1-\gamma ^*)) = 1-\gamma ^*\) so that the size is

Finally, if \(\theta =\theta _2\) , from ( 7 ) we have

and thus the size of the CD*-test must be included in the interval \([\gamma ^*,2\gamma ^*]\) , provided that \(2\gamma ^*\) is less than 1. For the case \(\theta _1=\theta _2\) , it follows from ( 7 ) that the size of the CD*-test is \(2\gamma ^*\) .

Point v). Because \(H_t([\theta _1,\theta _2]=H_t(\theta _2)-H_t(\theta _1)\le H_t(\theta _2)\) and also \(H_t(\theta _2)-H_t(\theta _1) \le 1-H_t(\theta _1)\) , recalling Definition 4 , it immediately follows that the CD-support is not greater than the CD*-support. Thus if the same threshold is fixed for the two tests, the rejection region of the CD-test includes that of the CD*-test, and the size of the first test is not smaller than that of the second one. \(\diamond \)

Proof of Proposition 6

Recall from point i) of Proposition 5 , that the CD-test with threshold \(\gamma \) rejects \({{{\mathcal {H}}}_{0}}: \theta \in [\theta _1,\theta _2]\) for values of T less than \(t^\prime \) or greater than \(t^{\prime \prime }\) , with \(t^\prime \) and \(t^{\prime \prime }\) solutions of the equation \(F_{\theta _1}(t)-F_{\theta _2}(t)=\gamma \) . Denoting with \(\pi _{CD}\) its power function, we have

Thus the power function of the CD-test is equal in \(\theta _1\) and \(\theta _2\) and this condition characterizes the UMPU test for the exponential families, see Lehmann & Romano ( 2005 , p. 135). \(\diamond \)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Melilli, E., Veronese, P. Confidence distributions and hypothesis testing. Stat Papers (2024). https://doi.org/10.1007/s00362-024-01542-4

Download citation

Received : 05 April 2023

Revised : 14 December 2023

Published : 29 March 2024

DOI : https://doi.org/10.1007/s00362-024-01542-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Confidence curve
  • Precise and interval hypotheses
  • Statistical measure of evidence
  • Uniformly most powerful test

Mathematics Subject Classification

  • Find a journal
  • Publish with us
  • Track your research

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Inferential Statistics | An Easy Introduction & Examples

Inferential Statistics | An Easy Introduction & Examples

Published on September 4, 2020 by Pritha Bhandari . Revised on June 22, 2023.

While descriptive statistics summarize the characteristics of a data set, inferential statistics help you come to conclusions and make predictions based on your data.

When you have collected data from a sample , you can use inferential statistics to understand the larger population from which the sample is taken.

Inferential statistics have two main uses:

  • making estimates about populations (for example, the mean SAT score of all 11th graders in the US).
  • testing hypotheses to draw conclusions about populations (for example, the relationship between SAT scores and family income).

Table of contents

Descriptive versus inferential statistics, estimating population parameters from sample statistics, hypothesis testing, other interesting articles, frequently asked questions about inferential statistics.

Descriptive statistics allow you to describe a data set, while inferential statistics allow you to make inferences based on a data set.

  • Descriptive statistics

Using descriptive statistics, you can report characteristics of your data:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability concerns how spread out the values are.

In descriptive statistics, there is no uncertainty – the statistics precisely describe the data that you collected. If you collect data from an entire population, you can directly compare these descriptive statistics to those from other populations.

Inferential statistics

Most of the time, you can only acquire data from samples, because it is too difficult or expensive to collect data from the whole population that you’re interested in.

While descriptive statistics can only summarize a sample’s characteristics, inferential statistics use your sample to make reasonable guesses about the larger population.

With inferential statistics, it’s important to use random and unbiased sampling methods . If your sample isn’t representative of your population, then you can’t make valid statistical inferences or generalize .

Sampling error in inferential statistics

Since the size of a sample is always smaller than the size of the population, some of the population isn’t captured by sample data. This creates sampling error , which is the difference between the true population values (called parameters) and the measured sample values (called statistics).

Sampling error arises any time you use a sample, even if your sample is random and unbiased. For this reason, there is always some uncertainty in inferential statistics. However, using probability sampling methods reduces this uncertainty.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The characteristics of samples and populations are described by numbers called statistics and parameters :

  • A statistic is a measure that describes the sample (e.g., sample mean ).
  • A parameter is a measure that describes the whole population (e.g., population mean).

Sampling error is the difference between a parameter and a corresponding statistic. Since in most cases you don’t know the real population parameter, you can use inferential statistics to estimate these parameters in a way that takes sampling error into account.

There are two important types of estimates you can make about the population: point estimates and interval estimates .

  • A point estimate is a single value estimate of a parameter. For instance, a sample mean is a point estimate of a population mean.
  • An interval estimate gives you a range of values where the parameter is expected to lie. A confidence interval is the most common type of interval estimate.

Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.

Confidence intervals

A confidence interval uses the variability around a statistic to come up with an interval estimate for a parameter. Confidence intervals are useful for estimating parameters because they take sampling error into account.

While a point estimate gives you a precise value for the parameter you are interested in, a confidence interval tells you the uncertainty of the point estimate. They are best used in combination with each other.

Each confidence interval is associated with a confidence level. A confidence level tells you the probability (in percentage) of the interval containing the parameter estimate if you repeat the study again.

A 95% confidence interval means that if you repeat your study with a new sample in exactly the same way 100 times, you can expect your estimate to lie within the specified range of values 95 times.

Although you can say that your estimate will lie within the interval a certain percentage of the time, you cannot say for sure that the actual population parameter will. That’s because you can’t know the true value of the population parameter without collecting data from the full population.

However, with random sampling and a suitable sample size, you can reasonably expect your confidence interval to contain the parameter a certain percentage of the time.

Your point estimate of the population mean paid vacation days is the sample mean of 19 paid vacation days.

Hypothesis testing is a formal process of statistical analysis using inferential statistics. The goal of hypothesis testing is to compare populations or assess relationships between variables using samples.

Hypotheses , or predictions, are tested using statistical tests . Statistical tests also estimate sampling errors so that valid inferences can be made.

Statistical tests can be parametric or non-parametric. Parametric tests are considered more statistically powerful because they are more likely to detect an effect if one exists.

Parametric tests make assumptions that include the following:

  • the population that the sample comes from follows a normal distribution of scores
  • the sample size is large enough to represent the population
  • the variances , a measure of variability , of each group being compared are similar

When your data violates any of these assumptions, non-parametric tests are more suitable. Non-parametric tests are called “distribution-free tests” because they don’t assume anything about the distribution of the population data.

Statistical tests come in three forms: tests of comparison, correlation or regression.

Comparison tests

Comparison tests assess whether there are differences in means, medians or rankings of scores of two or more groups.

To decide which test suits your aim, consider whether your data meets the conditions necessary for parametric tests, the number of samples, and the levels of measurement of your variables.

Means can only be found for interval or ratio data , while medians and rankings are more appropriate measures for ordinal data .

Correlation tests

Correlation tests determine the extent to which two variables are associated.

Although Pearson’s r is the most statistically powerful test, Spearman’s r is appropriate for interval and ratio variables when the data doesn’t follow a normal distribution.

The chi square test of independence is the only test that can be used with nominal variables.

Regression tests

Regression tests demonstrate whether changes in predictor variables cause changes in an outcome variable. You can decide which regression test to use based on the number and types of variables you have as predictors and outcomes.

Most of the commonly used regression tests are parametric. If your data is not normally distributed, you can perform data transformations.

Data transformations help you make your data normally distributed using mathematical operations, like taking the square root of each value.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Confidence interval
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

A statistic refers to measures about the sample , while a parameter refers to measures about the population .

A sampling error is the difference between a population parameter and a sample statistic .

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Inferential Statistics | An Easy Introduction & Examples. Scribbr. Retrieved April 2, 2024, from https://www.scribbr.com/statistics/inferential-statistics/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, parameter vs statistic | definitions, differences & examples, descriptive statistics | definitions, types, examples, hypothesis testing | a step-by-step guide with easy examples, what is your plagiarism score.

IMAGES

  1. 13 Different Types of Hypothesis (2024)

    hypothesis statistics example

  2. Statistical Hypothesis Testing: Step by Step

    hypothesis statistics example

  3. Hypothesis Testing- Meaning, Types & Steps

    hypothesis statistics example

  4. PPT

    hypothesis statistics example

  5. Hypothesis Testing Solved Examples(Questions and Solutions)

    hypothesis statistics example

  6. PPT

    hypothesis statistics example

VIDEO

  1. Hypothesis Testing

  2. Proportion Hypothesis Testing, example 2

  3. HYPOTHESIS TESTING CONCEPT AND EXAMPLE #shorts #statistics #data #datanalysis #analysis #hypothesis

  4. TESTING OF HYPOTHESIS, STATISTICS AND NUMERICAL METHODS UNIT-1, MA8391 UNIT-3, VIDEO-1

  5. Concept of Hypothesis

  6. FA II STATISTICS/ Chapter no 7 / Testing of hypothesis/ Z distribution / Example 7.8

COMMENTS

  1. Hypothesis Testing

    There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1 ). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...

  2. What is Hypothesis Testing in Statistics? Types and Examples

    Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a population. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then collecting data to assess the evidence.

  3. 5.2

    5.2 - Writing Hypotheses. The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis ( H 0) and an alternative hypothesis ( H a ). Null Hypothesis. The statement that there is not a difference in the population (s), denoted as H 0.

  4. S.3.3 Hypothesis Testing Examples

    If the biologist set her significance level \(\alpha\) at 0.05 and used the critical value approach to conduct her hypothesis test, she would reject the null hypothesis if her test statistic t* were less than -1.6939 (determined using statistical software or a t-table):s-3-3. Since the biologist's test statistic, t* = -4.60, is less than -1.6939, the biologist rejects the null hypothesis.

  5. Hypothesis Testing

    The main purpose of statistics is to test a hypothesis. For example, you might run an experiment and find that a certain drug is effective at treating headaches. But if you can't repeat that experiment, no one will take your results seriously. ... One Sample Hypothesis Testing Examples: #3. Watch the video for an example of a two-tailed z-test:

  6. 10.1

    10.1 - Setting the Hypotheses: Examples. A significance test examines whether the null hypothesis provides a plausible explanation of the data. The null hypothesis itself does not involve the data. It is a statement about a parameter (a numerical characteristic of the population). These population values might be proportions or means or ...

  7. Introduction to Hypothesis Testing

    A statistical hypothesis is an assumption about a population parameter.. For example, we may assume that the mean height of a male in the U.S. is 70 inches. The assumption about the height is the statistical hypothesis and the true mean height of a male in the U.S. is the population parameter.. A hypothesis test is a formal statistical test we use to reject or fail to reject a statistical ...

  8. 7.1: Basics of Hypothesis Testing

    Figure 7.1.1. Before calculating the probability, it is useful to see how many standard deviations away from the mean the sample mean is. Using the formula for the z-score from chapter 6, you find. z = ¯ x − μo σ / √n = 490 − 500 25 / √30 = − 2.19. This sample mean is more than two standard deviations away from the mean.

  9. 9.4 Full Hypothesis Test Examples

    An investor believes the stock won't grow as quickly. The changes in stock price is recorded for ten weeks and are as follows: $4, $3, $2, $3, $1, $7, $2, $1, $1, $2. Perform a hypothesis test using a 5% level of significance. State the null and alternative hypotheses, state your conclusion, and identify the Type I errors.

  10. Hypothesis Testing

    Hypothesis testing is a technique that is used to verify whether the results of an experiment are statistically significant. It involves the setting up of a null hypothesis and an alternate hypothesis. There are three types of tests that can be conducted under hypothesis testing - z test, t test, and chi square test.

  11. A Gentle Introduction to Statistical Hypothesis Testing

    A statistical hypothesis test may return a value called p or the p-value. This is a quantity that we can use to interpret or quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to a threshold value chosen beforehand called the significance level.

  12. Hypothesis Examples: How to Write a Great Research Hypothesis

    Statistical hypothesis: This hypothesis uses statistical analysis to evaluate a representative sample of the population and then generalizes the findings to the larger group. Logical hypothesis : This hypothesis assumes a relationship between variables without collecting data or evidence.

  13. 4 Examples of Hypothesis Testing in Real Life

    In statistics, hypothesis tests are used to test whether or not some hypothesis about a population parameter is true. To perform a hypothesis test in the real world, researchers will obtain a random sample from the population and perform a hypothesis test on the sample data, using a null and alternative hypothesis:. Null Hypothesis (H 0): The sample data occurs purely from chance.

  14. 8.4: Hypothesis Test Examples for Proportions

    The next example is a poem written by a statistics student named Nicole Hart. The solution to the problem follows the poem. Notice that the hypothesis test is for a single population proportion. This means that the null and alternate hypotheses use the parameter \(p\). The distribution for the test is normal.

  15. S.3 Hypothesis Testing

    S.3 Hypothesis Testing. In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail. The general idea of hypothesis testing involves: Making an initial assumption. Collecting evidence (data).

  16. Statistical Hypothesis

    Hypothesis testing involves two statistical hypotheses. The first is the null hypothesis (H 0) as described above.For each H 0, there is an alternative hypothesis (H a) that will be favored if the null hypothesis is found to be statistically not viable.The H a can be either nondirectional or directional, as dictated by the research hypothesis. For example, if a researcher only believes the new ...

  17. What is Hypothesis Testing in Statistics? Types and Examples

    Hypothesis testing is a fundamental concept in statistics that aids analysts in making informed decisions based on sample data about a larger population. The process involves setting up two contrasting hypotheses, the null hypothesis and the alternative hypothesis, and then using statistical methods to determine which hypothesis provides a more ...

  18. Significance tests (hypothesis testing)

    Significance tests give us a formal process for using sample data to evaluate the likelihood of some claim about a population value. Learn how to conduct significance tests and calculate p-values to see how likely a sample result is to occur by random chance. You'll also see how we use p-values to make conclusions about hypotheses.

  19. 9.E: Hypothesis Testing with One Sample (Exercises)

    9.6: Additional Information and Full Hypothesis Test Examples. For each of the word problems, use a solution sheet to do the hypothesis test. The solution sheet is found in . Please feel free to make copies of the solution sheets. ... Use a significance level of 0.05 and using appropriate statistical evidence, conduct a hypothesis test and ...

  20. Choosing the Right Statistical Test

    What does a statistical test do? Statistical tests work by calculating a test statistic - a number that describes how much the relationship between variables in your test differs from the null hypothesis of no relationship.. It then calculates a p value (probability value). The p-value estimates how likely it is that you would see the difference described by the test statistic if the null ...

  21. Confidence distributions and hypothesis testing

    The traditional frequentist approach to hypothesis testing has recently come under extensive debate, raising several critical concerns. Additionally, practical applications often blend the decision-theoretical framework pioneered by Neyman and Pearson with the inductive inferential process relied on the p-value, as advocated by Fisher. The combination of the two methods has led to interpreting ...

  22. Inferential Statistics

    Example: Inferential statistics. You randomly select a sample of 11th graders in your state and collect data on their SAT scores and other characteristics. You can use inferential statistics to make estimates and test hypotheses about the whole population of 11th graders in the state based on your sample data.