• Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

Hypothesis Testing – A Deep Dive into Hypothesis Testing, The Backbone of Statistical Inference

  • September 21, 2023

Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis. Dive into methods, interpretations, and applications for making data-driven decisions.

null hypothesis in machine learning

In this Blog post we will learn:

  • What is Hypothesis Testing?
  • Steps in Hypothesis Testing 2.1. Set up Hypotheses: Null and Alternative 2.2. Choose a Significance Level (α) 2.3. Calculate a test statistic and P-Value 2.4. Make a Decision
  • Example : Testing a new drug.
  • Example in python

1. What is Hypothesis Testing?

In simple terms, hypothesis testing is a method used to make decisions or inferences about population parameters based on sample data. Imagine being handed a dice and asked if it’s biased. By rolling it a few times and analyzing the outcomes, you’d be engaging in the essence of hypothesis testing.

Think of hypothesis testing as the scientific method of the statistics world. Suppose you hear claims like “This new drug works wonders!” or “Our new website design boosts sales.” How do you know if these statements hold water? Enter hypothesis testing.

2. Steps in Hypothesis Testing

  • Set up Hypotheses : Begin with a null hypothesis (H0) and an alternative hypothesis (Ha).
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true. Think of it as the chance of accusing an innocent person.
  • Calculate Test statistic and P-Value : Gather evidence (data) and calculate a test statistic.
  • p-value : This is the probability of observing the data, given that the null hypothesis is true. A small p-value (typically ≤ 0.05) suggests the data is inconsistent with the null hypothesis.
  • Decision Rule : If the p-value is less than or equal to α, you reject the null hypothesis in favor of the alternative.

2.1. Set up Hypotheses: Null and Alternative

Before diving into testing, we must formulate hypotheses. The null hypothesis (H0) represents the default assumption, while the alternative hypothesis (H1) challenges it.

For instance, in drug testing, H0 : “The new drug is no better than the existing one,” H1 : “The new drug is superior .”

2.2. Choose a Significance Level (α)

When You collect and analyze data to test H0 and H1 hypotheses. Based on your analysis, you decide whether to reject the null hypothesis in favor of the alternative, or fail to reject / Accept the null hypothesis.

The significance level, often denoted by $α$, represents the probability of rejecting the null hypothesis when it is actually true.

In other words, it’s the risk you’re willing to take of making a Type I error (false positive).

Type I Error (False Positive) :

  • Symbolized by the Greek letter alpha (α).
  • Occurs when you incorrectly reject a true null hypothesis . In other words, you conclude that there is an effect or difference when, in reality, there isn’t.
  • The probability of making a Type I error is denoted by the significance level of a test. Commonly, tests are conducted at the 0.05 significance level , which means there’s a 5% chance of making a Type I error .
  • Commonly used significance levels are 0.01, 0.05, and 0.10, but the choice depends on the context of the study and the level of risk one is willing to accept.

Example : If a drug is not effective (truth), but a clinical trial incorrectly concludes that it is effective (based on the sample data), then a Type I error has occurred.

Type II Error (False Negative) :

  • Symbolized by the Greek letter beta (β).
  • Occurs when you accept a false null hypothesis . This means you conclude there is no effect or difference when, in reality, there is.
  • The probability of making a Type II error is denoted by β. The power of a test (1 – β) represents the probability of correctly rejecting a false null hypothesis.

Example : If a drug is effective (truth), but a clinical trial incorrectly concludes that it is not effective (based on the sample data), then a Type II error has occurred.

Balancing the Errors :

null hypothesis in machine learning

In practice, there’s a trade-off between Type I and Type II errors. Reducing the risk of one typically increases the risk of the other. For example, if you want to decrease the probability of a Type I error (by setting a lower significance level), you might increase the probability of a Type II error unless you compensate by collecting more data or making other adjustments.

It’s essential to understand the consequences of both types of errors in any given context. In some situations, a Type I error might be more severe, while in others, a Type II error might be of greater concern. This understanding guides researchers in designing their experiments and choosing appropriate significance levels.

2.3. Calculate a test statistic and P-Value

Test statistic : A test statistic is a single number that helps us understand how far our sample data is from what we’d expect under a null hypothesis (a basic assumption we’re trying to test against). Generally, the larger the test statistic, the more evidence we have against our null hypothesis. It helps us decide whether the differences we observe in our data are due to random chance or if there’s an actual effect.

P-value : The P-value tells us how likely we would get our observed results (or something more extreme) if the null hypothesis were true. It’s a value between 0 and 1. – A smaller P-value (typically below 0.05) means that the observation is rare under the null hypothesis, so we might reject the null hypothesis. – A larger P-value suggests that what we observed could easily happen by random chance, so we might not reject the null hypothesis.

2.4. Make a Decision

Relationship between $α$ and P-Value

When conducting a hypothesis test:

We then calculate the p-value from our sample data and the test statistic.

Finally, we compare the p-value to our chosen $α$:

  • If $p−value≤α$: We reject the null hypothesis in favor of the alternative hypothesis. The result is said to be statistically significant.
  • If $p−value>α$: We fail to reject the null hypothesis. There isn’t enough statistical evidence to support the alternative hypothesis.

3. Example : Testing a new drug.

Imagine we are investigating whether a new drug is effective at treating headaches faster than drug B.

Setting Up the Experiment : You gather 100 people who suffer from headaches. Half of them (50 people) are given the new drug (let’s call this the ‘Drug Group’), and the other half are given a sugar pill, which doesn’t contain any medication.

  • Set up Hypotheses : Before starting, you make a prediction:
  • Null Hypothesis (H0): The new drug has no effect. Any difference in healing time between the two groups is just due to random chance.
  • Alternative Hypothesis (H1): The new drug does have an effect. The difference in healing time between the two groups is significant and not just by chance.

Calculate Test statistic and P-Value : After the experiment, you analyze the data. The “test statistic” is a number that helps you understand the difference between the two groups in terms of standard units.

For instance, let’s say:

  • The average healing time in the Drug Group is 2 hours.
  • The average healing time in the Placebo Group is 3 hours.

The test statistic helps you understand how significant this 1-hour difference is. If the groups are large and the spread of healing times in each group is small, then this difference might be significant. But if there’s a huge variation in healing times, the 1-hour difference might not be so special.

Imagine the P-value as answering this question: “If the new drug had NO real effect, what’s the probability that I’d see a difference as extreme (or more extreme) as the one I found, just by random chance?”

For instance:

  • P-value of 0.01 means there’s a 1% chance that the observed difference (or a more extreme difference) would occur if the drug had no effect. That’s pretty rare, so we might consider the drug effective.
  • P-value of 0.5 means there’s a 50% chance you’d see this difference just by chance. That’s pretty high, so we might not be convinced the drug is doing much.
  • If the P-value is less than ($α$) 0.05: the results are “statistically significant,” and they might reject the null hypothesis , believing the new drug has an effect.
  • If the P-value is greater than ($α$) 0.05: the results are not statistically significant, and they don’t reject the null hypothesis , remaining unsure if the drug has a genuine effect.

4. Example in python

For simplicity, let’s say we’re using a t-test (common for comparing means). Let’s dive into Python:

Making a Decision : “The results are statistically significant! p-value < 0.05 , The drug seems to have an effect!” If not, we’d say, “Looks like the drug isn’t as miraculous as we thought.”

5. Conclusion

Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

More Articles

Correlation – connecting the dots, the role of correlation in data analysis, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, skewness and kurtosis – peaks and tails, understanding data through skewness and kurtosis”, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

null hypothesis in machine learning

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

null hypothesis in machine learning

null hypothesis in machine learning

Null Hypothesis

null hypothesis in machine learning

Understanding the Null Hypothesis

The null hypothesis is a fundamental concept in statistics that serves as a starting point for statistical testing. Often denoted as H0, the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena or no association among groups. In other words, it assumes that any kind of difference or significance you see in a set of data is due to chance.

Role of the Null Hypothesis in Statistical Tests

Statistical hypothesis testing is a method of making decisions using data, whether from a controlled experiment or an observational study (not based on chance). The null hypothesis is what you attempt to disprove or nullify with evidence to the contrary. It is contrasted with the alternative hypothesis, denoted as H1 or Ha, which expresses that there is a statistically significant relationship between two variables.

The process of hypothesis testing involves choosing a null hypothesis which is tested against the alternative hypothesis. If there is enough evidence to suggest that the null hypothesis is not plausible, it is rejected in favor of the alternative hypothesis. This does not mean the null hypothesis is false; rather, it suggests that there is enough evidence to support the alternative hypothesis.

Examples of Null Hypotheses

Here are a few examples of null hypotheses:

  • In a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. We would write H0: there is no difference in effectiveness between the new and current drugs.
  • In a plant growth experiment, the null hypothesis might be that the type of fertilizer does not affect the growth rate of plants. We would write H0: the mean growth rate for plants with fertilizer type A is equal to the mean growth rate for plants with fertilizer type B.
  • In a study on education techniques, the null hypothesis might be that a new teaching strategy has no effect on student performance. We would write H0: the average test score for students taught with the new strategy is the same as the average score for students taught with traditional methods.

Importance of the Null Hypothesis in Research

The null hypothesis is important in research because it can be tested and found to be false, which then implies that there is a relationship between the observed data. Rejecting or failing to reject the null hypothesis does not prove the null or alternative hypotheses. Instead, statistical tests can provide evidence that supports a hypothesis or determines the probability that the observed data occurred by chance.

Decision Making in Hypothesis Testing

When conducting a hypothesis test, a significance level (alpha) must be determined. The significance level is the probability of rejecting the null hypothesis when it is actually true, commonly set at 0.05 (5%). If the p-value of the test is less than the chosen significance level, then the null hypothesis is rejected.

Keep in mind that rejecting the null hypothesis is not a proof of the truth of the alternative hypothesis; it only suggests that there is enough statistical evidence to prefer the alternative hypothesis over the null hypothesis.

Types of Errors in Hypothesis Testing

There are two types of errors that can occur in hypothesis testing:

  • Type I error : This occurs when the null hypothesis is true, but is incorrectly rejected. It is equivalent to a false positive.
  • Type II error : This occurs when the null hypothesis is false, but erroneously fails to be rejected. It is equivalent to a false negative.

Researchers aim to minimize these errors, but they can never be completely eliminated. The design of the study and the choice of significance level can help control the rate of Type I errors.

The null hypothesis is a crucial part of any statistical analysis, representing the theory that there is no effect or no difference, and serves as the assertion to be challenged and potentially rejected in favor of an alternative hypothesis. Understanding the null hypothesis and its role in research is essential for interpreting the results of statistical tests and making informed decisions based on data.

The world's most comprehensive data science & artificial intelligence glossary

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Adventures in Machine Learning

Mastering p-value and null hypothesis with scikit learn, understanding p-value and null hypothesis.

As data continues to play an essential role in shaping our decision-making process, its crucial to understand different statistical tools used to analyze data accurately. One of the most important statistical tools is the p-value.

But before we dive deeper into p-value, lets first understand the concept of Null Hypothesis and its importance.

Definition and Importance of Null Hypothesis

Null Hypothesis, also known as H0, is a statement that assumes there is no significant difference between two groups or factors in a study. In other words, the Null Hypothesis is a default position that researchers take when analyzing the data.

For example, let’s say the research question is: “Does eating breakfast every day impact weight loss?” The Null Hypothesis would be that “There is no significant difference in weight loss between the group that ate breakfast every day and those who did not eat breakfast every day.” The importance of Null Hypothesis is that it provides a baseline to compare the data against.

Statistical Significance and Alpha level

When conducting research, we usually have an idea of what the expected outcome should be. The aim of statistical tests is to evaluate whether or not the results are due to chance or if they are statistically significant.

Statistical significance refers to the level of confidence we have in the data, knowing that the results didn’t happen by chance. The Alpha level is the predetermined probability level that the results are due to chance, usually set at 0.05.

This means that if the p-value is less than 0.05, the results are considered statistically significant, which suggests that the Null Hypothesis is false.

Definition and Significance of p-value

Now that we’ve understood the importance of Null Hypothesis and statistical significance, let’s dive into the concept of p-value. The p-value is the probability that the results obtained occurred by chance, given that the Null Hypothesis is true.

In other words, p-value is a statistical tool used to determine the statistical significance of the data analyzed. For example, let’s say the p-value obtained from the research is 0.03.

This means that the probability of obtaining such results if the Null Hypothesis were true is 3%. Since 3% is less than the Alpha level of 0.05, the results are statistically significant.

This implies that we reject the Null Hypothesis, and the outcome was not due to chance.

Finding p-value using Statsmodel Library and Scikit Learn library

There are two ways that we can calculate the p-value when analyzing data – using Statsmodel Library and Scikit Learn library.

Using Statsmodel Library

Statsmodel library is an essential library in Python used for statistical modeling, data visualization, and statistical analysis. The library has an Ordinary Least Squares (OLS) formula used to fit linear regression models.

The OLS method creates a linear regression line that fits the data, and the p-value for each variable can be computed. To use the Statsmodel library, follow these steps:

Import the library using `import statsmodels.api as sm`. 2.

Load your dataset. 3.

Add a constant term to the data using `X = sm.add_constant(X)`. 4.

Fit the linear regression model for the data using `model = sm.OLS(y,X).fit()`. 5.

Get the p-values for each variable using `model.pvalues`.

Using Scikit Learn Library

Scikit Learn library is another widely used library in Python used for machine learning. The library computes p-values for predictive modeling of continuous and categorical data.

Scikit Learn has a function that calculates the p-value for the t-test and cumulative distribution function (cdf) that can be used to calculate the p-value. To use Scikit Learn Library, follow these steps:

Import the library using `from sklearn.feature_selection import f_regression`. 2.

Split the data into independent and dependent variables. 4.

Compute the p-value using `f_regression(X,y)`.

In conclusion, Null Hypothesis and p-value are critical concepts that should be understood when analyzing data. Understanding and utilizing the two tools accurately can help researchers make informed decisions and conclusions based on data analysis.

The two libraries discussed above can aid in computing the p-value, making it easier for reliable data analysis. Hence, its crucial to embrace statistical analysis positively to understand the results and make informed decisions in any research, business, or study.

Calculating p-value in Scikit Learn Library

When conducting statistical analysis, one of the key aspects that researchers need to understand is how to compute the p-value. A p-value is a measure of the statistical significance of the results obtained in a study.

In this article, we will focus on how to calculate the p-value using the Scikit Learn library in Python. We will also discuss the different tests used to compute the p-value and how to calculate the test statistic for a given hypothesis test.

Calculation of Test Statistic and p-value for left-tailed test

A left-tailed test is a hypothesis test where the alternate hypothesis is less than the null hypothesis. To calculate the p-value for a left-tailed test, we need to find the area under the curve to the left of the test statistic.

The test statistic for a left-tailed test is calculated as:

test statistic = (sample mean – hypothesized mean) / (sample standard deviation / sqrt(sample size))

where the hypothesized mean is the value of the mean under the null hypothesis. Once we have computed the test statistic, we can use the t-distribution to find the p-value.

To calculate the p-value for a left-tailed test using Scikit Learn Library, we first need to import the required module as follows:

“`python

from scipy.stats import ttest_1samp

Next, we load the dataset and perform the t-test using the `ttest_1samp()` function as follows:

t_statistic, p_value = ttest_1samp(data, hypothesized_mean)

The `ttest_1samp()` function returns two values: the test statistic and the p-value. In this case, we are only interested in the p-value.

Calculation of Test Statistic and p-value for right-tailed test

A right-tailed test is a hypothesis test where the alternate hypothesis is greater than the null hypothesis. To calculate the p-value for a right-tailed test, we need to find the area under the curve to the right of the test statistic.

The test statistic for a right-tailed test is calculated as:

To calculate the p-value for a right-tailed test using Scikit Learn Library, we first need to import the required module as follows:

Calculation of Test Statistic and p-value for two-tailed test

A two-tailed test is a hypothesis test where the alternate hypothesis is different from the null hypothesis. To calculate the p-value for a two-tailed test, we need to find the area under the curve to the left and right of the test statistic.

The test statistic for a two-tailed test is calculated as:

To calculate the p-value for a two-tailed test using Scikit Learn Library, we first need to import the required module as follows:

Once we have computed the p-value for a given hypothesis test, we can use it to interpret the statistical significance of our results. If the p-value is less than our significance level, which is usually set at 0.05, we reject the null hypothesis.

On the other hand, if the p-value is greater than our significance level, we fail to reject the null hypothesis.

Conclusion and Summary

In summary, the p-value is a useful statistical tool that can be used to evaluate the statistical significance of our results. In this article, we have discussed how to calculate the p-value for different tests using the Scikit Learn library in Python.

We have also discussed how to calculate the test statistic for a given hypothesis test. By understanding these concepts, we can make informed decisions and conclusions based on the data analyzed.

The implementation of these concepts in Python libraries makes data analysis more efficient and streamlined, making scientific research and data analysis more accessible to more people. In summary, this article discusses the importance of understanding p-value and its significance in statistical analysis.

It covers the definition and importance of Null Hypothesis, statistical significance and Alpha level, and the definition and significance of p-value. Additionally, it explores how to calculate p-value in Scikit Learn Library using the different tests, such as left-tailed, right-tailed and two-tailed tests, and how to calculate the test statistic for a given hypothesis test.

Python libraries such as Scikit Learn and Statsmodel can help make data analysis more efficient and accurate. Understanding p-value accurately is crucial for informed decision-making, and its relevance in statistical analysis cannot be overstated.

Popular Posts

Mastering advanced text processing techniques with python, avoid these common sql syntax errors: a guide for developers, resolving python’s ‘int’ object not callable error: causes and solutions.

  • Terms & Conditions
  • Privacy Policy

Hypothesis Testing: Null Hypothesis and Alternative Hypothesis

Join over 2 million students who advanced their careers with 365 Data Science. Learn from instructors who have worked at Meta, Spotify, Google, IKEA, Netflix, and Coca-Cola and master Python, SQL, Excel, machine learning, data analysis, AI fundamentals, and more.

null hypothesis in machine learning

Figuring out exactly what the null hypothesis and the alternative hypotheses are is not a walk in the park. Hypothesis testing is based on the knowledge that you can acquire by going over what we have previously covered about statistics in our blog.

So, if you don’t want to have a hard time keeping up, make sure you have read all the tutorials about confidence intervals , distributions , z-tables and t-tables .

We've also made a video on null hypothesis vs alternative hypothesis - you can watch it below or just scroll down if you prefer reading.

Confidence intervals, distributions, z-tables and t-tables

Confidence intervals provide us with an estimation of where the parameters are located. You can obtain them with our confidence interval calculator and learn more about them in the related article.

Confidence interval, null hypothesis

However, when we are making a decision, we need a yes or no answer. The correct approach, in this case, is to use a test .

Here we will start learning about one of the fundamental tasks in statistics - hypothesis testing !

Hypothesis Testing

The Hypothesis Testing Process

  First off, let’s talk about data-driven decision-making. It consists of the following steps:

  • First, we must formulate a hypothesis .
  • After doing that, we have to find the right test for our hypothesis .
  • Then, we execute the test.
  • Finally, we make a decision based on the result.

Steps in data driven decision making, null hypothesis

Let’s start from the beginning.

What is a Hypothesis?

Though there are many ways to define it, the most intuitive must be:

“A hypothesis is an idea that can be tested.”

What is a hypothesis?

This is not the formal definition, but it explains the point very well.

So, if we say that apples in New York are expensive, this is an idea or a statement. However, it is not testable, until we have something to compare it with.

Apples in New York are expensive, null hypothesis

For instance, if we define expensive as: any price higher than $1.75 dollars per pound, then it immediately becomes a hypothesis .

More expensive that $1.75

What Cannot Be a Hypothesis?

An example may be: would the USA do better or worse under a Clinton administration, compared to a Trump administration? Statistically speaking, this is an idea , but there is no data to test it. Therefore, it cannot be a hypothesis of a statistical test.

No data, cannot be tested = not a hypothesis

Actually, it is more likely to be a topic of another discipline.

Conversely, in statistics, we may compare different US presidencies that have already been completed. For example, the Obama administration and the Bush administration, as we have data on both.

Can be tested, might be a hypothesis

A Two-Sided Test

Alright, let’s get out of politics and get into hypotheses . Here’s a simple topic that CAN be tested.

According to Glassdoor (the popular salary information website), the mean data scientist salary in the US is 113,000 dollars.

Glassdoor

So, we want to test if their estimate is correct.

The Null and Alternative Hypotheses

There are two hypotheses that are made: the null hypothesis , denoted H 0 , and the alternative hypothesis , denoted H 1 or H A .

Null hypothesis and alternative hypothesis

The null hypothesis is the one to be tested and the alternative is everything else. In our example:

The null hypothesis would be: The mean data scientist salary is 113,000 dollars.

The mean data scientist salary is 113,000 dollars

While the alternative : The mean data scientist salary is not 113,000 dollars.

The mean data scientist salary is not 113,000 dollars

Author's note: If you're interested in a data scientist career, check out our articles Data Scientist Career Path , 5 Business Basics for Data Scientists , Data Science Interview Questions , and 15 Data Science Consulting Companies Hiring Now .

An Example of a One-Sided Test

You can also form one-sided or one-tailed tests.

Say your friend, Paul, told you that he thinks data scientists earn more than 125,000 dollars per year. You doubt him, so you design a test to see who’s right.

Who is right?

The null hypothesis of this test would be: The mean data scientist salary is more than 125,000 dollars.

The alternative will cover everything else, thus: The mean data scientist salary is less than or equal to 125,000 dollars.

Null and alternative hypothesis

Important: The outcomes of tests refer to the population parameter rather than the sample statistic! So, the result that we get is for the population.

The outcomes of tests refer to the population parameter rather than the sample statistic

Important: Another crucial consideration is that, generally, the researcher is trying to reject the null hypothesis . Think about the null hypothesis as the status quo and the alternative as the change or innovation that challenges that status quo. In our example, Paul was representing the status quo, which we were challenging.

Status quo

Let’s go over it once more. In statistics, the null hypothesis is the statement we are trying to reject. Therefore, the null hypothesis is the present state of affairs, while the alternative is our personal opinion.

The null hypothesis is the present state of affairs, while the alternative is our personal opinion

Why Hypothesis Testing Works

Right now, you may be feeling a little puzzled. This is normal because this whole concept is counter-intuitive at the beginning. However, there is an extremely easy way to continue your journey of exploring it. By diving into the linked tutorial, you will find out why hypothesis testing actually works.

Interested in learning more? You can take your skills from good to great with our statistics course!

Try statistics course for free

Next Tutorial:  Hypothesis Testing: Significance Level and Rejection Region

World-Class

Data Science

Learn with instructors from:

Iliya Valchanov

Co-founder of 365 Data Science

Iliya is a finance graduate with a strong quantitative background who chose the exciting path of a startup entrepreneur. He demonstrated a formidable affinity for numbers during his childhood, winning more than 90 national and international awards and competitions through the years. Iliya started teaching at university, helping other students learn statistics and econometrics. Inspired by his first happy students, he co-founded 365 Data Science to continue spreading knowledge. He authored several of the program’s online courses in mathematics, statistics, machine learning, and deep learning.

We Think you'll also like

False Positive vs. False Negative: Type I and Type II Errors in Statistical Hypothesis Testing

Statistics Tutorials

False Positive vs. False Negative: Type I and Type II Errors in Statistical Hypothesis Testing

Article by Iliya Valchanov

Hypothesis Testing with Z-Test: Significance Level and Rejection Region

Calculating and Using Covariance and Linear Correlation Coefficient

Calculating and Using Covariance and Linear Correlation Coefficient

Getting Familiar with the Central Limit Theorem and the Standard Error

Getting Familiar with the Central Limit Theorem and the Standard Error

eml header

Best Guesses: Understanding The Hypothesis in Machine Learning

Stewart Kaplan

  • February 22, 2024
  • General , Supervised Learning , Unsupervised Learning

Machine learning is a vast and complex field that has inherited many terms from other places all over the mathematical domain.

It can sometimes be challenging to get your head around all the different terminologies, never mind trying to understand how everything comes together.

In this blog post, we will focus on one particular concept: the hypothesis.

While you may think this is simple, there is a little caveat regarding machine learning.

The statistics side and the learning side.

Don’t worry; we’ll do a full breakdown below.

You’ll learn the following:

What Is a Hypothesis in Machine Learning?

  • Is This any different than the hypothesis in statistics?
  • What is the difference between the alternative hypothesis and the null?
  • Why do we restrict hypothesis space in artificial intelligence?
  • Example code performing hypothesis testing in machine learning

learning together

In machine learning, the term ‘hypothesis’ can refer to two things.

First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance.

Second, it can refer to the traditional null and alternative hypotheses from statistics.

Since machine learning works so closely with statistics, 90% of the time, when someone is referencing the hypothesis, they’re referencing hypothesis tests from statistics.

Is This Any Different Than The Hypothesis In Statistics?

In statistics, the hypothesis is an assumption made about a population parameter.

The statistician’s goal is to prove it true or disprove it.

prove them wrong

This will take the form of two different hypotheses, one called the null, and one called the alternative.

Usually, you’ll establish your null hypothesis as an assumption that it equals some value.

For example, in Welch’s T-Test Of Unequal Variance, our null hypothesis is that the two means we are testing (population parameter) are equal.

This means our null hypothesis is that the two population means are the same.

We run our statistical tests, and if our p-value is significant (very low), we reject the null hypothesis.

This would mean that their population means are unequal for the two samples you are testing.

Usually, statisticians will use the significance level of .05 (a 5% risk of being wrong) when deciding what to use as the p-value cut-off.

What Is The Difference Between The Alternative Hypothesis And The Null?

The null hypothesis is our default assumption, which we are trying to prove correct.

The alternate hypothesis is usually the opposite of our null and is much broader in scope.

For most statistical tests, the null and alternative hypotheses are already defined.

You are then just trying to find “significant” evidence we can use to reject our null hypothesis.

can you prove it

These two hypotheses are easy to spot by their specific notation. The null hypothesis is usually denoted by H₀, while H₁ denotes the alternative hypothesis.

Example Code Performing Hypothesis Testing In Machine Learning

Since there are many different hypothesis tests in machine learning and data science, we will focus on one of my favorites.

This test is Welch’s T-Test Of Unequal Variance, where we are trying to determine if the population means of these two samples are different.

There are a couple of assumptions for this test, but we will ignore those for now and show the code.

You can read more about this here in our other post, Welch’s T-Test of Unequal Variance .

We see that our p-value is very low, and we reject the null hypothesis.

welch t test result with p-value

What Is The Difference Between The Biased And Unbiased Hypothesis Spaces?

The difference between the Biased and Unbiased hypothesis space is the number of possible training examples your algorithm has to predict.

The unbiased space has all of them, and the biased space only has the training examples you’ve supplied.

Since neither of these is optimal (one is too small, one is much too big), your algorithm creates generalized rules (inductive learning) to be able to handle examples it hasn’t seen before.

Here’s an example of each:

Example of The Biased Hypothesis Space In Machine Learning

The Biased Hypothesis space in machine learning is a biased subspace where your algorithm does not consider all training examples to make predictions.

This is easiest to see with an example.

Let’s say you have the following data:

Happy  and  Sunny  and  Stomach Full  = True

Whenever your algorithm sees those three together in the biased hypothesis space, it’ll automatically default to true.

This means when your algorithm sees:

Sad  and  Sunny  And  Stomach Full  = False

It’ll automatically default to False since it didn’t appear in our subspace.

This is a greedy approach, but it has some practical applications.

greedy

Example of the Unbiased Hypothesis Space In Machine Learning

The unbiased hypothesis space is a space where all combinations are stored.

We can use re-use our example above:

This would start to breakdown as

Happy  = True

Happy  and  Sunny  = True

Happy  and  Stomach Full  = True

Let’s say you have four options for each of the three choices.

This would mean our subspace would need 2^12 instances (4096) just for our little three-word problem.

This is practically impossible; the space would become huge.

subspace

So while it would be highly accurate, this has no scalability.

More reading on this idea can be found in our post, Inductive Bias In Machine Learning .

Why Do We Restrict Hypothesis Space In Artificial Intelligence?

We have to restrict the hypothesis space in machine learning. Without any restrictions, our domain becomes much too large, and we lose any form of scalability.

This is why our algorithm creates rules to handle examples that are seen in production. 

This gives our algorithms a generalized approach that will be able to handle all new examples that are in the same format.

Other Quick Machine Learning Tutorials

At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

  • Instance-Based Learning in Machine Learning
  • Types of Data For Machine Learning
  • Verbose in Machine Learning
  • Generalization In Machine Learning
  • Epoch In Machine Learning
  • Inductive Bias in Machine Learning
  • Understanding The Hypothesis In Machine Learning
  • Zip Codes In Machine Learning
  • get_dummies() in Machine Learning
  • Bootstrapping In Machine Learning
  • X and Y in Machine Learning
  • F1 Score in Machine Learning
  • Recent Posts

Stewart Kaplan

  • How HIPAA Affects Reception Area Security [Maximize Compliance with These Tips] - April 11, 2024
  • Is SignMaster Software Easy to Use? A Comprehensive Review [Don’t Miss These User-Friendly Features!] - April 11, 2024
  • How to Create Excel Dashboard Reports [Boost User Engagement Now!] - April 11, 2024

Trending now

Multivariate Polynomial Regression Python

Understanding P-value in Machine Learning

P Value

In this article we will learn about p-value and how it is used to infer the significance in any hypothesis. The null hypothesis is present in all statistical tests. The null hypothesis for the majority of tests is that there is no association between your interest variables or that there is no distinction between groups.

In simple terms, the  p  value , or probability value, tells you how likely it is that your data could have occurred under the null hypothesis.

What is null hypothesis?

The null hypothesis, often known as H0, is the assertion that there is no relationship or difference between the two sets of data or variables being investigated in a scientific study. Usually a significance test is intended to determine how strong the evidence is against the null hypothesis. With every null hypothesis, there is a alternate hypothesis denoted by Ha. In hypothesis testing, generally we define them as:

To know more in detail about hypothesis testing visit https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/

  • Null Hypothesis : There is no effect or no relationship between variables.
  • Alternative Hypothesis : There is effect or relationship in variables.

What is statistical significance?

When we define a hypothesis, we also define a parameter called level of significance( alpha), which is the chance of the study rejecting the null hypothesis, given that the null hypothesis is true. Given that the null hypothesis is correct, p-value is the likelihood of getting a result that is at least as extreme.

  • When p < alpha, the result is statistically significant according to the standard studies.

What is p-value and its significance?

In statistics, the p-value is a measure of the evidence against the null hypothesis. It tells us how likely it is to observe a test statistic as extreme or more extreme than the one we observed, assuming that the null hypothesis is true.

A small p-value (less than the significance level, usually 0.05) indicates that the observed result is unlikely to have occurred by chance alone, and therefore provides evidence against the null hypothesis. A large p-value, on the other hand, indicates that the observed result is likely to have occurred by chance, and we cannot reject the null hypothesis.

For example, if we obtain a p-value of 0.05, it means that if we were to repeat the experiment many times under the assumption that the null hypothesis is true, only 5% of the time would we obtain a test statistic as extreme or more extreme than the one we observed. This suggests that there is strong evidence against the null hypothesis, and we can reject it in favor of an alternative hypothesis.

How To Find P-value (significance) In Scikit-learn?

We will first import built-in dataset of diabetes and then we will run a linear regression model on that and calculate p-values for all the variables and also conclude if they statistically significant or not.

Method 1: Using statsmodel library

Importing libraries

The below code uses the statsmodels library in Python to perform an Ordinary Least Squares (OLS) regression analysis on the diabetes dataset. OLS is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the sum of the squared differences between the observed and predicted values of the dependent variable.

Importing dataset

The diabetes_df dataset is imported from the sklearn library, which contains information about 442 diabetes patients, including age, sex, body mass index (BMI), blood pressure, and six blood serum measurements. The X variable contains the predictor variables, and the y variable contains the target variable.

Fitting a Ordinal Least square model to the data

The OLS function is used to fit an OLS model to the data, with the fit() method used to obtain the model coefficients and other summary statistics.

Getting p-values for each variable

The summary2() method is used to display the p-values for each independent variable, which indicate which variables are statistically significant.

Below is the output of the above code(p-values)

P Value Output

Method 2: Directly from scikit learn library

In this approach, we will go more mathematically using matrix, but basically to calculate the p-value, you must first understand the distribution of your test statistic under the assumption that the null hypothesis is true. Then, using this distribution’s cumulative distribution function (cdf), we can express the probability of the test statistics being at least as extreme as its value x for the sample:

  • Left-tailed test:  p-value = cdf(ts) .
  • Right-tailed test:  p-value = 1 - cdf(ts)
  • Two-tailed test:  p-value = 2 × min{cdf(ts), 1 - cdf(ts)}

When conducting a t-test, we calculate a test statistic (TS) based on the data we have. We can then use this test statistic to calculate a p-value, which tells us how likely it is to observe such a large (or larger) value of the test statistic if the null hypothesis is true.

To calculate the p-value for a two-sided test, we assume that the distribution of the test statistic under the null hypothesis is symmetric about zero. We then use the observed value of the test statistic (ts) to calculate the p-value as follows:

Here, P(TS ≥ |ts| | H0 is true) is the probability of observing a test statistic as extreme or more extreme than |ts|, assuming that the null hypothesis is true. The cumulative distribution function (cdf) is a function that tells us the probability of observing a test statistic less than or equal to a certain value. We use the absolute value of the observed test statistic (|ts|) in the calculation because we assume that the distribution is symmetric about zero.

Also read: T-Test Hypothesis in Python

The above code generates p-values for each of vairables as:

P Value

Also read: Anova-test-in-python

By this, we have reached the end of this topic. We have understood the significance of p-value and how to interpret as well implement using different methods in python libraries. Feel free to comment below, in case you come across any question.

  • https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/manually-calculate-a-p-value/
  • https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression

Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions

JavaTpoint

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

  • Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
  • Data Analysis with Python

Introduction to Data Analysis

  • What is Data Analysis?
  • Data Analytics and its type
  • How to Install Numpy on Windows?
  • How to Install Pandas in Python?
  • How to Install Matplotlib on python?
  • How to Install Python Tensorflow in Windows?

Data Analysis Libraries

  • Pandas Tutorial
  • NumPy Tutorial - Python Library
  • Data Analysis with SciPy
  • Introduction to TensorFlow

Data Visulization Libraries

  • Matplotlib Tutorial
  • Python Seaborn Tutorial
  • Plotly tutorial
  • Introduction to Bokeh in Python

Exploratory Data Analysis (EDA)

  • Univariate, Bivariate and Multivariate data and its analysis
  • Measures of Central Tendency in Statistics
  • Measures of spread - Range, Variance, and Standard Deviation
  • Interquartile Range and Quartile Deviation using NumPy and SciPy
  • Anova Formula
  • Skewness of Statistical Data
  • How to Calculate Skewness and Kurtosis in Python?
  • Difference Between Skewness and Kurtosis
  • Histogram | Meaning, Example, Types and Steps to Draw
  • Interpretations of Histogram
  • Quantile Quantile plots
  • What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?
  • Using pandas crosstab to create a bar plot
  • Exploring Correlation in Python
  • Mathematics | Covariance and Correlation
  • Factor Analysis | Data Analysis
  • Data Mining - Cluster Analysis
  • MANOVA Test in R Programming
  • Python - Central Limit Theorem
  • Probability Distribution Function
  • Probability Density Estimation & Maximum Likelihood Estimation
  • Exponential Distribution in R Programming - dexp(), pexp(), qexp(), and rexp() Functions
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Poisson Distribution - Definition, Formula, Table and Examples
  • P-Value: Comprehensive Guide to Understand, Apply, and Interpret
  • Z-Score in Statistics
  • How to Calculate Point Estimates in R?
  • Confidence Interval
  • Chi-square test in Machine Learning

Understanding Hypothesis Testing

Data preprocessing.

  • ML | Data Preprocessing in Python
  • ML | Overview of Data Cleaning
  • ML | Handling Missing Values
  • Detect and Remove the Outliers using Python

Data Transformation

  • Data Normalization Machine Learning
  • Sampling distribution Using Python

Time Series Data Analysis

  • Data Mining - Time-Series, Symbolic and Biological Sequences Data
  • Basic DateTime Operations in Python
  • Time Series Analysis & Visualization in Python
  • How to deal with missing values in a Timeseries in Python?
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • What is a trend in time series?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • AutoCorrelation

Case Studies and Projects

  • Top 8 Free Dataset Sources to Use for Data Science Projects
  • Step by Step Predictive Analysis - Machine Learning
  • 6 Tips for Creating Effective Data Visualizations

Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously evaluating these assumptions against empirical evidence. This article sheds light on the significance of hypothesis testing and the critical steps involved in the process.

What is Hypothesis Testing?

Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming, and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

Defining Hypotheses

\mu

Key Terms of Hypothesis Testing

\alpha

  • P-value: The P value , or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
  • Test Statistic: The test statistic is a numerical value calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis. It is compared to a critical value or p-value to make decisions about the statistical significance of the observed results.
  • Critical value : The critical value in statistics is a threshold or cutoff point used to determine whether to reject the null hypothesis in a hypothesis test.
  • Degrees of freedom: Degrees of freedom are associated with the variability or freedom one has in estimating a parameter. The degrees of freedom are related to the sample size and determine the shape.

Why do we use Hypothesis Testing?

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, thanks to hypothesis testing. 

One-Tailed and Two-Tailed Test

One tailed test focuses on one direction, either greater than or less than a specified value. We use a one-tailed test when there is a clear directional expectation based on prior knowledge or theory. The critical region is located on only one side of the distribution curve. If the sample falls into this critical region, the null hypothesis is rejected in favor of the alternative hypothesis.

One-Tailed Test

There are two types of one-tailed test:

\mu \geq 50

Two-Tailed Test

A two-tailed test considers both directions, greater than and less than a specified value.We use a two-tailed test when there is no specific directional expectation, and want to detect any significant difference.

\mu =

What are Type 1 and Type 2 errors in Hypothesis Testing?

In hypothesis testing, Type I and Type II errors are two possible errors that researchers can make when drawing conclusions about a population based on a sample of data. These errors are associated with the decisions made regarding the null hypothesis and the alternative hypothesis.

\alpha

How does Hypothesis Testing work?

Step 1: define null and alternative hypothesis.

H_0

We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another, assuming Normally distributed data.

Step 2 – Choose significance level

\alpha

Step 3 – Collect and Analyze data.

Gather relevant data through observation or experimentation. Analyze the data using appropriate statistical methods to obtain a test statistic.

Step 4-Calculate Test Statistic

The data for the tests are evaluated in this step we look for various scores based on the characteristics of data. The choice of the test statistic depends on the type of hypothesis test being conducted.

There are various hypothesis tests, each appropriate for various goal to calculate our test. This could be a Z-test , Chi-square , T-test , and so on.

  • Z-test : If population means and standard deviations are known. Z-statistic is commonly used.
  • t-test : If population standard deviations are unknown. and sample size is small than t-test statistic is more appropriate.
  • Chi-square test : Chi-square test is used for categorical data or for testing independence in contingency tables
  • F-test : F-test is often used in analysis of variance (ANOVA) to compare variances or test the equality of means across multiple groups.

We have a smaller dataset, So, T-test is more appropriate to test our hypothesis.

T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

Step 5 – Comparing Test Statistic:

In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis. There are two ways to decide where we should accept or reject the null hypothesis.

Method A: Using Crtical values

Comparing the test statistic and tabulated critical value we have,

  • If Test Statistic>Critical Value: Reject the null hypothesis.
  • If Test Statistic≤Critical Value: Fail to reject the null hypothesis.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Method B: Using P-values

We can also come to an conclusion using the p-value,

p\leq\alpha

Note : The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. To determine p-value for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Step 7- Interpret the Results

At last, we can conclude our experiment using method A or B.

Calculating test statistic

To validate our hypothesis about a population parameter we use statistical functions . We use the z-score, p-value, and level of significance(alpha) to make evidence for our hypothesis for normally distributed data .

1. Z-statistics:

When population means and standard deviations are known.

z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}

  • μ represents the population mean, 
  • σ is the standard deviation
  • and n is the size of the sample.

2. T-Statistics

T test is used when n<30,

t-statistic calculation is given by:

t=\frac{x̄-μ}{s/\sqrt{n}}

  • t = t-score,
  • x̄ = sample mean
  • μ = population mean,
  • s = standard deviation of the sample,
  • n = sample size

3. Chi-Square Test

Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

  • i,j are the rows and columns index respectively.

E_{ij}

Real life Hypothesis Testing example

Let’s examine hypothesis testing using two real life situations,

Case A: D oes a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market, they need to conduct a study to assess its impact on blood pressure.

  • Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
  • After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1 : Define the Hypothesis

  • Null Hypothesis : (H 0 )The new drug has no effect on blood pressure.
  • Alternate Hypothesis : (H 1 )The new drug has an effect on blood pressure.

Step 2: Define the Significance level

Let’s consider the Significance level at 0.05, indicating rejection of the null hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3 : Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.

The test statistic (e.g., T-statistic) is calculated based on the differences between blood pressure measurements before and after treatment.

t = m/(s/√n)

  • m  = mean of the difference i.e X after, X before
  • s  = standard deviation of the difference (d) i.e d i ​= X after, i ​− X before,
  • n  = sample size,

then, m= -3.9, s= 1.8 and n= 10

we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value

The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the p-value using statistical software or a t-distribution table.

thus, p-value = 8.538051223166285e-06

Step 5: Result

  • If the p-value is less than or equal to 0.05, the researchers reject the null hypothesis.
  • If the p-value is greater than 0.05, they fail to reject the null hypothesis.

Conclusion: Since the p-value (8.538051223166285e-06) is less than the significance level (0.05), the researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

Python Implementation of Hypothesis Testing

Let’s create hypothesis testing with python, where we are testing whether a new drug affects blood pressure. For this example, we will use a paired T-test. We’ll use the scipy.stats library for the T-test.

Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations.

We will implement our first real life problem via python,

In the above example, given the T-statistic of approximately -9 and an extremely small p-value, the results indicate a strong case to reject the null hypothesis at a significance level of 0.05. 

  • The results suggest that the new drug, treatment, or intervention has a significant effect on lowering blood pressure.
  • The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Case B : Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

Populations Mean = 200

Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis

  • Null Hypothesis (H 0 ): The average cholesterol level in a population is 200 mg/dL.
  • Alternate Hypothesis (H 1 ): The average cholesterol level in a population is different from 200 mg/dL.

As the direction of deviation is not given , we assume a two-tailed test, and based on a normal distribution table, the critical values for a significance level of 0.05 (two-tailed) can be calculated through the z-table and are approximately -1.96 and 1.96.

(203.8 - 200) / (5 \div \sqrt{25})

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value (1.96), we reject the null hypothesis. And conclude that, there is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL

Limitations of Hypothesis Testing

  • Although a useful technique, hypothesis testing does not offer a comprehensive grasp of the topic being studied. Without fully reflecting the intricacy or whole context of the phenomena, it concentrates on certain hypotheses and statistical significance.
  • The accuracy of hypothesis testing results is contingent on the quality of available data and the appropriateness of statistical methods used. Inaccurate data or poorly formulated hypotheses can lead to incorrect conclusions.
  • Relying solely on hypothesis testing may cause analysts to overlook significant patterns or relationships in the data that are not captured by the specific hypotheses being tested. This limitation underscores the importance of complimenting hypothesis testing with other analytical approaches.

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data scientists to navigate uncertainties and draw credible inferences from sample data. By systematically defining null and alternative hypotheses, choosing significance levels, and leveraging statistical tests, researchers can assess the validity of their assumptions. The article also elucidates the critical distinction between Type I and Type II errors, providing a comprehensive understanding of the nuanced decision-making process inherent in hypothesis testing. The real-life example of testing a new drug’s effect on blood pressure using a paired T-test showcases the practical application of these principles, underscoring the importance of statistical rigor in data-driven decision-making.

Frequently Asked Questions (FAQs)

1. what are the 3 types of hypothesis test.

There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed. Right-tailed tests assess if a parameter is greater, left-tailed if lesser. Two-tailed tests check for non-directional differences, greater or lesser.

2.What are the 4 components of hypothesis testing?

Null Hypothesis ( ): No effect or difference exists. Alternative Hypothesis ( ): An effect or difference exists. Significance Level ( ): Risk of rejecting null hypothesis when it’s true (Type I error). Test Statistic: Numerical value representing observed evidence against null hypothesis.

3.What is hypothesis testing in ML?

Statistical method to evaluate the performance and validity of machine learning models. Tests specific hypotheses about model behavior, like whether features influence predictions or if a model generalizes well to unseen data.

4.What is the difference between Pytest and hypothesis in Python?

Pytest purposes general testing framework for Python code while Hypothesis is a Property-based testing framework for Python, focusing on generating test cases based on specified properties of the code.

Please Login to comment...

Similar reads.

  • data-science
  • Data Science
  • Machine Learning
  • 10 Best Slack Integrations to Enhance Your Team's Productivity
  • 10 Best Zendesk Alternatives and Competitors
  • 10 Best Trello Power-Ups for Maximizing Project Management
  • Google Rolls Out Gemini In Android Studio For Coding Assistance
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Confidence curves : an alternative to null hypothesis significance testing for the comparison of classifiers

  • Published: 30 December 2016
  • Volume 106 , pages 911–949, ( 2017 )

Cite this article

  • Daniel Berrar 1  

6547 Accesses

14 Citations

1 Altmetric

Explore all metrics

Null hypothesis significance testing is routinely used for comparing the performance of machine learning algorithms. Here, we provide a detailed account of the major underrated problems that this common practice entails. For example, omnibus tests, such as the widely used Friedman test, are not appropriate for the comparison of multiple classifiers over diverse data sets. In contrast to the view that significance tests are essential to a sound and objective interpretation of classification results, our study suggests that no such tests are needed. Instead, greater emphasis should be placed on the magnitude of the performance difference and the investigator’s informed judgment. As an effective tool for this purpose, we propose confidence curves , which depict nested confidence intervals at all levels for the performance difference. These curves enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results. We benchmarked several classifiers on multiple data sets and analyzed the results with both significance tests and confidence curves. Our conclusion is that confidence curves effectively summarize the key information needed for a meaningful interpretation of classification results while avoiding the intrinsic pitfalls of significance tests.

Similar content being viewed by others

null hypothesis in machine learning

Using p-values for the comparison of classifiers: pitfalls and alternatives

Daniel Berrar

Statistical comparison of classifiers through Bayesian hierarchical modelling

Giorgio Corani, Alessio Benavoli, … Marco Zaffalon

null hypothesis in machine learning

Should significance testing be abandoned in machine learning?

Daniel Berrar & Werner Dubitzky

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning classifiers are frequently compared and selected based on their performance on multiple benchmark data sets. Given a set of k classifiers and N data sets, the question is whether there exists a significant performance difference, and if so, between which pairs of classifiers. Null hypothesis significance testing (NHST) is increasingly used for this task. However, NHST has been criticized for many years in other fields (Harlow et al. 1997 ), for example, biomedicine and epidemiology (Poole 1987 ; Goodman 1993 , 2008 ; Rothman et al. 2008 ; Stang et al. 2010 ), the social sciences (Cohen 1994 ; Gigerenzer et al. 2004 ), statistics (Berger and Berry 1988 ), and particularly psychology (Rozeboom 1960 ; Bakan 1966 ; Carver 1978 ; Schmidt and Hunter 1997 ; Rozeboom 1997 ; Gigerenzer 1998 ). By contrast, in machine learning, these critical voices have not been widely echoed so far. Recently, some deficiencies of the common benchmarking practice have been pointed out (Drummond and Japkowicz 2010 ), and Bayesian alternatives were proposed (Corani et al. 2015 ; Benavoli et al. 2015 ). Overall, however, there is a clear trend towards significance testing for the comparison of machine learning algorithms.

In this paper, we criticize this common evaluation practice. First, we scrutinize the key problems and common misconceptions of NHST that, in our view, have received scant attention in the machine learning literature so far. For example, it is widely assumed that NHST originates from one coherent theory (Goodman 2008 ), but actually it is an unfortunate hybrid of concepts from the Fisherian and Neyman–Pearsonian school of thought. We believe that the amalgamation of incompatible ideas from these schools and the ensuing problems are not widely recognized. For example, the p value is often considered a type of error rate, although it does not have such an interpretation. A p value is widely considered as an objective measure, but in fact, it depends on the researcher’s intentions (whether these were actually realized or not) and how the researcher thought about the experiment. The p value is therefore far less objective than is commonly assumed. Sampling intentions do matter, and they also have a bearing on other frequentist methods, such as confidence intervals.

A significant p value is widely regarded as a research desideratum, but it is probably one of the most widely misinterpreted and overrated values in the scientific literature (Goodman 2008 ; Nuzzo 2014 ). We investigate several problems of this recondite value with particular relevance to performance evaluation. A major goal of this study is to kindle a debate on the role of NHST, the p value, and alternative evaluation methods in machine learning.

One of our main criticisms concerns the use of omnibus tests in comparative classification studies. The Friedman test is now widely used when multiple classifiers are compared over multiple data sets. When such tests give a significant result, post-hoc tests are carried out to detect which pair-wise comparisons are significantly different. Here, we provide several arguments against this procedure in general and the Friedman test in particular. A key finding is that such tests are not needed, and when a study involves diverse benchmark data sets, omnibus tests (such as the Friedman test) are not even appropriate.

The underlying problem of the current evaluation practice, however, is a much deeper one. There is a common thread that weaves through the machine learning literature, suggesting that statistical testing lends scientific rigor to the analysis of empirical results. Well-meaning researchers, eager for a sound and objective interpretation of their empirical results, might consider a statistical test indispensable. Here, we wish to challenge this view. We argue that such tests often provide only a veneer of rigor, and that they are therefore not needed for the comparison of classifiers. Our criticism pertains to both the Fisherian significance testing and the Neyman–Pearsonian hypothesis testing, and particularly to the blurring of concepts from both schools of thought. We do not consider Bayesian testing in this article.

We put forward that a focus on the effect size (i.e., the magnitude of the difference in performance) and its reasonable bounds is needed, not a focus on statistical significance. As an alternative evaluation tool, we propose confidence curves , which are based on the idea of depicting an infinite number of nested confidence intervals for an effect size (Birnbaum 1961 ). The resulting “tipi”-shaped graph enables the investigator to simultaneously assess the compatibility of an infinite number of null hypotheses with the experimental results. Thereby, confidence curves solve a key problem of the common testing practice, namely the focus on a single null hypothesis (i.e., the null hypothesis of no difference) with its single p value.

In our experiments involving real-world and synthetic data sets, we use first the Friedman test with Nemenyi post-hoc test and then confidence curves. By juxtaposing both approaches, we show that the evaluation with confidence curves is more meaningful but, at the same time, also more challenging because they require an interpretation beyond the dichotomous decision of “significant” versus “non-significant”.

The novelty of this study is twofold. First, we investigate several underrated problems of NHST and the p value. To our knowledge, no detailed account of these problems has been given in the machine learning literature yet. Second, we propose confidence curves as an alternative, graphical evaluation tool. The significance of our work is that it opens a possible avenue towards a more flexible and meaningful interpretation of empirical classification results. The main contributions of our paper are as follows.

We investigate five key problems of the p value that are particularly relevant for the evaluation of classification results but have received scant attention so far. We discuss several examples to illustrate these problems.

We show that widely used omnibus tests, such as the Friedman test, are not appropriate for the comparison of multiple classifiers over multiple data sets. If the test subjects are diverse benchmark data sets, then the p value has no meaningful interpretation.

We propose an alternative evaluation method, confidence curves , which help avoid the intrinsic pitfalls of NHST. As a summary measure, we derive the area under the confidence curve (AUCC). We provide a detailed experimental comparison between the evaluation based on NHST and confidence curves.

We provide the R code to plot confidence curves and calculate the AUCC. This code is available at https://github.com/dberrar/ConfidenceCurve .

This paper is organized as follows. After a brief review of related work, we first describe the main differences between the Fisherian and the Neyman–Pearsonian school of thought, which are often amalgamated into an incoherent framework for statistical inference. Then, we scrutinize the key problems of the p value. Finally, we present several arguments against significance tests for the comparison of multiple classifiers over multiple data sets. This first part of the paper represents the rationale for our research on alternative evaluation methods. We begin the second part of the paper with an illustration of the key concepts of confidence curves and then provide their mathematical details. As a summary statistic of precision, we propose the area under the confidence curve (AUCC). Then, we provide some examples illustrating what we can do with these curves and the AUCC. In the experimental part of the paper, we compare the performance of several classifiers over both UCI benchmark and synthetic data sets. First, we analyze the results using a standard approach (Friedman test with Nemenyi post-hoc test). Then, we interpret the same results with confidence curves and compare both approaches. In Sect.  8 , we summarize our arguments against significance testing and discuss the pros and cons of the proposed alternative. Our conclusion (Sect.  9 ) is that greater emphasis should be placed on effect size estimation and informed judgment, not on significance tests and p values.

2 Related work

There exists a substantial amount of literature on the problems of significance testing. In machine learning, however, such critical voices are extremely rare. Demšar ( 2006 ), for example, concludes with a paragraph reminding us about the alternative opinion of statisticians who reject statistical testing (Cohen 1994 ; Schmidt 1996 ). These objections are further expatiated in (Demšar 2008 ). Dietterich ( 1998 ) compared several statistical tests and concluded that they should be viewed as approximate, heuristic tests, and not as rigorously correct statistical methods. Drummond and Japkowicz ( 2010 ) criticize the current practice in machine learning that puts too much emphasis on benchmarking and statistical hypothesis testing. In a similar vein, Drummond ( 2006 ) questions the value of NHST for comparing the performance of machine learning algorithms.

Yet despite decades of severe criticisms, significance tests and their p values enjoy an unbroken popularity in many scientific disciplines (Nuzzo 2014 ). How prevalent is their use in machine learning? As it is difficult to answer that question directly, we queried the ScienceDirect database Footnote 1 for articles containing the terms “ p value” and “classification”. We divided the number of articles containing these search terms by the number of articles containing only “classification” and not “ p value”. We restricted the search to computer science articles only. Figure  1 shows that p values (and hence significance testing) have been increasingly used over the last 15 years. Of course, Fig.  1 needs to be interpreted very cautiously because the results may also include articles that are critical of significance testing. Nonetheless, we believe that Fig.  1 indicates a clear trend towards the use of significance tests for the comparison of classifiers.

Use of significance tests in classification studies. The rate denotes the number of computer science articles containing the words “ p value” and “classification” divided by the number of computer science articles containing only “classification” and not “ p value”. Results are based on ScienceDirect database queries, 19 December 2015

Several alternatives to significance testing have been proposed, for example, Bayesian analysis (Berger and Berry 1988 ). Bayesian tests were also recently proposed for the comparison of machine learning algorithms (Benavoli et al. 2015 ; Corani et al. 2015 ). Killeen ( 2004 ) recommends replacing the p value by \(p_{rep}\) , a measure of replicability of results.

As another alternative, confidence intervals are widely considered more meaningful than significance tests (Tukey 1991 ; Cohen 1994 ; Schmidt 1996 ). In fact, a confidence interval provides a measure of the effect size and a measure of its uncertainty (Cummings 2012 ), whereas the p value conflates the effect size with the precision with which this effect size has been measured. We will discuss this issue in detail in Sect.  4.2 . Many statisticians and other scientists have therefore argued that confidence intervals should replace significance tests and p values (Cox 1977 ; Cohen 1994 ; Schmidt 1996 ; Thompson 1999 ; Stang et al. 2010 ). The journal Epidemiology even advises against the use of p values: “[...] we prefer that p values be omitted altogether, provided that point and interval estimates, or some equivalent, are available.” (Rothman 1998 , p. 334). Although confidence intervals and p values are often considered as two sides of the same coin, they are different tools and have a different influence on the interpretation of empirical results (Poole 2001 ). Specifically, for the comparison of machine learning classifiers, confidence intervals were shown to be preferable to significance tests (Berrar and Lozano 2013 ).

However, Levin ( 1998 ), Savalei and Dunn ( 2015 ), and Abelson ( 1997 ), among others, are skeptical about the benefits of confidence intervals over significance testing because it is unclear how wide such intervals should be. It has been suggested that several intervals alongside the common 95% interval be reported (Cox 1958 ), but according to Levin ( 1998 ), this is “subjective nonsense” (p. 47) because it is unclear (and arbitrary) which confidence levels should be reported. Furthermore, it is perhaps too tempting to interpret a confidence interval merely as a surrogate significance test by checking whether it includes the null value or not. In that case, the advantage of the confidence interval over the p value is of course lost.

The evaluation method that we consider as an alternative to NHST is based on the confidence curve estimator developed by Birnbaum ( 1961 ). In his unified theory of estimation for one-parameter problems, Birnbaum constructed nested confidence intervals for point estimates. He did not propose confidence curve estimators as an alternative to significance testing, though. In epidemiology and medical research, these estimators were proposed as a meaningful inferential tool under their alias of p value function (Poole 1987 ; Rothman 1998 ; Rothman et al. 2008 ). Similar graphs were proposed before under the different names of consonance function (Folks 1981 ) and confidence interval function (Sullivan and Foster 1990 ). To our knowledge, however, such graphs are rarely used in epidemiology or clinical research. In reference to the paper that first described the key idea, we use Birnbaum’s term confidence curve to refer to nested confidence intervals at all levels. We consider confidence curves for cross-validated point estimates of classification performance. We also derive the area under the confidence curve (AUCC), which, similarly to the AUC of a ROC curve, is a scalar summary measure.

3 Short revision of classic statistical testing

The foundations of what has become the classic statistical testing procedure were laid in the early 20th century by two different approaches to statistical inference, the Fisherian and the Neyman–Pearsonian school of thought. These two schools are widely believed to represent one single, coherent theory of statistical inference (Hubbard 2004 ). However, their underlying philosophies and concepts are fundamentally different (Goodman 1993 ; Hubbard and Bayarri 2003 ; Hubbard and Armstrong 2006 ) and their amalgamation can entail severe problems. We will now briefly revise the essential concepts.

3.1 Fisherian significance testing

The Fisherian school of thought goes back to Ronald A. Fisher and is motivated by inductive inference , which is based on the premise that it is possible to make inferences from observations to a hypothesis. In the Fisherian paradigm, only one hypothesis exists, the null hypothesis. There is no alternative hypothesis. Following the notation by Bayarri and Berger ( 2000 ), we state the null hypothesis, H0, as follows,

where X denotes data, and \(f({\mathbf{x}},\theta )\) is a density with parameter \(\theta \) . The word “null” in “null hypothesis” refers to the hypothesis to be nullified. It does not mean that we need to test whether some value is 0 (for example, that the difference in performance is 0). To make this distinction clear, Cohen ( 1994 ) prefers “nil hypothesis” for the null hypothesis of no difference.

In the Fisherian inductive paradigm, we are only interested in whether the null hypothesis is plausible or not. So we ask: which data cast as much doubt as (or more doubt than) our observed data, given that the null hypothesis is true? Fisher considered this conditional probability, called the p value, as a measure of evidence against the null hypothesis: the smaller this value, the greater the evidential weight against the null hypothesis, and vice versa. To investigate the compatibility of the null hypothesis with our observed data \({\mathbf{x}_{\mathrm{obs}}}\) , we choose a statistic \(T = t({\mathbf{X}})\) , for example, the mean. The p value is defined as

In other words, the p value is the probability of a result as extreme as or more extreme than the observed result, given that the null hypothesis is true. As the null hypothesis is a statement about a hypothetical infinite population, the p value is a measure that refers to that population. The p value is therefore not a summary measure of the observed data at hand.

Under the null hypothesis, the p value is a random variable uniformly distributed over [0, 1]. If the p value is smaller than an arbitrary threshold (commonly 0.05, the Fisherian level of significance), then the result is considered “significant”, otherwise “non-significant”. For Fisher, a significant p value merely meant that it is worthwhile doing further experiments (Goodman 2008 ; Nuzzo 2014 ). Formally, the p value is defined as a probability, but it is a rather difficult-to-interpret probability—it may be best to think of the p value as a “crude indicator that something surprising is going on” (Berger and Delampaday 1987 , p. 329). As Fisher reminded us, this “something surprising” may also refer to a problem with the study design or the data collection process (Fisher 1943 ).

We are often reminded not to interpret the p value as the probability that the null hypothesis is true, given the data, that is, Pr (H0 \(|{\mathbf{x}_{\mathrm{obs}}}\) ). Still, this interpretation is perhaps one of the most pervasive misconceptions of the p value. Sometimes we are advised that although the p value cannot tell us anything about the probability that the null hypothesis is true or false, we can act as if it were true or false. However, this is not in the spirit of Fisher who regarded the p value as an evidential measure, not as a criterion for decision making or behavior. It is the Neyman–Pearsonian hypothesis testing that provides such a criterion.

3.2 Neyman–Pearsonian hypothesis testing

In contrast to the Fisherian paradigm, the procedure invented by Jerzy Neyman and Egon Pearson regards hypothesis testing as a vehicle for decision making or inductive behavior . Here, the null hypothesis H0 is pitted against an alternative hypothesis, H1 (which, we remember, does not exist in the Fisherian paradigm). The emphasis is on making a decision between two options, with the goal to minimize the errors that we make in the long run , not to find out which hypothesis is true. Thus, “accepting H” is not to be equated with “believing that H is true”. In the words of Neyman and Pearson,

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis [...] Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience , we shall not be too often wrong. (Neyman and Pearson 1933 , p. 290–291) (our italics).

The importance of the phrase “in the long run” cannot be overstated. It means that the Neyman–Pearsonian paradigm is not conceived as a procedure to assess the evidential weight provided by an individual experimental outcome.

There are now two errors that one can make, (i) deciding against H0 although it is correct (Type I error or \(\alpha \) ), and (ii) deciding in favor of H0 although it is false (Type II error or \(\beta \) ). Note that “deciding in favor of” and “deciding against” implies a dichotomization of the results; otherwise, the concept of Type I/II errors would have no meaning. If H0 is false, then the probability—over many applications of the testing procedure—that H0 is rejected is the power of that procedure, where \(\textit{power} = 1 - \beta \) . There is a trade-off between the two types of errors, and we can tweak them through our arbitrarily fixed \(\alpha \) . What should guide this tweaking? Clearly, the guide should be the costs associated with the errors. Costs, however, have no bearing on the truth of a hypothesis; they have purely pragmatic reasons and were frowned upon by Fisher ( 1955 ).

Note that in the Neyman–Pearsonian school of thought, the Fisherian p value does not exist. Nor does any other measure of evidence. By contrast, the concepts of error rates, alternative hypothesis, and power do not exist in the Fisherian school of thought. Specifically, note the crucially important difference between the p value and the Type I error rate. The p value has no interpretation as a long-run, repetitive error rate, whereas the Type I error rate does; compare (Berger and Delampaday 1987 , p. 329). This stands in stark contrast to the nearly ubiquitous misinterpretation of p values as error rates. It means that we cannot simply compare the Fisherian p value with the Neyman–Pearsonian error rate \(\alpha \) . If the question of interest is “Given these data, do I have reason to conclude that H0 is false?”, then the Neyman–Pearsonian error rate is irrelevant. Note that a Bayesian calibration can, to some extent, reconcile the Fisherian p value and the Neyman–Pearsonian \(\alpha \) . The calibration is \(\alpha (p) = [1 + (-ep\log p)^{-1} ]^{-1}\) , if \(p<e^{-1}\) (Sellke et al. 2001 ). For example, if the p value is 0.03, then it can be given the frequentist error interpretation of \(\alpha = 0.22\) . A lower bound on the Bayes factor is given by \(B(p) = -ep\log (p)\) . For example, a p value of 0.03 corresponds to an odds of 0.29 for H0 to H1 (i.e., about 1:3.5). This calibration also illustrates that a p value almost always drastically overstates the evidence against the null hypothesis.

In the Fisherian school of thought, we can never accept H0, only fail to reject it, if we deem the p value too high. But in the Neyman–Pearsonian paradigm, we may indeed accept H0 because we are interested in decision rules: if the test statistic falls into the rejection region, then H0 is rejected and H1 is accepted. By contrast, if the test statistic does not fall into the rejection region, then H0 is accepted and H1 is rejected. The concrete numerical value of the probability associated with the test statistic is irrelevant. In fact, the Neyman–Pearsonian hypothesis test is based on the notion of a critical region that minimizes \(\beta \) for a fixed \(\alpha \) . The concept of “error rate” requires that a result can be anywhere within the tail area (Goodman 1993 ). This is not so for the p value, and it makes therefore sense to report it exactly. There is no big difference between a p value of 0.048 and 0.052, simply because they can be interpreted as indicators of about equal weight. But in the Neyman–Pearsonian school of thought, it is an all-or-nothing decision, so 0.048 and 0.052 make all the difference.

Confused? You should be. Both the Neyman–Pearsonian \(\alpha \) -level and the Fisherian p value have been called the “significance level of a test”. The \(\alpha \) -level is set before the experiment is carried out, whereas the p value is calculated from the data after the experiment. The symbol \(\alpha \) is used as both an arbitrary threshold for the p value and as a frequentist error rate. And to make matters worse, the two different concepts are commonly employed at the 5%-level, thereby blurring the differences even more. In an excellent review, Hubbard ( 2004 ) describes the widespread confusion over p values as error probabilities, which has been perpetuated even in statistics textbooks. This confusion has also percolated into the machine learning literature where we observe the misconception that replicability or power (a Neyman–Pearsonian concept) can be measured as a function of p values (Demšar 2006 ). The p value, however, tells us nothing about either replicability or power. We will come back to this issue in Sect.  4.4 .

While Neyman believed that null hypothesis testing can be “worse than useless” (Gigerenzer 1998 , p. 200) in a mathematical sense, Fisher called the Neyman–Pearsonian Type II error the result of a “mental confusion” (Fisher 1955 , p. 73). Therefore, Fisher, Neyman, and Pearson would undoubtedly all have strongly objected to the inconsistent conflation of their ideas.

While the Neyman–Pearsonian approach is certainly useful in an industrial quality-control setting—a fact that was acknowledged even by one of its sternest opponents, R. A. Fisher (1955)—it can be questioned whether it has any role to play in the scientific enterprise. Rothman et al. ( 2008 ) wonder:

Why has such an unsound practice as the Neyman-Pearson (dichotomous) hypothesis testing become so ingrained in scientific research? [...] The neatness of an apparent clear-cut result may appear more gratifying to investigators, editors, and readers than a finding that cannot be immediately pigeonholed. (Rothman et al. 2008 , p. 154)

When we are interested in evaluating the plausibility of a concrete hypothesis, we might ask: “How compatible are our data with the hypothesis?” It seems that the Fisherian significance testing with its (allegedly evidential) p value is indeed more suitable to answer this question than the Neyman–Pearsonian approach. However, the p value is not as easy to interpret as we might think.

4 Underrated problems of the p value

Goodman ( 2008 ) gives an overview of the 12 most common misconceptions of the p value. Perhaps the single most serious misconception is that the p value has a sound theoretical foundation as an inferential tool. In fact, Fisher regarded the p value as an evidential measure of the discrepancy between the data and the null hypothesis, which should be used together with other background information to draw conclusions from experiments (Goodman 1999 ). The p value, however, is not an evidential measure (Berger and Sellke 1987 ; Goodman and Royall 1988 ; Cohen 1994 ; Hubbard and Lindsay 2008 ; Schmidt and Hunter 1997 ; Schervish 1996 ). An evidential measure requires two competing explanations for an observation (Goodman and Royall 1988 ), but the theory underlying the p value does not allow any alternative hypothesis. The p value is based on only one hypothesis. But at least, the p value is an objective measure—or is it not?

4.1 The p value is not a completely objective measure

The p value includes probabilities for data that were actually not obtained, but that could have been obtained under the null hypothesis . Yet what exactly are these imaginary, more extreme data? Curiously, this question cannot be answered on the basis of the observed data alone, but we need to know how the researcher thought about the possible outcomes. This means that it is impossible to analyze any experimental outcomes for their (non-)significance, unless we understand how the experiment was planned and conducted. We illustrate this problem in two scenarios that are adapted from mathematically equivalent examples of hypothetical clinical trials (Goodman 1999 ; Berger and Berry 1988 ).

Suppose that Alice and Bob jointly developed a new classifier A . They believe that A can outperform another classifier X on a range of data sets. Both Alice and Bob formulate the null hypothesis as follows, H0: the probability that their algorithm is better than X is 0.5. Alice and Bob decide to benchmark their algorithm independently and then compare their results later. Both Alice and Bob select the same six data sets from the UCI repository.

Alice is determined to carry out all six benchmark experiments, even if her algorithm loses the competition on the very first few data sets. After all six experiments, she notes that their classifier A was better on the first five data sets, but not on the last data set. Under H0, the probability of observing these results is calculated as \({6 \atopwithdelims ()5}0.5^10.5^5\) (i.e., 5 successes out of 6 trials, where each trial has a chance of 0.5). A more extreme result would be that their algorithm performs better on all six data sets. Thus, the probability of the more extreme result is \(0.5^6\) . Therefore, Alice obtains \({6 \atopwithdelims ()5}0.5^10.5^5 + 0.5^6 = 0.11\) as the one-sided p value. Consequently, she concludes that their model A is not significantly better than the competing model X .

Bob has a different plan. He decides to stop the benchmarking as soon as their algorithm fails to be superior. Coincidentally, Bob analyzed the data sets in the same order as Alice did. Consequently, Bob obtained the identical benchmark results. But Bob’s calculation of the p value is different from Alice’s. In Bob’s experiment, the failure can only happen at the end of a sequence of experiments because he planned to stop the benchmarking in case that A performs worse than X . Hence, the probability of the observed result is \(0.5^5\times 0.5^1\) (i.e., a success in the first five experiments and a failure in the last one). The probability of the more extreme result is the same as that that Alice calculated. Therefore, Bob obtains \(0.5^5\times 0.5^1 + 0.5^6 = 0.03\) as the one-sided p value. And he concludes that their classifier is significantly better than the competing model X . The conundrum is that both Bob and Alice planned their experiments well. They used the same data sets and the same models, yet their p values—and in this example, their conclusions—are quite different.

Before Bob continues with his research, he discusses his experimental design with his supervisor, Carlos. Bob plans to compare their algorithm with an established algorithm X . This time, the null hypothesis is stated as follows, H0: probability that his algorithm performs differently from X is 0.5. Thus, this time, it is a two-sided test. Bob decides to benchmark his algorithm against X on ten data sets. He observes that his algorithm outperforms X in 9 of 10 data sets. The probability of 9 successes in 10 trials is \({10 \atopwithdelims ()9}0.5^90.5^1=0.0098\) . The possible outcomes that are as extreme as or more extreme than the observed outcome are 0, 1, 9, and 10. The two-sided p value is the sum of probabilities of these outcomes: \({10 \atopwithdelims ()0} 0.5^0 0.5^{10} + {10 \atopwithdelims ()1} 0.5^1 0.5^{9} + {10 \atopwithdelims ()9} 0.5^9 0.5^{1} + {10 \atopwithdelims ()10} 0.5^{10} 0.5^{0}= 0.021\) . Given that the p value is smaller than 0.05, Bob concludes that his algorithm is significantly better than X .

Alice discusses a different study design with Carlos. She also wants to investigate the same ten data sets; however, if she cannot find a significant difference after ten experiments, then she wants to investigate another ten data sets. Thus, her study has two stages, where the second stage is merely a contingency plan. Her null hypothesis is the same as Bob’s. She uses again the same data sets as Bob did, and in the same order. Therefore, she also observes that her algorithm outperforms X in 9 out of 10 data sets. Thus, she does not need her contingency plan.

However, after discussing her result with Carlos, she is disappointed: her two-sided p value is 0.34. How can that be? After all, she did exactly the same experiments as Bob did and obtained exactly the same results! This is highly counterintuitive, but it follows from the logic of the p value. After the first 10 experiments, the outcomes that are as extreme as her actually observed results are 1 and 9. The more extreme results are 0 and 10. In the two-stage design, however, the total number of possible experiments is 20. Thus, the two-sided p value is calculated as \({20 \atopwithdelims ()0}0.5^0 0.5^{20} + {20 \atopwithdelims ()1}0.5^1 0.5^{19} + {20 \atopwithdelims ()9}0.5^9 0.5^{11} + {20 \atopwithdelims ()10}0.5^{10} 0.5^{10} = 0.34\) . In the words of Berger and Berry ( 1988 ), “[...] the nature of p values demands consideration of any intentions, realized or not” (p. 165). It does not matter that Alice did not carry out her contingency plan. What matters is that she contemplated to do so before obtaining her results, and this does affect the calculation of her p value.

Even if sufficient evidence against a hypothesis has been accrued, the experiment must adhere to the initial design; otherwise, p values have no valid interpretation. This is the reason why clinical trials cannot be stopped in mid-course for an interim (frequentist) analysis. In practice, it can happen that a drug trial is stopped in mid-term because the beneficial effects of the drug are so clear that it would be unethical to continue administering a placebo to the control group (Morgan 2003 ). But then the data can be analyzed only by Bayesian, not frequentist methods. The frequentist paradigm requires that the investigator adhere to everything that has been specified before the study (Berry 2006 ). Realized or not, the intentions of the investigator matter, which indicates a potentially serious flaw in the logic of the p value (Berger and Berry 1988 ).

4.2 The p value conflates effect size and precision

A common misinterpretation of the p value is that it reflects the probability that the experiment would show as strong an effect as the observed one (or stronger), if the null hypothesis was correct. Suppose that we observe a difference in accuracy of 0.15 between two classifiers, A and B , with an estimated standard deviation of 0.10. For simplicity, let us assume that a standard normal test statistic is appropriate. Then we obtain \(z = \frac{0.15}{0.10}=1.5\) with a two-tailed p value of \(0.134 > 0.05\) . Thus, we cannot reject the null hypothesis of equal performance between A and B . Now assume that we observe a difference of only 0.12 between A and C , with an estimated standard deviation of 0.06. The test statistic \(z = \frac{0.12}{0.06}=2\) gives a p value of 0.046. Thus, we can reject the null hypothesis of equal performance between A and C . But compared with the difference between A and B , the difference between A and C is smaller (and closer to the null value of 0). The problem is that the p value conflates the magnitude of the difference (here, 0.15 and 0.12, respectively) with its precision (measured by the standard deviations 0.10 and 0.06, respectively).

Comparison of classification accuracy of random forest and CART on the data set Transfusion in r times repeated tenfold stratified cross-validation. p values are derived from the variance-corrected repeated k -fold cross-validation test

To illustrate the conflating effect, we compared the performance of random forest and CART on the data set Transfusion from the UCI repository in r times repeated tenfold stratified cross-validation (Fig.  2 ). The p values were derived from the variance-corrected t test (Nadeau and Bengio 2003 ; Bouckaert and Frank 2004 ). Footnote 2 The test statistic,

follows approximately Student’s distribution with \(\nu =kr-1\) degrees of freedom; \(a_{ij}\) and \(b_{ij}\) denote the performances (here, accuracy) achieved by classifiers A and B , respectively, in the j th repetition of the i th cross-validation fold; s is the standard deviation; \(n_2\) is the number of cases in one validation set, and \(n_1\) is the number of cases in the corresponding training set.

For \(r=1\) , i.e., a single run of tenfold cross-validation, the p value is 0.065, so we do not see a significant difference between the two classifiers. However, by repeating the cross-validation just three times ( \(r=3\) ), the p value falls below the magical 5% hurdle. The p value decreases further with increasing r . For 10 times repeated tenfold cross-validation, we obtain a p value of 0.003.

It is a well-known fact that for a large enough sample size, we are bound to find a statistically significant result (Hays 1963 ). With an increasing number of repetitions, we can increase the sample size. Note, however, that the differences \(a_{ij}-b_{ij}\) become more and more dependent, since we are effectively analyzing the same data over and over again. This is of course a clear violation of the assumptions of the test; however, this violation is not immediately obvious. Depending on our intentions, we could now choose to present either the result of the single cross-validation (“the observed difference was not significant”) or the result of the 10 times repeated tenfold cross-validation (“the observed difference was highly significant”).

4.3 Same p value, different null hypothesis

It is not generally appreciated that for every point null hypothesis, there is another point null hypothesis, possibly with a very different null value, that has exactly the same p value. We call it the entangled null hypothesis . Consider the following example. Bob and Alice analyze independently the accuracy of two classifiers, A and X , where X is an established classifier and A is their novel classifier. Bob’s null hypothesis is that there is no difference in performance, i.e., \(\text {H0:}~\delta = \tau _A - \tau _X = 0\) , where \(\tau _A\) and \(\tau _X\) refer to the true, unknown accuracy of classifier A and X , respectively.

Alice considers another null hypothesis, \(\text {H0:}~\delta = \tau _A - \tau _X = 0.10\) . Suppose now that both Alice and Bob obtain exactly the same p value of, say, 0.15. Bob does not reject the hypothesis of equal performance. Alice, on the other hand, concludes that the data are consistent with a rather large difference (10%) in performance. Thus, neither the null hypothesis of no difference nor its entangled hypothesis of a relatively large difference can be rejected, since the data are consistent with both hypotheses. These results cannot be easily reconciled within the framework of significance testing; with confidence curves, however, they can.

4.4 p value and replicability

Let us consider the following example. Suppose that Alice carried out an experiment \(E_{a1}\) and obtained a p value of \(p_{a1}=0.001\) . Bob carried out another experiment, \(E_{b1}\) , and obtained \(p_{b1}=0.03\) . Now, Alice is going to repeat her experiment. In her new experiment, \(E_{a2}\) , everything is the same as in \(E_{a1}\) . The only difference is that Alice is going to use a different sample. The size of this sample will be the same as that of \(E_{a1}\) . For example, the sample from \(E_{a1}\) is a test set of \(n_{a1}=100\) cases, and the sample from \(E_{a2}\) is a new test set of \(n_{a2}=100\) new cases, which are randomly drawn from the same population of interest. Bob, too, repeats his experiment in this way. We invite the reader to ponder briefly over the following question: who is more likely to get a significant result in the second experiment, Alice (who obtained \(p_{a1}=0.001\) ) or Bob (who obtained \(p_{b1}=0.03\) )?

One might be tempted to answer “Alice—because her initial p value is much smaller than Bob’s. Surely, the p value must tell us something about the replicability of a finding, right?” But this is not so. Our interpretation of replicability implies that the smaller the first p value is, the more likely it is that the second p value (from an exact replication study) will be smaller than 0.05, given that the null hypothesis is false. Under the null hypothesis, the p value takes on values randomly and uniformly between 0 and 1. But if the null hypothesis is false, it is the statistical power that determines whether a result can be replicated or not. Power depends on three main factors, (i) the alpha level for the test (the higher the level, the higher the power of the test, everything else being equal); (ii) the true effect size in the target distribution (the larger this effect, the higher the power of the test, everything else being equal); and (iii) the sample size (the larger the test set, the higher the power of the test, everything else being equal) (Fraley and Marks 2007 ; Schmidt and Hunter 1997 ). Note that these factors are constants. If the true difference in performance is \(\delta \) , and the alpha level is fixed, and the size of the test set is n , then the power of our test is the same in any study, regardless of the concrete makeup of the sampled test set. By contrast, the p value does depend on the concrete makeup. The power of a test to detect a particular effect size in the population of interest can be calculated before the experiment has been carried out. By contrast, the p value can be calculated only after the experiment has been carried out. As the power does not depend on the p value, the p value is irrelevant for assessing the likelihood of replicability. We remember that power is a concept from the Neyman–Pearsonian school of thought, while the p value is a concept from the Fisherian school of thought. Carver ( 1978 ) notes:

It is a fantasy to hold that statistical significance reflects the degree of confidence in the replicability or reliability of results. (Carver 1978 , p. 384)

The misinterpretation of a significant result as an indicator of replicability is known as replication fallacy .

Greenwald et al. ( 1996 ) showed that the p value is monotonically related to the replicability of a non-null finding. However, in their study, replicability is understood differently, namely as the probability of data at least as extreme as the observed data under an alternative hypothesis H1, which is defined post-hoc and with the effect size from the first experiment, i.e., Pr (observed or more extreme data|H1). Numerical examples can be found in (Krueger 2001 ).

4.5 The p value and Jeffreys–Lindley paradox

Suppose that we compare four classifiers over 50 data sets and use a significance test for the null hypothesis of equal performance. Assume that we obtain a very small p value of, say, 0.005. Many researchers might think that it is now straightforward how to interpret this result; however, it is actually not so obvious. The reason is the Jeffreys–Lindley paradox . This paradox is a well-known conundrum in inferential statistics where the frequentist and Bayesian approach give different results. Assume that H0 is a point null hypothesis and x the result of an experiment. Then the following two statements can be true simultaneously (Lindley 1957 ):

A significance test reveals that x is significant at level \(\alpha \) .

The posterior probability for H0 given the result x , \(P(\mathrm {H0}|x)\) , can be as high as \(1-\alpha \) .

This means that a significance test can reject a point null hypothesis with a very small p value, although at the same time, there is a very high probability that the null hypothesis is true. The lesson here is that the single p value for the null hypothesis, even when it is extremely small, can be more difficult to interpret than is commonly assumed.

5 Arguments against omnibus tests for comparing classifiers

When the performance of multiple classifiers is compared on more than one data set, it is now common practice to account for multiplicity effects by means of an omnibus test. Here, the global null hypothesis is that there is no difference between any of the classifiers. If the omnibus test gives a significant result, then we may conclude that there is a significant difference between at least one pair of classifiers. A post-hoc test can then be applied to detect which pair(s) are significantly different. The Friedman test is a non-parametric omnibus test for analyzing randomized complete block designs (Friedman 1937 , 1940 ). This test is now widely used for the comparison of multiple classifiers (Demšar 2006 ), together with the Nemenyi post-hoc test (Nemenyi 1963 ). However, there are several problems with this approach.

First, in contrast to common belief, the Friedman test is not a non-parametric equivalent of the repeated-measures ANOVA, but it is a generalization of the sign test (Zimmerman and Zumbo 1993 ; Baguley 2012 ). This is because the ranks in the Friedman test depend only on the order of the scores (here, observed performance) within each subject (here, data set), but the test ignores the differences between subjects. As a sign test, the Friedman test has relatively low power (Zimmerman and Zumbo 1993 ). Baguley ( 2012 ) advises us that rank transformation followed by ANOVA is both a more powerful and robust alternative.

Second, the Friedman test sacrifices information by requiring that real values are rank-transformed. Sheskin ( 2007 ) explains that this is one reason why statisticians are reluctant to prefer this test over parametric alternatives, even if one or more of their assumptions are violated. Furthermore, note that the transformation into ranks depends on the rounding of the real values. For example, assume that three classifiers achieve the following accuracies on one of the data sets: 0.809, 0.803, and 0.801. The corresponding ranks are then 1, 2, and 3. If we round the values to two decimal places, then the ranks are 1, 2.5, and 2.5. It is possible that such ties can change the result from non-significant to significant. It is somehow disconcerting that mere rounding can have such an effect on the outcome of the test.

Third, it is widely assumed that post-hoc tests for multiple comparisons may be conducted only if an omnibus test has first given a significant result. The rationale is that we need to control the family-wise Type I error rate. But there is an alternative view among statisticians that adjustments for multiple testing are not necessarily needed (Rothman 1990 ; Poole 1991 ; Savitz and Olshan 1998 ), and that such adjustments can even create more problems that they solve (Perneger 1998 ). If the omnibus test is a one-way ANOVA, then post-hoc tests are valid, Footnote 3 irrespective of the outcome of the omnibus test (Sheskin 2007 ). Hsu ( 1996 ) deplores that it has become an unfortunate common practice to pursue multiple comparisons only when the global null hypothesis has been rejected.

Implicit in the application of the Friedman test is the premise that the global null hypothesis is the most important one: unless we can reject it, we are not allowed to proceed with post-hoc tests. Cohen ( 1990 ) argues that we already know that the null hypothesis is false because the difference is never precisely 0; hence, “[...] what’s the big deal about rejecting it?” (Cohen 1990 , p. 1000). According to Rothman ( 1990 ), there is no empirical basis for a global null hypothesis. Following Rothman’s line of thought, let us suppose that we compare two classifiers, X and Y , on a data set D and observe that X is significantly better. Would we then not recommend X over Y for data sets that are similar to D ? But now suppose that we apply three classifiers, X , Y , and Z to the data set D . We use an omnibus test (or otherwise correct for multiple testing), and we now fail to reject the global null hypothesis of equal performance. Would we still recommend X over Y despite the lack of significance? The difference in accuracy (or whichever metric we are using) between X and Y has not changed—it has of course nothing to do with Z . A defendant of omnibus tests might say that by making more comparisons, we have to pay a “penalty for peeking” (Rothman 1990 , p. 46), i.e., adopt a stricter criterion for statistical significance. But let us consider the following simplified scenario. Alice designs a study to compare the performance of a support vector machine with random forest on a particular data set. She carries out her experiments in the morning and observes that the support vector machine performs significantly better than random forest. No corrections for multiple testing are needed because there are just two classifiers. Out of curiosity, Alice then applies naive Bayes to the same data in the evening . Clearly, Alice’s new experiment has no effect on her earlier experiments, but should she make multiplicity adjustments? This question does not have an obvious answer because it is not clear where the boundaries of one experiment end and those of another one begin; compare (Rothman 1990 ; Perneger 1998 ). Our stance is that adjustments for multiple testing are necessary under some circumstances, for instance, in confirmatory studies where we pre-specify a goal (or prospective endpoints). In exploratory studies, however, we recommend reporting unadjusted p values, while clearly highlighting that they result from an exploratory analysis. Comparative classification studies are generally exploratory, as they normally do not pre-specify any prospective endpoints. Thus, omnibus tests are not needed.

Fourth, to apply the Friedman test, we first need to rank the classifiers from “best” to “worst” for each data set. However, how meaningful is it to give a different rank to classifiers whose performances differ only very slightly? For example, in Guyon et al. ( 2009 ), the top 20 models from KDD Cup 2009 were analyzed based on the Friedman test. The model with rank 1 scored \(\hbox {AUC} = 0.9092\) on the upselling test set, while the model with rank 20 scored \(\hbox {AUC} = 0.8995\) . Is it meaningful to impose such an artificial hierarchy? We believe that this ranking rather blurs the real picture that all top 20 classifiers virtually performed the same.

Fifth, the Friedman test assumes that the subjects (data sets) have been randomly sampled from one superpopulation. It can be questioned whether in practice, data sets are ever randomly sampled. Surely, purely pragmatic reasons, such as availability, at least influence (if not guide) the choice of data sets. While this violation of a basic assumption is probably widely known, it seems to be tacitly ignored. Do we want to show that there is no difference in performance? Or do we want to show that there is? In Sect.  7.1 , we illustrate how we can easily tweak the results by considering different combinations of data sets and classifiers.

Sixth, the experimental results that led to the recommendation of the Friedman test are based on the estimation of replicability as a function of p values (Demšar 2006 ). However, as we discussed in Sect.  4.4 , this approach is questionable.

Finally, our last argument is perhaps the most compelling one. Consider the following simplified example, where \(k = 3\) conditions ( \(C_1\) , \(C_2\) , and \(C_3\) ) are applied to \(N = 5\) subjects (Table  1 ).

The numbers reflect the effect of a condition on a subject. The null hypothesis is stated as \(\text {H0:}~\theta _1 = \theta _2 = \theta _3\) , i.e., the median of the population that the numbers 3, 2, 1, 3, 2 represent equals the median of the population that the numbers 4, 5, 6, 4, 5 represent, which also equals the median of the population that the numbers 8, 9, 10, 9, 7 represent. When the null hypothesis is true, the sum of ranks of all three conditions will be equal. The p value of the Friedman test is the probability of observing sums of ranks at least as far apart as the observed ones, under the null hypothesis of no difference.

Let us now consider the following (simplified) drug trial. We administer three drugs, \(C_1\) , \(C_2\) , and \(C_3\) (one after another, allowing for adequate washout phase, etc.), to \(N=5\) patients and measure how well these drugs improve a certain condition. Here, each patient is a subject, and each drug is a condition (Table  1 ). From the result of the Friedman test, we can make an inference to the population of patients with similar characteristics , which means patients with the same medical condition, age, sex, etc. For example, we might reject the global null hypothesis that all drugs are equally effective, and a post-hoc test might tell us that drug \(C_3\) is significantly better than \(C_1\) and \(C_2\) . Thus, we might conclude that \(C_3\) should be given to the target patients. In this scenario, the Friedman test can be used.

However, if the conditions are classifiers and the subjects are benchmark data sets, then we have a problem. Commonly used data sets (e.g., the Transfusion and the King-rook-vs-king-pawn data sets from the UCI repository) are completely diverse entities that cannot be thought of as originating from one superpopulation. There is simply no “population of data sets.” This means that the numbers 3, 2, 1, 3, 2 for condition \(C_1\) , for example, cannot be a sample from one population. Unless the subjects originate from the same population, all inferences based on the Friedman test are elusive. In fact, this last argument applies to any omnibus test, not just the Friedman test. Therefore, such tests are inappropriate when the subjects represent diverse data sets.

6 Confidence curve

We now present an alternative method, the confidence curve , which is a variant of Birnbaum’s confidence curve estimator (Birnbaum 1961 ). Confidence curves enable us to assess simultaneously how compatible an infinite number of null hypotheses are with our experimental results. Thereby, our focus is no longer on “the” null hypothesis of no difference, its single p value, and the question whether it is significant or not. This shift of focus can help us avoid the major problems of NHST and the p value.

6.1 Illustration of key concepts

A confidence curve is a two-dimensional plot that shows the nested confidence intervals of all levels for a point estimate. We consider the effect size , which we define as follows.

Definition 1

( Effect size ) Let \(\tau _A\) and \(\tau _B\) be the true performance and \(o_A\) and \(o_B\) be the observed performance of two classifiers, A and B , on data D . The true (unstandardized) effect size is the difference \(\delta = \tau _A - \tau _B\) . The observed difference \(d = o_A - o_B\) is the point estimate of \(\delta \) .

For example, if a classifier A achieves 0.85 accuracy on a specific data set while a classifier B achieves only 0.70, then the estimated effect size is \(d = 0.15\) . Of course, the effect size could be measured based on any performance metric.

Key elements of the confidence curve for the difference in performance between two models. By mentally sliding the red vertical line (“null line”) along the x -axis, we can assess how compatible the corresponding null hypothesis is with the observed data. This compatibility is maximal ( \(p\, \mathrm{value} =1.0\) ) when the red line reaches d , which corresponds to the null hypothesis H0: \(\delta =d\) . As the red line moves away from d in either direction, the corresponding null hypotheses become less and less compatible with the observed data

Figure  3 illustrates the key features of the confidence curve. In this example, the point estimate of the effect size is 0.15. The plot has two complementary y -axes. The left y -axis shows the p value associated with each null value, which is shown on the x -axis. The right y -axis shows the confidence level. In Fig.  3 , the p value of “the” null hypothesis of no difference, \(\text {H0:}~\delta = 0\) , is 0.15. Each horizontal slice through the curve gives one confidence interval. For example, the 95%-confidence interval for \(\delta \) is the slice through the p value of 0.05. Technically, the maximum of the curve gives the zero percent confidence interval. In this example, the confidence intervals are symmetric; thus, we obtain a “tipi”-shaped symmetric confidence curve around the point estimate.

In Fig.  3 , we see that the null value of no difference lies within the 95%-confidence interval. By conventional criteria, we would therefore fail to reject the null hypothesis of no difference. However, note that a confidence curve should not be used as a surrogate significance test. A confidence curve can tell us much more. First and foremost, it disentangles the effect size from the precision of its measurement. The effect size is the magnitude of the observed difference, d , while its precision is given by the width of the curve. The wider the curve, the less precise is the measurement, and the narrower the curve, the more precise is the measurement. The area under the confidence curve (AUCC) can therefore be considered a measure of precision.

Note that every p value is associated with exactly two different null values because we are considering precise point null hypotheses (e.g., \(\text {H0:}~\delta = 0\) ). Consider the dotted horizontal line at the p value of 0.15 (Fig.  3 ). This line crosses the curve at the “entangled” null value (marked by a star), which corresponds to the null hypothesis \(\text {H0:}~\delta = 0.30\) in this example. This means that both the null hypothesis of no difference and the entangled null hypothesis are associated with exactly the same p value. Therefore, there is no reason why we should prefer one null hypothesis over the other; both hypotheses are equally compatible with the data.

Figure  3 shows the infinite spectrum of null hypotheses on the x -axis and how compatible they are with our data at any given level of confidence. Not surprisingly, the null hypothesis \(\text {H0:}~\delta = d\) is most compatible. The left y -axis shows all possible p values. It might perhaps seem surprising to see p values appear in the proposed alternative method, given all the arguments against the p value in Sect.  4 . However, note that these arguments pertain to the single p value from one single null hypothesis test. Poole ( 2001 ) refers to this p value more precisely as the “null p value” (p. 292). But in contrast to NHST, confidence curves do not give undue emphasis to a single p value.

To check the compatibility of other null hypotheses with the obtained data, one can easily imagine sliding the red “null line” in Fig.  3 along the x -axis and see where it intersects with the confidence curve. Thus, confidence curves allow us to check the compatibility of an infinite number of null hypotheses with our experimental results. Compatibility is a gradual, not a dichotomous characteristic; some hypotheses are more, others are less compatible.

6.2 Confidence curves for the effect size in repeated cross-validation

We consider only one data resampling scheme, r -times repeated k -fold cross-validation because it is probably the most widely used strategy. A \((1-\alpha )100\%\) confidence interval for the true effect size can be derived from the variance-corrected resampled t test (Nadeau and Bengio 2003 ; Bouckaert and Frank 2004 ),

where t is the critical value of Student’s distribution with \(\nu =kr-1\) degrees of freedom; k is the number of cross-validation folds; r is the number of repetitions; \(n_2\) is the number of cases in one validation set; and \(n_1\) is the number of cases in the corresponding training set (where \(n_1 \approx 5n_2\) ). The standard deviation s is calculated as

where \(d_{ij}\) is the difference between the classifiers in the i th repetition of the j th cross-validation fold, and \(\bar{d}\) is the average of these differences. Note that in the case of repeated cross-validation, the sampling intention is clear. The confidence intervals, and thereby the confidence curve, assume that exactly kr samples were to be taken.

The confidence curve consists of an infinite number of nested confidence intervals for the true effect size. This nesting can be described as follows. Let \(F_{d,\sigma }(x)\) be a cumulative distribution function with density \(f_{d,\sigma }(x)\) . The confidence curve c ( x ,  d ) is then defined as shown in Eq. ( 1 ).

When the degrees of freedom are sufficiently large ( \(\nu > 30\) ), the t -distribution approximates the standard normal distribution, and F can be approximated by the cumulative distribution function of the normal distribution, \(\Phi \) . The difference between Eq. ( 1 ) and Birnbaum’s estimator is the factor 2, which is needed for two-sided p values.

The basic algorithm for plotting confidence curves consists of four simple steps, (1) calculating a few dozen confidence intervals at different levels, from 99 to 0%; (2) plotting \(\alpha \) as a function of the lower bound; (3) plotting \(\alpha \) as a function of the upper bound; and (4) interpolating through all points. The following pseudocode plots the confidence curve for a difference in performance that is measured in r times repeated k -fold cross-validation. The supplementary material at https://github.com/dberrar/ConfidenceCurve contains the R code PlotConfidenceCurve .

6.3 Area under the confidence curve

When we compare the performance of many classifiers and there are space limitations, it can be preferable to tabulate the results instead of plotting all confidence curves. Confidence curves can be summarized by two values, the point estimate and their width. The wider the confidence curve, the less precise is the measured performance difference. Thus, the area under the confidence curve (AUCC) can be used as a measure of precision. But clearly, by using a single scalar, we lose important information about the classification performance. For r -times repeated k -fold cross-validation, the area is given by Eq. ( 2 ) (see “Appendix” for details).

6.4 Further notes on confidence curves

In this section, we will illustrate how to use confidence curves.

6.4.1 Statistical significance versus effect size and precision

Statistical significance versus effect size and precision. Confidence curves show nested, non-symmetric Quesenberry and Hurst confidence intervals for the difference in error rates on the test set (Berrar and Lozano 2013 ). a A narrow curve indicating the absence of any strong effect despite a significant result. The null value \(\delta =0\) ( red line ) lies outside the 95%-CI. The measurement is very precise. b A wide curve indicating that the data are readily compatible with a moderate to a strong effect despite a non-significant result. The null value \(\delta =0\) ( red line ) lies inside the 95%-CI. The measurement is not very precise

In the example shown in Fig.  4 , we consider the error rate as performance measure. The confidence curve in Fig.  4 a is a narrow spike, which indicates a highly precise measurement. The null hypothesis of equal performance can be rejected because the null value, \(\delta = 0\) , is outside the 95%-CI of [0.00017, 0.00843] for the point estimate \(d = 0.00430\) . Note that the upper bound of the 95%-CI is quite close to the null value. Emphasizing the significance would therefore be misleading in this study. The correct interpretation is that the data are not even compatible with a moderate effect.

Figure  4 b shows a wider curve, which indicates that the measurement is less precise. The null value lies within the 95%-CI of \([-0.038, 0.231]\) for the point estimate \(d=0.10\) . Based on the conventional criterion, we would not reject the null hypothesis, but the curve in Fig.  4 b indicates at least a moderate effect. In fact, null values that are readily compatible with the data span across a relatively wide range. Emphasizing the lack of significance would be misleading in this study. The correct interpretation is that the data are compatible with a moderate to a large effect.

In which situations can we expect such confidence curves? Suppose that we compare the performance of two models, A and B , and A is only marginally better than B . Suppose that this difference is in fact truly negligible for all practical applications. By using a sufficiently large test set, however, we can “make” this difference significant. This is a well-known effect that results from increasing the sample size. Footnote 4 Figure  4 a shows the result of such an overpowered study. Here, both the training and the test set contain 10,000 cases; 5000 cases belong to the positive class and 5000 belong to the negative class. Each positive case is described by a 10-dimensional feature vector, with elements randomly sampled from \(\mathcal {N}(0,1)\) . The features of the negative cases are randomly sampled from \(\mathcal {N}(1.5,1)\) . We trained a classification and regression tree (CART) on the training set and then applied it to the test set. This is model A . Model B is also a CART, but trained on a deliberately corrupted training set: the class labels of 50 randomly selected positive and 50 randomly selected negative cases were swapped. To show that this deliberate corruption has a small but negligible effect, we repeated the experiment 1000 times. We observed that the uncorrupted model performs significantly better ( \(p < 0.05\) , McNemar’s test) than the corrupted counterpart in 171 experiments. In 139 experiments, however, the corrupted model performed significantly better. In the remaining 690 experiments, there was no significant difference between the two models. The mean and median differences in all 1000 experiments were only 0.0023 and 0, respectively. Figure  4 a shows an experiment where the uncorrupted model was slightly better.

In the second experiment (Fig.  4 b), the learning set contains only 100 cases, each described by 10 numerical attributes. The first 50 cases belong to the positive class and the remaining 50 cases to the negative class. For the positive cases, the attributes take on values randomly from \(\mathcal {N}(0,1)\) . For the negative cases, the attributes take on values randomly from \(\mathcal {N}(0.2,1)\) . We trained a random forest classifier and implemented the competing model as a fair coin. We expected that random forest would perform better than the coin, but we obtained the following result: the coin made 55 errors, while random forest made 45 errors. This difference is not significant ( \(p = 0.20\) , McNemar’s test).

The lesson is that a significant result can be misleading if a strong effect is absent. On the other hand, a non-significant result can be meaningful if there is evidence of a strong effect. Confidence curves keep our focus on what really matters: the effect size and its precision.

6.4.2 Replicability versus reproducibility

It is expedient to distinguish between replicability and reproducibility. According to Drummond ( 2009 ), replicability means that the exact experimental protocol can be repeated. By contrast, reproducibility means that the same results can be obtained by other experiments. To replicate a classification study, it would be necessary to make publicly available not only the source code and protocol details, but also all resampled data subsets. Although this is possible, as demonstrated by the OpenML project, Footnote 5 we offer for debate whether replicability is really so desirable. We concur with Drummond ( 2009 ), arguing that replicability is an impoverished version of reproducibility.

Comparative classification studies of course do not always involve the same data resampling schemes. Suppose that one study investigated the difference between two algorithms on the basis of five times repeated cross-validation, while another study used tenfold stratified cross-validation. The first study failed to detect a significant difference but the second one did not. How can we reconcile these apparently contradictory results? Replicating both studies will not solve the problem because we would obtain the same results as before.

a – e Confidence curves for the difference in accuracy between random forest and CART on the Ionosphere data set based on different cross-validation schemes; f the average of all confidence curves ( solid black curve )

Consider the following example where we compared the accuracy of random forest and CART on the Ionosphere data set (Fig.  5 ). We used five different resampling strategies, (a) tenfold cross-validation without stratification; (b) 100 times stratified repeated tenfold cross-validation; (c) tenfold stratified cross-validation; (d) five times repeated twofold cross-validation; and (e) one training set with 70% and one test set with 30% cases (split-sampling).

Let us assume that these resampling strategies represent five studies, published by five different research groups. Furthermore, let us assume that all groups use significance testing. Studies (d) and (e) indicate that there is no difference in performance, in contrast to studies (a)–(c). If all groups published their results, then the literature could be deemed inconclusive regarding the difference between the two algorithms for this particular classification problem. Assuming that there were no errors in the original studies, we would not gain any new insights by replicating them

In contrast, confidence curves can reconcile the apparently contradictory results. The confidence curves in Fig.  5 suggest that the individual studies reproduce each other. The curves in Fig.  5 convey essentially the same message. All studies point to a moderate effect, as can be seen in the average of the confidence curves in Fig.  5 f. Random forest outperforms CART, and the true difference in accuracy is about 0.06. Thus, the five studies are in fact confirmatory, not contradictory.

Furthermore, the reliance on statistical tests can lead to a publication bias. We speculate that many researchers feel that a study should not be submitted for publication if the result is not significant, and vice versa. Suppose that only the significant results (a)–(c) were published. These studies would then indicate that the effect (i.e., the difference in performance between random forest and CART on the Ionosphere data set) is \(\frac{1}{3}(0.069+0.060+0.068)=0.066\) . This value overestimates the true difference, which, based on all experiments, is only 0.058 (Fig.  5 f).

Consider now the following scenario. Assume that the confidence curves from Fig.  5 refer to clinical trials on the effectiveness of a drug. Suppose that this drug really has a small but beneficial effect, as shown in Fig.  5 a–c. Let us further assume that a new study (Fig.  5 e) is conducted. If this new study focused only on significance, then it would erroneously refute the earlier studies (“no significant effect of the drug was observed”). But the correct interpretation is that this new study confirms the previous ones. Rothman et al. ( 2008 ) give two real-world examples of clinical trials where such erroneous conclusions were drawn because of the focus on significance. Confidence curves, on the other hand, would most likely have prevented the investigators from misinterpreting their findings.

6.4.3 Comparison with a null model

Instead of constructing confidence curves for the difference between two real classifiers, A and B , we can construct the curves for A and B with a common baseline model or null model . A natural choice for the null model is the majority voter, which predicts each case as a member of the most frequent class. Another possible choice is the empirical classifier, which uses only the class prior information. If the proportion of class c is p in the training set, and the test set contains n cases, then the empirical classifier will classify np test cases as members of class c ; these cases are selected at random. For example, assume that the training set contains only two classes, \(c_1\) and \(c_2\) , and the class ratio is \(c_1{:}c_2=30{:}70\) in the learning set. Let the test set contain 200 cases. Then \(200 \times 0.3 = 60\) randomly selected test cases will be predicted as members of class \(c_1\) and \(200 \times 0.7 = 140\) randomly selected test cases will be predicted as members of class \(c_2\) . One advantage is that the plot now shows how much a real classifier has learned beyond the information provided by the class distribution in the training set. When we compare n classifiers, we do not need to produce \(\frac{1}{2}n(n-1)\) confidence curves for all pair-wise comparisons but only n curves, i.e., A versus null model, B versus null model, etc. When multiple confidence curves are produced in the same study, it is possible to control the family-wise error rate by adjusting \(\alpha \) , which leads to a widening of the curves. However, given the arguments in Sect.  5 , we advise against such adjustments in comparative classification studies.

Confidence curves for the difference in accuracy between a random forest and CART, and b random forest and the majority voter as null model

Figure  6 a shows the confidence curve for the difference between random forest and CART. Random forest achieved an accuracy of 0.796 whereas CART achieved 0.782. The point estimate of the difference is 0.014, which is quite close to the null value 0. The seemingly good performance of around 80% could invite us to speculate that both random forest and CART have learned something from the features, but this is not the case here. The training set contains 1000 cases, each described by a 10-dimensional feature vector of real values from \(\mathcal {N}(0,1)\) . The cases were randomly assigned either a positive or negative class label, with a ratio of 20:80. As the features do not discriminate the classes, no classifier is expected to perform better than the majority voter. Figure  6 b shows the confidence curve for the difference between random forest and the majority voter. The majority voter performs slightly better than random forest: the point estimate of the difference in accuracy is \(d = o_{RF} - o_{MV} = -0.004\) . Thus, for this data set, there is no reason why we should prefer random forest over the null model.

7 Experiments

We will now use confidence curves and significance testing in a real classification study. Let us imagine two researchers, Alice and Bob, who wish to compare the performance of four classifiers over 14 data sets. To interpret their results, Alice decides to use the Friedman test with Nemenyi post-hoc test, whereas Bob chooses confidence curves. Alice and Bob benchmark the same classifiers on the same data sets based on average accuracy in 10-times repeated tenfold stratified cross-validation (Table  2 ).

The real-world data sets (#1–#10) are from the UCI machine learning repository (Lichman 2013 ). The synthetic data sets were generated as follows. Synthetic 1 consists of 100 cases, half of which belong to the positive class and the other half to the negative class. Each case is described by a 10-dimensional numerical feature vector, \(\mathbf x =(x_1,x_2,\ldots ,x_{10})\) . The values \(x_i\) of the positive cases \(\mathbf x _+\) are randomly sampled from \(\mathcal {N}(0,1)\) . The values of the negative cases \(\mathbf x _-\) are sampled from \(\mathcal {N}(0.5,1)\) .

Synthetic 2 consists of 100 cases; 30 cases belong to the positive class, and 70 cases belong to the negative class. Each case is described by a 10-dimensional feature vector. All values \(x_i\) are randomly sampled from \(\mathcal {N}(0,1)\) ; hence, the features do not discriminate the classes.

Synthetic 3 consists of 100 cases; 20 cases belong to the positive class and 80 cases belong to the negative class. Each case is described by a 10-dimensional feature vector. The ten feature values of the negative and positive cases were randomly sampled from \(\mathcal {N}(0,1)\) and \(\mathcal {N}(0.5,1)\) , respectively.

Synthetic 4 consists of 100 cases, half of which belong to the positive class and the other half to the negative class. Each case is described by a 20-dimensional feature vector. For cases of the negative class, the first ten feature values \((x_1,x_2,\ldots ,x_{10})\) were randomly sampled from \(\mathcal {N}(0,1)\) . For cases of the positive class, the first ten feature values were randomly sampled from \(\mathcal {N}(0.5,1)\) . Irrespective of the class, the next 10 features, \((x_{11},x_{12},\ldots ,x_{20})\) , were randomly sampled from a uniform distribution \(\mathcal {U}(-1,1)\) .

All experiments in Table  2 were carried out in R (R Development Core Team 2009 ). NB is a naive Bayes classifier (implementation from R package e1071 ). RF is a random forest (Breiman 2001 ), and we used the R implementation randomForest with default settings (Liaw and Wiener 2002 ). CART is a classification and regression tree (Breiman et al. 1984 ); we used the R implementation rpart (Therneau et al. 2014 ). EC is an empirical classifier, which uses only the class prior information to make predictions and ignores any covariate information.

7.1 Analysis with Friedman test and Nemenyi posthoc test

The Friedman test statistic is defined as

where N is the number of data sets and k is the number of classifiers; \(R_{j}^2=\frac{1}{N}\sum _{i}r_{i}^j\) , where \(r_{i}^{j}\) is the rank of the j th algorithm on the i th data set (Demšar 2006 ). Iman and Davenport ( 1980 ) proposed a less conservative statistic,

which is distributed according to the F -distribution with \(\nu _1 = (k-1)\) and \(\nu _2 = (k-1)(N-1)\) degrees of freedom. For the experimental data (Table  2 ), Alice obtains

For \(\nu _1=4-1\) and \(\nu _2=(4-1)(14-1)\) , Alice obtains the critical value of the F -distribution for \(\alpha =0.05\) as \(F(3,39)=2.85\) . As \(F_F > 2.85\) , Alice concludes that there is a significant difference between the classifiers. She therefore proceeds with Nemenyi post-hoc test to find out between which pairs of classifiers a significant difference exists. Two classifiers perform significantly differently if their average ranks differ by at least the critical value \(CD = |q_\alpha \sqrt{\frac{k(k+1)}{6N}}| = 2.569 \sqrt{\frac{20}{84}} = 1.25\) . Alice concludes that all pair-wise comparisons involving EC are significantly different.

Alice wonders whether the difference between any pair of NB, RF, and CART becomes significant without the null model, EC. By discarding EC, Alice obtains \(F_F = 4.48\) , which exceeds the critical value of \(F(2,26)=3.369\) . Alice therefore applies again Nemenyi post-hoc test and finds that the absolute difference between the average ranks of RF and CART, 0.964, exceeds the new critical difference of 0.886. Alice thinks: “Surely, it cannot be wrong to report this result, as excluding EC has no effect on the ranks of the other classifiers. The reason why I get a significant result now is that the critical difference has decreased.”

Alice wonders what would happen if she focused only on the real-world data sets and the real classifiers. Alice obtains \(F_F = 6.19\) , which exceeds the critical value of \(F(2,18) = 3.55\) ; therefore, Alice proceeds with the post-hoc test. The new critical difference is 1.05. This time, however, the significant difference between RF and CART vanishes because the absolute difference between their average ranks is only 0.85. By contrast, Alice observes now a significant difference between naive Bayes and random forest. This result puzzles Alice: “What would it actually mean for future model selection if I reported that random forest is significantly better than naive Bayes? This result would clearly be misleading: if my new, unseen data sets happen to be similar to the excluded data sets #11–#14, then naive Bayes—and not random forest—seems preferable...”

Alice then considers other combinations of data sets and classifiers. She includes again data sets #11–#14 but excludes data sets #6–#9, compares naive Bayes, random forest, and CART, and obtains \(F_F = 1.50 < 3.55\) . The Friedman test now tells her that there is no significant difference. Alice realizes her dilemma: by considering various combinations of data sets and classifiers, she can tweak her results.

What Alice did is not correct, though. As we noted in Sect.  4.1 , she should have adhered to everything that she had specified before carrying out her experiments. This means that if she planned to compare four classifiers on 14 data sets, then she cannot change her protocol later; otherwise, the frequentist paradigm is violated and the p value can no longer be interpreted.

This problem is a deep one. In good faith, a researcher might try different combinations of data sets and classifiers. In Guyon et al. ( 2009 ), for example, the performance of the classifiers in the KDD Cup 2009 competition was assessed based on the Friedman test. First, the analysis included all submitted classifiers, and then only the (arbitrarily selected?) top 20 classifiers.

All the juggling with data sets and classifiers and the mathematical details above can easily distract us from a more fundamental and far more important question: is an omnibus test actually appropriate here? Unless we are convinced that the data sets in Table  2 originate from one superpopulation, the answer should be “no” (cf. Sect.  5 ).

7.2 Analysis with confidence curves

Bob decides to compare the performance of the classifiers on each data set individually. For each data set, Bob derives a confidence curve for the performance difference between NB and EC, RF and EC, and CART and EC (Figs.  7 , 8 , 9 ). Table  3 shows the corresponding AUCC.

7.2.1 Interpreting confidence curves

Confidence curves for the difference in performance between naive Bayes and the null model, random forest and the null model, and CART and the null model for data sets a Sonar, b Spect, c Heart, d Ionosphere, e Transfusion, and f Pima

Confidence curves for the difference in performance between naive Bayes and the null model, random forest and the null model, and CART and the null model for data sets a Tic-tac-toe, b German, c Liver, and d KRvsKP

Let us consider first the confidence curves for the Sonar data set (Fig.  7 a). The cross-validated accuracy is 0.53 for the null model and 0.83 for random forest, so \(d=0.30\) . The confidence curve for random forest does not overlap much with the confidence curve of the next best model, CART (blue). However, there is quite some overlap between the curves for CART and naive Bayes (red). Unsurprisingly, all classifiers performed better than the empirical classifier; random forest, however, is the preferable model for this data set, as its confidence curve (green) is most to the right. Note that the null value \(\delta =0\) (vertical line) lies well outside the 95%-CI, which means that random forest performs significantly better than the null model.

On the data sets Spect, Heart, and Pima (Fig.  7 b, c, f), we see a large overlap of the confidence curves, indicating that the differences in performance are probably negligible for all practical purposes. Compared to the curves for the Sonar data set, the curves are much narrower, which reflects a higher precision. Bob concludes that all classifiers are suitable for these data sets. Given the large overlap of the curves, Bob also concludes that the performance differences are negligible.

On the data set Ionosphere (Fig.  7 d), the confidence curves also show some overlap, but Bob considers them sufficiently far apart to argue that the performance differences matter. His conclusion is that random forest is preferable to CART, which is in turn preferable to naive Bayes.

On the data set Transfusion (Fig.  7 e), naive Bayes and random forest performed essentially the same, whereas CART performed remarkably better. Bob cannot explain why CART, a single tree, could outperform an ensemble of trees in this classification task.

In Fig.  8 a, the confidence curves paint a clear picture: random forest is preferable to CART, which in turn performs substantially better than naive Bayes on the data set Tic-tac-toe. By contrast, the performance differences are less pronounced in the data set German (Fig.  8 b).

For the data set Liver (Fig.  8 c), random forest seems preferable to CART: the point estimate for the difference between random forest and the null model is 0.23, while it is 0.17 for the difference between CART and the null model. Bob could carry out a statistical test to assess the difference between random forest and CART. Without adjustment for multiple testing, the variance-corrected resampled t test gives a significant result ( \(p \ \text {value} = 0.02\) ), but this result adds little to the interpretation. The vertical line at \(\delta =0\) crosses the confidence curve of naive Bayes at the p value of 0.26; hence, naive Bayes does not perform significantly better than the null model. However, at the 95%-confidence level, compatible values for the true difference range from about \(-0.03\) to 0.13, indicating that naive Bayes is preferable to the null model despite the lack of significance.

On the data set KRvsKP (Fig.  8 d), random forest is again the preferable model, although its performance is not substantially different from that of CART. Note the narrowness of the confidence curves, which reflects a high precision. With \(N = 3196\) , the number of cases is the largest in KRvsKP; Footnote 6 therefore, the performance could be measured most precisely for this data set.

Confidence curves for the difference in performance between naive Bayes and the null model, random forest and the null model, and CART and the null model for data sets a Synthetic 1, b Synthetic 2, c Synthetic 3, and d Synthetic 4

For Synthetic 1 (Fig.  9 a), CART achieved a performance of 0.550, which is only marginally better than that of the null model with 0.508. The null value lies well within the 95%-CI, suggesting that the difference is not significant, but we remember that a confidence curve should not be used as a surrogate significance test. Instead, we should consider their shape and position. Note that the curve for CART is relatively wide. Actually, plausible values for the difference in performance range from \(-0.11\) to 0.19 (the bounds of the 95%-CI), which points to at least a moderate effect. Synthetic 1 was created by random sampling from a known distribution. Is it possible that this particular instantiation of the data set was just an “unlucky” sample for CART, and that CART could perform much better than the null model on other instantiations? As we will show later, this is indeed the case. For Synthetic 1 (Fig.  9 a), the curve for naive Bayes partly overlaps with that of random forest, but Bob considers the point estimates sufficiently far apart and therefore concludes that naive Bayes is preferable. Footnote 7

The data set Synthetic 2 (Fig.  9 b) is random, with a class ratio of 30:70. The expected accuracy of the empirical classifier (null model) is \(0.3 \times 0.3 + 0.7 \times 0.7 = 0.58\) . In the experiment, the empirical classifier achieved 0.59. The best a classifier can do is predict like the majority voter, with an expected accuracy of 0.70. Random forest comes closest to this performance with an accuracy of 0.682. The null value lies within the 95%-CI of all models; thus, by conventional reasoning, none of the models would be considered significantly better than the null model. But the width and the position of the curves for naive Bayes and random forest suggest that these models are, in fact, preferable to the null model. By contrast, the difference between CART and the null model is negligible.

On Synthetic 3 (Fig.  9 c), naive Bayes performed slightly better than random forest, but given the large overlap of the curves, the difference can be neglected. Although CART does not perform significantly better than the null model, the position and width of its confidence curve suggest that CART is preferable.

On Synthetic 4 (Fig.  9 d), naive Bayes performs again better than random forest. Note that the curves are wider than those in Fig.  9 c. The reason is that Synthetic 4 contains ten additional, irrelevant features, which contribute to the larger variance in accuracy. None of the models performs better than the null model at the conventional significance level. However, plausible null hypotheses include relatively large values (up to 0.29 for naive Bayes), suggesting a reasonably large effect. Emphasizing the lack of significance would not do justice to the actually good performance.

To further illustrate pair-wise performance differences, let us consider the confidence curves for the difference between naive Bayes and random forest on Synthetic 1 and Synthetic 4. In Fig.  10 a, the null value \(\delta =0\) is just outside the 95%-CI, suggesting that naive Bayes performs significantly better than random forest on Synthetic 1. In contrast, the null value is well within the 95%-CI in Fig.  10 b. The point estimate of the difference is 0.05. Plausible values for the difference range from \(-0.06\) to 0.16 (the bounds of the 95%-CI). The shape and position of the curve points to a moderately large effect. Both curves in Fig.  10 suggest that naive Bayes is preferable to random forest for these two data sets; significance, or lack thereof, is irrelevant for this conclusion.

Confidence curves for the difference in performance between naive Bayes and random forest for a Synthetic 1 and b Synthetic 4

Next, we verified whether Bob’s conclusions based on the confidence curves can be justified empirically. As we know the generative functions of the synthetic data sets, we created 1000 new instantiations for each data set and applied the classifiers again. Then, we created box-and-whiskers plots for the resulting cross-validated accuracies (Fig.  11 ).

Distribution of average accuracy in 10 times repeated tenfold stratified cross-validation for the synthetic data sets. For each data set, 1000 instantiations were created by random sampling from its underlying distribution. Each instantiation was then classified by the four models. Shown are the box-and-whiskers plots of the resulting accuracies of each model. a Synthetic 1: class ratio 50:50, 10 features, features of class 0 from \(\mathcal {N}(0,1)\) , features of class 1 from \(\mathcal {N}(0.5,1)\) ; b Synthetic 2: class ratio 70:30, 10 features, all from \(\mathcal {N}(0,1)\) ; c Synthetic 3: class ratio 80:20, 10 features, features of class 0 from \(\mathcal {N}(0,1)\) , features of class 1 from \(\mathcal {N}(0.5,1)\) ; d Synthetic 4: class ratio 50:50, 20 features, features 1..10 of class 0 from \(\mathcal {N}(0,1)\) , features 1..10 of class 1 from \(\mathcal {N}(0.5,1)\) , features 11..20 from \(\mathcal {U}(-1,1)\) irrespective of class

Figure  11 a reveals that naive Bayes is indeed preferable to random forest for Synthetic 1. Particularly, the difference between CART and the null model (EC) is striking. CART performs indeed better than the null model. Thus, Bob’s conclusions based on the confidence curve (Fig.  9 a) are confirmed.

For data set Synthetic 2, Bob concluded that the difference between CART and the null model was negligible. In contrast, he concluded that both random forest and naive Bayes are suitable for this data set despite their lack of significance. Bob concluded that random forest is slightly better than naive Bayes. Figure  11 b confirms Bob’s interpretation: random forest performs indeed slightly better than naive Bayes; however, the interquartile range for random forest lies almost entirely in that for naive Bayes, suggesting that the difference is only marginal. Based on a visual inspection of the box-and-whiskers plots, CART cannot be considered superior to the null model.

Based on the confidence curves in Fig.  9 c, Bob concluded that the difference between naive Bayes and random forest is negligible; the box-and-whiskers plots in Fig.  11 c support this conclusion. Interestingly, CART performs better than the null model, which confirms Bob’s interpretation based on Fig.  9 c. If Bob had based his interpretation on the test, then he would have concluded that CART does not perform significantly better than the null model.

The confidence curves for random forest and CART in Fig.  9 d suggest no substantial difference between these two models. However, Fig.  11 d suggests that random forest is preferable to CART because the interquartile ranges do not overlap. Here, Bob’s interpretation of the curves conflicts with the interpretation of the box-and-whiskers plots. Figure  11 d shows that naive Bayes performs slightly better than random forest on Synthetic 4. Both models perform better than CART, and each model performs substantially better than the null model. This confirms Bob’s interpretation of Fig.  9 d. If he had based his verdict on significance only, then he would have considered no model superior to the null model, as the null value lies in all 95%-confidence intervals. Footnote 8

7.2.2 Interpreting the area under the confidence curve

If a study includes many classifiers and data sets, then it may not be feasible to plot all confidence curves because of space limitations. In that case, it can be preferable to tabulate the point estimates of the performance differences with their AUCC (Table  3 ).

In Table  3 , the difference between the null model and naive Bayes is 0.15 for Sonar. We can interpret this value as follows: by using naive Bayes instead of the null model, we gain an additional 15% in accuracy. The area under the confidence curve (AUCC) can be interpreted in terms of the precision of this estimate. In other words, the AUCC is a scalar that represents a numerical estimate of the precision. The closer the value is to 0, the narrower is the confidence curve, and the more precise is the estimate; by contrast, the larger the AUCC, the wider the curve, and the less precise is the estimate.

Let us consider the performance of naive Bayes on Spect and Heart. By using naive Bayes instead of the null model, we gain 12% in accuracy on Spect but nearly three times as much (34%) on Heart. Both estimates have comparable precisions ( \(\hbox {AUCC} = 0.053\) and \(\hbox {AUCC} = 0.054\) , respectively). We can therefore conclude that naive Bayes is much more suitable to data sets that are similar to the Heart data set. Among all investigated data sets, naive Bayes is most suitable to the data set KRvsKP. The gain in accuracy is 38%, which is also the most precise measurement with \(\hbox {AUCC} = 0.017\) .

Similarly, point estimates and AUCC can be compared across classifiers. For example, consider the performance on Tic-tac-toe. Compared to the null model, naive Bayes yields a gain in accuracy of 15% ( \(\hbox {AUCC} = 0.033\) ), whereas the gain is 36% for CART ( \(\hbox {AUCC} = 0.033\) ). As the precision is the same, we can be confident that CART is more than twice as suitable to this data set. However, random forest is even better, with a gain in accuracy of 44% ( \(\hbox {AUCC} = 0.027\) ). The performance of random forest is almost three times better than that of naive Bayes on the data set Tic-tac-toe. All other comparisons can be made analogously.

8 Discussion

We investigated several problems of null hypothesis significance testing for the comparison of classifiers. We discussed that the Fisherian and Neyman–Pearsonian schools of thought are widely believed to originate from one single, coherent theory of statistical inference, although their philosophies are fundamentally different (Goodman 1993 ; Hubbard and Bayarri 2003 ; Hubbard and Armstrong 2006 ). We paid particular attention to the problems of the p value, which have received scant attention in the machine learning literature so far. First, perhaps the most persistent misconception is that the p value is a completely objective measure of statistical inference. Second, the p value is not predictive of reproducibility. Third, the p value confounds the effect size (e.g., the magnitude of the performance difference) and the precision with which that effect has been measured. And finally, the p value invites a dichotomization into significant and non-significant findings and thereby emphasizes decision making over estimation. We argue that estimation, not decision making, should be the proper role for statistical inference in machine learning.

There is certainly no shortage of papers criticizing NHST, notably in psychology and sociology (Rozeboom 1960 ; Carver 1978 ; Cohen 1994 ; Schmidt 1996 ). On the other hand, it is extraordinarily difficult to find explicit defenses of this practice. Four defenses are particularly noteworthy, though. Levin ( 1998 ) argues in favor of significance testing because it can be done intelligently and because alternatives, such as confidence intervals, also have their caveats and pitfalls. In a similar line of defense, Abelson ( 1997 ) argues that significance testing can be justified as long as we are only interested in the direction of the finding and not in the magnitude of the effect. Krueger ( 2001 ) defends the pragmatic value of NHST while clearly acknowledging its logical deficiencies. Note, however, that these are defenses of (Fisherian) significance testing, not arguments in favor of the Neyman–Pearsonian hypothesis testing or, worse, the hybrid approach. We acknowledge that there is a time and place for significance testing, for example, in feature selection tasks. For the comparison of classifiers, however, our stance is that no such tests are needed.

For several decades, significance testing has fueled heated debates, yet despite these criticisms, it seems to be a widely held view that statistical tests are essential for the interpretation of empirical results in machine learning. Why is that so? We speculate that there are two main reasons. First, many researchers may feel a need for a clear-cut decision (Schmidt and Hunter 1997 ), which the dichotomous verdict of a test can provide (“significant” versus “non-significant”). Second, well-meaning researchers might believe that scientific integrity necessitates an objective, rigorous procedure, which, supposedly, can be provided by NHST. Statistical tests are indeed often believed to give “certain reassurance about the validity and non-randomness of the published results” (Demšar 2006 , p. 27). Apparently, the p value—provided that it is sufficiently small—can give such a reassurance. In this study, we challenge these views. We agree with Berger and Berry ( 1988 ) who caution against the illusionary objectivity of statistical testing.

We then discussed the problem of multiple comparisons in classification studies. It may be widely assumed that multiple comparisons generally require that p values be adjusted; however, there is no consensus among statisticians that this is indeed so (Poole 1991 ; Rothman 1990 ). Researchers who argue in favor of adjustments are faced with the difficult choice from a wealth of procedures, ranging from the very conservative Bonferroni correction to more advanced methods; see, for example, (García and Herrera 2008 ). We provided several arguments against omnibus tests in comparative classification studies; in particular, we questioned the appropriateness of the Friedman test for this task.

As an alternative to NHST, we propose confidence curves . A key problem of classic statistical testing is the focus on one single null hypothesis, i.e., the null hypothesis of no difference, and its p value. In contrast, confidence curves do not put undue emphasis on that hypothesis as the only one of interest. Instead, confidence curves enable us to easily check how compatible other hypotheses are with our experimental results. In fact, we can mentally check the compatibility of an infinite number of null hypotheses. Compatibility should not be regarded as a dichotomous but as a gradual characteristic; some hypotheses are more, others are less compatible. Whether a difference is significant or not can be easily verified by checking where the null line intersects the confidence curve, but doing so would defeat all the advantages of confidence curves over significance tests. Significance (or lack thereof) should not be our major concern. A confidence curve brings to the forefront what matters: the effect size and its precision. We think that this is in the spirit of F. Yates when he wrote that

[...] scientific research workers [...] pay undue attention to the results of the tests of significance [...] and too little to the estimates of the magnitude of the effects they are investigating. (Yates 1951 , p. 32)

Our study has several limitations, though. First, we considered only one performance metric, accuracy, and only one data resampling strategy (repeated cross-validation). Our future work will focus on other performance measures and data resampling strategies.

Second, the use and comparison of confidence curves might be criticized as lacking objectivity. However, a confidence curve is merely a tool that summarizes the results, putting emphasis on the effect size and its reasonable bounds, and it is incumbent on the investigator to construct an argument for or against a classifier based on that curve. The investigator’s informed judgment is a crucial element in the interpretation of data (Cohen 1994 ). Clearly, this judgment may not always be convincing to everyone. For example, a researcher might consider the point estimate of the difference in performance sufficiently large and the curve sufficiently narrow to conclude that one model is superior to another one for a concrete classification task. Not all readers may agree with this particular judgment, and that is fine. They can judge by themselves whether the investigator’s reasoning is plausible or not and then draw their own conclusions. By contrast, how would a researcher make her case with a NHST? Her argument would stand or fall with a single p value, requiring no further intellectual engagement, which Cox and Hinkley ( 1974 ) criticized as follows:

It is essentially consideration of intellectual economy that makes a pure significance test of interest. (Cox and Hinkley 1974 , p. 81)

When Bob refrains from making a decision regarding performance differences, Alice might object, demanding a final, clear-cut verdict. But are categorical statements always possible—or even desirable—in the scientific analysis of data? Poole ( 1987 ) reminded us that science and decision making are two different enterprises: sciences focuses on learning, explaining, and understanding, whereas decision making focuses on reasons to act or refrain from acting. In a similar vein, Rothman et al. ( 2008 ) consider estimation, and not decision making, as the proper role for statistical inference in science. While there may not be a consensus among statisticians about the proper role of statistical inference in data analysis, we posit that machine learning would benefit from a stronger focus on estimation and not decision making. This does of course not imply that test statistics have no role to play in machine learning; for example, for feature selection from high-dimensional data sets, appropriate test statistics can of course be extremely useful.

Finally, like confidence intervals, confidence curves need to be interpreted within the frequentist framework. There exist alternative methods, such as likelihood ratios (Cox 1958 ; Goodman and Royall 1988 ), which directly measure the weight of evidence for and against hypotheses. But like confidence curves, these alternative methods are no silver bullet. We doubt that there is one. Evaluation tools should assist the investigator in interpreting the experimental results and support, not supplant, informed judgment. We believe that confidence curves, despite their limitations, serve this purpose well.

9 Conclusions

Null hypothesis significance testing has become deeply entrenched in machine learning for the comparison of classifiers. A significance test is widely considered necessary to underpin the sound and objective interpretation of empirical results. However, this practice provides only a veneer of rigor. A thorough interpretation of classification results does not need to rely on significance tests. We conclude that null hypothesis significance testing should be replaced by methods that support informed judgment and put greater emphasis on effect size estimation, and not on decision making. The use of the proposed confidence curves could be a step in this direction. We hope that our study will encourage a widespread debate on the role of statistical tests in machine learning and spur further research on alternative evaluation methods and visualization tools that give more room to informed judgment.

http://www.sciencedirect.com .

The standard t test is not applicable because the individual training sets overlap, so the independence assumption is violated, which leads to an underestimation of the true variance. This is what the term \(\frac{n_2}{n_1}\) corrects.

The two exceptions are Scheffe’s test and Fisher’s Least Significant Difference (LSD) test.

“Virtually any study can be made to show significant results if one uses enough subjects regardless of how nonsensical the content may be.” (Hays 1963 , p. 326).

http://www.openml.org/ .

For Tic-tac-toe, \(N = 958\) ; for German, \(N = 1000\) ; for Liver, \(N = 345\) .

As we will show in Fig.  10 a, this difference is also significant.

And if Bob had corrected for multiple testing, then the results would have been even “less significant”.

Abelson, R. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would need to be invented). In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 117–141). Mahwah, NJ: Psychology Press.

Google Scholar  

Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences . New York: Palgrave Macmillan.

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin , 66 (6), 423–437.

Article   Google Scholar  

Bayarri, M., & Berger, J. (2000). P values for composite null models. Journal of the American Statistical Association , 95 (452), 1127–1142.

MathSciNet   MATH   Google Scholar  

Benavoli, A., Corani, G., Mangili, F., & Zaffalon, M. (2015). A Bayesian nonparametric procedure for comparing algorithms. In Proceedings of the 32nd international conference on machine learning, JMLR.org, JMLR Proceedings (Vol. 37, pp. 1264–1272).

Berger, J., & Berry, D. (1988). Statistical analysis and the illusion of objectivity. American Scientist , 76 (2), 159–165.

Berger, J., & Delampaday, M. (1987). Testing precise hypotheses. Statistical Science , 2 (3), 317–352.

Article   MathSciNet   Google Scholar  

Berger, J., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of \(p\) values and evidence. Journal of the American Statistical Association , 82 , 112–122.

Berrar, D., & Lozano, J. (2013). Significance tests or confidence intervals: Which are preferable for the comparison of classifiers? Journal of Experimental and Theoretical Artificial Intelligence , 25 (2), 189–206.

Berry, D. (2006). Bayesian clinical trials. Nature Reviews Drug Discovery , 5 , 27–36.

Birnbaum, A. (1961). A unified theory of estimation. I. Annals of Mathematical Statistics , 32 , 112–135.

Article   MATH   Google Scholar  

Bouckaert, R., & Frank, E. (2004). Evaluating the replicability of significance tests for comparing learning algorithms. In Proceedings of the 8th Asia-Pacific conference on advances in knowledge discovery and data mining , Springer Lecture Notes in Computer Science (Vol. 3056, pp. 3–12).

Breiman, L. (2001). Random forests. Machine Learning , 45 (1), 5–32.

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees . New York: Chapman and Hall.

MATH   Google Scholar  

Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review , 48 (3), 378–399.

Cohen, J. (1990). Things I have learned (so far). American Psychologist , 45 (12), 1304–1312.

Cohen, J. (1994). The earth is round (p \(<\) .05). American Psychologist , 49 (12), 997–1003.

Corani, G., Benavoli, A., Mangili, F., & Zaffalon, M. (2015). Bayesian hypothesis testing in machine learning. In Proceedings of 2015 ECML-PKDD, Part III, Springer Lecture Notes in Artificial Intelligence (pp. 199–202).

Cox, D. (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics , 29 (2), 357–372.

Article   MathSciNet   MATH   Google Scholar  

Cox, D. (1977). The role of significance tests. Scandinavian Journal of Statistics , 4 (2), 49–70.

Cox, D., & Hinkley, D. (1974). Theoretical statistics . New York: Chapman and Hall/CR.

Book   MATH   Google Scholar  

Cummings, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis . New York, London: Routledge, Taylor & Francis Group.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research , 7 , 1–30.

Demšar, J. (2008). On the appropriateness of statistical tests in machine learning. In Proceedings of the 3rd workshop on evaluation methods for machine learning, in conjunction with the 25th international conference on machine learning (pp. 1–4).

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation , 10 , 1895–1923.

Drummond, C. (2006). Machine learning as an experimental science, revisited. In Proceedings of the 21st national conference on artificial intelligence: Workshop on evaluation methods for machine learning , Technical Report WS-06-06 (pp. 1–5). AAAI Press.

Drummond, C. (2009). Replicability is not reproducibility: Nor is it good science. In Proceedings of evaluation methods for machine learning workshop at the 26th international conference on machine learning, Montreal (pp. 1–6).

Drummond, C., & Japkowicz, N. (2010). Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental and Theoretical Artificial Intelligence , 2 , 67–80.

Fisher, R. (1943). Note on Dr. Berkson’s criticism of tests of significance. Journal of the American Statistical Association , 38 , 103–104.

Fisher, R. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, Series B , 17 (1), 69–78.

Folks, J. (1981). Ideas of statistics . New York: Wiley.

Fraley, R., & Marks, M. (2007). The null hypothesis significance testing debate and its implications for personality research. In R. Robins, R. Fraley, & R. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 149–169). New York: Guilford.

Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association , 32 (200), 675–701.

Friedman, M. (1940). A comparison of alternative tests of significance for the problem of \(m\) rankings. Annals of Mathematical Statistics , 11 (1), 86–92.

García, S., & Herrera, F. (2008). An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research , 9 , 2677–2694.

Gigerenzer, G. (1998). We need statistical thinking, not statistical rituals. Behavioral and Brain Sciences , 21 , 199–200.

Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual–What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan (Ed.), The sage handbook of quantitative methodology for the social sciences (pp. 391–408). Thousand Oaks, CA: Sage.

Goodman, S. (1993). \(p\) values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology , 137 (5), 485–496.

Goodman, S. (1999). Toward evidence-based medical statistics. 1: The \(p\) value fallacy. Annals of Internal Medicine , 130 (12), 995–1004.

Goodman, S. (2008). A dirty dozen: Twelve \(p\) -value misconceptions. Seminars in Hematology , 45 (3), 135–140.

Goodman, S., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health , 78 (12), 1568–1574.

Greenwald, A., Gonzalez, R., Harris, R., & Guthrie, D. (1996). Effect sizes and \(p\) values: What should be reported and what should be replicated? Psychophysiology , 33 (2), 175–183.

Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the KDD Cup 2009: Fast scoring on a large Orange customer database. In JMLR: Workshop and conference proceedings (Vol. 7, pp. 1–22).

Harlow, L., Mulaik, S., & Steiger, J. (1997). What if there were no significance tests? Multivariate applications book series . Mahwah, NJ: Lawrence Erlbaum Associates Publishers.

Hays, W. (1963). Statistics . New York: Holt, Rinehart and Winston.

Hsu, J. (1996). Multiple comparisons: Theory and methods . Boca Raton, FL: CRC Press.

Hubbard, R. (2004). Alphabet soup—blurring the distinctions between \(p\) ’s and \(\alpha \) ’s in psychological research. Theory and Psychology , 14 (3), 295–327.

Hubbard, R., & Armstrong, J. (2006). Why we don’t really know what “statistical significance” means: A major educational failure. Journal of Marketing Education , 28 (2), 114–120.

Hubbard, R., & Bayarri, M. (2003). P values are not error probabilities . Technical Report University of Valencia; Accessed 22 Sept. 2016 http://www.uv.es/sestio/TechRep/tr14-03

Hubbard, R., & Lindsay, R. (2008). Why \(p\) values are not a useful measure of evidence in statistical significance testing. Theory Psychology , 18 (1), 69–88.

Iman, R., & Davenport, J. (1980). Approximations of the critical region of the Friedman statistic. Communications in Statistics—Theory and Methods , 9 (6), 571–595.

Killeen, P. (2004). An alternative to null hypothesis significance tests. Psychological Science , 16 (5), 345–353.

Krueger, J. (2001). Null hypothesis significance testing—On the survival of a flawed method. American Psychologist , 56 (1), 16–26.

Levin, J. (1998). What if there were no more bickering about statistical significance tests? Research in the Schools , 5 (2), 43–53.

Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2 (3), 18–22. http://CRAN.R-project.org/doc/Rnews/

Lichman, M, (2013). UCI machine learning repository . University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml

Lindley, D. (1957). A statistical paradox. Biometrika , 44 , 187–192.

Morgan, P. (2003). Null hypothesis significance testing: Philosophical and practical considerations of a statistical controversy. Exceptionality , 11 (4), 209–221.

Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning , 52 , 239–281.

Nemenyi, P. (1963). Distribution-free multiple comparisons . Ph.D. thesis, Princeton University, Princeton.

Neyman, J., & Pearson, E. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London Series A , 231 , 289–337.

Nuzzo, R. (2014). Statistical errors. Nature , 506 , 150–152.

Perneger, T. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal , 316 , 1236–1238.

Poole, C. (1987). Beyond the confidence interval. American Journal of Public Health , 2 (77), 195–199.

Poole, C. (1991). Multiple comparisons? No problem!. Epidemiology , 4 (2), 241–243.

Poole, C. (2001). Low p-values or narrow confidence intervals: Which are more durable? Epidemiology , 12 (3), 291–294.

R Development Core Team. (2009). R: A language and environment for statistical computing . R Foundation for Statistical Computing, Vienna. http://www.R-project.org , ISBN 3-900051-07-0

Rothman, K. (1990). No adjustments are needed for multiple comparisons. Epidemiology , 1 (1), 43–46.

Rothman, K. (1998). Writing for Epidemiology. Epidemiology , 9 (3), 333–337.

Rothman, K., Greenland, S., & Lash, T. (2008). Modern epidemiology (3rd ed.). Philadelphia: Wolters Kluwer.

Rozeboom, W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin , 57 , 416–428.

Rozeboom, W. (1997). Good science is abductive, not hypothetico-deductive. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 132–149). Mahwah, NJ: Psychology Press.

Savalei, V., & Dunn, E. (2015). Is the call to abandon \(p\) -values the red herring of the replicability crisis? Frontiers in Psychology , 245 (6), 1–4.

Savitz, D., & Olshan, A. (1998). Describing data requires no adjustment for multiple comparisons: A reply from Savitz and Olshan. American Journal of Epidemiology , 147 (9), 813–814.

Schervish, M. (1996). \(P\) values: What they are and what they are not. The American Statistician , 50 (3), 203–206.

MathSciNet   Google Scholar  

Schmidt, F. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods , 1 (2), 115–129.

Schmidt, F., & Hunter, J. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 37–64). Mahwah, NJ: Psychology Press.

Sellke, T., Bayarri, M., & Berger, J. (2001). Calibration of \(p\) values for testing precise null hypotheses. The American Statistician , 55 (1), 62–71.

Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). London/New York: Chapman and Hall.

Stang, A., Poole, C., & Kuss, O. (2010). The ongoing tyranny of statistical significance testing in biomedical research. European Journal of Epidemiology , 25 , 225–230.

Sullivan, K., & Foster, D. (1990). Use of the confidence interval function. Epidemiology , 1 (1), 39–42.

Therneau, T., Atkinson, B., & Ripley, B. (2014). rpart: Recursive partitioning and regression trees. http://CRAN.R-project.org/package=rpart , R package version 4.1-5.

Thompson, B. (1999). If statistical significance tests are broken/misused, what practices should supplement or replace them? Theory & Psychology , 9 (2), 165–181.

Tukey, J. (1991). The philosophy of multiple comparisons. Statistical Science , 6 (1), 100–116.

Yates, F. (1951). The influence of statistical methods for research workers on the development of the science of statistics. Journal of the American Statistical Association , 46 (253), 19–34.

Zimmerman, D., & Zumbo, B. (1993). Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks. The Journal of Experimental Education , 62 (1), 75–86.

Download references

Acknowledgements

We thank the three anonymous reviewers very much for their detailed and constructive comments that have greatly helped improve this manuscript.

Author information

Authors and affiliations.

College of Engineering, Shibaura Institute of Technology, Saitama, 337-8570, Japan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Daniel Berrar .

Additional information

Editors: Nathalie Japkowicz and Stan Matwin.

To derive the \(\mathrm {AUCC_{rCV}}\) from r -times repeated k -fold cross-validation, we proceed as follows. We assume that the cumulative distribution function F can be approximated by \(\Phi \) . The point estimate d indicates only the location on the x -axis and is irrelevant for the area. Consider Fig.  12 . The AUCC for an arbitrary point estimate ( \(d = 0.15\) in Fig.  12 a) is twice the red area in Fig.  12 b. By exploiting the symmetry and choosing \(d = 0\) , integration by parts gives

a Confidence curve for \(d = 0.15\) and \(\sigma ^2 = 0.01\) . b The left part of the same confidence curve, centered at 0

It is obvious that the second term goes to \(\frac{4}{\sqrt{2\pi }}\sigma \) when u goes to infinity. Applying L’Hôpital’s rule, we see that the first term goes to zero when u goes to infinity:

Using the correction term \(\frac{n_2}{n_1}\) for the variance (Nadeau and Bengio 2003 ; Bouckaert and Frank 2004 ), we obtain the area under the confidence curve for r -times repeated k -fold cross-validation as

Rights and permissions

Reprints and permissions

About this article

Berrar, D. Confidence curves : an alternative to null hypothesis significance testing for the comparison of classifiers. Mach Learn 106 , 911–949 (2017). https://doi.org/10.1007/s10994-016-5612-6

Download citation

Received : 09 March 2016

Accepted : 18 November 2016

Published : 30 December 2016

Issue Date : June 2017

DOI : https://doi.org/10.1007/s10994-016-5612-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Confidence curve
  • Significance test
  • Multiple comparisons
  • Performance evaluation
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. Null Hypothesis and the P-Value. If you don't have a background in

    Originally published on my personal blog.. When you're starting your machine learning journey, you'll come across null hypothesis and the p-value. At a certain point in your journey, it becomes quite important to know what these mean to make meaningful decisions while designing your machine learning models.

  2. Hypothesis in Machine Learning

    A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data. The Hypothesis can be calculated as: Where, y = range. m = slope of the lines. x = domain.

  3. Everything you need to know about Hypothesis Testing in Machine Learning

    The null hypothesis represented as H₀ is the initial claim that is based on the prevailing belief about the population. The alternate hypothesis represented as H₁ is the challenge to the null hypothesis. It is the claim which we would like to prove as True. One of the main points which we should consider while formulating the null and alternative hypothesis is that the null hypothesis ...

  4. Hypothesis Testing

    Foundations Of Machine Learning (Free) Python Programming(Free) Numpy For Data Science(Free) Pandas For Data Science(Free) Linux Command Line(Free) ... ($α$) 0.05: the results are "statistically significant," and they might reject the null hypothesis, believing the new drug has an effect. If the P-value is greater than ($α$) 0.05: ...

  5. Null Hypothesis Definition

    The null hypothesis is what you attempt to disprove or nullify with evidence to the contrary. It is contrasted with the alternative hypothesis, denoted as H1 or Ha, which expresses that there is a statistically significant relationship between two variables. The process of hypothesis testing involves choosing a null hypothesis which is tested ...

  6. A Gentle Introduction to Statistical Hypothesis Testing

    Two concrete examples that we will use a lot in machine learning are: A test that assumes that data has a normal distribution. A test that assumes that two samples were drawn from the same underlying population distribution. The assumption of a statistical test is called the null hypothesis, or hypothesis 0 (H0 for short).

  7. Mastering p-value and Null Hypothesis with Scikit Learn

    Null Hypothesis, also known as H0, is a statement that assumes there is no significant difference between two groups or factors in a study. In other words, the Null Hypothesis is a default position that researchers take when analyzing the data. ... Scikit Learn library is another widely used library in Python used for machine learning. The ...

  8. Null Hypothesis and Alternative Hypothesis

    The null hypothesis of this test would be: The mean data scientist salary is more than 125,000 dollars. The alternative will cover everything else, ... statistics, machine learning, and deep learning. We Think you'll also like Statistics Tutorials False Positive vs. False Negative: Type I and Type II Errors in Statistical Hypothesis Testing

  9. Best Guesses: Understanding The Hypothesis in Machine Learning

    In machine learning, the term 'hypothesis' can refer to two things. First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance. Second, it can refer to the traditional null and alternative hypotheses from statistics. Since machine learning works so closely ...

  10. Hypothesis Testing with Python: Step by step hands-on tutorial with

    It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). Suppose the resulting p-value of Levene's test is less than the significance level (typically 0.05).In that case, the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances.

  11. What is a Hypothesis in Machine Learning?

    Hypothesis in Machine Learning: Candidate model that approximates a target function for mapping examples of inputs to outputs. We can see that a hypothesis in machine learning draws upon the definition of a hypothesis more broadly in science. Just like a hypothesis in science is an explanation that covers available evidence, is falsifiable and ...

  12. Hypothesis Testing

    Hypothesis testing will always involve a question that we will be seeking to answer. The formulation of the question follows a specific structure and is constructed through two elements: the null hypothesis (H ₀) and the alternative hypothesis (H ₁).. Null hypothesis (H ₀): This is the hypothesis that holds that the initial assumption regarding a population parameter is false.

  13. P-value in Machine Learning

    P-value in Machine Learning. P-value helps us determine how likely it is to get a particular result when the null hypothesis is assumed to be true. It is the probability of getting a sample like ours or more extreme than ours if the null hypothesis is correct. Therefore, if the null hypothesis is assumed to be true, the p-value gives us an ...

  14. Null Hypothesis

    Null hypothesis, often denoted as H0, is a foundational concept in statistical hypothesis testing. It represents an assumption that no significant difference, effect, or relationship exists between variables within a population. ... Machine Learning and Data Science; DSA Courses. Data Structure & Algorithm(C++/JAVA) Data Structure & Algorithm ...

  15. Understanding P-value in Machine Learning

    In statistics, the p-value is a measure of the evidence against the null hypothesis. It tells us how likely it is to observe a test statistic as extreme or more extreme than the one we observed, assuming that the null hypothesis is true. A small p-value (less than the significance level, usually 0.05) indicates that the observed result is ...

  16. Hypothesis in Machine Learning

    The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an available dataset. In supervised learning techniques, the main aim is to determine the possible ...

  17. The Effect of Statistical Hypothesis Testing on Machine Learning Model

    Statistical tests of hypothesis play a crucial role in evaluating the performance of machine learning (ML) models and selecting the best model among a set of candidates. However, their effectiveness in selecting models over larger periods of time remains unclear. This study aims to investigate the impact of statistical tests on ML model ...

  18. Understanding Hypothesis Testing

    Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

  19. Confidence curves: an alternative to null hypothesis significance

    Machine learning classifiers are frequently compared and selected based on their performance on multiple benchmark data sets. Given a set of k classifiers and N data sets, the question is whether there exists a significant performance difference, and if so, between which pairs of classifiers. Null hypothesis significance testing (NHST) is increasingly used for this task.