Hypothesis Testing with Pearson's r (Jump to: Lecture | Video )

Just like with other tests such as the z-test or ANOVA, we can conduct hypothesis testing using Pearson�s r.

To test if age and income are related, researchers collected the ages and yearly incomes of 10 individuals, shown below. Using alpha = 0.05, are they related?

1. Define Null and Alternative Hypotheses

2. State Alpha

alpha = 0.05

3. Calculate Degrees of Freedom

Where n is the number of subjects you have:

df = n - 2 = 10 � 2 = 8

4. State Decision Rule

Using our alpha level and degrees of freedom, we look up a critical value in the r-Table . We find a critical r of 0.632.

If r is greater than 0.632, reject the null hypothesis.

5. Calculate Test Statistic

We calculate r using the same method as we did in the previous lecture:

6. State Results

Reject the null hypothesis.

7. State Conclusion

There is a relationship between age and yearly income, r(8) = 0.99, p < 0.05

Back to Top

Hypothesis Tests in R

This tutorial covers basic hypothesis testing in R.

  • Normality tests
  • Shapiro-Wilk normality test
  • Kolmogorov-Smirnov test
  • Comparing central tendencies: Tests with continuous / discrete data
  • One-sample t-test : Normally-distributed sample vs. expected mean
  • Two-sample t-test : Two normally-distributed samples
  • Wilcoxen rank sum : Two non-normally-distributed samples
  • Weighted two-sample t-test : Two continuous samples with weights
  • Comparing proportions: Tests with categorical data
  • Chi-squared goodness of fit test : Sampled frequencies of categorical values vs. expected frequencies
  • Chi-squared independence test : Two sampled frequencies of categorical values
  • Weighted chi-squared independence test : Two weighted sampled frequencies of categorical values
  • Comparing multiple groups: Tests with categorical and continuous / discrete data
  • Analysis of Variation (ANOVA) : Normally-distributed samples in groups defined by categorical variable(s)
  • Kruskal-Wallace One-Way Analysis of Variance : Nonparametric test of the significance of differences between two or more groups

Hypothesis Testing

Science is "knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method" (Merriam-Webster 2022) .

The idealized world of the scientific method is question-driven , with the collection and analysis of data determined by the formulation of research questions and the testing of hypotheses. Hypotheses are tentative assumptions about what the answers to your research questions may be.

  • Formulate questions: How can I understand some phenomenon?
  • Literature review: What does existing research say about my questions?
  • Formulate hypotheses: What do I think the answers to my questions will be?
  • Collect data: What data can I gather to test my hypothesis?
  • Test hypotheses: Does the data support my hypothesis?
  • Communicate results: Who else needs to know about this?
  • Formulate questions: Frame missing knowledge about a phenomenon as research question(s).
  • Literature review: A literature review is an investigation of what existing research says about the phenomenon you are studying. A thorough literature review is essential to identify gaps in existing knowledge you can fill, and to avoid unnecessarily duplicating existing research.
  • Formulate hypotheses: Develop possible answers to your research questions.
  • Collect data: Acquire data that supports or refutes the hypothesis.
  • Test hypotheses: Run tools to determine if the data corroborates the hypothesis.
  • Communicate results: Share your findings with the broader community that might find them useful.

While the process of knowledge production is, in practice, often more iterative than this waterfall model, the testing of hypotheses is usually a fundamental element of scientific endeavors involving quantitative data.

hypothesis testing pearson r

The Problem of Induction

The scientific method looks to the past or present to build a model that can be used to infer what will happen in the future. General knowledge asserts that given a particular set of conditions, a particular outcome will or is likely to occur.

The problem of induction is that we cannot be 100% certain that what we are assuming is a general principle is not, in fact, specific to the particular set of conditions when we made our empirical observations. We cannot prove that that such principles will hold true under future conditions or different locations that we have not yet experienced (Vickers 2014) .

The problem of induction is often associated with the 18th-century British philosopher David Hume . This problem is especially vexing in the study of human beings, where behaviors are a function of complex social interactions that vary over both space and time.

hypothesis testing pearson r

Falsification

One way of addressing the problem of induction was proposed by the 20th-century Viennese philosopher Karl Popper .

Rather than try to prove a hypothesis is true, which we cannot do because we cannot know all possible situations that will arise in the future, we should instead concentrate on falsification , where we try to find situations where a hypothesis is false. While you cannot prove your hypothesis will always be true, you only need to find one situation where the hypothesis is false to demonstrate that the hypothesis can be false (Popper 1962) .

If a hypothesis is not demonstrated to be false by a particular test, we have corroborated that hypothesis. While corroboration does not "prove" anything with 100% certainty, by subjecting a hypothesis to multiple tests that fail to demonstrate that it is false, we can have increasing confidence that our hypothesis reflects reality.

hypothesis testing pearson r

Null and Alternative Hypotheses

In scientific inquiry, we are often concerned with whether a factor we are considering (such as taking a specific drug) results in a specific effect (such as reduced recovery time).

To evaluate whether a factor results in an effect, we will perform an experiment and / or gather data. For example, in a clinical drug trial, half of the test subjects will be given the drug, and half will be given a placebo (something that appears to be the drug but is actually a neutral substance).

hypothesis testing pearson r

Because the data we gather will usually only be a portion (sample) of total possible people or places that could be affected (population), there is a possibility that the sample is unrepresentative of the population. We use a statistical test that considers that uncertainty when assessing whether an effect is associated with a factor.

  • Statistical testing begins with an alternative hypothesis (H 1 ) that states that the factor we are considering results in a particular effect. The alternative hypothesis is based on the research question and the type of statistical test being used.
  • Because of the problem of induction , we cannot prove our alternative hypothesis. However, under the concept of falsification , we can evaluate the data to see if there is a significant probability that our data falsifies our alternative hypothesis (Wilkinson 2012) .
  • The null hypothesis (H 0 ) states that the factor has no effect. The null hypothesis is the opposite of the alternative hypothesis. The null hypothesis is what we are testing when we perform a hypothesis test.

hypothesis testing pearson r

The output of a statistical test like the t-test is a p -value. A p -value is the probability that any effects we see in the sampled data are the result of random sampling error (chance).

  • If a p -value is greater than the significance level (0.05 for 5% significance) we fail to reject the null hypothesis since there is a significant possibility that our results falsify our alternative hypothesis.
  • If a p -value is lower than the significance level (0.05 for 5% significance) we reject the null hypothesis and have corroborated (provided evidence for) our alternative hypothesis.

The calculation and interpretation of the p -value goes back to the central limit theorem , which states that random sampling error has a normal distribution.

hypothesis testing pearson r

Using our example of a clinical drug trial, if the mean recovery times for the two groups are close enough together that there is a significant possibility ( p > 0.05) that the recovery times are the same (falsification), we fail to reject the null hypothesis.

hypothesis testing pearson r

However, if the mean recovery times for the two groups are far enough apart that the probability they are the same is under the level of significance ( p < 0.05), we reject the null hypothesis and have corroborated our alternative hypothesis.

hypothesis testing pearson r

Significance means that an effect is "probably caused by something other than mere chance" (Merriam-Webster 2022) .

  • The significance level (α) is the threshold for significance and, by convention, is usually 5%, 10%, or 1%, which corresponds to 95% confidence, 90% confidence, or 99% confidence, respectively.
  • A factor is considered statistically significant if the probability that the effect we see in the data is a result of random sampling error (the p -value) is below the chosen significance level.
  • A statistical test is used to evaluate whether a factor being considered is statistically significant (Gallo 2016) .

Type I vs. Type II Errors

Although we are making a binary choice between rejecting and failing to reject the null hypothesis, because we are using sampled data, there is always the possibility that the choice we have made is an error.

There are two types of errors that can occur in hypothesis testing.

  • Type I error (false positive) occurs when a low p -value causes us to reject the null hypothesis, but the factor does not actually result in the effect.
  • Type II error (false negative) occurs when a high p -value causes us to fail to reject the null hypothesis, but the factor does actually result in the effect.

The numbering of the errors reflects the predisposition of the scientific method to be fundamentally skeptical . Accepting a fact about the world as true when it is not true is considered worse than rejecting a fact about the world that actually is true.

hypothesis testing pearson r

Statistical Significance vs. Importance

When we fail to reject the null hypothesis, we have found information that is commonly called statistically significant . But there are multiple challenges with this terminology.

First, statistical significance is distinct from importance (NIST 2012) . For example, if sampled data reveals a statistically significant difference in cancer rates, that does not mean that the increased risk is important enough to justify expensive mitigation measures. All statistical results require critical interpretation within the context of the phenomenon being observed. People with different values and incentives can have different interpretations of whether statistically significant results are important.

Second, the use of 95% probability for defining confidence intervals is an arbitrary convention. This creates a good vs. bad binary that suggests a "finality and certitude that are rarely justified." Alternative approaches like Beyesian statistics that express results as probabilities can offer more nuanced ways of dealing with complexity and uncertainty (Clayton 2022) .

Science vs. Non-science

Not all ideas can be falsified, and Popper uses the distinction between falsifiable and non-falsifiable ideas to make a distinction between science and non-science. In order for an idea to be science it must be an idea that can be demonstrated to be false.

While Popper asserts there is still value in ideas that are not falsifiable, such ideas are not science in his conception of what science is. Such non-science ideas often involve questions of subjective values or unseen forces that are complex, amorphous, or difficult to objectively observe.

Example Data

As example data, this tutorial will use a table of anonymized individual responses from the CDC's Behavioral Risk Factor Surveillance System . The BRFSS is a "system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services" (CDC 2019) .

A CSV file with the selected variables used in this tutorial is available here and can be imported into R with read.csv() .

Guidance on how to download and process this data directly from the CDC website is available here...

Variable Types

The publicly-available BRFSS data contains a wide variety of discrete, ordinal, and categorical variables. Variables often contain special codes for non-responsiveness or missing (NA) values. Examples of how to clean these variables are given here...

The BRFSS has a codebook that gives the survey questions associated with each variable, and the way that responses are encoded in the variable values.

hypothesis testing pearson r

Normality Tests

Tests are commonly divided into two groups depending on whether they are built on the assumption that the continuous variable has a normal distribution.

  • Parametric tests presume a normal distribution.
  • Non-parametric tests can work with normal and non-normal distributions.

The distinction between parametric and non-parametric techniques is especially important when working with small numbers of samples (less than 40 or so) from a larger population.

The normality tests given below do not work with large numbers of values, but with many statistical techniques, violations of normality assumptions do not cause major problems when large sample sizes are used. (Ghasemi and Sahediasi 2012) .

The Shapiro-Wilk Normality Test

  • Data: A continuous or discrete sampled variable
  • R Function: shapiro.test()
  • Null hypothesis (H 0 ): The population distribution from which the sample is drawn is not normal
  • History: Samuel Sanford Shapiro and Martin Wilk (1965)

This is an example with random values from a normal distribution.

This is an example with random values from a uniform (non-normal) distribution.

The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov is a more-generalized test than the Shapiro-Wilks test that can be used to test whether a sample is drawn from any type of distribution.

  • Data: A continuous or discrete sampled variable and a reference probability distribution
  • R Function: ks.test()
  • Null hypothesis (H 0 ): The population distribution from which the sample is drawn does not match the reference distribution
  • History: Andrey Kolmogorov (1933) and Nikolai Smirnov (1948)
  • pearson.test() The Pearson Chi-square Normality Test from the nortest library. Lower p-values (closer to 0) means to reject the reject the null hypothesis that the distribution IS normal.

Modality Tests of Samples

Comparing two central tendencies: tests with continuous / discrete data, one sample t-test (two-sided).

The one-sample t-test tests the significance of the difference between the mean of a sample and an expected mean.

  • Data: A continuous or discrete sampled variable and a single expected mean (μ)
  • Parametric (normal distributions)
  • R Function: t.test()
  • Null hypothesis (H 0 ): The means of the sampled distribution matches the expected mean.
  • History: William Sealy Gosset (1908)

t = ( Χ - μ) / (σ̂ / √ n )

  • t : The value of t used to find the p-value
  • Χ : The sample mean
  • μ: The population mean
  • σ̂: The estimate of the standard deviation of the population (usually the stdev of the sample
  • n : The sample size

T-tests should only be used when the population is at least 20 times larger than its respective sample. If the sample size is too large, the low p-value makes the insignificant look significant. .

For example, we test a hypothesis that the mean weight in IL in 2020 is different than the 2005 continental mean weight.

Walpole et al. (2012) estimated that the average adult weight in North America in 2005 was 178 pounds. We could presume that Illinois is a comparatively normal North American state that would follow the trend of both increased age and increased weight (CDC 2021) .

hypothesis testing pearson r

The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight changed between 2005 and 2020 in Illinois.

One Sample T-Test (One-Sided)

Because we were expecting an increase, we can modify our hypothesis that the mean weight in 2020 is higher than the continental weight in 2005. We can perform a one-sided t-test using the alternative="greater" parameter.

The low p-value leads us to again reject the null hypothesis and corroborate our alternative hypothesis that mean weight in 2020 is higher than the continental weight in 2005.

Note that this does not clearly evaluate whether weight increased specifically in Illinois, or, if it did, whether that was caused by an aging population or decreasingly healthy diets. Hypotheses based on such questions would require more detailed analysis of individual data.

Although we can see that the mean cancer incidence rate is higher for counties near nuclear plants, there is the possiblity that the difference in means happened by accident and the nuclear plants have nothing to do with those higher rates.

The t-test allows us to test a hypothesis. Note that a t-test does not "prove" or "disprove" anything. It only gives the probability that the differences we see between two areas happened by chance. It also does not evaluate whether there are other problems with the data, such as a third variable, or inaccurate cancer incidence rate estimates.

hypothesis testing pearson r

Note that this does not prove that nuclear power plants present a higher cancer risk to their neighbors. It simply says that the slightly higher risk is probably not due to chance alone. But there are a wide variety of other other related or unrelated social, environmental, or economic factors that could contribute to this difference.

Box-and-Whisker Chart

One visualization commonly used when comparing distributions (collections of numbers) is a box-and-whisker chart. The boxes show the range of values in the middle 25% to 50% to 75% of the distribution and the whiskers show the extreme high and low values.

hypothesis testing pearson r

Although Google Sheets does not provide the capability to create box-and-whisker charts, Google Sheets does have candlestick charts , which are similar to box-and-whisker charts, and which are normally used to display the range of stock price changes over a period of time.

This video shows how to create a candlestick chart comparing the distributions of cancer incidence rates. The QUARTILE() function gets the values that divide the distribution into four equally-sized parts. This shows that while the range of incidence rates in the non-nuclear counties are wider, the bulk of the rates are below the rates in nuclear counties, giving a visual demonstration of the numeric output of our t-test.

While categorical data can often be reduced to dichotomous data and used with proportions tests or t-tests, there are situations where you are sampling data that falls into more than two categories and you would like to make hypothesis tests about those categories. This tutorial describes a group of tests that can be used with that type of data.

Two-Sample T-Test

When comparing means of values from two different groups in your sample, a two-sample t-test is in order.

The two-sample t-test tests the significance of the difference between the means of two different samples.

  • Two normally-distributed, continuous or discrete sampled variables, OR
  • A normally-distributed continuous or sampled variable and a parallel dichotomous variable indicating what group each of the values in the first variable belong to
  • Null hypothesis (H 0 ): The means of the two sampled distributions are equal.

For example, given the low incomes and delicious foods prevalent in Mississippi, we might presume that average weight in Mississippi would be higher than in Illinois.

hypothesis testing pearson r

We test a hypothesis that the mean weight in IL in 2020 is less than the 2020 mean weight in Mississippi.

The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight in Illinois is less than in Mississippi.

While the difference in means is statistically significant, it is small (182 vs. 187), which should lead to caution in interpretation that you avoid using your analysis simply to reinforce unhelpful stigmatization.

Wilcoxen Rank Sum Test (Mann-Whitney U-Test)

The Wilcoxen rank sum test tests the significance of the difference between the means of two different samples. This is a non-parametric alternative to the t-test.

  • Data: Two continuous sampled variables
  • Non-parametric (normal or non-normal distributions)
  • R Function: wilcox.test()
  • Null hypothesis (H 0 ): For randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.
  • History: Frank Wilcoxon (1945) and Henry Mann and Donald Whitney (1947)

The test is is implemented with the wilcox.test() function.

  • When the test is performed on one sample in comparison to an expected value around which the distribution is symmetrical (μ), the test is known as a Mann-Whitney U test .
  • When the test is performed to compare two samples, the test is known as a Wilcoxon rank sum test .

For this example, we will use AVEDRNK3: During the past 30 days, on the days when you drank, about how many drinks did you drink on the average?

  • 1 - 76: Number of drinks
  • 77: Don’t know/Not sure
  • 99: Refused
  • NA: Not asked or Missing

The histogram clearly shows this to be a non-normal distribution.

hypothesis testing pearson r

Continuing the comparison of Illinois and Mississippi from above, we might presume that with all that warm weather and excellent food in Mississippi, they might be inclined to drink more. The means of average number of drinks per month seem to suggest that Mississippians do drink more than Illinoians.

We can test use wilcox.test() to test a hypothesis that the average amount of drinking in Illinois is different than in Mississippi. Like the t-test, the alternative can be specified as two-sided or one-sided, and for this example we will test whether the sampled Illinois value is indeed less than the Mississippi value.

The low p-value leads us to reject the null hypothesis and corroborates our hypothesis that average drinking is lower in Illinois than in Mississippi. As before, this tells us nothing about why this is the case.

Weighted Two-Sample T-Test

The downloadable BRFSS data is raw, anonymized survey data that is biased by uneven geographic coverage of survey administration (noncoverage) and lack of responsiveness from some segments of the population (nonresponse). The X_LLCPWT field (landline, cellphone weighting) is a weighting factor added by the CDC that can be assigned to each response to compensate for these biases.

The wtd.t.test() function from the weights library has a weights parameter that can be used to include a weighting factor as part of the t-test.

Comparing Proportions: Tests with Categorical Data

Chi-squared goodness of fit.

  • Tests the significance of the difference between sampled frequencies of different values and expected frequencies of those values
  • Data: A categorical sampled variable and a table of expected frequencies for each of the categories
  • R Function: chisq.test()
  • Null hypothesis (H 0 ): The relative proportions of categories in one variable are different from the expected proportions
  • History: Karl Pearson (1900)
  • Example Question: Are the voting preferences of voters in my district significantly different from the current national polls?

For example, we test a hypothesis that smoking rates changed between 2000 and 2020.

In 2000, the estimated rate of adult smoking in Illinois was 22.3% (Illinois Department of Public Health 2004) .

The variable we will use is SMOKDAY2: Do you now smoke cigarettes every day, some days, or not at all?

  • 1: Current smoker - now smokes every day
  • 2: Current smoker - now smokes some days
  • 3: Not at all
  • 7: Don't know
  • NA: Not asked or missing - NA is used for people who have never smoked

We subset only yes/no responses in Illinois and convert into a dummy variable (yes = 1, no = 0).

The listing of the table as percentages indicates that smoking rates were halved between 2000 and 2020, but since this is sampled data, we need to run a chi-squared test to make sure the difference can't be explained by the randomness of sampling.

In this case, the very low p-value leads us to reject the null hypothesis and corroborates the alternative hypothesis that smoking rates changed between 2000 and 2020.

Chi-Squared Contingency Analysis / Test of Independence

  • Tests the significance of the difference between frequencies between two different groups
  • Data: Two categorical sampled variables
  • Null hypothesis (H 0 ): The relative proportions of one variable are independent of the second variable.

We can also compare categorical proportions between two sets of sampled categorical variables.

The chi-squared test can is used to determine if two categorical variables are independent. What is passed as the parameter is a contingency table created with the table() function that cross-classifies the number of rows that are in the categories specified by the two categorical variables.

The null hypothesis with this test is that the two categories are independent. The alternative hypothesis is that there is some dependency between the two categories.

For this example, we can compare the three categories of smokers (daily = 1, occasionally = 2, never = 3) across the two categories of states (Illinois and Mississippi).

hypothesis testing pearson r

The low p-value leads us to reject the null hypotheses that the categories are independent and corroborates our hypotheses that smoking behaviors in the two states are indeed different.

p-value = 1.516e-09

Weighted Chi-Squared Contingency Analysis

As with the weighted t-test above, the weights library contains the wtd.chi.sq() function for incorporating weighting into chi-squared contingency analysis.

As above, the even lower p-value leads us to again reject the null hypothesis that smoking behaviors are independent in the two states.

Suppose that the Macrander campaign would like to know how partisan this election is. If people are largely choosing to vote along party lines, the campaign will seek to get their base voters out to the polls. If people are splitting their ticket, the campaign may focus their efforts more broadly.

In the example below, the Macrander campaign took a small poll of 30 people asking who they wished to vote for AND what party they most strongly affiliate with.

The output of table() shows fairly strong relationship between party affiliation and candidates. Democrats tend to vote for Macrander, while Republicans tend to vote for Stewart, while independents all vote for Miller.

This is reflected in the very low p-value from the chi-squared test. This indicates that there is a very low probability that the two categories are independent. Therefore we reject the null hypothesis.

In contrast, suppose that the poll results had showed there were a number of people crossing party lines to vote for candidates outside their party. The simulated data below uses the runif() function to randomly choose 50 party names.

The contingency table() shows no clear relationship between party affiliation and candidate. This is validated quantitatively by the chi-squared test. The fairly high p-value of 0.4018 indicates a 40% chance that the two categories are independent. Therefore, we fail to reject the null hypothesis and the campaign should focus their efforts on the broader electorate.

The warning message given by the chisq.test() function indicates that the sample size is too small to make an accurate analysis. The simulate.p.value = T parameter adds Monte Carlo simulation to the test to improve the estimation and get rid of the warning message. However, the best way to get rid of this message is to get a larger sample.

Comparing Categorical and Continuous Variables

Analysis of variation (anova).

Analysis of Variance (ANOVA) is a test that you can use when you have a categorical variable and a continuous variable. It is a test that considers variability between means for different categories as well as the variability of observations within groups.

There are a wide variety of different extensions of ANOVA that deal with covariance (ANCOVA), multiple variables (MANOVA), and both of those together (MANCOVA). These techniques can become quite complicated and also assume that the values in the continuous variables have a normal distribution.

  • Data: One or more categorical (independent) variables and one continuous (dependent) sampled variable
  • R Function: aov()
  • Null hypothesis (H 0 ): There is no difference in means of the groups defined by each level of the categorical (independent) variable
  • History: Ronald Fisher (1921)
  • Example Question: Do low-, middle- and high-income people vary in the amount of time they spend watching TV?

As an example, we look at the continuous weight variable (WEIGHT2) split into groups by the eight income categories in INCOME2: Is your annual household income from all sources?

  • 1: Less than $10,000
  • 2: $10,000 to less than $15,000
  • 3: $15,000 to less than $20,000
  • 4: $20,000 to less than $25,000
  • 5: $25,000 to less than $35,000
  • 6: $35,000 to less than $50,000
  • 7: $50,000 to less than $75,000)
  • 8: $75,000 or more

The barplot() of means does show variation among groups, although there is no clear linear relationship between income and weight.

hypothesis testing pearson r

To test whether this variation could be explained by randomness in the sample, we run the ANOVA test.

The low p-value leads us to reject the null hypothesis that there is no difference in the means of the different groups, and corroborates the alternative hypothesis that mean weights differ based on income group.

However, it gives us no clear model for describing that relationship and offers no insights into why income would affect weight, especially in such a nonlinear manner.

Suppose you are performing research into obesity in your city. You take a sample of 30 people in three different neighborhoods (90 people total), collecting information on health and lifestyle. Two variables you collect are height and weight so you can calculate body mass index . Although this index can be misleading for some populations (notably very athletic people), ordinary sedentary people can be classified according to BMI:

Average BMI in the US from 2007-2010 was around 28.6 and rising, standard deviation of around 5 .

You would like to know if there is a difference in BMI between different neighborhoods so you can know whether to target specific neighborhoods or make broader city-wide efforts. Since you have more than two groups, you cannot use a t-test().

Kruskal-Wallace One-Way Analysis of Variance

A somewhat simpler test is the Kruskal-Wallace test which is a nonparametric analogue to ANOVA for testing the significance of differences between two or more groups.

  • R Function: kruskal.test()
  • Null hypothesis (H 0 ): The samples come from the same distribution.
  • History: William Kruskal and W. Allen Wallis (1952)

For this example, we will investigate whether mean weight varies between the three major US urban states: New York, Illinois, and California.

hypothesis testing pearson r

To test whether this variation could be explained by randomness in the sample, we run the Kruskal-Wallace test.

The low p-value leads us to reject the null hypothesis that the samples come from the same distribution. This corroborates the alternative hypothesis that mean weights differ based on state.

A convienent way of visualizing a comparison between continuous and categorical data is with a box plot , which shows the distribution of a continuous variable across different groups:

hypothesis testing pearson r

A percentile is the level at which a given percentage of the values in the distribution are below: the 5th percentile means that five percent of the numbers are below that value.

The quartiles divide the distribution into four parts. 25% of the numbers are below the first quartile. 75% are below the third quartile. 50% are below the second quartile, making it the median.

Box plots can be used with both sampled data and population data.

The first parameter to the box plot is a formula: the continuous variable as a function of (the tilde) the second variable. A data= parameter can be added if you are using variables in a data frame.

The chi-squared test can be used to determine if two categorical variables are independent of each other.

[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Search Rcompanion.org

  • Purpose of this Book
  • Author of this Book
  • Statistics Textbooks and Other Resources
  • Why Statistics?
  • Evaluation Tools and Surveys
  • Types of Variables
  • Descriptive Statistics
  • Confidence Intervals
  • Basic Plots

Hypothesis Testing and p-values

  • Reporting Results of Data and Analyses
  • Choosing a Statistical Test
  • Independent and Paired Values
  • Introduction to Likert Data
  • Descriptive Statistics for Likert Item Data
  • Descriptive Statistics with the likert Package
  • Confidence Intervals for Medians
  • Converting Numeric Data to Categories
  • Introduction to Traditional Nonparametric Tests
  • One-sample Wilcoxon Signed-rank Test
  • Sign Test for One-sample Data
  • Two-sample Mann–Whitney U Test
  • Mood’s Median Test for Two-sample Data
  • Two-sample Paired Signed-rank Test
  • Sign Test for Two-sample Paired Data
  • Kruskal–Wallis Test
  • Mood’s Median Test
  • Friedman Test
  • Scheirer–Ray–Hare Test
  • Aligned Ranks Transformation ANOVA
  • Nonparametric Regression and Local Regression
  • Nonparametric Regression for Time Series
  • Introduction to Permutation Tests
  • One-way Permutation Test for Ordinal Data
  • One-way Permutation Test for Paired Ordinal Data
  • Permutation Tests for Medians and Percentiles
  • Association Tests for Ordinal Tables
  • Measures of Association for Ordinal Tables
  • Introduction to Linear Models
  • Using Random Effects in Models
  • What are Estimated Marginal Means?
  • Estimated Marginal Means for Multiple Comparisons
  • Factorial ANOVA: Main Effects, Interaction Effects, and Interaction Plots
  • p-values and R-square Values for Models
  • Accuracy and Errors for Models
  • Introduction to Cumulative Link Models (CLM) for Ordinal Data
  • Two-sample Ordinal Test with CLM
  • Two-sample Paired Ordinal Test with CLMM
  • One-way Ordinal Regression with CLM
  • One-way Repeated Ordinal Regression with CLMM
  • Two-way Ordinal Regression with CLM
  • Two-way Repeated Ordinal Regression with CLMM
  • Introduction to Tests for Nominal Variables
  • Confidence Intervals for Proportions
  • Goodness-of-Fit Tests for Nominal Variables
  • Association Tests for Nominal Variables
  • Measures of Association for Nominal Variables
  • Tests for Paired Nominal Data
  • Cochran–Mantel–Haenszel Test for 3-Dimensional Tables
  • Cochran’s Q Test for Paired Nominal Data
  • Models for Nominal Data
  • Introduction to Parametric Tests
  • One-sample t-test
  • Two-sample t-test
  • Paired t-test
  • One-way ANOVA
  • One-way ANOVA with Blocks
  • One-way ANOVA with Random Blocks
  • Two-way ANOVA
  • Repeated Measures ANOVA
  • Correlation and Linear Regression
  • Advanced Parametric Methods
  • Transforming Data
  • Normal Scores Transformation
  • Regression for Count Data
  • Beta Regression for Percent and Proportion Data
  • An R Companion for the Handbook of Biological Statistics

Initial comments

Traditionally when students first learn about the analysis of experiments, there is a strong focus on hypothesis testing and making decisions based on p -values. Hypothesis testing is important for determining if there are statistically significant effects.  However, readers of this book should not place undo emphasis on p -values. Instead, they should realize that p -values are affected by sample size, and that a low p -value does not necessarily suggest a large effect or a practically meaningful effect.  Summary statistics, plots, effect size statistics, and practical considerations should be used. The goal is to determine: a) statistical significance, b) effect size, c) practical importance.  These are all different concepts, and they will be explored below.

Statistical inference

Most of what we’ve covered in this book so far is about producing descriptive statistics: calculating means and medians, plotting data in various ways, and producing confidence intervals.  The bulk of the rest of this book will cover statistical inference:  using statistical tests to draw some conclusion about the data.  We’ve already done this a little bit in earlier chapters by using confidence intervals to conclude if means are different or not among groups.

As Dr. Nic mentions in her article in the “References and further reading” section, this is the part where people sometimes get stumped.  It is natural for most of us to use summary statistics or plots, but jumping to statistical inference needs a little change in perspective.  The idea of using some statistical test to answer a question isn’t a difficult concept, but some of the following discussion gets a little theoretical.  The video from the Statistics Learning Center in the “References and further reading” section does a good job of explaining the basis of statistical inference.

One important thing to gain from this chapter is an understanding of how to use the p -value, alpha , and decision rule to test the null hypothesis.  But once you are comfortable with that, you will want to return to this chapter to have a better understanding of the theory behind this process.

Another important thing is to understand the limitations of relying on p -values, and why it is important to assess the size of effects and weigh practical considerations.

Packages used in this chapter

The packages used in this chapter include:

The following commands will install these packages if they are not already installed:

if(!require(lsr)){install.packages("lsr")}

Hypothesis testing

The null and alternative hypotheses.

The statistical tests in this book rely on testing a null hypothesis, which has a specific formulation for each test.  The null hypothesis always describes the case where e.g. two groups are not different or there is no correlation between two variables, etc.

The alternative hypothesis is the contrary of the null hypothesis, and so describes the cases where there is a difference among groups or a correlation between two variables, etc.

Notice that the definitions of null hypothesis and alternative hypothesis have nothing to do with what you want to find or don't want to find, or what is interesting or not interesting, or what you expect to find or what you don’t expect to find.  If you were comparing the height of men and women, the null hypothesis would be that the height of men and the height of women were not different.  Yet, you might find it surprising if you found this hypothesis to be true for some population you were studying.  Likewise, if you were studying the income of men and women, the null hypothesis would be that the income of men and women are not different, in the population you are studying.  In this case you might be hoping the null hypothesis is true, though you might be unsurprised if the alternative hypothesis were true.  In any case, the null hypothesis will take the form that there is no difference between groups, there is no correlation between two variables, or there is no effect of this variable in our model.

p -value definition

Most of the tests in this book rely on using a statistic called the p -value to evaluate if we should reject, or fail to reject, the null hypothesis.

Given the assumption that the null hypothesis is true , the p -value is defined as the probability of obtaining a result equal to or more extreme than what was actually observed in the data.

We’ll unpack this definition in a little bit.

Decision rule

The p -value for the given data will be determined by conducting the statistical test.

This p -value is then compared to a pre-determined value alpha .  Most commonly, an alpha value of 0.05 is used, but there is nothing magic about this value.

If the p -value for the test is less than alpha , we reject the null hypothesis.

If the p -value is greater than or equal to alpha , we fail to reject the null hypothesis.

Coin flipping example

For an example of using the p -value for hypothesis testing, imagine you have a coin you will toss 100 times.  The null hypothesis is that the coin is fair—that is, that it is equally likely that the coin will land on heads as land on tails.  The alternative hypothesis is that the coin is not fair.  Let’s say for this experiment you throw the coin 100 times and it lands on heads 95 times out of those hundred.  The p -value in this case would be the probability of getting 95, 96, 97, 98, 99, or 100 heads, or 0, 1, 2, 3, 4, or 5 heads, assuming that the null hypothesis is true . 

This is what we call a two-sided test, since we are testing both extremes suggested by our data:  getting 95 or greater heads or getting 95 or greater tails.  In most cases we will use two sided tests.

You can imagine that the p -value for this data will be quite small.  If the null hypothesis is true, and the coin is fair, there would be a low probability of getting 95 or more heads or 95 or more tails.

Using a binomial test, the p -value is < 0.0001.

(Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16 , which is 0.00000000000000022, with 15 zeros after the decimal point.)

Assuming an alpha of 0.05, since the p -value is less than alpha , we reject the null hypothesis.  That is, we conclude that the coin is not fair.

binom.test(5, 100, 0.5)

Exact binomial test number of successes = 5, number of trials = 100, p-value < 2.2e-16 alternative hypothesis: true probability of success is not equal to 0.5

Passing and failing example

As another example, imagine we are considering two classrooms, and we have counts of students who passed a certain exam.  We want to know if one classroom had statistically more passes or failures than the other.

In our example each classroom will have 10 students.  The data is arranged into a contingency table.

Classroom   Passed   Failed A          8       2 B          3       7

We will use Fisher’s exact test to test if there is an association between Classroom and the counts of passed and failed students.  The null hypothesis is that there is no association between Classroom and Passed/Failed , based on the relative counts in each cell of the contingency table.

Input =("  Classroom  Passed  Failed  A          8       2  B          3       7 ") Matrix = as.matrix(read.table(textConnection(Input),                    header=TRUE,                    row.names=1)) Matrix 

  Passed Failed A      8      2 B      3      7

fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.06978

The reported p -value is 0.070.  If we use an alpha of 0.05, then the p -value is greater than alpha , so we fail to reject the null hypothesis.  That is, we did not have sufficient evidence to say that there is an association between Classroom and Passed/Failed .

More extreme data in this case would be if the counts in the upper left or lower right (or both!) were greater. 

Classroom   Passed   Failed A          9       1 B          3       7 Classroom   Passed   Failed A          10      0 B           3      7 and so on, with Classroom B...

In most cases we would want to consider as "extreme" not only the results when Classroom A has a high frequency of passing students, but also results when Classroom B has a high frequency of passing students.  This is called a two-sided or two-tailed test.  If we were only concerned with one classroom having a high frequency of passing students, relatively, we would instead perform a one-sided test.  The default for the fisher.test function is two-sided, and usually you will want to use two-sided tests.

Classroom   Passed   Failed A          2       8 B          7       3 Classroom   Passed   Failed A          1       9 B          7       3 Classroom   Passed   Failed A          0       10 B          7        3 and so on, with Classroom B...

In both cases, "extreme" means there is a stronger association between Classroom and Passed/Failed .

Theory and practice of using p -values

Wait, does this make any sense.

Recall that the definition of the p -value is:

The astute reader might be asking herself, “If I’m trying to determine if the null hypothesis is true or not, why would I start with the assumption that the null hypothesis is true?  And why am I using a probability of getting certain data given that a hypothesis is true?  Don’t I want to instead determine the probability of the hypothesis given my data?”

The answer is yes , we would like a method to determine the likelihood of our hypothesis being true given our data, but we use the Null Hypothesis Significance Test approach since it is relatively straightforward, and has wide acceptance historically and across disciplines.

In practice we do use the results of the statistical tests to reach conclusions about the null hypothesis.

Technically, the p -value says nothing about the alternative hypothesis.  But logically, if the null hypothesis is rejected, then its logical complement, the alternative hypothesis, is supported.  Practically, this is how we handle significant p -values, though this practical approach generates disapproval in some theoretical circles.

Statistics is like a jury?

Note the language used when testing the null hypothesis.  Based on the results of our statistical tests, we either reject the null hypothesis, or fail to reject the null hypothesis.

This is somewhat similar to the approach of a jury in a trial.  The jury either finds sufficient evidence to declare someone guilty, or fails to find sufficient evidence to declare someone guilty. 

Failing to convict someone isn’t necessarily the same as declaring someone innocent.  Likewise, if we fail to reject the null hypothesis, we shouldn’t assume that the null hypothesis is true.  It may be that we didn’t have sufficient samples to get a result that would have allowed us to reject the null hypothesis, or maybe there are some other factors affecting the results that we didn’t account for.  This is similar to an “innocent until proven guilty” stance.

Errors in inference

For the most part, the statistical tests we use are based on probability, and our data could always be the result of chance.  Considering the coin flipping example above, if we did flip a coin 100 times and came up with 95 heads, we would be compelled to conclude that the coin was not fair.  But 95 heads could happen with a fair coin strictly by chance.

We can, therefore, make two kinds of errors in testing the null hypothesis:

•  A Type I error occurs when the null hypothesis really is true, but based on our decision rule we reject the null hypothesis.  In this case, our result is a false positive ; we think there is an effect (unfair coin, association between variables, difference among groups) when really there isn’t.  The probability of making this kind error is alpha , the same alpha we used in our decision rule.

•  A Type II error occurs when the null hypothesis is really false, but based on our decision rule we fail to reject the null hypothesis.  In this case, our result is a false negative ; we have failed to find an effect that really does exist.  The probability of making this kind of error is called beta .

The following table summarizes these errors.

                            Reality                             ___________________________________ Decision of Test             Null is true             Null is false Reject null hypothesis      Type I error           Correctly                              (prob. = alpha)          reject null                                                      (prob. = 1 – beta) Retain null hypothesis      Correctly               Type II error                              retain null             (prob. = beta)                              (prob. = 1 – alpha)

Statistical power

The statistical power of a test is a measure of the ability of the test to detect a real effect.  It is related to the effect size, the sample size, and our chosen alpha level. 

The effect size is a measure of how unfair a coin is, how strong the association is between two variables, or how large the difference is among groups.  As the effect size increases or as the number of observations we collect increases, or as the alpha level increases, the power of the test increases.

Statistical power in the table above is indicated by 1 – beta , and power is the probability of correctly rejecting the null hypothesis.

An example should make these relationship clear.  Imagine we are sampling a large group of 7 th grade students for their height.  That is, the group is the population, and we are sampling a sub-set of these students.  In reality, for students in the population, the girls are taller than the boys, but the difference is small (that is, the effect size is small), and there is a lot of variability in students’ heights.  You can imagine that in order to detect the difference between girls and boys that we would have to measure many students.  If we fail to sample enough students, we might make a Type II error.  That is, we might fail to detect the actual difference in heights between sexes.

If we had a different experiment with a larger effect size—for example the weight difference between mature hamsters and mature hedgehogs—we might need fewer samples to detect the difference.

Note also, that our chosen alpha plays a role in the power of our test, too.  All things being equal, across many tests, if we decrease our alph a, that is, insist on a lower rate of Type I errors, we are more likely to commit a Type II error, and so have a lower power.  This is analogous to a case of a meticulous jury that has a very high standard of proof to convict someone.  In this case, the likelihood of a false conviction is low, but the likelihood of a letting a guilty person go free is relatively high.

The 0.05 alpha value is not dogma

The level of alpha is traditionally set at 0.05 in some disciplines, though there is sometimes reason to choose a different value.

One situation in which the alpha level is increased is in preliminary studies in which it is better to include potentially significant effects even if there is not strong evidence for keeping them.  In this case, the researcher is accepting an inflated chance of Type I errors in order to decrease the chance of Type II errors.

Imagine an experiment in which you wanted to see if various environmental treatments would improve student learning.  In a preliminary study, you might have many treatments, with few observations each, and you want to retain any potentially successful treatments for future study.  For example, you might try playing classical music, improved lighting, complimenting students, and so on, and see if there is any effect on student learning.  You might relax your alpha value to 0.10 or 0.15 in the preliminary study to see what treatments to include in future studies.

On the other hand, in situations where a Type I, false positive, error might be costly in terms of money or people’s health, a lower alpha can be used, perhaps, 0.01 or 0.001.  You can imagine a case in which there is an established treatment for cancer, and a new treatment is being tested.  Because the new treatment is likely to be expensive and to hold people’s lives in the balance, a researcher would want to be very sure that the new treatment is more effective than the established treatment.  In reality, the researchers would not just lower the alpha level, but also look at the effect size, submit the research for peer review, replicate the study, be sure there were no problems with the design of the study or the data collection, and weigh the practical implications.

The 0.05 alpha value is almost dogma

In theory, as a researcher, you would determine the alpha level you feel is appropriate.  That is, the probability of making a Type I error when the null hypothesis is in fact true. 

In reality, though, 0.05 is almost always used in most fields for readers of this book.  Choosing a different alpha value will rarely go without question.  It is best to keep with the 0.05 level unless you have good justification for another value, or are in a discipline where other values are routinely used.

Practical advice

One good practice is to report actual p -values from analyses.  It is fine to also simply say, e.g. “The dependent variable was significantly correlated with variable A ( p < 0.05).”  But I prefer when possible to say, “The dependent variable was significantly correlated with variable A ( p = 0.026).

It is probably best to avoid using terms like “marginally significant” or “borderline significant” for p -values less than 0.10 but greater than 0.05, though you might encounter similar phrases.  It is better to simply report the p -values of tests or effects in straight-forward manner.  If you had cause to include certain model effects or results from other tests, they can be reported as e.g., “Variables correlated with the dependent variable with p < 0.15 were A , B , and C .”

Is the p -value every really true?

Considering some of the examples presented, it may have occurred to the reader to ask if the null hypothesis is ever really true.   For example, in some population of 7 th graders, if we could measure everyone in the population to a high degree of precision, then there must be some difference in height between girls and boys.  This is an important limitation of null hypothesis significance testing.  Often, if we have many observations, even small effects will be reported as significant.  This is one reason why it is important to not rely too heavily on p -values, but to also look at the size of the effect and practical considerations.  In this example, if we sampled many students and the difference in heights was 0.5 cm, even if significant, we might decide that this effect is too small to be of practical importance, especially relative to an average height of 150 cm.  (Here, the difference would be  0.3% of the average height).

Effect sizes and practical importance

Practical importance and statistical significance.

It is important to remember to not let p -values be the only guide for drawing conclusions.  It is equally important to look at the size of the effects you are measuring, as well as take into account other practical considerations like the costs of choosing a certain path of action.

For example, imagine we want to compare the SAT scores of two SAT preparation classes with a t -test.

Class.A = c(1500, 1505, 1505, 1510, 1510, 1510, 1515, 1515, 1520, 1520) Class.B = c(1510, 1515, 1515, 1520, 1520, 1520, 1525, 1525, 1530, 1530) t.test(Class.A, Class.B)

Welch Two Sample t-test t = -3.3968, df = 18, p-value = 0.003214 mean of x mean of y      1511      1521

The p -value is reported as 0.003, so we would consider there to be a significant difference between the two classes ( p < 0.05).

But we have to ask ourselves the practical question, is a difference of 10 points on the SAT large enough for us to care about?  What if enrolling in one class costs significantly more than the other class?  Is it worth the extra money for a difference of 10 points on average?

Sizes of effects

It should be remembered that p -values do not indicate the size of the effect being studied.  It shouldn’t be assumed that a small p -value indicates a large difference between groups, or vice-versa. 

For example, in the SAT example above, the p -value is fairly small, but the size of the effect (difference between classes) in this case is relatively small (10 points, especially small relative to the range of scores students receive on the SAT).

In converse, there could be a relatively large size of the effects, but if there is a lot of variability in the data or the sample size is not large enough, the p -value could be relatively large. 

In this example, the SAT scores differ by 100 points between classes, but because the variability is greater than in the previous example, the p -value is not significant.

Class.C = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.D = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.C, Class.D)

Welch Two Sample t-test t = -1.4174, df = 18, p-value = 0.1735 mean of x mean of y      1290      1390

boxplot(cbind(Class.C, Class.D))

image

p -values and sample sizes

It should also be remembered that p -values are affected by sample size.   For a given effect size and variability in the data, as the sample size increases, the p -value is likely to decrease.  For large data sets, small effects can result in significant p -values.

As an example, let’s take the data from Class.C and Class.D and double the number of observations for each without changing the distribution of the values in each, and rename them Class.E and Class.F .

Class.E = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500,             1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.F = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600,             1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.E, Class.F)

Welch Two Sample t-test t = -2.0594, df = 38, p-value = 0.04636 mean of x mean of y      1290      1390

boxplot(cbind(Class.E, Class.F))

Notice that the p -value is lower for the t -test for Class.E and Class.F than it was for Class.C and Class.D .  Also notice that the means reported in the output are the same, and the box plots would look the same.

Effect size statistics

One way to account for the effect of sample size on our statistical tests is to consider effect size statistics.  These statistics reflect the size of the effect in a standardized way, and are unaffected by sample size.

An appropriate effect size statistic for a t -test is Cohen’s d .  It takes the difference in means between the two groups and divides by the pooled standard deviation of the groups.  Cohen’s d equals zero if the means are the same, and increases to infinity as the difference in means increases relative to the standard deviation.

In the following, note that Cohen’s d is not affected by the sample size difference in the Class.C / Class.D and the Class.E /  Class.F examples.

library(lsr) cohensD(Class.C, Class.D,         method = "raw")

cohensD(Class.E, Class.F,         method = "raw")

Effect size statistics are standardized so that they are not affected by the units of measurements of the data.  This makes them interpretable across different situations, or if the reader is not familiar with the units of measurement in the original data.  A Cohen’s d of 1 suggests that the two means differ by one pooled standard deviation.  A Cohen’s d of 0.5 suggests that the two means differ by one-half the pooled standard deviation.

For example, if we create new variables— Class.G and Class.H —that are the SAT scores from the previous example expressed as a proportion of a 1600 score, Cohen’s d will be the same as in the previous example.

Class.G = Class.E / 1600 Class.H = Class.F / 1600 Class.G Class.H cohensD(Class.G, Class.H,         method="raw")

Good practices for statistical analyses

Statistics is not like a trial.

When analyzing data, the analyst should not approach the task as would a lawyer for the prosecution.  That is, the analyst should not be searching for significant effects and tests, but should instead be like an independent investigator using lines of evidence to find out what is most likely to true given the data, graphical analysis, and statistical analysis available.

The problem of multiple p -values

One concept that will be in important in the following discussion is that when there are multiple tests producing multiple p -values, that there is an inflation of the Type I error rate.  That is, there is a higher chance of making false-positive errors.

This simply follows mathematically from the definition of alpha .  If we allow a probability of 0.05, or 5% chance, of making a Type I error for any one test, as we do more and more tests, the chances that at least one of them having a false positive becomes greater and greater.

p -value adjustment

One way we deal with the problem of multiple p -values in statistical analyses is to adjust p -values when we do a series of tests together (for example, if we are comparing the means of multiple groups).

Don’t use Bonferroni adjustments

There are various p -value adjustments available in R.  In some cases, we will use FDR, which stands for false discovery rate , and in R is an alias for the Benjamini and Hochberg method.  There are also cases in which we’ll use Tukey range adjustment to correct for the family-wise error rate. 

Unfortunately, students in analysis of experiments courses often learn to use Bonferroni adjustment for p -values.  This method is simple to do with hand calculations, but is excessively conservative in most situations, and, in my opinion, antiquated.

There are other p -value adjustment methods, and the choice of which one to use is dictated either by which are common in your field of study, or by doing enough reading to understand which are statistically most appropriate for your application.

Preplanned tests

The statistical tests covered in this book assume that tests are preplanned for their p -values to be accurate.  That is, in theory, you set out an experiment, collect the data as planned, and then say “I’m going to analyze it with kind of model and do these post-hoc tests afterwards”, and report these results, and that’s all you would do.

Some authors emphasize this idea of preplanned tests.  In contrast is an exploratory data analysis approach that relies upon examining the data with plots and using simple tests like correlation tests to suggest what statistical analysis makes sense.

If an experiment is set out in a specific design, then usually it is appropriate to use the analysis suggested by this design.

p -value hacking

It is important when approaching data from an exploratory approach, to avoid committing p -value hacking.  Imagine the case in which the researcher collects many different measurements across a range of subjects.  The researcher might be tempted to simply try different tests and models to relate one variable to another, for all the variables.  He might continue to do this until he found a test with a significant p -value.

But this would be a form of p -value hacking.

Because an alpha value of 0.05 allows us to make a false-positive error five percent of the time, finding one p -value below 0.05 after several successive tests may simply be due to chance.

Some forms of p -value hacking are more egregious.  For example, if one were to collect some data, run a test, and then continue to collect data and run tests iteratively until a significant p -value is found.

Publication bias

A related issue in science is that there is a bias to publish, or to report, only significant results.  This can also lead to an inflation of the false-positive rate.  As a hypothetical example, imagine if there are currently 20 similar studies being conducted testing a similar effect—let’s say the effect of glucosamine supplements on joint pain.  If 19 of those studies found no effect and so were discarded, but one study found an effect using an alpha of 0.05, and was published, is this really any support that glucosamine supplements decrease joint pain?

Clarification of terms and reporting on assignments

"statistically significant".

In the context of this book, the term "significant" means "statistically significant". 

Whenever the decision rule finds that p < alpha , the difference in groups, the association, or the correlation under consideration is then considered "statistically significant" or "significant". 

No effect size or practical considerations enter into determining whether an effect is “significant” or not.  The only exception is that test assumptions and requirements for appropriate data must also be met in order for the p -value to be valid.

What you need to consider :

 •  The null hypothesis

 •  p , alpha , and the decision rule,

 •  Your result.  That is, whether the difference in groups, the association, or the correlation is significant or not.

What you should report on your assignments:

•  The p -value

•  The conclusion, e.g. "There was a significant difference in the mean heights of boys and girls in the class." It is best to preface this with the "reject" or "fail to reject" language concerning your decision about the null hypothesis.

“Size of the effect” / “effect size”

In the context of this book, I use the term "size of the effect" to suggest the use of summary statistics to indicate how large an effect is.  This may be, for example the difference in two medians.  I try reserve the term “effect size” to refer to the use of effect size statistics. This distinction isn’t necessarily common.

Usually you will consider an effect in relation to the magnitude of measurements.  That is, you might look at the difference in medians as a percent of the median of one group or of the global median.  Or, you might look at the difference in medians in relation to the range of answers.  For example, a one-point difference on a 5-point Likert item.  Counts might be expressed as proportions of totals or subsets.

What you should report on assignments :

 •  The size of the effect.  That is, the difference in medians or means, the difference in counts, or the  proportions of counts among groups.

 •  Where appropriate, the size of the effect expressed as a percentage or proportion.

•  If there is an effect size statistic—such as r , epsilon -squared, phi , Cramér's V , or Cohen's d —:  report this and its interpretation (small, medium, large), and incorporate this into your conclusion.

"Practical" / "Practical importance"

If there is a significant result, the question of practical importance asks if the difference or association is large enough to matter in the real world.

If there is no significant result, the question of practical importance asks if the a difference or association is large enough to warrant another look, for example by running another test with a larger sample size or that controls variability in observations better.

•  Your conclusion as to whether this effect is large enough to be important in the real world.

•  The context, explanation, or support to justify your conclusion.

•  In some cases you might include considerations that aren't included in the data presented.  Examples might include the cost of one treatment over another, including time investment, or whether there is a large risk in selecting one treatment over another (e.g., if people's lives are on the line).

A few of xkcd comics

Significant.

xkcd.com/882/

Null hypothesis

xkcd.com/892/

xkcd.com/1478/

Experiments, sampling, and causation

Types of experimental designs, experimental designs.

A true experimental design assigns treatments in a systematic manner.  The experimenter must be able to manipulate the experimental treatments and assign them to subjects.  Since treatments are randomly assigned to subjects, a causal inference can be made for significant results.  That is, we can say that the variation in the dependent variable is caused by the variation in the independent variable.

For interval/ratio data, traditional experimental designs can be analyzed with specific parametric models, assuming other model assumptions are met.  These traditional experimental designs include:

•  Completely random design

•  Randomized complete block design

•  Factorial

•  Split-plot

•  Latin square

Quasi-experiment designs

Often a researcher cannot assign treatments to individual experimental units, but can assign treatments to groups.  For example, if students are in a specific grade or class, it would not be practical to randomly assign students to grades or classes.  But different classes could receive different treatments (such as different curricula).  Causality can be inferred cautiously if treatments are randomly assigned and there is some understanding of the factors that affect the outcome.

Observational studies

In observational studies, the independent variables are not manipulated, and no treatments are assigned.  Surveys are often like this, as are studies of natural systems without experimental manipulation.  Statistical analysis can reveal the relationships among variables, but causality cannot be inferred.  This is because there may be other unstudied variables that affect the measured variables in the study.

Good sampling practices are critical for producing good data.  In general, samples need to be collected in a random fashion so that bias is avoided.

In survey data, bias is often introduced by a self-selection bias.  For example, internet or telephone surveys include only those who respond to these requests.  Might there be some relevant difference in the variables of interest between those who respond to such requests and the general population being surveyed?  Or bias could be introduced by the researcher selecting some subset of potential subjects, for example only surveying a 4-H program with particularly cooperative students and ignoring other clubs.  This is sometimes called “convenience sampling”.

In election forecasting, good pollsters need to account for selection bias and other biases in the survey process.  For example, if a survey is done by landline telephone, those being surveyed are more likely to be older than the general population of voters, and so likely to have a bias in their voting patterns.

Plan ahead and be consistent

It is sometimes necessary to change experimental conditions during the course of an experiment.  Equipment might fail, or unusual weather may prevent making meaningful measurements.

But in general, it is much better to plan ahead and be consistent with measurements. 

Consistency

People sometimes have the tendency to change measurement frequency or experimental treatments during the course of a study.  This inevitably causes headaches in trying to analyze data, and makes writing up the results messy.  Try to avoid this.

Controls and checks

If you are testing an experimental treatment, include a check treatment that almost certainly will have an effect and a control treatment that almost certainly won’t.  A control treatment will receive no treatment and a check treatment will receive a treatment known to be successful.  In an educational setting, perhaps a control group receives no instruction on the topic but on another topic, and the check group will receive standard instruction.

Including checks and controls helps with the analysis in a practical sense, since they serve as standard treatments against which to compare the experimental treatments.  In the case where the experimental treatments have similar effects, controls and checks allow you say, for example, “Means for the all experimental treatments were similar, but were higher than the mean for control, and lower than the mean for check treatment.”

Include alternate measurements

It often happens that measuring equipment fails or that a certain measurement doesn’t produce the expected results.  It is therefore helpful to include measurements of several variables that can capture the potential effects.  Perhaps test scores of students won’t show an effect, but a self-assessment question on how much students learned will.

Include covariates

Including additional independent variables that might affect the dependent variable is often helpful in an analysis.  In an educational setting, you might assess student age, grade, school, town, background level in the subject, or how well they are feeling that day.

The effects of covariates on the dependent variable may be of interest in itself.  But also, including co-variates in an analysis can better model the data, sometimes making treatment effects more clear or making a model better meet model assumptions.

Optional discussion: Alternative methods to the Null Hypothesis Significance Test

The nhst controversy.

Particularly in the fields of psychology and education, there has been much criticism of the null hypothesis significance test approach.  From my reading, the main complaints against NHST tend to be:

•  Students and researchers don’t really understand the meaning of p -values.

•  p -values don’t include important information like confidence intervals or parameter estimates.

•  p -values have properties that may be misleading, for example that they do not represent effect size, and that they change with sample size.

•  We often treat an alpha of 0.05 as a magical cutoff value.

Personally, I don’t find these to be very convincing arguments against the NHST approach. 

The first complaint is in some sense pedantic:  Like so many things, students and researchers learn the definition of p -values at some point and then eventually forget.  This doesn’t seem to impact the usefulness of the approach.

The second point has weight only if researchers use only p -values to draw conclusions from statistical tests.  As this book points out, one should always consider the size of the effects and practical considerations of the effects, as well present finding in table or graphical form, including confidence intervals or measures of dispersion.  There is no reason why parameter estimates, goodness-of-fit statistics, and confidence intervals can’t be included when a NHST approach is followed.

The properties in the third point also don’t count much as criticism if one is using p -values correctly.  One should understand that it is possible to have a small effect size and a small p -value, and vice-versa.  This is not a problem, because p -values and effect sizes are two different concepts.  We shouldn’t expect them to be the same.  The fact that p -values change with sample size is also in no way problematic to me.  It makes sense that when there is a small effect size or a lot of variability in the data that we need many samples to conclude the effect is likely to be real.

(One case where I think the considerations in the preceding point are commonly problematic is when people use statistical tests to check for the normality or homogeneity of data or model residuals.  As sample size increases, these tests are better able to detect small deviations from normality or homoscedasticity.  Too many people use them and think their model is inappropriate because the test can detect a small effect size, that is, a small deviation from normality or homoscedasticity).

The fourth point is a good one.  It doesn’t make much sense to come to one conclusion if our p -value is 0.049 and the opposite conclusion if our p -value is 0.051.  But I think this can be ameliorated by reporting the actual p -values from analyses, and relying less on p -values to evaluate results.

Overall it seems to me that these complaints condemn poor practices that the authors observe: not reporting the size of effects in some manner; not including confidence intervals or measures of dispersion; basing conclusions solely on p -values; and not including important results like parameter estimates and goodness-of-fit statistics.

Alternatives to the NHST approach

Estimates and confidence intervals.

One approach to determining statistical significance is to use estimates and confidence intervals.  Estimates could be statistics like means, medians, proportions, or other calculated statistics.  This approach can be very straightforward, easy for readers to understand, and easy to present clearly.

Bayesian approach

The most popular competitor to the NHST approach is Bayesian inference.  Bayesian inference has the advantage of calculating the probability of the hypothesis given the data , which is what we thought we should be doing in the “Wait, does this make any sense?” section above.  Essentially it takes prior knowledge about the distribution of the parameters of interest for a population and adds the information from the measured data to reassess some hypothesis related to the parameters of interest.  If the reader will excuse the vagueness of this description, it makes intuitive sense.  We start with what we suspect to be the case, and then use new data to assess our hypothesis.

One disadvantage of the Bayesian approach is that it is not obvious in most cases what could be used for legitimate prior information.  A second disadvantage is that conducting Bayesian analysis is not as straightforward as the tests presented in this book.

References and further reading

[Video]  “Understanding statistical inference” from Statistics Learning Center (Dr. Nic). 2015. www.youtube.com/watch?v=tFRXsngz4UQ .

[Video]  “Hypothesis tests, p-value” from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=0zZYBALbZgg .

[Video]   “Understanding the p-value” from Statistics Learning Center (Dr. Nic). 2011.

www.youtube.com/watch?v=eyknGvncKLw .

[Video]  “Important statistical concepts: significance, strength, association, causation” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=FG7xnWmZlPE .

“Understanding statistical inference” from Dr. Nic. 2015. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/understanding-statistical-inference/ .

“Basic concepts of hypothesis testing” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/hypothesistesting.html .

“Hypothesis testing” , section 4.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Hypothesis Testing with One Sample”, sections 9.1–9.2 in Openstax. 2013. Introductory Statistics . openstax.org/textbooks/introductory-statistics .

"Proving causation" from Dr. Nic. 2013. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/proving-causation/ .

[Video]   “Variation and Sampling Error” from Statistics Learning Center (Dr. Nic). 2014. www.youtube.com/watch?v=y3A0lUkpAko .

[Video]   “Sampling: Simple Random, Convenience, systematic, cluster, stratified” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=be9e-Q-jC-0 .

“Confounding variables” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/confounding.html .

“Overview of data collection principles” , section 1.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Observational studies and sampling strategies” , section 1.4, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Experiments” , section 1.5, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

Exercises F

1.  Which of the following pair is the null hypothesis?

A) The number of heads from the coin is not different from the number of tails.

B) The number of heads from the coin is different from the number of tails.

2.  Which of the following pair is the null hypothesis?

A) The height of boys is different than the height of girls.

B) The height of boys is not different than the height of girls.

3.  Which of the following pair is the null hypothesis?

A) There is an association between classroom and sex.  That is, there is a difference in counts of girls and boys between the classes.

B) There is no association between classroom and sex.  That is, there is no difference in counts of girls and boys between the classes.

4.  We flip a coin 10 times and it lands on heads 7 times.  We want to know if the coin is fair.

a.  What is the null hypothesis?

b.  Looking at the code below, and assuming an alpha of 0.05,

What do you decide (use the reject or fail to reject language)?

c.  In practical terms, what do you conclude?

binom.test(7, 10, 0.5)

Exact binomial test number of successes = 7, number of trials = 10, p-value = 0.3438

5.  We measure the height of 9 boys and 9 girls in a class, in centimeters.  We want to know if one group is taller than the other.

c.  In practical terms, what do you conclude?  Address the practical importance of the results.

Girls = c(152, 150, 140, 160, 145, 155, 150, 152, 147) Boys  = c(144, 142, 132, 152, 137, 147, 142, 144, 139) t.test(Girls, Boys)

Welch Two Sample t-test t = 2.9382, df = 16, p-value = 0.009645 mean of x mean of y  150.1111  142.1111

mean(Boys) sd(Boys) quantile(Boys)

mean(Girls) sd(Girls) quantile(Girls) boxplot(cbind(Girls, Boys))

6. We count the number of boys and girls in two classrooms.  We are interested to know if there is an association between the classrooms and the number of girls and boys.  That is, does the proportion of boys and girls differ statistically across the two classrooms?

Classroom   Girls   Boys A          13       7 B           5      15

Input =("  Classroom  Girls  Boys  A          13       7  B           5      15 ") Matrix = as.matrix(read.table(textConnection(Input),                    header=TRUE,                    row.names=1)) fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.02484

Matrix rowSums(Matrix) colSums(Matrix) prop.table(Matrix,            margin=1)    ### Proportions for each row barplot(t(Matrix),         beside = TRUE,         legend = TRUE,         ylim   = c(0, 25),         xlab   = "Class",         ylab   = "Count")

7. Why should you not rely solely on p -values to make a decision in the real world?  (You should have at least two reasons.)

8. Create your own example to show the importance of considering the size of the effect . Describe the scenario: what the research question is, and what kind of data were collected.  You may make up data and provide real results, or report hypothetical results.

9. Create your own example to show the importance of weighing other practical considerations . Describe the scenario: what the research question is, what kind of data were collected, what statistical results were reached, and what other practical considerations were brought to bear.

10. What is 5e-4 in common decimal notation?

©2016 by Salvatore S. Mangiafico. Rutgers Cooperative Extension, New Brunswick, NJ.

Non-commercial reproduction of this content, with attribution, is permitted. For-profit reproduction without permission is prohibited.

If you use the code or information in this site in a published work, please cite it as a source.  Also, if you are an instructor and use this book in your course, please let me know.   My contact information is on the About the Author of this Book page.

Mangiafico, S.S. 2016. Summary and Analysis of Extension Program Evaluation in R, version 1.20.05, revised 2023. rcompanion.org/handbook/ . (Pdf version: rcompanion.org/documents/RHandbookProgramEvaluation.pdf .)

This means that we can state a null and alternative hypothesis for the population correlation ρ based on our predictions for a correlation. Let's look at how this works in an example.

Now let's go through our hypothesis testing steps:

Step 1: State hypotheses and choose α level

Remember we're going to state hypotheses in terms of our population correlation ρ. In this example, we expect GPA to decrease as distance from campus increases. This means that we are making a directional hypothesis and using a 1-tailed test. It also means we expect to find a negative value of ρ, because that would indicate a negative relationship between GPA and distance from campus. So here are our hypotheses:

H 0 ρ > 0

H A : ρ < 0

We're making our predictions as a comparison with 0, because 0 would indicate no relationship. Note that if we were conducting a 2-tailed test, our hypotheses would be ρ = 0 for the null hypothesis and ρ not equal to 0 for the alternative hypothesis.

We'll use our conventional α = .05.

Step 2: Collect the sample

Here are our sample data:

Step 3: Calculate test statistic

For this example, we're going to calculate a Pearson r statistic. Recall the formula for Person r:   The bottom of the formula requires us to calculate the sum of squares (SS) for each measure individually and the top of the formula requires calculation of the sum of products of the two variables (SP). We'll start with the SS terms. Remember the formula for SS is: SS = Σ(X - ) 2 We'll calculate this for both GPA and Distance. If you need a review of how to calculate SS, review Lab 9 . For our example, we get: SS GPA = .58 and SS distance = 18.39 Now we need to calculate the SP term. Remember the formula for SP is SP = Σ(X - )(Y - ) If you need to review how to calculate the SP term, go to Lab 12 . For our example, we get SP = -.63 Plugging these SS and SP values into our r equation gives us r = -.19 Now we need to find our critical value of r using a table like we did for our z and t-tests. We'll need to know our degrees of freedom, because like t, the r distribution changes depending on the sample size. For r, df = n - 2 So for our example, we have df = 5 - 2 = 3. Now, with df = 3, α = .05, and a one-tailed test, we can find r critical in the Table of Pearson r values . This table is organized and used in the same way that the t-table is used.

Our r crit = .805. We write r crit (3) = -.805 (negative because we are doing a 1-tailed test looking for a negative relationship).

Step 4: Compare observed test statistic to critical test statistic and make a decision about H 0

Our r obs (3) = -.19 and r crit (3) = -.805

Since -.19 is not in the critical region that begins at -.805, we cannot reject the null. We must retain the null hypothesis and conclude that we have no evidence of a relationship between GPA and distance from campus.

Now try a few of these types of problems on your own. Show all four steps of hypothesis testing in your answer (some questions will require more for each step than others) and be sure to state hypotheses in terms of ρ.

(1) A high school counselor would like to know if there is a relationship between mathematical skill and verbal skill. A sample of n = 25 students is selected, and the counselor records achievement test scores in mathematics and English for each student. The Pearson correlation for this sample is r = +0.50. Do these data provide sufficient evidence for a real relationship in the population? Test at the .05 α level, two tails.

(2) It is well known that similarity in attitudes, beliefs, and interests plays an important role in interpersonal attraction. Thus, correlations for attitudes between married couples should be strong and positive. Suppose a researcher developed a questionnaire that measures how liberal or conservative one's attitudes are. Low scores indicate that the person has liberal attitudes, while high scores indicate conservatism. Here are the data from the study:

Couple A: Husband - 14, Wife - 11

Couple B: Husband - 7, Wife - 6

Couple C: Husband - 15, Wife - 18

Couple D: Husband - 7, Wife - 4

Couple E: Husband - 3, Wife - 1

Couple F: Husband - 9, Wife - 10

Couple G: Husband - 9, Wife - 5

Couple H: Husband - 3, Wife - 3

Test the researcher's hypothesis with α set at .05.

(3) A researcher believes that a person's belief in supernatural events (e.g., ghosts, ESP, etc) is related to their education level. For a sample of n = 30 people, he gives them a questionnaire that measures their belief in supernatural events (where a high score means they believe in more of these events) and asks them how many years of schooling they've had. He finds that SS beliefs = 10, SS schooling = 10, and SP = -8. With α = .01, test the researcher's hypothesis.

Using SPSS for Hypothesis Testing with Pearson r

We can also use SPSS to a hypothesis test with Pearson r. We could calculate the Pearson r with SPSS and then look at the output to make our decision about H 0 . The output will give us a p value for our Pearson r (listed under Sig in the Output). We can compare this p value with alpha to determine if the p value is in the critical region.

Remember from Lab 12 , to calculate a Pearson r using SPSS:

The output that you get is a correlation matrix. It correlates each variable against each variable (including itself). You should notice that the table has redundant information on it (e.g., you'll find an r for height correlated with weight, and and r for weight correlated with height. These two statements are identical.)

Making Sense of Data with R

Chapter 7 hypothesis tests, 7.1 independent and dependent variables–recap.

Example 7.1

Research Question: Do women make less money than men?

There are two main variables in this research question–income and gender. Which is the independent variable and which is the dependent variable?

This study implies that people’s incomes may depend on their gender to some degree. In any case, it’d be ridiculous to say that people’s gender depends on their incomes.

So, the variable that depends on the other one is the dependent variable. In this case, it’s income. The variables that does NOT depend on the other one is the independent variable. In this example, it’s gender.

In this class, we mainly deal with 3 types of research questions–prediction/correlation, group comparison, and causal inference. The following table may help you identify the independent and dependent variables in each case.

7.2 Research Hypothesis and Null Hypothesis

When researchers conduct a quantitative research study, they typically already have a theory regarding their research question. This theory posits a general explanation of the causal connections between the facts. It is a story we tell to explain why there is variation. The goal of the research study is to test and update the theory.

A hypothesis is a testable prediction derived from the theory.

For example, before the battle of Midway during WWII, the US code breakers were able to decode some of Japanese radio messages. They noticed two letters, “AF,” appearing repeatedly in these messages. Based on their understanding of the Japanese strategies and the geography of the Pacific ocean, they believed “AF” stands for the Midway islands. This was their theory and they needed to prove it.

The US Navy code breakers contacted the Midway via the underwater secure cable and asked the Americans there to send a radio message to Hawaii, declaring falsely that they were running out of fresh water. They predicted/hypothesized that the Japanese would intercept this message and report it to their headquarters so the letters “AF” would appear again repeatedly in the Japanese radio message. This is a hypothesis. It is derived from the theory and it is directly testable.

And btw, the code breakers were right, and it was one of the keys of success for the US Navy during the battle of Midway.

Your ordinary educational researchers typically come up with hypotheses that are much less intricate than that. For example, my theory might be that 3rd graders need to practice reading in order to get good at it. Based on this theory, I may hypothesize that reserving a short period of class time for students to read independently and silently would raise their reading scores.

So a hypothesis is usually much more specific and actionable than a theory. It states a relationship between a specific independent variable (e.g. whether they use class time to read or not) and a dependent variable (e.g. the reading scores).

Research Hypothesis

The research hypothesis states the researcher’s belief.

Null Hypothesis

The Null hypothesis, regardless of what the researcher believes, always states that there is no relationship between the dependent and independent variables, or there is no difference between groups.

A hypothesis test always tests the null hypothesis.

Example 7.2

Research Question: is educational attainment related to income?

The Null Hypothesis is stated as

\(𝐻_0\) : there is no correlation between educational attainment and income.

\(𝐻_0: r_{Ed, Income}=0\)

Research Question: do athletes have better work ethic than non-athletes?

\(𝐻_0\) : There is no difference between athletes and non-athletes in terms of their work ethic.

\(H_0: \mu_{athlete} - \mu_{nonathlete}=0\)

7.3 Hypothesis Test–Step by Step

Remember the PPDAC cycle? The 5 letters represent the 5 steps in a hypothesis test.

Step 1 (problem): Raise the research question.

Step 2 (plan): State the Null Hypothesis (in order to come up with a testable hypothesis, we must already have a plan).

Step 3 (Data): Recruit a sample; take measurements of the sample.

Step 4 (Analysis): Analyze the data.

Step 5 (Conclusion): Draw a conclusion regarding the null hypothesis–reject or accept.

Example 7.3

Research question: are boys more likely to have ADHD than girls?

\(H_0\) : There is no difference between boys and girls in their chances of having ADHD.

Collect evidence: Get a sample of children; record their gender and their ADHD diagnosis

Analyze the data.

Conclusion: based on the analysis result, we decide if this sample supports the null hypothesis.

7.3.1 What do we do when we “analyze the data?”

We have discussed the first 3 steps of hypothesis testing to various degrees in previous chapters. I would like to roughly explain the 4th step without getting into too much math.

What do we do when we analyze, say, the ADHD data of a group of children? Don’t we just compare the percentages of ADHD diagnoses among boys and girls? That sounds pretty easy. Why do we need to take a course in statistics to do that?

Well, we certainly do compare the percentages of ADHD diagnoses among boys and girls, but that’s not all we do in our data analysis.

Recall that, in inferential statistics, we use a sample to make inference about a larger population. This means that we are not extremely concerned with any tiny difference between this group of boys and girls. Instead, our fundamental question is, is the difference significant enough to make us believe that it is systematic and pervasive, and therefore may exist among other similar groups of boys and girls as well.

Example 7.3 Continued

Suppose we collected data for the boys vs. girls in ADHD question. In each of the following 4 scenarios, there is difference between boys and girls, but in which case(s) would you call the difference “significant?”

Scenario 1:

Boy ADHD rate = 20%, Girl ADHD rate = 10%

Sample size: 10 boys and 10 girls, i.e. 2 boys and 1 girl have ADHD

Scenario 2:

Boy ADHD rate = 90%, Girl ADHD rate = 0%

Sample size: 10 boys and 10 girls, i.e. 9 boys have ADHD and none of the girls has it.

Scenario 3:

Boy ADHD rate = 7%, Girl ADHD rate = 6%

Sample size: 100 boys and 100 girls, i.e. 7 boys and 6 girls have ADHD.

Scenario 4:

Sample size: 10,000 boys and 10,000 girls, i.e. 700 boys and 600 girls have ADHD.

Compare scenarios 1 and 2. In the first one, the difference between boys and girls doesn’t seem particularly striking. It could very well have been a small fluke. If we sample another 10 boys and 10 girls, how confident are we to see the same kind of difference again? Probably not very confident. In the second scenario though, the difference really stands out. If we randomly cast a net and would see 9 out of 10 boys having ADHD and none of the girls has it, it would certainly look like a problem that’s worth our attention, right? Can it be a fluke? Yes of course. But the chance of it being a fluke is much smaller than in scenario 1. What’s the chance that you flip a coin 10 times and get 9 heads?

Compare scenarios 3 and 4. The differences in ADHD rates are identical, but sample sizes are different. While the 7 boys and 6 girls in scenario 3 could be a fluke that is easily reversed–sample another 100 boys and 100 girls, we might get 6 boys and 7 girls with ADHD, the 700 boys vs. 600 girls in scenario 4 is much less likely to be reversible.

What does it mean to say a correlation/difference is significant?

To say a correlation/difference is significant means that it is unlikely to happen by chance, i.e. non-accidental.

The significance of the correlation/difference depends on:

How big the correlation/difference is;

How big the sample is.

So, in the data analysis stage of a hypothesis test, we analyze the significance of the correlation/difference, and this is how we do it:

First, a p-value is estimated which is the likelihood that the correlation/difference happened by chance.

Next, we compare the p-value with a pre-determined significance level (aka \(\alpha\) , usually set at 0.05).

  • If \(p<\alpha\) , we conclude that the correlation/difference is significant, i.e. it is unlikely to have happened by chance.
  • If \(p>\alpha\) , we conclude that the correlation/difference is not significant, i.e. it is likely to have happened by chance.

P stands for probability, therefore it ranges from 0 to 1.

It is the likelihood that the observed difference/correlation happened by chance.

In other words, it is the probability that the null hypothesis is correct.

If you are rooting for your research hypothesis to be correct (usually we believe there is a real difference/correlation), that means that you want your p value to be small.

7.3.2 Conclusion

First of all, we don’t prove a hypothesis to be true or false in a hypothesis test. Recall that the hypothesis is usually about a general population and yet in a particular research study we only have access to a sample. It is not possible to prove something to be true or false in general with a mere sample. Even if we have access to the so-called “big data,” it is still impossible to prove something to be generally true or false for people in the past and in the future.

Instead of proving the hypothesis, we simply state, in the conclusion stage, that the null hypothesis is rejected or accepted. If the particular study stands the test of peer review and is allowed to be published, our conclusion will be counted as a piece of evidence to be reckoned in the long run.

For example, no one has ever been able to prove that smoking causes lung cancer in any single research study. It has become common knowledge after the accumulation of evidences over decades of intense research:

In many studies, the incidence of lung cancer is so much higher among smokers than non-smokers that it’s unlikely to be due to chance.

We see the same results repeated in studies after studies that it’s beyond reasonable doubt .

Conclusion of a Hypothesis Test

The level of significance ( \(\alpha\) ) is usually set at 0.05 (this is related to the fact that confidence intervals are usually built at the level of 95%).

Occasionally, researchers also use \(\alpha=0.01\) or \(\alpha=0.1\)

If \(p<\alpha\) , reject the null hypothesis;

if \(p>\alpha\) , accept the null hypothesis.

7.4 One-Sample t Test

We will now illustrate the steps and principles of hypothesis testing with a one-sample t test.

One-sample t tests are used when we compare a single group with a population and try to decide if the group belongs to the population.

In the last exercise problem of Chapter 3, we looked at the heights of 12 US presidents during the TV era. We can now add President Biden to this dataset who is 72 inches tall.

According to data published by the Centers for Disease Control and Prevention (CDC) , the average height for American men 20 years old and up is 69.1 inches. Obviously no one is able to measure the height of all adult men living in the US, but the population mean can be reasonably well estimated by a carefully drawn sample.

The research question is, are presidents taller than the general male population, in other words, do we prefer taller people to be our presidents?

The null hypothesis is that there is no diference between the US presidents and the general US male population in their heights, in other words, the US presidents belong to the general population in terms of their heights.

\[H_0: \mu_{presidents}=69.1\] We will adopt \(\alpha=0.05\) for this analysis.

Next, we enter the data and carry out the one-sample t test:

“mean of x,” 72.61538, is the mean of our sample of presidents.

“df,” degree of freedom, gives away the sample size: sample size = df + 1 = 13

“95 percent confident interval,” [71.52490, 73.70586], can be understood in this way: there is a 95% chance that the next president we have after Joe Biden will be between 71.5 and 73.7 inches tall. Of course, the next president could be a woman, and in that case, the height expectation won’t be applicable, but we know how hard it is to elect a female president–the overall chance is definitely small.

Lastly, “p-value = 1.387e-05.” Recall scientific notation we introduced in the first class?

\[1.387e-05=0.00001387\] Therefore \(p<\alpha\) , and we reject the null hypothesis.

So the difference between the heights of the group of 13 presidents and that of the general population is unlikely due to chance, that is to say, the difference is significant. The US presidents are significantly taller than the average American adult males.

7.5 Type I and Type II Errors

Just like how the jury members never saw the crime, or how the doping committee members never saw the athletes taking drugs, statisticians don’t see lung cells turning into cancer under the influence of smoking.

In a hypothesis test, we make dichotomous decisions, either rejecting or accepting the null. As said earlier, we cannot prove the hypothesis to be true or false, which means our decisions are always prone to errors. This is not only because what is true for the sample is not necessarily true for the general population, but also since our understandings of the science could be limited and even our measurement instruments could be error-prone.

7.5.1 Type I Error

If an athlete tested positive for doping, it means that he/she is believed to have doped. If someone tested positive for Covid-19, it means that he/she is believed to have Covid. If these test results are incorrect, we call this type of error a false positive. False positive is also known as type I error .

When someone tests positive for something it means that he/she differs from the normal population. Therefore, another way to understand type I error is that we are rejecting the null hypothesis by mistake .

Since we won’t reject a null hypothesis unless \(p<\alpha\) , it means we cannot possibly commit a type I error unless \(p<\alpha\) . Setting \(\alpha\) very small is something we can do to control the chance of type I error. The level of significance ( \(\alpha\) ) is the maximum acceptable probability of making a type I error .

7.5.2 Type II Error

When someone actually has Covid but tests negative, it’s called a false negative—type II error .

In other words, we make a type II error when we accept the null hypothesis by mistake .

The probability of making a type II error is denoted as \(\beta\) .

7.5.3 The relationship between type I and II errors

As mentioned earlier, \(\alpha\) is a lever we have to control type I errors. Is it possible to make sure that we never make any type I errors? Absolutely. Just set \(\alpha=0\) , which means we will never reject any null hypothesis. This way we won’t make any type I error.

The problem with doing so is that it will certainly increase our chance of making type II errors.

The following plots illustrate the relationship between type I and II errors.

hypothesis testing pearson r

The vertical dashed line is the cutoff for the test results. The horizontal solid line is the actual threshold of the true value of the underlying trait. The dots ending up in the lower right quadrant are the false positives, and those in the upper left are the false negatives.

When we move the vertical dashed line to the right, it is equivalent of maneuvering \(\alpha\) to make it smaller, and we can thus reduce the false positives. But then the false negatives will inevitably increase.

The upshot is that it is not possible to eliminate both type I and II errors. But it is possible to find an optimal balance between them.

In practice, we don’t always need to consider this issue as a novice data analyst, since we can just follow the convention of setting \(\alpha=0.05\) . Sometimes though we may come across research studies that set their \(\alpha\) at 0.1 or 0.01. So it’s important to understand the consequences of choosing a larger or smaller \(\alpha\) .

7.6 Exercises

7.6.1 identify the independent and dependent variables.

  • Is test anxiety a good predictor of test scores?
  • Does having a disability affect a child’s decision of attending public or private schools?
  • Does financial assistance improve the symptoms of mentally ill patients?
  • Does a country’s institutional corruption increase its citizen’s tendency to lie?
  • Are boys more likely to drop out of high school than girls?

7.6.2 We’d like to research the idea that praising children for their effort makes them more resilient. Please lay out the steps for the hypothesis test.

7.6.3 which ones of the following are appropriate null hypotheses.

A. \(𝐻_0\) : praising children for effort makes them more resilient. B. \(𝐻_0\) : praising children for effort does not make them more resilient. C. \(𝐻_0\) : The 400 children praised for effort would score higher on the resiliency test than the other group not praised for effort. D. \(𝐻_0\) : In general, children praised for effort would score equally high on a resiliency test than those not praised for effort.

7.6.4 True or False

  • The p-value is a number that ranges from -1 to 1.
  • P in the p-value stands for probability.
  • The p-value is the likelihood that the relationship we saw in the sample is absolutely real.
  • Large p-values, such as 0.8, indicates that the observed outcome is highly likely to be an accident.

7.6.5 Decimals and percentages

  • .0007= ____%
  • 0.745=_____
  • 0.01%=_____

7.6.6 The scientific notation

Use the scientific notation to represent these numbers: 1) 7380 = 2) 0.00246 =

7.6.7 Investigators wish to study the question, “do blondes have more fun?”

  • What is the null hypothesis in this question?
  • What would be a type I error in this case?
  • What would be a type II error?
  • If one investigator uses a .05 level of significance (i.e. alpha) in investigating this question and another one uses .001 level of significance, which would be more likely to make a type I error?

7.6.8 Type I and II Errors

Based on a random sample of teenage girls from the north and south, researchers concluded that there is no significant difference between north and south in teen pregnancy rates. But others question this finding. They are suggesting that it may be a type __ error.

An experiment shows that eating grapefruit can help you lose weight. But later a design flaw of the experiment was found–those who ate grapefruit regularly also happened to exercise more than the no-grapefruit group. The original conclusion could be a type __ error.

In the above question, one researcher uses an alpha of 0.05, another researcher uses an alpha of 0.01. The first researcher, compared to the second:

A. Is more likely to make the type I error B. Is more likely to make the type II error C. Is more likely to make errors. D. Is less likely to make errors.

7.6.9 One-sample t tests

7.6.9.1 please test the hypothesis that a middle-aged man in uk has had 7 sexual partners on average using the dataset “sexual.partners.sav.”.

  • State the Null Hypothesis (in words or in an equation).
  • Carry out the data analysis in R. Copy and paste your codes here. Note: After reading in the data, we need to first select a subset of the dataset that only includes men. Use the following codes to accomplish this:
  • Copy and paste the output from R. Interpret the p-value, the 95% confidence interval, the degrees of freedom, and the sample mean.
  • Do you reject or accept the null hypothesis?
  • Has this sample had more or less than 7 sexual partners on average?

7.6.9.2 The adult American men have an average height of 69.1 inches. We would like to find out if men in our class are taller or shorter than the national average using our class survey data.

  • Are men in our class taller or shorter than the national average?

7.6.9.3 The average American has visited 12.5 states, according to Ipsos polls. We would like to find out if our class has travelled more or less than the national average using our class survey data.

  • Carry out the data analysis in R. Copy and paste your codes here.
  • Has our class travelled more or less than the national average?

7.6.9.4 The adult American women have an average height of 63.7 inches. We would like to find out if women in our class are taller or shorter than the national average using our class survey data.

Carry out the data analysis in R. Copy and paste your codes here. Note: After reading in the data, we need to first select a subset of the dataset that only includes women. Use the code given in some of the exercise problems above and make the necessary changes.

Are the women in our class taller or shorter than the national average?

7.6.9.5 Proteins in energy bars

The labels on a particular brand of energy bars claim that each bar contains 20 grams of protein. We have collected a random sample of 31 energy bars from a number of different stores to represent the population of this brand of bars available to the general consumer. The data are listed in the following table

hypothesis testing pearson r

7.6.9.6 Many cities in the US are known for their crimes. The average violent crime rate per 100,000 in the US in 2012 was 693, including homicide, assault, rape, sexual assault, and robbery. The dataset “crimeInCity.sav” records crime rates for a group of cities. Please test the hypothesis that cities have more violent crimes than the national average using this data.

  • Do cities have more or less violent crimes than the national average?

12.2.1 - Hypothesis Testing

In testing the statistical significance of the relationship between two quantitative variables we will use the five step hypothesis testing procedure:

In order to use Pearson's \(r\) both variables must be quantitative and the relationship between \(x\) and \(y\) must be linear

Minitab will not provide the test statistic for correlation. It will provide the sample statistic, \(r\), along with the p-value (for step 3).

Optional : If you are conducting a test by hand, a \(t\) test statistic is computed in step 2 using the following formula:

  \(t=\dfrac{r- \rho_{0}}{\sqrt{\dfrac{1-r^2}{n-2}}} \)

In step 3, a \(t\) distribution with \(df=n-2\) is used to obtain the p-value.

Minitab will give you the p-value for a two-tailed test (i.e., \(H_a: \rho \neq 0\)). If you are conducting a one-tailed test you will need to divide the p-value in the output by 2.

If \(p \leq \alpha\) reject the null hypothesis, there is evidence of a relationship in the population.

If \(p>\alpha\) fail to reject the null hypothesis, there is not enough evidence of a relationship in the population.

Based on your decision in Step 4, write a conclusion in terms of the original research question.

12.2.1.1 - Example: Quiz & Exam Scores

Example: quiz and exam scores.

Is there a relationship between students' quiz averages in a course and their final exam scores in the course?

Let's use the 5 step hypothesis testing procedure to address this process research question.

In order to use Pearson's \(r\) both variables must be quantitative and the relationship between \(x\) and \(y\) must be linear. We can use Minitab to create the scatterplot using the file: Exam.mpx

Note that when creating the scatterplot it does not matter what you designate as the x or y axis. We get the following which shows a fairly linear relationship.

Scatter plot with exam scores on the x-axis and quiz scores on the y-axis

Our hypotheses:

  • Null Hypothesis, \(H_{0}\): \(\rho=0\)
  • Alternative Hypothesis, \(H_{a}\): \(\rho\ne0\)

Use Minitab to compute \(r\) and the p-value.

  • Open the file in Minitab
  • Select Stat > Basic Statistics > Correlation
  • Enter the columns Quiz_Average and Final in the Variables box
  • Select the Results button and check the Pairwise correlation table in the new window

Pairwise Pearson Correlations

Our sample statistic r = 0.609.

From our output the p-value is 0.000.

There is evidence of a relationship between students' quiz averages and their final exam scores in the population.

12.2.1.2 - Example: Age & Height

Data concerning body measurements from 507 adults retrieved from body.dat.txt for more information see body.txt. In this example, we will use the variables of age (in years) and height (in centimeters) only.

For the full data set and descriptions see the original files:

  • body.dat.txt

For this example, you can use the following Minitab file: body.dat.mpx

Research question:  Is there a relationship between age and height in adults?

Age (in years) and height (in centimeters) are both quantitative variables. From the scatterplot below we can see that the relationship is linear (or at least not non-linear).

Scatterplot of Height (cm) vs Age

\(H_0: \rho = 0\) \(H_a: \rho \neq 0\)

From Minitab:

\(r=0.068\)

\(p > \alpha\) therefore we fail to reject the null hypothesis.

There is not enough evidence of a relationship between age and height in the population from which this sample was drawn.

12.2.1.3 - Example: Temperature & Coffee Sales

Data concerning sales at student-run cafe were retrieved from cafedata.xls more information about this data set available at cafedata.txt. Let's determine if there is a statistically significant relationship between the maximum daily temperature and coffee sales.

  • CAFEDATA.XLS
  • CAFEDATA.TXT

For this example, you can use the following Minitab file: cafedata.mpx

Maximum daily temperature and coffee sales are both quantitative variables. From the scatterplot below we can see that the relationship is linear.

Scatterplot of Coffees vs Max Daily Temperature (F)

\(r=-0.741\)

\(p \leq \alpha\) therefore we reject the null hypothesis.

There is evidence of a relationship between the maximum daily temperature and coffee sales in the population.

Logo for Kwantlen Polytechnic University

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

11 Hypothesis testing

The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience. This process, however, has no logical foundation but only a psychological one. It is clear that there are no grounds for believing that the simplest course of events will really happen. It is an hypothesis that the sun will rise tomorrow: and this means that we do not know whether it will rise. – Ludwig Wittgenstein 157

In the last chapter, I discussed the ideas behind estimation, which is one of the two “big ideas” in inferential statistics. It’s now time to turn out attention to the other big idea, which is hypothesis testing . In its most abstract form, hypothesis testing really a very simple idea: the researcher has some theory about the world, and wants to determine whether or not the data actually support that theory. However, the details are messy, and most people find the theory of hypothesis testing to be the most frustrating part of statistics. The structure of the chapter is as follows. Firstly, I’ll describe how hypothesis testing works, in a fair amount of detail, using a simple running example to show you how a hypothesis test is “built”. I’ll try to avoid being too dogmatic while doing so, and focus instead on the underlying logic of the testing procedure. 158 Afterwards, I’ll spend a bit of time talking about the various dogmas, rules and heresies that surround the theory of hypothesis testing.

11.1 A menagerie of hypotheses

Eventually we all succumb to madness. For me, that day will arrive once I’m finally promoted to full professor. Safely ensconced in my ivory tower, happily protected by tenure, I will finally be able to take leave of my senses (so to speak), and indulge in that most thoroughly unproductive line of psychological research: the search for extrasensory perception (ESP). 159

Let’s suppose that this glorious day has come. My first study is a simple one, in which I seek to test whether clairvoyance exists. Each participant sits down at a table, and is shown a card by an experimenter. The card is black on one side and white on the other. The experimenter takes the card away, and places it on a table in an adjacent room. The card is placed black side up or white side up completely at random, with the randomisation occurring only after the experimenter has left the room with the participant. A second experimenter comes in and asks the participant which side of the card is now facing upwards. It’s purely a one-shot experiment. Each person sees only one card, and gives only one answer; and at no stage is the participant actually in contact with someone who knows the right answer. My data set, therefore, is very simple. I have asked the question of \(N\) people, and some number \(X\) of these people have given the correct response. To make things concrete, let’s suppose that I have tested \(N = 100\) people, and \(X = 62\) of these got the answer right… a surprisingly large number, sure, but is it large enough for me to feel safe in claiming I’ve found evidence for ESP? This is the situation where hypothesis testing comes in useful. However, before we talk about how to test hypotheses, we need to be clear about what we mean by hypotheses.

11.1.1 Research hypotheses versus statistical hypotheses

The first distinction that you need to keep clear in your mind is between research hypotheses and statistical hypotheses. In my ESP study, my overall scientific goal is to demonstrate that clairvoyance exists. In this situation, I have a clear research goal: I am hoping to discover evidence for ESP. In other situations I might actually be a lot more neutral than that, so I might say that my research goal is to determine whether or not clairvoyance exists. Regardless of how I want to portray myself, the basic point that I’m trying to convey here is that a research hypothesis involves making a substantive, testable scientific claim… if you are a psychologist, then your research hypotheses are fundamentally about psychological constructs. Any of the following would count as research hypotheses :

  • Listening to music reduces your ability to pay attention to other things. This is a claim about the causal relationship between two psychologically meaningful concepts (listening to music and paying attention to things), so it’s a perfectly reasonable research hypothesis.
  • Intelligence is related to personality . Like the last one, this is a relational claim about two psychological constructs (intelligence and personality), but the claim is weaker: correlational not causal.
  • Intelligence is* speed of information processing . This hypothesis has a quite different character: it’s not actually a relational claim at all. It’s an ontological claim about the fundamental character of intelligence (and I’m pretty sure it’s wrong). It’s worth expanding on this one actually: It’s usually easier to think about how to construct experiments to test research hypotheses of the form “does X affect Y?” than it is to address claims like “what is X?” And in practice, what usually happens is that you find ways of testing relational claims that follow from your ontological ones. For instance, if I believe that intelligence is* speed of information processing in the brain, my experiments will often involve looking for relationships between measures of intelligence and measures of speed. As a consequence, most everyday research questions do tend to be relational in nature, but they’re almost always motivated by deeper ontological questions about the state of nature.

Notice that in practice, my research hypotheses could overlap a lot. My ultimate goal in the ESP experiment might be to test an ontological claim like “ESP exists”, but I might operationally restrict myself to a narrower hypothesis like “Some people can `see’ objects in a clairvoyant fashion”. That said, there are some things that really don’t count as proper research hypotheses in any meaningful sense:

  • Love is a battlefield . This is too vague to be testable. While it’s okay for a research hypothesis to have a degree of vagueness to it, it has to be possible to operationalise your theoretical ideas. Maybe I’m just not creative enough to see it, but I can’t see how this can be converted into any concrete research design. If that’s true, then this isn’t a scientific research hypothesis, it’s a pop song. That doesn’t mean it’s not interesting – a lot of deep questions that humans have fall into this category. Maybe one day science will be able to construct testable theories of love, or to test to see if God exists, and so on; but right now we can’t, and I wouldn’t bet on ever seeing a satisfying scientific approach to either.
  • The first rule of tautology club is the first rule of tautology club . This is not a substantive claim of any kind. It’s true by definition. No conceivable state of nature could possibly be inconsistent with this claim. As such, we say that this is an unfalsifiable hypothesis, and as such it is outside the domain of science. Whatever else you do in science, your claims must have the possibility of being wrong.
  • More people in my experiment will say “yes” than “no” . This one fails as a research hypothesis because it’s a claim about the data set, not about the psychology (unless of course your actual research question is whether people have some kind of “yes” bias!). As we’ll see shortly, this hypothesis is starting to sound more like a statistical hypothesis than a research hypothesis.

As you can see, research hypotheses can be somewhat messy at times; and ultimately they are scientific claims. Statistical hypotheses are neither of these two things. Statistical hypotheses must be mathematically precise, and they must correspond to specific claims about the characteristics of the data generating mechanism (i.e., the “population”). Even so, the intent is that statistical hypotheses bear a clear relationship to the substantive research hypotheses that you care about! For instance, in my ESP study my research hypothesis is that some people are able to see through walls or whatever. What I want to do is to “map” this onto a statement about how the data were generated. So let’s think about what that statement would be. The quantity that I’m interested in within the experiment is \(P(\mbox{“correct”})\) , the true-but-unknown probability with which the participants in my experiment answer the question correctly. Let’s use the Greek letter \(\theta\) (theta) to refer to this probability. Here are four different statistical hypotheses:

  • If ESP doesn’t exist and if my experiment is well designed, then my participants are just guessing. So I should expect them to get it right half of the time and so my statistical hypothesis is that the true probability of choosing correctly is \(\theta = 0.5\) .
  • Alternatively, suppose ESP does exist and participants can see the card. If that’s true, people will perform better than chance. The statistical hypotheis would be that \(\theta > 0.5\) .
  • A third possibility is that ESP does exist, but the colours are all reversed and people don’t realise it (okay, that’s wacky, but you never know…). If that’s how it works then you’d expect people’s performance to be below chance. This would correspond to a statistical hypothesis that \(\theta < 0.5\) .
  • Finally, suppose ESP exists, but I have no idea whether people are seeing the right colour or the wrong one. In that case, the only claim I could make about the data would be that the probability of making the correct answer is not equal to 50. This corresponds to the statistical hypothesis that \(\theta \neq 0.5\) .

All of these are legitimate examples of a statistical hypothesis because they are statements about a population parameter and are meaningfully related to my experiment.

What this discussion makes clear, I hope, is that when attempting to construct a statistical hypothesis test the researcher actually has two quite distinct hypotheses to consider. First, he or she has a research hypothesis (a claim about psychology), and this corresponds to a statistical hypothesis (a claim about the data generating population). In my ESP example, these might be

And the key thing to recognise is this: a statistical hypothesis test is a test of the statistical hypothesis, not the research hypothesis . If your study is badly designed, then the link between your research hypothesis and your statistical hypothesis is broken. To give a silly example, suppose that my ESP study was conducted in a situation where the participant can actually see the card reflected in a window; if that happens, I would be able to find very strong evidence that \(\theta \neq 0.5\) , but this would tell us nothing about whether “ESP exists”.

11.1.2 Null hypotheses and alternative hypotheses

So far, so good. I have a research hypothesis that corresponds to what I want to believe about the world, and I can map it onto a statistical hypothesis that corresponds to what I want to believe about how the data were generated. It’s at this point that things get somewhat counterintuitive for a lot of people. Because what I’m about to do is invent a new statistical hypothesis (the “null” hypothesis, \(H_0\) ) that corresponds to the exact opposite of what I want to believe, and then focus exclusively on that, almost to the neglect of the thing I’m actually interested in (which is now called the “alternative” hypothesis, \(H_1\) ). In our ESP example, the null hypothesis is that \(\theta = 0.5\) , since that’s what we’d expect if ESP didn’t exist. My hope, of course, is that ESP is totally real, and so the alternative to this null hypothesis is \(\theta \neq 0.5\) . In essence, what we’re doing here is dividing up the possible values of \(\theta\) into two groups: those values that I really hope aren’t true (the null), and those values that I’d be happy with if they turn out to be right (the alternative). Having done so, the important thing to recognise is that the goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false. Most people find this pretty weird.

The best way to think about it, in my experience, is to imagine that a hypothesis test is a criminal trial 160 … the trial of the null hypothesis . The null hypothesis is the defendant, the researcher is the prosecutor, and the statistical test itself is the judge. Just like a criminal trial, there is a presumption of innocence: the null hypothesis is deemed to be true unless you, the researcher, can prove beyond a reasonable doubt that it is false. You are free to design your experiment however you like (within reason, obviously!), and your goal when doing so is to maximise the chance that the data will yield a conviction… for the crime of being false. The catch is that the statistical test sets the rules of the trial, and those rules are designed to protect the null hypothesis – specifically to ensure that if the null hypothesis is actually true, the chances of a false conviction are guaranteed to be low. This is pretty important: after all, the null hypothesis doesn’t get a lawyer. And given that the researcher is trying desperately to prove it to be false, someone has to protect it.

11.2 Two types of errors

Before going into details about how a statistical test is constructed, it’s useful to understand the philosophy behind it. I hinted at it when pointing out the similarity between a null hypothesis test and a criminal trial, but I should now be explicit. Ideally, we would like to construct our test so that we never make any errors. Unfortunately, since the world is messy, this is never possible. Sometimes you’re just really unlucky: for instance, suppose you flip a coin 10 times in a row and it comes up heads all 10 times. That feels like very strong evidence that the coin is biased (and it is!), but of course there’s a 1 in 1024 chance that this would happen even if the coin was totally fair. In other words, in real life we always have to accept that there’s a chance that we did the wrong thing. As a consequence, the goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them.

At this point, we need to be a bit more precise about what we mean by “errors”. Firstly, let’s state the obvious: it is either the case that the null hypothesis is true, or it is false; and our test will either reject the null hypothesis or retain it. 161 So, as the table below illustrates, after we run the test and make our choice, one of four things might have happened:

As a consequence there are actually two different types of error here. If we reject a null hypothesis that is actually true, then we have made a type I error . On the other hand, if we retain the null hypothesis when it is in fact false, then we have made a type II error .

Remember how I said that statistical testing was kind of like a criminal trial? Well, I meant it. A criminal trial requires that you establish “beyond a reasonable doubt” that the defendant did it. All of the evidentiary rules are (in theory, at least) designed to ensure that there’s (almost) no chance of wrongfully convicting an innocent defendant. The trial is designed to protect the rights of a defendant: as the English jurist William Blackstone famously said, it is “better that ten guilty persons escape than that one innocent suffer.” In other words, a criminal trial doesn’t treat the two types of error in the same way~… punishing the innocent is deemed to be much worse than letting the guilty go free. A statistical test is pretty much the same: the single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted \(\alpha\) , is called the significance level of the test (or sometimes, the size of the test). And I’ll say it again, because it is so central to the whole set-up~… a hypothesis test is said to have significance level \(\alpha\) if the type I error rate is no larger than \(\alpha\) .

So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by \(\beta\) . However, it’s much more common to refer to the power of the test, which is the probability with which we reject a null hypothesis when it really is false, which is \(1-\beta\) . To help keep this straight, here’s the same table again, but with the relevant numbers added:

A “powerful” hypothesis test is one that has a small value of \(\beta\) , while still keeping \(\alpha\) fixed at some (small) desired level. By convention, scientists make use of three different \(\alpha\) levels: \(.05\) , \(.01\) and \(.001\) . Notice the asymmetry here~… the tests are designed to ensure that the \(\alpha\) level is kept small, but there’s no corresponding guarantee regarding \(\beta\) . We’d certainly like the type II error rate to be small, and we try to design tests that keep it small, but this is very much secondary to the overwhelming need to control the type I error rate. As Blackstone might have said if he were a statistician, it is “better to retain 10 false null hypotheses than to reject a single true one”. To be honest, I don’t know that I agree with this philosophy – there are situations where I think it makes sense, and situations where I think it doesn’t – but that’s neither here nor there. It’s how the tests are built.

11.3 Test statistics and sampling distributions

At this point we need to start talking specifics about how a hypothesis test is constructed. To that end, let’s return to the ESP example. Let’s ignore the actual data that we obtained, for the moment, and think about the structure of the experiment. Regardless of what the actual numbers are, the form of the data is that \(X\) out of \(N\) people correctly identified the colour of the hidden card. Moreover, let’s suppose for the moment that the null hypothesis really is true: ESP doesn’t exist, and the true probability that anyone picks the correct colour is exactly \(\theta = 0.5\) . What would we expect the data to look like? Well, obviously, we’d expect the proportion of people who make the correct response to be pretty close to 50%. Or, to phrase this in more mathematical terms, we’d say that \(X/N\) is approximately \(0.5\) . Of course, we wouldn’t expect this fraction to be exactly 0.5: if, for example we tested \(N=100\) people, and \(X = 53\) of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if \(X = 99\) of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong. Similarly, if only \(X=3\) people got the answer right, we’d be similarly confident that the null was wrong. Let’s be a little more technical about this: we have a quantity \(X\) that we can calculate by looking at our data; after looking at the value of \(X\) , we make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favour of the alternative. The name for this thing that we calculate to guide our choices is a test statistic .

Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause is to reject the null hypothesis, and which values would cause us to keep it. In order to do so, we need to determine what the sampling distribution of the test statistic would be if the null hypothesis were actually true (we talked about sampling distributions earlier in Section 10.3.1 ). Why do we need this? Because this distribution tells us exactly what values of \(X\) our null hypothesis would lead us to expect. And therefore, we can use this distribution as a tool for assessing how closely the null hypothesis agrees with our data.

The sampling distribution for our test statistic $X$ when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is $\theta = .5$, the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.

Figure 11.1: The sampling distribution for our test statistic \(X\) when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is \(\theta = .5\) , the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.

How do we actually determine the sampling distribution of the test statistic? For a lot of hypothesis tests this step is actually quite complicated, and later on in the book you’ll see me being slightly evasive about it for some of the tests (some of them I don’t even understand myself). However, sometimes it’s very easy. And, fortunately for us, our ESP example provides us with one of the easiest cases. Our population parameter \(\theta\) is just the overall probability that people respond correctly when asked the question, and our test statistic \(X\) is the count of the number of people who did so, out of a sample size of \(N\) . We’ve seen a distribution like this before, in Section 9.4 : that’s exactly what the binomial distribution describes! So, to use the notation and terminology that I introduced in that section, we would say that the null hypothesis predicts that \(X\) is binomially distributed, which is written \[ X \sim \mbox{Binomial}(\theta,N) \] Since the null hypothesis states that \(\theta = 0.5\) and our experiment has \(N=100\) people, we have the sampling distribution we need. This sampling distribution is plotted in Figure 11.1 . No surprises really: the null hypothesis says that \(X=50\) is the most likely outcome, and it says that we’re almost certain to see somewhere between 40 and 60 correct responses.

11.4 Making decisions

Okay, we’re very close to being finished. We’ve constructed a test statistic ( \(X\) ), and we chose this test statistic in such a way that we’re pretty confident that if \(X\) is close to \(N/2\) then we should retain the null, and if not we should reject it. The question that remains is this: exactly which values of the test statistic should we associate with the null hypothesis, and which exactly values go with the alternative hypothesis? In my ESP study, for example, I’ve observed a value of \(X=62\) . What decision should I make? Should I choose to believe the null hypothesis, or the alternative hypothesis?

11.4.1 Critical regions and critical values

To answer this question, we need to introduce the concept of a critical region for the test statistic \(X\) . The critical region of the test corresponds to those values of \(X\) that would lead us to reject null hypothesis (which is why the critical region is also sometimes called the rejection region). How do we find this critical region? Well, let’s consider what we know:

  • \(X\) should be very big or very small in order to reject the null hypothesis.
  • If the null hypothesis is true, the sampling distribution of \(X\) is Binomial \((0.5, N)\) .
  • If \(\alpha =.05\) , the critical region must cover 5% of this sampling distribution.

It’s important to make sure you understand this last point: the critical region corresponds to those values of \(X\) for which we would reject the null hypothesis, and the sampling distribution in question describes the probability that we would obtain a particular value of \(X\) if the null hypothesis were actually true. Now, let’s suppose that we chose a critical region that covers 20% of the sampling distribution, and suppose that the null hypothesis is actually true. What would be the probability of incorrectly rejecting the null? The answer is of course 20%. And therefore, we would have built a test that had an \(\alpha\) level of \(0.2\) . If we want \(\alpha = .05\) , the critical region is only allowed to cover 5% of the sampling distribution of our test statistic.

hypothesis testing pearson r

Figure 11.2: The critical region associated with the hypothesis test for the ESP study, for a hypothesis test with a significance level of \(\alpha = .05\) . The plot itself shows the sampling distribution of \(X\) under the null hypothesis: the grey bars correspond to those values of \(X\) for which we would retain the null hypothesis. The black bars show the critical region: those values of \(X\) for which we would reject the null. Because the alternative hypothesis is two sided (i.e., allows both \(\theta <.5\) and \(\theta >.5\) ), the critical region covers both tails of the distribution. To ensure an \(\alpha\) level of \(.05\) , we need to ensure that each of the two regions encompasses 2.5% of the sampling distribution.

As it turns out, those three things uniquely solve the problem: our critical region consists of the most extreme values , known as the tails of the distribution. This is illustrated in Figure 11.2 . As it turns out, if we want \(\alpha = .05\) , then our critical regions correspond to \(X \leq 40\) and \(X \geq 60\) . 162 That is, if the number of people saying “true” is between 41 and 59, then we should retain the null hypothesis. If the number is between 0 to 40 or between 60 to 100, then we should reject the null hypothesis. The numbers 40 and 60 are often referred to as the critical values , since they define the edges of the critical region.

At this point, our hypothesis test is essentially complete: (1) we choose an \(\alpha\) level (e.g., \(\alpha = .05\) , (2) come up with some test statistic (e.g., \(X\) ) that does a good job (in some meaningful sense) of comparing \(H_0\) to \(H_1\) , (3) figure out the sampling distribution of the test statistic on the assumption that the null hypothesis is true (in this case, binomial) and then (4) calculate the critical region that produces an appropriate \(\alpha\) level (0-40 and 60-100). All that we have to do now is calculate the value of the test statistic for the real data (e.g., \(X = 62\) ) and then compare it to the critical values to make our decision. Since 62 is greater than the critical value of 60, we would reject the null hypothesis. Or, to phrase it slightly differently, we say that the test has produced a significant result.

11.4.2 A note on statistical “significance”

Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners. – Attributed to G. O. Ashley 163

A very brief digression is in order at this point, regarding the word “significant”. The concept of statistical significance is actually a very simple one, but has a very unfortunate name. If the data allow us to reject the null hypothesis, we say that “the result is statistically significant ”, which is often shortened to “the result is significant”. This terminology is rather old, and dates back to a time when “significant” just meant something like “indicated”, rather than its modern meaning, which is much closer to “important”. As a result, a lot of modern readers get very confused when they start learning statistics, because they think that a “significant result” must be an important one. It doesn’t mean that at all. All that “statistically significant” means is that the data allowed us to reject a null hypothesis. Whether or not the result is actually important in the real world is a very different question, and depends on all sorts of other things.

11.4.3 The difference between one sided and two sided tests

There’s one more thing I want to point out about the hypothesis test that I’ve just constructed. If we take a moment to think about the statistical hypotheses I’ve been using, \[ \begin{array}{cc} H_0 : & \theta = .5 \\ H_1 : & \theta \neq .5 \end{array} \] we notice that the alternative hypothesis covers both the possibility that \(\theta < .5\) and the possibility that \(\theta > .5\) . This makes sense if I really think that ESP could produce better-than-chance performance or worse-than-chance performance (and there are some people who think that). In statistical language, this is an example of a two-sided test . It’s called this because the alternative hypothesis covers the area on both “sides” of the null hypothesis, and as a consequence the critical region of the test covers both tails of the sampling distribution (2.5% on either side if \(\alpha =.05\) ), as illustrated earlier in Figure 11.2 .

However, that’s not the only possibility. It might be the case, for example, that I’m only willing to believe in ESP if it produces better than chance performance. If so, then my alternative hypothesis would only covers the possibility that \(\theta > .5\) , and as a consequence the null hypothesis now becomes \(\theta \leq .5\) : \[ \begin{array}{cc} H_0 : & \theta \leq .5 \\ H_1 : & \theta > .5 \end{array} \] When this happens, we have what’s called a one-sided test , and when this happens the critical region only covers one tail of the sampling distribution. This is illustrated in Figure 11.3 .

hypothesis testing pearson r

Figure 11.3: The critical region for a one sided test. In this case, the alternative hypothesis is that \(\theta > .05\) , so we would only reject the null hypothesis for large values of \(X\) . As a consequence, the critical region only covers the upper tail of the sampling distribution; specifically the upper 5% of the distribution. Contrast this to the two-sided version earlier)

11.5 The \(p\) value of a test

In one sense, our hypothesis test is complete; we’ve constructed a test statistic, figured out its sampling distribution if the null hypothesis is true, and then constructed the critical region for the test. Nevertheless, I’ve actually omitted the most important number of all: the \(p\) value . It is to this topic that we now turn. There are two somewhat different ways of interpreting a \(p\) value, one proposed by Sir Ronald Fisher and the other by Jerzy Neyman. Both versions are legitimate, though they reflect very different ways of thinking about hypothesis tests. Most introductory textbooks tend to give Fisher’s version only, but I think that’s a bit of a shame. To my mind, Neyman’s version is cleaner, and actually better reflects the logic of the null hypothesis test. You might disagree though, so I’ve included both. I’ll start with Neyman’s version…

11.5.1 A softer view of decision making

One problem with the hypothesis testing procedure that I’ve described is that it makes no distinction at all between a result this “barely significant” and those that are “highly significant”. For instance, in my ESP study the data I obtained only just fell inside the critical region – so I did get a significant effect, but was a pretty near thing. In contrast, suppose that I’d run a study in which \(X=97\) out of my \(N=100\) participants got the answer right. This would obviously be significant too, but my a much larger margin; there’s really no ambiguity about this at all. The procedure that I described makes no distinction between the two. If I adopt the standard convention of allowing \(\alpha = .05\) as my acceptable Type I error rate, then both of these are significant results.

This is where the \(p\) value comes in handy. To understand how it works, let’s suppose that we ran lots of hypothesis tests on the same data set: but with a different value of \(\alpha\) in each case. When we do that for my original ESP data, what we’d get is something like this

When we test ESP data ( \(X=62\) successes out of \(N=100\) observations) using \(\alpha\) levels of .03 and above, we’d always find ourselves rejecting the null hypothesis. For \(\alpha\) levels of .02 and below, we always end up retaining the null hypothesis. Therefore, somewhere between .02 and .03 there must be a smallest value of \(\alpha\) that would allow us to reject the null hypothesis for this data. This is the \(p\) value; as it turns out the ESP data has \(p = .021\) . In short:

\(p\) is defined to be the smallest Type I error rate ( \(\alpha\) ) that you have to be willing to tolerate if you want to reject the null hypothesis.

If it turns out that \(p\) describes an error rate that you find intolerable, then you must retain the null. If you’re comfortable with an error rate equal to \(p\) , then it’s okay to reject the null hypothesis in favour of your preferred alternative.

In effect, \(p\) is a summary of all the possible hypothesis tests that you could have run, taken across all possible \(\alpha\) values. And as a consequence it has the effect of “softening” our decision process. For those tests in which \(p \leq \alpha\) you would have rejected the null hypothesis, whereas for those tests in which \(p > \alpha\) you would have retained the null. In my ESP study I obtained \(X=62\) , and as a consequence I’ve ended up with \(p = .021\) . So the error rate I have to tolerate is 2.1%. In contrast, suppose my experiment had yielded \(X=97\) . What happens to my \(p\) value now? This time it’s shrunk to \(p = 1.36 \times 10^{-25}\) , which is a tiny, tiny 164 Type I error rate. For this second case I would be able to reject the null hypothesis with a lot more confidence, because I only have to be “willing” to tolerate a type I error rate of about 1 in 10 trillion trillion in order to justify my decision to reject.

11.5.2 The probability of extreme data

The second definition of the \(p\) -value comes from Sir Ronald Fisher, and it’s actually this one that you tend to see in most introductory statistics textbooks. Notice how, when I constructed the critical region, it corresponded to the tails (i.e., extreme values) of the sampling distribution? That’s not a coincidence: almost all “good” tests have this characteristic (good in the sense of minimising our type II error rate, \(\beta\) ). The reason for that is that a good critical region almost always corresponds to those values of the test statistic that are least likely to be observed if the null hypothesis is true. If this rule is true, then we can define the \(p\) -value as the probability that we would have observed a test statistic that is at least as extreme as the one we actually did get. In other words, if the data are extremely implausible according to the null hypothesis, then the null hypothesis is probably wrong.

11.5.3 A common mistake

Okay, so you can see that there are two rather different but legitimate ways to interpret the \(p\) value, one based on Neyman’s approach to hypothesis testing and the other based on Fisher’s. Unfortunately, there is a third explanation that people sometimes give, especially when they’re first learning statistics, and it is absolutely and completely wrong . This mistaken approach is to refer to the \(p\) value as “the probability that the null hypothesis is true”. It’s an intuitively appealing way to think, but it’s wrong in two key respects: (1) null hypothesis testing is a frequentist tool, and the frequentist approach to probability does not allow you to assign probabilities to the null hypothesis… according to this view of probability, the null hypothesis is either true or it is not; it cannot have a “5% chance” of being true. (2) even within the Bayesian approach, which does let you assign probabilities to hypotheses, the \(p\) value would not correspond to the probability that the null is true; this interpretation is entirely inconsistent with the mathematics of how the \(p\) value is calculated. Put bluntly, despite the intuitive appeal of thinking this way, there is no justification for interpreting a \(p\) value this way. Never do it.

11.6 Reporting the results of a hypothesis test

When writing up the results of a hypothesis test, there’s usually several pieces of information that you need to report, but it varies a fair bit from test to test. Throughout the rest of the book I’ll spend a little time talking about how to report the results of different tests (see Section 12.1.9 for a particularly detailed example), so that you can get a feel for how it’s usually done. However, regardless of what test you’re doing, the one thing that you always have to do is say something about the \(p\) value, and whether or not the outcome was significant.

The fact that you have to do this is unsurprising; it’s the whole point of doing the test. What might be surprising is the fact that there is some contention over exactly how you’re supposed to do it. Leaving aside those people who completely disagree with the entire framework underpinning null hypothesis testing, there’s a certain amount of tension that exists regarding whether or not to report the exact \(p\) value that you obtained, or if you should state only that \(p < \alpha\) for a significance level that you chose in advance (e.g., \(p<.05\) ).

11.6.1 The issue

To see why this is an issue, the key thing to recognise is that \(p\) values are terribly convenient. In practice, the fact that we can compute a \(p\) value means that we don’t actually have to specify any \(\alpha\) level at all in order to run the test. Instead, what you can do is calculate your \(p\) value and interpret it directly: if you get \(p = .062\) , then it means that you’d have to be willing to tolerate a Type I error rate of 6.2% to justify rejecting the null. If you personally find 6.2% intolerable, then you retain the null. Therefore, the argument goes, why don’t we just report the actual \(p\) value and let the reader make up their own minds about what an acceptable Type I error rate is? This approach has the big advantage of “softening” the decision making process – in fact, if you accept the Neyman definition of the \(p\) value, that’s the whole point of the \(p\) value. We no longer have a fixed significance level of \(\alpha = .05\) as a bright line separating “accept” from “reject” decisions; and this removes the rather pathological problem of being forced to treat \(p = .051\) in a fundamentally different way to \(p = .049\) .

This flexibility is both the advantage and the disadvantage to the \(p\) value. The reason why a lot of people don’t like the idea of reporting an exact \(p\) value is that it gives the researcher a bit too much freedom. In particular, it lets you change your mind about what error tolerance you’re willing to put up with after you look at the data. For instance, consider my ESP experiment. Suppose I ran my test, and ended up with a \(p\) value of .09. Should I accept or reject? Now, to be honest, I haven’t yet bothered to think about what level of Type I error I’m “really” willing to accept. I don’t have an opinion on that topic. But I do have an opinion about whether or not ESP exists, and I definitely have an opinion about whether my research should be published in a reputable scientific journal. And amazingly, now that I’ve looked at the data I’m starting to think that a 9% error rate isn’t so bad, especially when compared to how annoying it would be to have to admit to the world that my experiment has failed. So, to avoid looking like I just made it up after the fact, I now say that my \(\alpha\) is .1: a 10% type I error rate isn’t too bad, and at that level my test is significant! I win.

In other words, the worry here is that I might have the best of intentions, and be the most honest of people, but the temptation to just “shade” things a little bit here and there is really, really strong. As anyone who has ever run an experiment can attest, it’s a long and difficult process, and you often get very attached to your hypotheses. It’s hard to let go and admit the experiment didn’t find what you wanted it to find. And that’s the danger here. If we use the “raw” \(p\) -value, people will start interpreting the data in terms of what they want to believe, not what the data are actually saying… and if we allow that, well, why are we bothering to do science at all? Why not let everyone believe whatever they like about anything, regardless of what the facts are? Okay, that’s a bit extreme, but that’s where the worry comes from. According to this view, you really must specify your \(\alpha\) value in advance, and then only report whether the test was significant or not. It’s the only way to keep ourselves honest.

11.6.2 Two proposed solutions

In practice, it’s pretty rare for a researcher to specify a single \(\alpha\) level ahead of time. Instead, the convention is that scientists rely on three standard significance levels: .05, .01 and .001. When reporting your results, you indicate which (if any) of these significance levels allow you to reject the null hypothesis. This is summarised in Table 11.1 . This allows us to soften the decision rule a little bit, since \(p<.01\) implies that the data meet a stronger evidentiary standard than \(p<.05\) would. Nevertheless, since these levels are fixed in advance by convention, it does prevent people choosing their \(\alpha\) level after looking at the data.

Nevertheless, quite a lot of people still prefer to report exact \(p\) values. To many people, the advantage of allowing the reader to make up their own mind about how to interpret \(p = .06\) outweighs any disadvantages. In practice, however, even among those researchers who prefer exact \(p\) values it is quite common to just write \(p<.001\) instead of reporting an exact value for small \(p\) . This is in part because a lot of software doesn’t actually print out the \(p\) value when it’s that small (e.g., SPSS just writes \(p = .000\) whenever \(p<.001\) ), and in part because a very small \(p\) value can be kind of misleading. The human mind sees a number like .0000000001 and it’s hard to suppress the gut feeling that the evidence in favour of the alternative hypothesis is a near certainty. In practice however, this is usually wrong. Life is a big, messy, complicated thing: and every statistical test ever invented relies on simplifications, approximations and assumptions. As a consequence, it’s probably not reasonable to walk away from any statistical analysis with a feeling of confidence stronger than \(p<.001\) implies. In other words, \(p<.001\) is really code for “as far as this test is concerned, the evidence is overwhelming.”

In light of all this, you might be wondering exactly what you should do. There’s a fair bit of contradictory advice on the topic, with some people arguing that you should report the exact \(p\) value, and other people arguing that you should use the tiered approach illustrated in Table 11.1 . As a result, the best advice I can give is to suggest that you look at papers/reports written in your field and see what the convention seems to be. If there doesn’t seem to be any consistent pattern, then use whichever method you prefer.

11.7 Running the hypothesis test in practice

At this point some of you might be wondering if this is a “real” hypothesis test, or just a toy example that I made up. It’s real. In the previous discussion I built the test from first principles, thinking that it was the simplest possible problem that you might ever encounter in real life. However, this test already exists: it’s called the binomial test , and it’s implemented by an R function called binom.test() . To test the null hypothesis that the response probability is one-half p = .5 , 165 using data in which x = 62 of n = 100 people made the correct response, here’s how to do it in R:

Right now, this output looks pretty unfamiliar to you, but you can see that it’s telling you more or less the right things. Specifically, the \(p\) -value of 0.02 is less than the usual choice of \(\alpha = .05\) , so you can reject the null. We’ll talk a lot more about how to read this sort of output as we go along; and after a while you’ll hopefully find it quite easy to read and understand. For now, however, I just wanted to make the point that R contains a whole lot of functions corresponding to different kinds of hypothesis test. And while I’ll usually spend quite a lot of time explaining the logic behind how the tests are built, every time I discuss a hypothesis test the discussion will end with me showing you a fairly simple R command that you can use to run the test in practice.

11.8 Effect size, sample size and power

In previous sections I’ve emphasised the fact that the major design principle behind statistical hypothesis testing is that we try to control our Type I error rate. When we fix \(\alpha = .05\) we are attempting to ensure that only 5% of true null hypotheses are incorrectly rejected. However, this doesn’t mean that we don’t care about Type II errors. In fact, from the researcher’s perspective, the error of failing to reject the null when it is actually false is an extremely annoying one. With that in mind, a secondary goal of hypothesis testing is to try to minimise \(\beta\) , the Type II error rate, although we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as \(1-\beta\) , this is the same thing.

11.8.1 The power function

Sampling distribution under the *alternative* hypothesis, for a population parameter value of $\theta = 0.55$. A reasonable proportion of the distribution lies in the rejection region.

Figure 11.4: Sampling distribution under the alternative hypothesis, for a population parameter value of \(\theta = 0.55\) . A reasonable proportion of the distribution lies in the rejection region.

Let’s take a moment to think about what a Type II error actually is. A Type II error occurs when the alternative hypothesis is true, but we are nevertheless unable to reject the null hypothesis. Ideally, we’d be able to calculate a single number \(\beta\) that tells us the Type II error rate, in the same way that we can set \(\alpha = .05\) for the Type I error rate. Unfortunately, this is a lot trickier to do. To see this, notice that in my ESP study the alternative hypothesis actually corresponds to lots of possible values of \(\theta\) . In fact, the alternative hypothesis corresponds to every value of \(\theta\) except 0.5. Let’s suppose that the true probability of someone choosing the correct response is 55% (i.e., \(\theta = .55\) ). If so, then the true sampling distribution for \(X\) is not the same one that the null hypothesis predicts: the most likely value for \(X\) is now 55 out of 100. Not only that, the whole sampling distribution has now shifted, as shown in Figure 11.4 . The critical regions, of course, do not change: by definition, the critical regions are based on what the null hypothesis predicts. What we’re seeing in this figure is the fact that when the null hypothesis is wrong, a much larger proportion of the sampling distribution distribution falls in the critical region. And of course that’s what should happen: the probability of rejecting the null hypothesis is larger when the null hypothesis is actually false! However \(\theta = .55\) is not the only possibility consistent with the alternative hypothesis. Let’s instead suppose that the true value of \(\theta\) is actually 0.7. What happens to the sampling distribution when this occurs? The answer, shown in Figure 11.5 , is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if \(\theta = 0.7\) the probability of us correctly rejecting the null hypothesis (i.e., the power of the test) is much larger than if \(\theta = 0.55\) . In short, while \(\theta = .55\) and \(\theta = .70\) are both part of the alternative hypothesis, the Type II error rate is different.

Sampling distribution under the *alternative* hypothesis, for a population parameter value of $\theta = 0.70$. Almost all of the distribution lies in the rejection region.

Figure 11.5: Sampling distribution under the alternative hypothesis, for a population parameter value of \(\theta = 0.70\) . Almost all of the distribution lies in the rejection region.

The probability that we will reject the null hypothesis, plotted as a function of the true value of $\theta$. Obviously, the test is more powerful (greater chance of correct rejection) if the true value of $\theta$ is very different from the value that the null hypothesis specifies (i.e., $\theta=.5$). Notice that when $\theta$ actually is equal to .5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error.

Figure 11.6: The probability that we will reject the null hypothesis, plotted as a function of the true value of \(\theta\) . Obviously, the test is more powerful (greater chance of correct rejection) if the true value of \(\theta\) is very different from the value that the null hypothesis specifies (i.e., \(\theta=.5\) ). Notice that when \(\theta\) actually is equal to .5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error.

What all this means is that the power of a test (i.e., \(1-\beta\) ) depends on the true value of \(\theta\) . To illustrate this, I’ve calculated the expected probability of rejecting the null hypothesis for all values of \(\theta\) , and plotted it in Figure 11.6 . This plot describes what is usually called the power function of the test. It’s a nice summary of how good the test is, because it actually tells you the power ( \(1-\beta\) ) for all possible values of \(\theta\) . As you can see, when the true value of \(\theta\) is very close to 0.5, the power of the test drops very sharply, but when it is further away, the power is large.

11.8.2 Effect size

Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned with mice when there are tigers abroad – George Box 1976

The plot shown in Figure 11.6 captures a fairly basic point about hypothesis testing. If the true state of the world is very different from what the null hypothesis predicts, then your power will be very high; but if the true state of the world is similar to the null (but not identical) then the power of the test is going to be very low. Therefore, it’s useful to be able to have some way of quantifying how “similar” the true state of the world is to the null hypothesis. A statistic that does this is called a measure of effect size (e.g. Cohen 1988 ; Ellis 2010 ) . Effect size is defined slightly differently in different contexts, 166 (and so this section just talks in general terms) but the qualitative idea that it tries to capture is always the same: how big is the difference between the true population parameters, and the parameter values that are assumed by the null hypothesis? In our ESP example, if we let \(\theta_0 = 0.5\) denote the value assumed by the null hypothesis, and let \(\theta\) denote the true value, then a simple measure of effect size could be something like the difference between the true value and null (i.e., \(\theta – \theta_0\) ), or possibly just the magnitude of this difference, \(\mbox{abs}(\theta – \theta_0)\) .

Why calculate effect size? Let’s assume that you’ve run your experiment, collected the data, and gotten a significant effect when you ran your hypothesis test. Isn’t it enough just to say that you’ve gotten a significant effect? Surely that’s the point of hypothesis testing? Well, sort of. Yes, the point of doing a hypothesis test is to try to demonstrate that the null hypothesis is wrong, but that’s hardly the only thing we’re interested in. If the null hypothesis claimed that \(\theta = .5\) , and we show that it’s wrong, we’ve only really told half of the story. Rejecting the null hypothesis implies that we believe that \(\theta \neq .5\) , but there’s a big difference between \(\theta = .51\) and \(\theta = .8\) . If we find that \(\theta = .8\) , then not only have we found that the null hypothesis is wrong, it appears to be very wrong. On the other hand, suppose we’ve successfully rejected the null hypothesis, but it looks like the true value of \(\theta\) is only .51 (this would only be possible with a large study). Sure, the null hypothesis is wrong, but it’s not at all clear that we actually care , because the effect size is so small. In the context of my ESP study we might still care, since any demonstration of real psychic powers would actually be pretty cool 167 , but in other contexts a 1% difference isn’t very interesting, even if it is a real difference. For instance, suppose we’re looking at differences in high school exam scores between males and females, and it turns out that the female scores are 1% higher on average than the males. If I’ve got data from thousands of students, then this difference will almost certainly be statistically significant , but regardless of how small the \(p\) value is it’s just not very interesting. You’d hardly want to go around proclaiming a crisis in boys education on the basis of such a tiny difference would you? It’s for this reason that it is becoming more standard (slowly, but surely) to report some kind of standard measure of effect size along with the the results of the hypothesis test. The hypothesis test itself tells you whether you should believe that the effect you have observed is real (i.e., not just due to chance); the effect size tells you whether or not you should care.

11.8.3 Increasing the power of your study

Not surprisingly, scientists are fairly obsessed with maximising the power of their experiments. We want our experiments to work, and so we want to maximise the chance of rejecting the null hypothesis if it is false (and of course we usually want to believe that it is false!) As we’ve seen, one factor that influences power is the effect size. So the first thing you can do to increase your power is to increase the effect size. In practice, what this means is that you want to design your study in such a way that the effect size gets magnified. For instance, in my ESP study I might believe that psychic powers work best in a quiet, darkened room; with fewer distractions to cloud the mind. Therefore I would try to conduct my experiments in just such an environment: if I can strengthen people’s ESP abilities somehow, then the true value of \(\theta\) will go up 168 and therefore my effect size will be larger. In short, clever experimental design is one way to boost power; because it can alter the effect size.

Unfortunately, it’s often the case that even with the best of experimental designs you may have only a small effect. Perhaps, for example, ESP really does exist, but even under the best of conditions it’s very very weak. Under those circumstances, your best bet for increasing power is to increase the sample size. In general, the more observations that you have available, the more likely it is that you can discriminate between two hypotheses. If I ran my ESP experiment with 10 participants, and 7 of them correctly guessed the colour of the hidden card, you wouldn’t be terribly impressed. But if I ran it with 10,000 participants and 7,000 of them got the answer right, you would be much more likely to think I had discovered something. In other words, power increases with the sample size. This is illustrated in Figure 11.7 , which shows the power of the test for a true parameter of \(\theta = 0.7\) , for all sample sizes \(N\) from 1 to 100, where I’m assuming that the null hypothesis predicts that \(\theta_0 = 0.5\) .

The power of our test, plotted as a function of the sample size $N$. In this case, the true value of $\theta$ is 0.7, but the null hypothesis is that $\theta = 0.5$. Overall, larger $N$ means greater power. (The small zig-zags in this function occur because of some odd interactions between $\theta$, $\alpha$ and the fact that the binomial distribution is discrete; it doesn't matter for any serious purpose)

Figure 11.7: The power of our test, plotted as a function of the sample size \(N\) . In this case, the true value of \(\theta\) is 0.7, but the null hypothesis is that \(\theta = 0.5\) . Overall, larger \(N\) means greater power. (The small zig-zags in this function occur because of some odd interactions between \(\theta\) , \(\alpha\) and the fact that the binomial distribution is discrete; it doesn’t matter for any serious purpose)

Because power is important, whenever you’re contemplating running an experiment it would be pretty useful to know how much power you’re likely to have. It’s never possible to know for sure, since you can’t possibly know what your effect size is. However, it’s often (well, sometimes) possible to guess how big it should be. If so, you can guess what sample size you need! This idea is called power analysis , and if it’s feasible to do it, then it’s very helpful, since it can tell you something about whether you have enough time or money to be able to run the experiment successfully. It’s increasingly common to see people arguing that power analysis should be a required part of experimental design, so it’s worth knowing about. I don’t discuss power analysis in this book, however. This is partly for a boring reason and partly for a substantive one. The boring reason is that I haven’t had time to write about power analysis yet. The substantive one is that I’m still a little suspicious of power analysis. Speaking as a researcher, I have very rarely found myself in a position to be able to do one – it’s either the case that (a) my experiment is a bit non-standard and I don’t know how to define effect size properly, (b) I literally have so little idea about what the effect size will be that I wouldn’t know how to interpret the answers. Not only that, after extensive conversations with someone who does stats consulting for a living (my wife, as it happens), I can’t help but notice that in practice the only time anyone ever asks her for a power analysis is when she’s helping someone write a grant application. In other words, the only time any scientist ever seems to want a power analysis in real life is when they’re being forced to do it by bureaucratic process. It’s not part of anyone’s day to day work. In short, I’ve always been of the view that while power is an important concept, power analysis is not as useful as people make it sound, except in the rare cases where (a) someone has figured out how to calculate power for your actual experimental design and (b) you have a pretty good idea what the effect size is likely to be. Maybe other people have had better experiences than me, but I’ve personally never been in a situation where both (a) and (b) were true. Maybe I’ll be convinced otherwise in the future, and probably a future version of this book would include a more detailed discussion of power analysis, but for now this is about as much as I’m comfortable saying about the topic.

11.9 Some issues to consider

What I’ve described to you in this chapter is the orthodox framework for null hypothesis significance testing (NHST). Understanding how NHST works is an absolute necessity, since it has been the dominant approach to inferential statistics ever since it came to prominence in the early 20th century. It’s what the vast majority of working scientists rely on for their data analysis, so even if you hate it you need to know it. However, the approach is not without problems. There are a number of quirks in the framework, historical oddities in how it came to be, theoretical disputes over whether or not the framework is right, and a lot of practical traps for the unwary. I’m not going to go into a lot of detail on this topic, but I think it’s worth briefly discussing a few of these issues.

11.9.1 Neyman versus Fisher

The first thing you should be aware of is that orthodox NHST is actually a mash-up of two rather different approaches to hypothesis testing, one proposed by Sir Ronald Fisher and the other proposed by Jerzy Neyman (for a historical summary see Lehmann 2011 ) . The history is messy because Fisher and Neyman were real people whose opinions changed over time, and at no point did either of them offer “the definitive statement” of how we should interpret their work many decades later. That said, here’s a quick summary of what I take these two approaches to be.

First, let’s talk about Fisher’s approach. As far as I can tell, Fisher assumed that you only had the one hypothesis (the null), and what you want to do is find out if the null hypothesis is inconsistent with the data. From his perspective, what you should do is check to see if the data are “sufficiently unlikely” according to the null. In fact, if you remember back to our earlier discussion, that’s how Fisher defines the \(p\) -value. According to Fisher, if the null hypothesis provided a very poor account of the data, you could safely reject it. But, since you don’t have any other hypotheses to compare it to, there’s no way of “accepting the alternative” because you don’t necessarily have an explicitly stated alternative. That’s more or less all that there was to it.

In contrast, Neyman thought that the point of hypothesis testing was as a guide to action, and his approach was somewhat more formal than Fisher’s. His view was that there are multiple things that you could do (accept the null or accept the alternative) and the point of the test was to tell you which one the data support. From this perspective, it is critical to specify your alternative hypothesis properly. If you don’t know what the alternative hypothesis is, then you don’t know how powerful the test is, or even which action makes sense. His framework genuinely requires a competition between different hypotheses. For Neyman, the \(p\) value didn’t directly measure the probability of the data (or data more extreme) under the null, it was more of an abstract description about which “possible tests” were telling you to accept the null, and which “possible tests” were telling you to accept the alternative.

As you can see, what we have today is an odd mishmash of the two. We talk about having both a null hypothesis and an alternative (Neyman), but usually 169 define the \(p\) value in terms of exreme data (Fisher), but we still have \(\alpha\) values (Neyman). Some of the statistical tests have explicitly specified alternatives (Neyman) but others are quite vague about it (Fisher). And, according to some people at least, we’re not allowed to talk about accepting the alternative (Fisher). It’s a mess: but I hope this at least explains why it’s a mess.

11.9.2 Bayesians versus frequentists

Earlier on in this chapter I was quite emphatic about the fact that you cannot interpret the \(p\) value as the probability that the null hypothesis is true. NHST is fundamentally a frequentist tool (see Chapter 9 ) and as such it does not allow you to assign probabilities to hypotheses: the null hypothesis is either true or it is not. The Bayesian approach to statistics interprets probability as a degree of belief, so it’s totally okay to say that there is a 10% chance that the null hypothesis is true: that’s just a reflection of the degree of confidence that you have in this hypothesis. You aren’t allowed to do this within the frequentist approach. Remember, if you’re a frequentist, a probability can only be defined in terms of what happens after a large number of independent replications (i.e., a long run frequency). If this is your interpretation of probability, talking about the “probability” that the null hypothesis is true is complete gibberish: a null hypothesis is either true or it is false. There’s no way you can talk about a long run frequency for this statement. To talk about “the probability of the null hypothesis” is as meaningless as “the colour of freedom”. It doesn’t have one!

Most importantly, this isn’t a purely ideological matter. If you decide that you are a Bayesian and that you’re okay with making probability statements about hypotheses, you have to follow the Bayesian rules for calculating those probabilities. I’ll talk more about this in Chapter 17 , but for now what I want to point out to you is the \(p\) value is a terrible approximation to the probability that \(H_0\) is true. If what you want to know is the probability of the null, then the \(p\) value is not what you’re looking for!

11.9.3 Traps

As you can see, the theory behind hypothesis testing is a mess, and even now there are arguments in statistics about how it “should” work. However, disagreements among statisticians are not our real concern here. Our real concern is practical data analysis. And while the “orthodox” approach to null hypothesis significance testing has many drawbacks, even an unrepentant Bayesian like myself would agree that they can be useful if used responsibly. Most of the time they give sensible answers, and you can use them to learn interesting things. Setting aside the various ideologies and historical confusions that we’ve discussed, the fact remains that the biggest danger in all of statistics is thoughtlessness . I don’t mean stupidity, here: I literally mean thoughtlessness. The rush to interpret a result without spending time thinking through what each test actually says about the data, and checking whether that’s consistent with how you’ve interpreted it. That’s where the biggest trap lies.

To give an example of this, consider the following example (see Gelman and Stern 2006 ) . Suppose I’m running my ESP study, and I’ve decided to analyse the data separately for the male participants and the female participants. Of the male participants, 33 out of 50 guessed the colour of the card correctly. This is a significant effect ( \(p = .03\) ). Of the female participants, 29 out of 50 guessed correctly. This is not a significant effect ( \(p = .32\) ). Upon observing this, it is extremely tempting for people to start wondering why there is a difference between males and females in terms of their psychic abilities. However, this is wrong. If you think about it, we haven’t actually run a test that explicitly compares males to females. All we have done is compare males to chance (binomial test was significant) and compared females to chance (binomial test was non significant). If we want to argue that there is a real difference between the males and the females, we should probably run a test of the null hypothesis that there is no difference! We can do that using a different hypothesis test, 170 but when we do that it turns out that we have no evidence that males and females are significantly different ( \(p = .54\) ). Now do you think that there’s anything fundamentally different between the two groups? Of course not. What’s happened here is that the data from both groups (male and female) are pretty borderline: by pure chance, one of them happened to end up on the magic side of the \(p = .05\) line, and the other one didn’t. That doesn’t actually imply that males and females are different. This mistake is so common that you should always be wary of it: the difference between significant and not-significant is not evidence of a real difference – if you want to say that there’s a difference between two groups, then you have to test for that difference!

The example above is just that: an example. I’ve singled it out because it’s such a common one, but the bigger picture is that data analysis can be tricky to get right. Think about what it is you want to test, why you want to test it, and whether or not the answers that your test gives could possibly make any sense in the real world.

11.10 Summary

Null hypothesis testing is one of the most ubiquitous elements to statistical theory. The vast majority of scientific papers report the results of some hypothesis test or another. As a consequence it is almost impossible to get by in science without having at least a cursory understanding of what a \(p\) -value means, making this one of the most important chapters in the book. As usual, I’ll end the chapter with a quick recap of the key ideas that we’ve talked about:

  • Research hypotheses and statistical hypotheses. Null and alternative hypotheses. (Section 11.1 ).
  • Type 1 and Type 2 errors (Section 11.2 )
  • Test statistics and sampling distributions (Section 11.3 )
  • Hypothesis testing as a decision making process (Section 11.4 )
  • \(p\) -values as “soft” decisions (Section 11.5 )
  • Writing up the results of a hypothesis test (Section 11.6 )
  • Effect size and power (Section 11.8 )
  • A few issues to consider regarding hypothesis testing (Section 11.9 )

Later in the book, in Chapter 17 , I’ll revisit the theory of null hypothesis tests from a Bayesian perspective, and introduce a number of new tools that you can use if you aren’t particularly fond of the orthodox approach. But for now, though, we’re done with the abstract statistical theory, and we can start discussing specific data analysis tools.

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences . 2nd ed. Lawrence Erlbaum.

Ellis, P. D. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results . Cambridge, UK: Cambridge University Press.

Lehmann, Erich L. 2011. Fisher, Neyman, and the Creation of Classical Statistics . Springer.

Gelman, A., and H. Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60: 328–31.

  • The quote comes from Wittgenstein’s (1922) text, Tractatus Logico-Philosphicus . ↩
  • A technical note. The description below differs subtly from the standard description given in a lot of introductory texts. The orthodox theory of null hypothesis testing emerged from the work of Sir Ronald Fisher and Jerzy Neyman in the early 20th century; but Fisher and Neyman actually had very different views about how it should work. The standard treatment of hypothesis testing that most texts use is a hybrid of the two approaches. The treatment here is a little more Neyman-style than the orthodox view, especially as regards the meaning of the \(p\) value. ↩
  • My apologies to anyone who actually believes in this stuff, but on my reading of the literature on ESP, it’s just not reasonable to think this is real. To be fair, though, some of the studies are rigorously designed; so it’s actually an interesting area for thinking about psychological research design. And of course it’s a free country, so you can spend your own time and effort proving me wrong if you like, but I wouldn’t think that’s a terribly practical use of your intellect. ↩
  • This analogy only works if you’re from an adversarial legal system like UK/US/Australia. As I understand these things, the French inquisitorial system is quite different. ↩
  • An aside regarding the language you use to talk about hypothesis testing. Firstly, one thing you really want to avoid is the word “prove”: a statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty, and as the saying goes, statistics means never having to say you’re certain. On that point almost everyone would agree. However, beyond that there’s a fair amount of confusion. Some people argue that you’re only allowed to make statements like “rejected the null”, “failed to reject the null”, or possibly “retained the null”. According to this line of thinking, you can’t say things like “accept the alternative” or “accept the null”. Personally I think this is too strong: in my opinion, this conflates null hypothesis testing with Karl Popper’s falsificationist view of the scientific process. While there are similarities between falsificationism and null hypothesis testing, they aren’t equivalent. However, while I personally think it’s fine to talk about accepting a hypothesis (on the proviso that “acceptance” doesn’t actually mean that it’s necessarily true, especially in the case of the null hypothesis), many people will disagree. And more to the point, you should be aware that this particular weirdness exists, so that you’re not caught unawares by it when writing up your own results. ↩
  • Strictly speaking, the test I just constructed has \(\alpha = .057\) , which is a bit too generous. However, if I’d chosen 39 and 61 to be the boundaries for the critical region, then the critical region only covers 3.5% of the distribution. I figured that it makes more sense to use 40 and 60 as my critical values, and be willing to tolerate a 5.7% type I error rate, since that’s as close as I can get to a value of \(\alpha = .05\) . ↩
  • The internet seems fairly convinced that Ashley said this, though I can’t for the life of me find anyone willing to give a source for the claim. ↩
  • That’s \(p = .000000000000000000000000136\) for folks that don’t like scientific notation! ↩
  • Note that the p here has nothing to do with a \(p\) value. The p argument in the binom.test() function corresponds to the probability of making a correct response, according to the null hypothesis. In other words, it’s the \(\theta\) value. ↩
  • There’s an R package called compute.es that can be used for calculating a very broad range of effect size measures; but for the purposes of the current book we won’t need it: all of the effect size measures that I’ll talk about here have functions in the lsr package ↩
  • Although in practice a very small effect size is worrying, because even very minor methodological flaws might be responsible for the effect; and in practice no experiment is perfect, so there are always methodological issues to worry about. ↩
  • Notice that the true population parameter \(\theta\) doesn’t necessarily correspond to an immutable fact of nature. In this context \(\theta\) is just the true probability that people would correctly guess the colour of the card in the other room. As such the population parameter can be influenced by all sorts of things. Of course, this is all on the assumption that ESP actually exists! ↩
  • Although this book describes both Neyman’s and Fisher’s definition of the \(p\) value, most don’t. Most introductory textbooks will only give you the Fisher version. ↩
  • In this case, the Pearson chi-square test of independence (Chapter 12 ; chisq.test() in R) is what we use; see also the prop.test() function. ↩

Learning Statistics with R Copyright © by Danielle Navarro is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Statology

Statistics Made Easy

How to Perform a Correlation Test in R (With Examples)

One way to quantify the relationship between two variables is to use the Pearson correlation coefficient , which is a measure of the linear association between two variables .

It always takes on a value between -1 and 1 where:

  • -1 indicates a perfectly negative linear correlation between two variables
  • 0 indicates no linear correlation between two variables
  • 1 indicates a perfectly positive linear correlation between two variables

To determine if a correlation coefficient is statistically significant, you can calculate the corresponding t-score and p-value.

The formula to calculate the t-score of a correlation coefficient (r) is:

t = r * √ n-2 / √ 1-r 2

The p-value is calculated as the corresponding two-sided p-value for the t-distribution with n-2 degrees of freedom.

Example: Correlation Test in R

To determine if the correlation coefficient between two variables is statistically significant, you can perform a correlation test in R using the following syntax:

cor.test(x, y, method=c(“pearson”, “kendall”, “spearman”))

  • x, y: Numeric vectors of data.
  • method:  Method used to calculate correlation between two vectors. Default is “pearson.”

For example, suppose we have the following two vectors in R:

Before we perform a correlation test between the two variables, we can create a quick scatterplot to view their relationship:

Correlation test in R

There appears to be a positive correlation between the two variables. That is, as one increases the other tends to increase as well.

To see if this correlation is statistically significant, we can perform a correlation test:

The correlation coefficient between the two vectors turns out to be 0.9279869 .

The test statistic turns out to be 7.8756 and the corresponding p-value is  1.35e-05 .

Since this value is less than .05, we have sufficient evidence to say that the correlation between the two variables is statistically significant.

Additional Resources

The following tutorials provide additional information about correlation coefficients:

An Introduction to the Pearson Correlation Coefficient What is Considered to Be a “Strong” Correlation? The Five Assumptions for Pearson Correlation

Featured Posts

5 Statistical Biases to Avoid

Hey there. My name is Zach Bobbitt. I have a Master of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

17: Hypothesis Testing in R

  • Last updated
  • Save as PDF
  • Page ID 7656

  • Russell A. Poldrack
  • Stanford University

In this chapter we will present several examples of using R to perform hypothesis testing.

  • 17.1: Simple Example- Coin-flipping (Section 16.3.5.1)
  • 17.2: Simulating p-values

logo

Stats and R

Correlation coefficient and correlation test in r.

  • Descriptive statistics
  • Inferential statistics
  • Hypothesis test

Introduction

Between two variables, correlation matrix: correlations for all variables, interpretation of a correlation coefficient, a scatterplot for 2 variables, scatterplots for several pairs of variables, another simple correlation matrix, for 2 variables, for several pairs of variables, correlograms, correlation does not imply causation.

hypothesis testing pearson r

Correlations between variables play an important role in a descriptive analysis . A correlation measures the relationship between two variables , that is, how they are linked to each other. In this sense, a correlation allows to know which variables evolve in the same direction, which ones evolve in the opposite direction, and which ones are independent.

In this article, I show how to compute correlation coefficients , how to perform correlation tests and how to visualize relationships between variables in R.

Correlation is usually computed on two quantitative variables, but it can also be computed on two qualitative ordinal variables. 1 See the Chi-square test of independence if you need to study the relationship between two qualitative nominal variables.

If you need to quantify the relationship between two variables, I refer you to the article about linear regression .

In this article, we use the mtcars dataset (loaded by default in R):

The variables vs and am are categorical variables, so they are removed for this article:

Correlation coefficient

The correlation between 2 variables is found with the cor() function.

Suppose we want to compute the correlation between horsepower ( hp ) and miles per gallon ( mpg ):

Note that the correlation between variables X and Y is equal to the correlation between variables Y and X so the order of the variables in the cor() function does not matter.

The Pearson correlation is computed by default with the cor() function. If you want to compute the Spearman correlation, add the argument method = "spearman" to the cor() function:

The most common correlation methods (Run ?cor for more information about the different methods available in the cor() function) are:

  • Pearson correlation is often used for quantitative continuous variables that have a linear relationship
  • Spearman correlation (which is actually similar to Pearson but based on the ranked values for each variable rather than on the raw data) is often used to evaluate relationships involving at least one qualitative ordinal variable or two quantitative variables if the link is partially linear
  • Kendall’s tau-b which is computed from the number of concordant and discordant pairs is often used for qualitative ordinal variables

Note that there exists the point-biserial correlation (which can be used to measure the association between a continuous variable and a nominal variable of two levels), but this correlation is not covered here.

Suppose now that we want to compute correlations for several pairs of variables. We can easily do so for all possible pairs of variables in the dataset, again with the cor() function:

This correlation matrix gives an overview of the correlations for all combinations of two variables.

First of all, correlation ranges from -1 to 1 . It gives us an indication on two things:

  • The direction of the relationship between the 2 variables
  • The strength of the relationship between the 2 variables

Regarding the direction of the relationship: On the one hand, a negative correlation implies that the two variables under consideration vary in opposite directions , that is, if a variable increases the other decreases and vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in the same direction , i.e., if a variable increases the other one increases and if one decreases the other one decreases as well.

Regarding the strength of the relationship: The more extreme the correlation coefficient (the closer to -1 or 1), the stronger the relationship . This also means that a correlation close to 0 indicates that the two variables are independent , that is, as one variable increases, there is no tendency in the other variable to either decrease or increase.

As an illustration, the Pearson correlation between horsepower ( hp ) and miles per gallon ( mpg ) found above is -0.78, meaning that the 2 variables vary in opposite direction. This makes sense, cars with more horsepower tend to consume more fuel (and thus have a lower millage par gallon). On the contrary, from the correlation matrix we see that the correlation between miles per gallon ( mpg ) and the time to drive 1/4 of a mile ( qsec ) is 0.42, meaning that fast cars (low qsec ) tend to have a worse millage per gallon (low mpg ). This again make sense as fast cars tend to consume more fuel.

Note that it is a good practice to visualize the type of the relationship between the two variables before interpreting the correlation coefficients. The reason is that the correlation coefficient could be biased due to an outlier or due to the type of link between the two variables.

For instance, see the two Pearson correlation coefficients (denoted by R in the following plots) when the outlier is excluded and included:

hypothesis testing pearson r

The Pearson correlation coefficient changes drastically due to a single point, and thus the interpretation. It goes from a negative correlation coefficient, indicating a negative relationship between the 2 variables, to a positive coefficient, indicating a positive relationship. We would have missed this insight if we had not visualized the data in a scatterplot (see how to draw a scatterplot in this section ).

A correlation coefficient may also miss a non-linear link between two variables:

hypothesis testing pearson r

The Pearson correlation coefficient is equal to 0, indicating no relationship between the two variables, because it measures the linear relationship and it is clear from the plot that the link is non-linear.

So to recap, it is a good practice to visualize the data via a scatterplot before interpreting a correlation coefficient (it does not tell the whole story) and see how the correlation coefficient changes when using the parametric (Pearson) or nonparametric version (Spearman or Kendall’s tau-b).

Visualizations

The correlation matrix presented above is not easily interpretable, especially when the dataset is composed of many variables. In the following sections, we present some alternatives to the correlation matrix for better readability.

A good way to visualize a correlation between 2 variables is to draw a scatterplot of the two variables of interest. Suppose we want to examine the relationship between horsepower ( hp ) and miles per gallon ( mpg ):

hypothesis testing pearson r

If you are unfamiliar with the {ggplot2} package , you can draw the scatterplot using the plot() function from R base graphics:

hypothesis testing pearson r

or use the esquisse addin to easily draw plots using the {ggplot2} package.

Suppose that instead of visualizing the relationship between only 2 variables, we want to visualize the relationship for several pairs of variables. This is possible thanks to the pair() function.

For this illustration, we focus only on miles per gallon ( mpg ), horsepower ( hp ) and weight ( wt ):

hypothesis testing pearson r

The figure indicates that weight ( wt ) and horsepower ( hp ) are positively correlated, whereas miles per gallon ( mpg ) seems to be negatively correlated with horsepower ( hp ) and weight ( wt ).

This version of the correlation matrix presents the correlation coefficients in a slightly more readable way, i.e., by coloring the coefficients based on their sign. Applied to our dataset, we have:

hypothesis testing pearson r

Correlation test

Unlike a correlation matrix which indicates the correlation coefficients between some pairs of variables in the sample , a correlation test is used to test whether the correlation (denoted \(\rho\) ) between 2 variables is significantly different from 0 or not in the population .

Actually, a correlation coefficient different from 0 in the sample does not mean that the correlation is significantly different from 0 in the population. This needs to be tested with a hypothesis test —and known as the correlation test.

The null and alternative hypothesis for the correlation test are as follows:

  • \(H_0\) : \(\rho = 0\) (meaning that there is no linear relationship between the two variables)
  • \(H_1\) : \(\rho \ne 0\) (meaning that there is a linear relationship between the two variables)

Via this correlation test, what we are actually testing is whether:

  • the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation coefficient does not equal 0, so the relationship exists in the population.
  • or on the contrary, the sample does not contain enough evidence that the correlation coefficient does not equal 0, so in this case we do not reject the null hypothesis of no relationship between the variables in the population.

Note that there are 2 assumptions for this test to be valid:

  • Independence of the data
  • For small sample sizes (usually \(n < 30\) ), the two variables should follow a normal distribution

Suppose that we want to test whether the rear axle ratio ( drat ) is correlated with the time to drive a quarter of a mile ( qsec ):

The p -value of the correlation test between these 2 variables is 0.62. At the 5% significance level, we do not reject the null hypothesis of no correlation. We therefore conclude that we do not reject the hypothesis that there is no linear relationship between the 2 variables. 2

This test proves that even if the correlation coefficient is different from 0 (the correlation is 0.09 in the sample), it is actually not significantly different from 0 in the population.

Note that the p -value of a correlation test is based on the correlation coefficient and the sample size. The larger the sample size and the more extreme the correlation (closer to -1 or 1), the more likely the null hypothesis of no correlation will be rejected.

With a small sample size, it is thus possible to obtain a relatively large correlation in the sample (based on the correlation coefficient), but still find a correlation not significantly different from 0 in the population (based on the correlation test). For this reason, it is recommended to always perform a correlation test before interpreting a correlation coefficient to avoid flawed conclusions.

Similar to the correlation matrix used to compute correlation for several pairs of variables, the rcorr() function (from the {Hmisc} package) allows to compute p -values of the correlation test for several pairs of variables at once. Applied to our dataset, we have:

Only correlations with p -values smaller than the significance level (usually \(\alpha = 0.05\) ) should be interpreted.

Combination of correlation coefficients and correlation tests

Now that we covered the concepts of correlation coefficients and correlation tests, let see if we can combine the two concepts.

If you need to do this for a few pairs of variables, I recommend using the ggscatterstats() function from the {ggstatsplot} package. Let’s see it in practice with one pair of variables— wt and mpg :

hypothesis testing pearson r

Based on the result of the test, we conclude that there is a negative correlation between the weight and the number of miles per gallon ( \(r = - 0.87\) , \(p\) -value < 0.001).

If you need to do it for many pairs of variables, I recommend using the the correlation function from the easystats {correlation} package .

This function allows to combine correlation coefficients and correlation tests for several pairs of variables, all in a single table (thanks to krzysiektr for pointing it out to me):

As you can see, it gives, among other useful information, the correlation coefficients (column r ) and the result of the correlation test (column 95% CI for the confidence interval or p for the \(p\) -value) for all pairs of variables.

The table above is very useful and informative, but let see if it is possible to combine the concepts of correlation coefficients and correlations test in one single visualization. A visualization that would be easy to read and interpret.

Ideally, we would like to have a concise overview of correlations between all possible pairs of variables present in a dataset, with a clear distinction for correlations that are significantly different from 0.

The figure below, known as a correlogram and adapted from the corrplot() function, does precisely this:

hypothesis testing pearson r

The correlogram shows correlation coefficients for all pairs of variables (with more intense colors for more extreme correlations), and correlations not significantly different from 0 are represented by a white box.

To learn more about this plot and the code used, I invite you to read the article entitled “ Correlogram in R: how to highlight the most correlated variables in a dataset ”.

For those of you who are still not completely satisfied, I recently found two alternatives—one with the ggpairs() function from the {GGally} package and one with the ggcormat() function from the {ggstatsplot} package.

The two functions are illustrated with the variables mpg , hp and wt :

hypothesis testing pearson r

The plot above combines correlation coefficients, correlation tests (via the asterisks next to the coefficients 3 ) and scatterplots for all possible pairs of variables present in a dataset.

hypothesis testing pearson r

The plot above also shows the correlation coefficients and if any, the non-significant correlations (by default at the 5% significance level with the Holm adjustment method) are shown by a big cross on the correlation coefficients.

The advantage of these two alternatives compared to the first one is that it is directly available within a package, so you do not need to run the code of the function first in order to draw the correlogram.

I am pretty sure you have already heard the statement “Correlation does not imply causation” in statistics. An article about correlation would not be complete without discussing about causation.

A non-zero correlation between two variables does not necessarily mean that there is a cause and effect relationship between these two variables!

Indeed, a significant correlation between two variables means that changes in one variable are associated (positively or negatively) with changes in the other variable. Nonetheless, a significant correlation does not indicates that variations in one variable cause the variations in the other variable.

A non-zero correlation between X and Y can appear in several cases:

  • a third variable cause X and Y
  • a combination of these three reasons

Sometimes it is quite clear that there is a causal relationship between two variables. Take for example the correlation between the price of a consumer product such as milk and its consumption. It is quite obvious that there is a causal link between the two: if the price of milk increases, it is expected that its consumption will decrease.

However, this causal link is not always present even if the correlation is significant. Maurage, Heeren, and Pesenti ( 2013 ) showed that, although there is a positive and significant correlation between chocolate consumption and the number of Nobel laureates, this correlation comes from the fact that a third variable, Gross Domestic Product (GDP), causes chocolate consumption and the number of Nobel laureates. They found that countries with higher GDP tend to have a higher level of chocolate consumption and scientific research (leading to more Nobel laureates).

This example shows that one must be very cautious when interpreting correlations and avoid over-interpreting a correlation as a causal relationship.

Thanks for reading.

I hope this article helped you to compute correlation coefficients and perform correlation tests in R. If you would like to learn how to compute the coefficients by hand, see this step-by-step tutorial .

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.

(Note that this article is available for download on my Gumroad page .)

It is true that there is the point-biserial correlation which can be used with a nominal variable (consisting of two factors). Nonetheless, this type of correlation is much less known and usually not covered in introductory statistics classes; with one continuous and one nominal variable, it is much more frequent to learn about the Student’s t-test (for a nominal variable with 2 groups) or ANOVA (for a nominal variable with 3 or more groups). More information about choosing the most appropriate measure of association depending on the type of variable can be found in this article . ↩︎

It is important to remember that we tested for a linear relationship between the two variables since we used the Pearson’s correlation. It may be the case that there is a relationship between the two variables in the population, but this relation may not be linear. ↩︎

One asterisk means that the coefficient is significant at the 5% level, 2 is at the 1% significance level, and 3 is at the 0.1% significance level. This is usually the case in R; the more asterisks, the more it is significant. ↩︎

Related articles

  • One-proportion and chi-square goodness of fit test
  • How to perform a one-sample t-test by hand and in R: test on one mean
  • What is survival analysis? Examples by hand and in R
  • One-sample Wilcoxon test in R

Liked this post?

  • Get updates every time a new article is published (no spam and unsubscribe anytime):

Yes, receive new posts by email

  • Support the blog

Consulting FAQ Contribute Sitemap

Hypothesis Testing

While exploring the relationship between an exposure and an outcome it may be useful to statistically test the strength of association. Hypothesis testing is a statistical inference technique by which one uses observed sample data to interrogate an assumption about a population parameter. Given an assumed probability distribution for the population parameter, a hypothesis test can yield a measure of how likely it would be to encounter the data observed. Two of the hypothesis tests applied to count data in two-by-two tables include Pearson’s chi-squared test and Fisher’s exact test. Both of these techniques are implemented in the twoxtwo package. The narrative that follows demonstrates how to perform these tests using example datasets created below as well as the titanic data that ships with twoxtwo 1 .

Pearson’s Chi-squared test

With data organized in a contingency table it is possible to test independence of counts in each cell. At a given cell, one can compare the observed count ( \(n_{ij}\) ) and expected count ( \(\mu_{ij}\) ) and arrive at the following test statistic:

\[\chi^2 = \sum \frac{(n_{ij}-\mu_{ij})^2}{\mu_{ij}}\] The Pearson’s \(\chi^2\) (chi-squared) statistic above is parameterized by degrees of freedom. A contingency table has degrees of freedom computed as (number or rows - 1 ) * (number of columns - 1). For a two-by-two table (2 rows, 2 columns), \(\chi^2\) will have 1 degree of freedom.

The \(\chi^2\) test statistic and its degrees of freedom can be used in a hypothesis test. If the probability of observing a statistic as large is less than the stated significance level ( \(\alpha\) ) then one can reject the null hypothesis of independence among cell counts.

Pearson’s \(\chi^2\) test is implemented in R via the chisq.test() function in the stats package. The twoxtwo package wraps this functionality in its chisq() function. One of the advantages of chisq() as opposed to chisq.test() is that the twoxtwo implementation returns a tidy hypothesis test output.

Below is an example of using the chisq() function to explore the association between crew status and survival in the titanic dataset:

Fisher’s exact test

Another method for testing independence between two-by-two cell counts is Fisher’s exact test. After fixing marginal row and column totals, one can cycle through possible combinations of cell counts that would sum to those margins. Each combination yields a unique two-by-two table. For each of these tables the binomial probability of observing the given count in cell A 2 can be expressed as:

\[p=\frac{(A+B)!\times{(C+D)!}\times{(A+C)!}\times{(B+D)!}}{A!\times{B!}\times{C!}\times{D!}}\]

To calculate a p-value, one can sum the probabilities for cell counts that are at least as extreme as the observed count. Using this method there is no need for a test statistic since the p-value can be calculated exactly as the sum of binomial probabilities.

The stated origin of Fisher’s exact test is the “tea taster” experiment. The design is detailed in the example below:

To conduct the hypothesis test with the fisher() function from the twoxtwo package:

As with chisq() the fisher() function wraps a function from the stats package. fisher() passes arguments for the proposed odds ratio for the hypothesis testing, confidence level for odds ratio estimate, and alternative hypothesis into the interal fisher.test() function. Examples of several of these parameters in practice are provided below to explore the association between crew status and survival in the titanic dataset:

Agresti, A. (2019). An Introduction to Categorical Data Analysis. Hoboken, New Jersey: Wiley and Sons.

For more details on see ?titanic ↩︎

The notation here is described in detail in the twoxtwo “Basic Usage” vignette. To read the vignette: vignette("basic-usage", package = "twoxtwo") ↩︎

Introduction

  • R installation
  • Working directory
  • Getting help
  • Install packages

Data structures

Data Wrangling

  • Sort and order
  • Merge data frames

Programming

  • Creating functions
  • If else statement
  • apply function
  • sapply function
  • tapply function

Import & export

  • Read TXT files
  • Import CSV files
  • Read Excel files
  • Read SQL databases
  • Export data
  • plot function
  • Scatter plot
  • Density plot
  • Tutorials Introduction Data wrangling Graphics Statistics See all

HYPOTHESIS TESTING IN R

Hypothesis testing is a statistical procedure used to make decisions or draw conclusions about the characteristics of a population based on information provided by a sample

NORMALITY TESTS

Normality tests are used to evaluate whether a data sample follows a normal distribution. These tests allow to verify if the data have a behavior similar to that of a Gaussian distribution, being useful to determine if the assumptions of certain parametric statistical analyses that require normality in the data are met

Shapiro Wilk test in R

Shapiro Wilk normality test

shapiro.test()

Lilliefors normality test in R

Lilliefors normality test

lillie.test()

GOODNESS OF FIT TESTS

These tests are used to verify whether a proposed theoretical distribution adequately matches the observed data. They are useful to assess whether a specific distribution fits the data well, allowing to determine whether a theoretical model accurately represents the observed data distribution

Chi-squared test in R

Pearson's Chi-squared test with chisq.test()

chisq.test()

Kolmogorov-Smirnov test in R with ks.test()

Kolmogorov-Smirnov test with ks.test()

Median tests.

Median tests are used to test whether the medians of two or more groups are statistically different, thus identifying whether there are significant differences in medians between populations or treatments

Wilcoxon signed rank test in R

Wilcoxon signed rank test

wilcox.test()

Wilcoxon rank sum test (Mann-Whitney U test) in R

Wilcoxon rank sum test (Mann-Whitney U test)

Kruskal Wallis rank sum test in R

Kruskal Wallis rank sum test (H test)

kruskal.test()

OTHER TYPES OF TESTS

There are other types of tests, such as tests for comparing means, for equality of variances or for equality of proportions

T-test in R

T-test to compare means

F test in R with var.test()

F test with var.test() to compare two variances

Test for proportions with prop.test() function

Test for proportions with prop.test()

prop.test()

Try adjusting your search query

👉 If you haven’t found what you’re looking for, consider clicking the checkbox to activate the extended search on R CHARTS for additional graphs tutorials, try searching a synonym of your query if possible (e.g., ‘bar plot’ -> ‘bar chart’), search for a more generic query or if you are searching for a specific function activate the functions search or use the functions search bar .

Hypothesis Test for Pearson Correlation Coefficient

Description.

Adjust the cor.test function so that it can define the specific H0 as per your request, that is based on Fisher's Z transformation of the correlation.

a named vector contains correlation coefficient ( cor ), confidence interval( lowerci and upperci ), Z statistic ( Z ) and p-value ( pval )

NCSS correlation document

cor.test() to see the detailed arguments.

IMAGES

  1. Hypothesis testing with Pearson's r

    hypothesis testing pearson r

  2. Introduction to Hypothesis Testing in R

    hypothesis testing pearson r

  3. How to Perform Hypothesis Testing in R using T-tests and μ-Tests

    hypothesis testing pearson r

  4. Hypothesis testing tutorial using p value method

    hypothesis testing pearson r

  5. Introduction to Hypothesis Testing in R Case Studies, Concept and Examples

    hypothesis testing pearson r

  6. Hypothesis testing for the (Pearson) correlation coefficient

    hypothesis testing pearson r

VIDEO

  1. Hypothesis Testing

  2. 19. Hypothesis Testing -Pearson Correlation Coefficient

  3. Proportion Hypothesis Testing, example 2

  4. CORRELATION TUTORIAL (TAGALOG / FILIPINO VERSION)

  5. Interpreting the Correlation Coefficient// significance of "r"

  6. Hypothesis Testing in R

COMMENTS

  1. Hypothesis Testing with Pearson's r

    4. State Decision Rule. Using our alpha level and degrees of freedom, we look up a critical value in the r-Table. We find a critical r of 0.632. If r is greater than 0.632, reject the null hypothesis. 5. Calculate Test Statistic. We calculate r using the same method as we did in the previous lecture: Figure 3.

  2. Pearson Correlation Coefficient (r)

    Revised on February 10, 2024. The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between -1 and 1 that measures the strength and direction of the relationship between two variables. When one variable changes, the other variable changes in the same direction.

  3. The Complete Guide: Hypothesis Testing in R

    A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis. This tutorial explains how to perform the following hypothesis tests in R: One sample t-test. Two sample t-test. Paired samples t-test. We can use the t.test () function in R to perform each type of test:

  4. 1.9

    Let's perform the hypothesis test on the husband's age and wife's age data in which the sample correlation based on n = 170 couples is r = 0.939. To test H 0: ρ = 0 against the alternative H A: ρ ≠ 0, we obtain the following test statistic: t ∗ = r n − 2 1 − R 2 = 0.939 170 − 2 1 − 0.939 2 = 35.39. To obtain the P -value, we need ...

  5. Hypothesis Testing with Pearson's r

    statisticslectures.com - where you can find free lectures, videos, and exercises, as well as get your questions answered on our forums!

  6. Hypothesis Tests in R

    R Function: chisq.test() Null hypothesis (H 0): The relative proportions of categories in one variable are different from the expected proportions; History: Karl Pearson ; For example, we test a hypothesis that smoking rates changed between 2000 and 2020.

  7. R Handbook: Hypothesis Testing and p-values

    Using a binomial test, the p -value is < 0.0001. (Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16, which is 0.00000000000000022, with 15 zeros after the decimal point.) Assuming an alpha of 0.05, since the p -value is less than alpha, we reject the null hypothesis.

  8. Lab 20: Hypothesis testing with correlation

    Using SPSS for Hypothesis Testing with Pearson r. We can also use SPSS to a hypothesis test with Pearson r. We could calculate the Pearson r with SPSS and then look at the output to make our decision about H 0. The output will give us a p value for our Pearson r (listed under Sig in the Output). We can compare this p value with alpha to ...

  9. Hypothesis testing with Pearson's r

    Gives an example of hypothesis testing for correlations using pearson's r.

  10. Chapter 7 Hypothesis Tests

    Step 2 (plan): State the Null Hypothesis (in order to come up with a testable hypothesis, we must already have a plan). Step 3 (Data): Recruit a sample; take measurements of the sample. Step 4 (Analysis): Analyze the data. Step 5 (Conclusion): Draw a conclusion regarding the null hypothesis-reject or accept. Example 7.3.

  11. Conducting a Hypothesis Test for the Population Correlation Coefficient

    It should be noted that the three hypothesis tests we learned for testing the existence of a linear relationship — the t-test for H 0: β 1 = 0, the ANOVA F-test for H 0: β 1 = 0, and the t-test for H 0: ρ = 0 — will always yield the same results. For example, if we treat the husband's age ("HAge") as the response and the wife's age ("WAge") as the predictor, each test yields a P-value ...

  12. 12.2.1

    In testing the statistical significance of the relationship between two quantitative variables we will use the five step hypothesis testing procedure: 1. Check assumptions and write hypotheses. In order to use Pearson's \ (r\) both variables must be quantitative and the relationship between \ (x\) and \ (y\) must be linear. Research Question.

  13. Statistics

    https://www.youtube.com/watch?v=jN-X7vf04u0&list=PLKt6D8cHRaGYHahOGtFYGR4CBzI9bf9aM&index=11&t=2s

  14. 15.8: Testing the Significance of a Correlation

    The important thing to note here is the t test associated with the predictor, in which we get a result of t (98)=−20.85, p<.001. Now let's compare this to the output of a different function, which goes by the name of cor.test(). As you might expect, this function runs a hypothesis test to see if the observed correlation between two ...

  15. 11.2: Correlation Hypothesis Test

    The formula for the test statistic is t = r√n − 2 √1 − r2. The value of the test statistic, t, is shown in the computer or calculator output along with the p-value. The test statistic t has the same sign as the correlation coefficient r. The p-value is the combined area in both tails.

  16. Hypothesis testing

    The hypothesis test itself tells you whether you should believe that the effect you have observed is real (i.e., not just due to chance); the effect size tells you whether or not you should care. ... In this case, the Pearson chi-square test of independence (Chapter 12; chisq.test() in R) is what we use; see also the prop.test() function. ...

  17. How to Perform a Correlation Test in R (With Examples)

    To see if this correlation is statistically significant, we can perform a correlation test: #perform correlation test between the two vectors cor.test(x, y) Pearson's product-moment correlation data: x and y t = 7.8756, df = 10, p-value = 1.35e-05 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0. ...

  18. 17: Hypothesis Testing in R

    This page titled 17: Hypothesis Testing in R is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.

  19. hypothesis testing

    With respect to this sample, you can think of it as a measure of the precision of the estimate, in the sense that if your null hypothesis had been any value (not just $0$) that lay outside that interval, your test would still have been significant.

  20. Correlation coefficient and correlation test in R

    Based on the result of the test, we conclude that there is a negative correlation between the weight and the number of miles per gallon ( r = −0.87 r = − 0.87, p p -value < 0.001). If you need to do it for many pairs of variables, I recommend using the the correlation function from the easystats {correlation} package.

  21. Hypothesis Testing

    Given an assumed probability distribution for the population parameter, a hypothesis test can yield a measure of how likely it would be to encounter the data observed. Two of the hypothesis tests applied to count data in two-by-two tables include Pearson's chi-squared test and Fisher's exact test. Both of these techniques are implemented in ...

  22. Hypothesis testing in R

    HYPOTHESIS TESTING IN R. Hypothesis testing is a statistical procedure used to make decisions or draw conclusions about the characteristics of a population based on information provided by a sample. NORMALITY TESTS. ... Pearson's Chi-squared test with chisq.test() chisq.test()

  23. R: Hypothesis Test for Pearson Correlation Coefficient

    Hypothesis Test for Pearson Correlation Coefficient Description Adjust the cor.test function so that it can define the specific H0 as per your request, that is based on Fisher's Z transformation of the correlation.