Slope Hypothesis Testing

Slope Hypothesis Testing

"What? I can't hear--" "What? I said, are you sure--" "CAN YOU PLEASE SPEAK--"

892: Null Hypothesis

Explanation [ edit ].

This comic (and the title text) is based on a misunderstanding. The null hypothesis is the hypothesis in a statistical analysis that indicates that the effect investigated by the analysis does not occur, i.e. 'null' as in zero effect. For example, the null hypothesis for a study about cell phones and cancer risk might be "Cell phones have no effect on cancer risk." The alternative hypothesis, by contrast, is the one under investigation - in this case, probably "Cell phones affect the risk of cancer."

After conducting a study, we can then make a judgment based on our data. There are statistical models for measuring the probability that a certain result occurred by random chance, even though in reality there is no correlation. If this probability is low enough (usually meaning it's below a certain threshold we set when we design the experiment, such as 5% or 1%), we reject the null hypothesis, in this case saying that cell phones do increase cancer risk. Otherwise, we fail to reject the null hypothesis, as we have insufficient evidence to conclusively state that cell phones increase cancer risk. This is how almost all scientific experiments, from high school biology classes to CERN, draw their conclusions.

It is very important to note that a null hypothesis is a specific statement relative to the current study. In mathematics, we often see terms such as "the Riemann hypothesis" or "the continuum hypothesis" that refer to universal statements, but a null hypothesis depends on context. There is no one " the null hypothesis." It refers to a method of statistical analysis (and falsifiability , not a specific hypothesis). Given that, Megan 's response would probably be to facepalm.

Transcript [ edit ]

comment.png

If you get a 50% discount at two shops and buy stuff from both of them, you have a 100% discount. Math. That's how it works, bitches. David y ²² [talk] 10:05, 9 March 2013 (UTC)

That's a misleading thing about percentages. Like this: Prices of coffee increase by 2% this year, then by 3% next year. That's a 1% increase between years, or a 50% increase between years (from 2 to 3). So which is it? 1 or 50? 141.101.98.240 08:26, 18 December 2013 (UTC)

That's why they've invented the "base points" in financials, to denote the percentages of percentages. It's 1% absolute but 50bpp (base point percentage). 108.162.246.11 18:35, 20 January 2014 (UTC)

Oh really. If you say it increased by 2% this year, then by 3% next year. It increased 3%. Unless you mean it will increase by 3% from LAST YEAR to NEXT YEAR. Then it really increased by 2% then .97%. But for this purpose let's throw that out and make it simple. It increased by 2% this year, and will increase by 3% next year. 50% isn't how much it increased, but how much the increase increased. That's called acceleration. The rate of increase per year is always 2 or 3%. So, 1% doesn't factor into this equation at all no matter how you do the math. The answer is 1.02*1.03. It increased by 5.06% over the last two years. 108.162.216.114 14:59, 18 August 2014 (UTC)

Don't these discussion points belong in a different comic? Or perhaps the garbage? Except (1), he lol'd me. 108.162.219.58 21:23, 5 February 2014 (UTC)

Every time I learn what the null hypothesis is, I forget about it by the next day. I guess my brain is trying to organize the information and it thinks the /dev/null/ folder would be a good place for it. 172.69.58.152 21:17, 26 March 2024 (UTC)

  • Comics from 2011
  • Comics from April
  • Friday comics
  • Comics featuring Cueball
  • Comics featuring Megan
  • Scientific research

Navigation menu

Personal tools.

  • Not logged in
  • Contributions
  • Create account
  • View history
  • Latest comic
  • Community portal
  • Recent changes
  • Random page
  • Browse comics
  • What links here
  • Related changes
  • Special pages
  • Printable version
  • Permanent link
  • Page information

hypothesis testing xkcd

  • This page was last edited on 20 April 2023, at 18:18.
  • Privacy policy
  • About explain xkcd
  • Disclaimers

Powered by MediaWiki

Multiple Hypothesis Testing

Sven schmit, october 15, 2015 - san francisco.

In recent years, there has been a lot of attention on hypothesis testing and so-called “p-hacking” , or misusing statistical methods to obtain more “significant” results. Rightly so: For example, we spend millions of dollars on medical research, and we don’t want to waste our time and money, pursuing false leads caused by flaky statistics. But even if all of our assumptions are met and our data collection is flawless, it’s not always easy to get the statistics right; there are still quite a few subtleties that we need to be aware of.

This post introduces some of the interesting phenomena that can occur when we are dealing with testing hypotheses. First, we consider an example of a single hypothesis test which gives great insight into the difference between significance and “being correct”. Next, we look at global testing, where we have many different hypotheses and we want to test whether all null hypotheses are true using a single test. We discuss two different tests, Fisher’s combination test and Bonferroni’s method, which lead to rather different results. We save the best till last, when we discuss what to do if we have many hypotheses and want to test each individually. We introduce the concepts of familywise error rate and false discovery rate, and explain the Benjamini-Hochberg procedure.

Also, this post is accompanied by an IPython notebook that demonstrates how these methods work in practice. We analyze free throw percentage data from the NBA to see whether there are players that perform better, or worse, playing at home versus away.

For this post, I assume you have some basic knowledge about hypothesis testing, such as

  • The difference between the null hypothesis and alternative hypothesis
  • Significance levels
  • Type I and Type II errors

If you want to read up on the above, the material is covered by any introductory statistics book, as well as many blog posts, e.g. [MT1] and [MT2]. On the other hand, no prior knowledge about multiple testing is necessary.

Also, we ignore all of the intricacies that come with satisfying assumptions of tests. In practice, this is often quite challenging, but our focus is on multiple testing and therefore we ignore these and pretend that we live in a perfect world.

All the material is based on the lectures and lecture notes of Stats 300C by Prof. Candes at Stanford.

Simple example: Difference between p-value and being “right”

hypothesis testing xkcd

Much has been written about what a p-value is, and what it is not, but it seems like many people still make mistakes. Sometimes, a simple example can make all the difference and for me the following example was really enlightening.

The setup is simple, we have a single, generic, two-sided hypothesis test. For example, we could be testing that the mean of some distribution we sample from is \(0\).

We assume that our test statistic, denoted by \(Z\) follows a standard normal distribution under the null hypothesis:

As good citizens, we gather our data after specifying our hypothesis and calculate the value of our test statistic \(Z\) and find \(Z = -2.0\) If you have ever taken a class about hypothesis testing, I’m afraid you know the critical value for such a two-sided test by heart: \(1.96\) telling you to reject the null hypothesis at the 5% significance level.

We can also compute a p-value, using the normal CDF: \(p = 2 \Phi(-2.0) = 0.0455\) where we multiply by two because we are conducting a two-sided test, and indeed, it’s slightly below \(0.05\)

So we reject the null hypothesis under a significance level of 5%. Does this tell us the null hypothesis is false? No it does not. The p-value equals the probability of seeing this test statistic, or something more extreme, if the null hypothesis is true. Hence, this value \(Z=-2.0\) was pretty unlikely to be observed if the null hypothesis is true.

From a frequentist perspective we are now done, only the data is generated randomly, the null hypothesis is either true or false, so we can’t make any more probabilistic statements. Still, we would like to know what the probability of a false rejection is, given this test statistic. If we put on our Bayesian hats and add some more assumptions, we can do just that.

First of all, we have to specify a prior on the probability that the null hypothesis is true. For our example let’s be generous and say that a priori we believe our null hypothesis has only a 50% probability of being true: \(P(H_0) = 0.5\)

Furthermore, we also have to specify the distribution of our test statistic under the alternative hypothesis. We often have little idea what this could be, so let’s say \(Z\) is uniformly distributed between \(10\) and \(10\)

Summarizing, we have: \(\begin{aligned} P(H_0) &= 0.5\\\\ Z &\sim_{H_0} \mathcal{N}(0, 1)\\\\ Z &\sim_{H_1} U[-10, 10] \end{aligned}\)

And we are interested in computing \(P(H_0 \mid Z = z)\) for \(z = -2.0\)

So before we even start, the prior probability that the null hypothesis is false equals 50%, and we observe a rather small p-value. Surely, we can be fairly certain about our rejection?

Let’s find out. Using Bayes’ Theorem, and abusing notation, we obtain \(P(H_0 \mid Z = z) = \frac{P(H_0)P(Z = z \mid H_0)}{P(Z = z)}\)

Plugging in the priors, we find that we reject correctly only 50% of the time !

Even in the simplest of examples, hypothesis testing can be subtle, and here we clearly demonstrate the big difference between significance and “correctness”.

How is this possible? In the above example, observing \(Z=-2\) under the null is quite unlikely, but it’s also unlikely under the alternative hypothesis . Note that we played this completely by the book, we didn’t do anything “illegal” to make our p-value significant.

Also, note we skipped any details about what hypothesis, what data and what sample size we are dealing with, because for the purposes of this example it’s irrelevant.

A final remark before we move on: One could think that we might be able to resolve this problem by obtaining a larger data set. However, as opposed to estimation, where we expect our estimation error to shrink with increased sample size, this is not the case with the Type I error in testing procedures. All the extra information we get from an increased sample size are used to improve power: increasing the probability of correctly rejecting the null hypothesis. reducing the probability of not rejecting the null hypothesis when we actually should.

Global testing

Hypothesis testing gets even more interesting when there are multiple hypotheses that we want to test. Now, we can “borrow” information from the other test statistics to gain power, as we will soon see.

There are two types of tests:

  • Global hypothesis testing : we want to simultaneously test all null hypotheses,
  • Multiple testing : for every hypothesis, we want to separately test each null hypothesis.

While the latter might be more relevant in practice, the former leads to great insight and many methods used for the multiple testing problem can be related back to global hypothesis tests, so let’s look at some interesting results for the global test first.

We assume we have \(n\) null hypotheses, \(H_{0, i}\) with corresponding p-values \(p_i\) obtained as if we were only testing hypothesis \(i\) The global null hypothesis then is that all null hypotheses are true: \(H_0 = \bigcap_{i=1}^n H_{0, i}\)

There are two simple tests that, surprisingly, behave quite differently.

Fisher’s combination test

Fisher’s combination test is derived using the following insight. If \(p_i\) is uniformly distributed over \(0,1]\) then the negative logarithm follows an exponential distribution: \(-\log p_i \sim \text{Exp}(1)\)

One can show that the test statistic \(T = - 2 \sum_{i=1}^n \log p_i\) follows a \(\chi^2\) distribution with \(2n\) degrees of freedom if the \(p_i\)s are independent: \(T = - 2 \sum_{i=1}^n \log p_i \sim \chi^2_{2n}.\)

Fisher’s combination test uses the above

  • Compute \(T = - 2 \sum \log p_i\) based on the individual p-values, noting that under the null hypothesis, a p-value has a uniform distribution on \([0, 1]\)
  • Under the global null hypothesis, \(T\) follows a \(\chi^2_{2n}\) distribution, so we can decide to reject based on the value of \(T\)

Bonferroni’s method

Bonferroni’s method takes a completely different approach: We only look at the smallest p-value, and if this is less than \(\alpha / n\) we reject the global null hypothesis:

Reject \(H_0\) if \(\min_i p_i \le \alpha / n\)

Using a union bound, we can show that the type I error rate is less than \(\alpha\) Assuming the global hypothesis is true, we have: \(\begin{aligned} P_{H_0}(\text{Type I error}) &= P_{H_0}\left(\bigcup_{i=1}^n \{p_i < \alpha / n\}\right)\\\\ &\le \sum_{i=1}^n P_{H_0}(p_i < \alpha / n) \\\\ &= n \frac{\alpha}{n} = \alpha \end{aligned}\)

Differences

Both tests are straightforward, so which one should you use? Well, as it turns out: it depends . In certain situations, Fisher’s combination test has near full power (that is, it rejects with very high probability when the global null hypothesis is false), while Bonferroni is powerless (the probability of correctly rejecting vanishes), while in other situations, the opposite happens.

This surprised me at first, but if we take a closer look, the reason becomes quite clear. Bonferroni only considers the smallest p-value, which means it is very good at detecting when there are few large effects: the smallest p-value will most likely be from an alternative hypothesis. In fact, it can be shown that Bonferroni’s method is optimal for detecting one large effect.

On the other hand, Fisher’s combination test is not as effective in this scenario: the effect of the alternative hypotheses will be wiped out by the noise of the large number of true null hypotheses. Hence, the test won’t be able to reject the global null hypothesis in this case.

But, if there are many small effects, Fisher’s combination test really shines: all these little deviations add up and this test is able to reject the global null. While no single hypothesis provides enough evidence to reject the null hypothesis that all hypotheses are true, all the little discrepancies combined cause us to reject the null.

However, Bonferroni’s method is useless in the latter case though: because it only uses one p-value, it is unable to get enough evidence to reject the global null hypothesis. In fact, it could be very likely that the smallest p-value actually comes from a test where the null hypothesis is true!

Multiple testing

xkcd.com/comics/significant

( Significant from xkcd )

Rejecting the global null hypothesis is great, but anyone’s first response to such rejection would be: so tell me, which null hypothesis is false?

For this, we need to open another bag of tricks. We still have \(n\) hypotheses, but now we want to reject hypotheses at the individual level.

Could we simply test each using the same procedure as when we are testing a single hypothesis? Suppose we are testing \(n = 1000\) hypotheses and we test at the 5% significance level. Then, we expect about 50 false positives by mere chance. That’s not very appealing. So if It turns out that our test rejects 100 hypotheses, you still can’t confidently say anything about any of the rejected hypotheses. We could of course decrease \(\alpha\) but this leads to the next problem.

Naturally, we are primarily interested in the smallest p-values. However, the interpretation of this smallest p-value as “the probability of observing a more extreme value given the null hypothesis being true” is now inaccurate. Rejecting such hypothesis at the \(\alpha\) significance level leads to a much larger type I error rate. We should take into account that this is the smallest out of \(n\) p-values, and hence this smallest p-value is not uniform on \(0, 1]\) at all, but is “skewed to the left” . Seeing something “extreme” in 100 observations much more likely than in a single observation. This leaves us clueless about the actual type I error rate.

We can look at this in another way; it is true that for every single hypothesis test, we use a procedure that leads to the probability of a type I error equal to \(\alpha\) This is a guarantee that we have before applying the method; after applying the method, there is no randomness left according to the frequentist perspective. However, when we consider the smallest p-value, we change the procedure to take additional information into account (we combine all \(n\) hypotheses into a single one, in some sense). We changed the procedure we are using, and old guarantees don’t carry over.

Familywise error rate

So, we need a different approach, which starts by defining exactly what we want to control. This is not a new problem at all, and for decades statisticians have used the Familywise Error Rate (FWER), which ensures that the probability of committing a single false rejection is bounded by \(\alpha\)

To achieve this, we can use Bonferroni’s method again: we reject null hypothesis \(H_0, i\) if \(p_i \le \alpha / n\) To show that indeed this controls the familywise error rate, we can use the exact same union bound as shown above.

However, when there are lots of hypotheses to test, this is very conservative, and leads to very few rejections, or low power: When we test hundreds or thousands of hypotheses, do we really care about making a few mistakes as long as most of our rejections are correct?

There are a few methods, such as Holm’s procedure, that are a bit more powerful, but the FWER criterion is too restrictive even using slightly better methods.

False discovery rate

So we want a different criterion, one that is less restrictive than the familywise error rate, but so that we still have tight control over the number of false rejections. When testing many hypotheses, we might be fine allowing a few false rejections, or false discoveries, as long as the majority of rejections are correct.

To make this more rigorous, let \(R\) be the total number of rejections, and \(V\) be the number of false rejections. Then we would like to make sure the fraction \(V / R\) known as the False Discovery Proportion (Fdp), is small. There is one problem with this approach though: we know \(R\) but \(V\) is unknown, so we cannot use this quantity directly.

Instead, in a seminal paper, Benjamini and Hochberg [BHQ] propose a procedure that is known as \(BH(q)\) (or Benjamini-Hochberg procedure ) that ensures that in expectation, the above ratio is controlled: \(\frac{E[V]}{\max(R, 1)} < q,\) which is known as the False Discovery Rate (FDR).

The procedure works as follows:

  • Sort the p-values from small to large, such that \(p_{(1)} \le p_{(2)} \le \ldots \le p_{(n)}\)
  • Find the largest (sorted) p-value \(p_{(i)}\) such that \(p_{(i)} \le q \frac{i}{n}\)
  • Reject only hypotheses \(1, \ldots, i\)

It is important to note that this does not always work. However, if the hypotheses are independent, then the above method controls FDR. Also, if all the null hypotheses are true, then FWER and FDR are equivalent, and because FWER is conservative, any procedure that controls the FWER also controls FDR.

The visualization demonstrates the procedure graphically. Here, we generate a set of p-values, the ones with a green stroke come from a true null hypothesis, while the ones with a red stroke are not uniformly distributed. The fill of the circles show show whether that hypothesis is rejected. While in practice most use a value of \(q=0.1\) we use \(0.2\) instead for demonstration purposes. We also report the false discovery proportion, and the “(empirical) power”, the fraction of alternative hypotheses that are correctly rejected.

While gaining popularity in scientific fields to deal with multiple testing, the FDR metric is not without its flaws either. Because we cannot divide by zero, each time we do not reject any hypothesis, this counts as an Fdp of \(0\). Therefore, as long as a method often does not reject any hypotheses at all, then this method is free to reject whatever it wants for the remaining fraction of tests and still control FDR. Consider the following two methods for testing multiple hypotheses:

Method 1 : Throw a biased coin that comes up head with probability \(q\) If the coin comes up tails, don’t reject any of the hypotheses. On the other hand, if the comes up heads, reject all hypotheses.

Method 2 : Throw a biased coin that comes up head with probability \(q\) If the coin comes up tails, don’t reject any of the hypotheses. If the coin comes up heads, select 1 random hypothesis and reject it.

Though both methods are completely useless, they both control FDR at level \(q\). I would argue, though, that method 2 is better than method 1. It’s better to make a single mistake than \(n\) mistakes. In general, this could lead to a false sense of “significance”. When we see a lot of rejections, we might be tempted to think: since there are so many rejections, this cannot be a coincidence, I’ve struck gold with my dataset. But recall, we can never say anything about a specific outcome.

Three comments:

  • Of course the same criticism holds for FWER as well, so going back to controlling FWER does not solve this problem.
  • There has been research into metrics related to FDR that do not have this problem, but controlling such metrics is more difficult [STO].
  • It has been shown that if hypotheses are independent, then the Fdp is very close to the FDR when applying the BH(q) procedure.

Bayesian approach to FDR

While we don’t have the space to delve into it, it is important to note that there is also a beautiful Bayesian approach that also leads to the FDR criterion and the BHq procedure. For the interested, please refer to Lecture notes from Stats 300C or [LSI].

While many of these algorithms are easy to implement, both Python and R can do the hard work for you.

Multiple testing in Python

For python, statsmodels is able to help out. For example, to use the \(BH(q)\) procedure, we can

See the documentation for more information.

Multiple testing in R

In R it is equally simple to adjust p-values for multiple comparisons. Again, focusing on \(BH(q)\) we can

The documentation has the details and a list of other available methods.

Final remarks

Hypothesis testing is a subtle and surprisingly beautiful subject. On the one hand, testing is more prevalent than ever, while on the other hand there is a big backlash against the use of p-values, especially in academia. On top of that, we rarely have access to data that has not been analyzed by others before. By gaining a better understanding of hypothesis tests and the multiple comparisons problem in particular, I feel like I have gained a powerful tool, even though I don’t use it every day. The real benefit is the awareness of the subtle dangers when dealing with statistics, where it is important to be skeptic. So next time you see a single significant p-value from a linear or logistic regression model, stop for a second and think about how significant “significant” really is.

Also, don’t forget about the IPython notebook that looks at these methods using data from the NBA if you are interested.

This post is based on material of terrific course Stats 300C by Prof. Candes at Stanford. In particular, the lecture notes 1 , 2 , 3 , 6 , 8 , 9 , and 12 . Definitely have a look at them if you are interested in learning more.

  • [BHQ] Y. Benjamini and Y. Hochberg - Controlling the false discovery rate: A practical and powerful approach to multiple testing (1995)
  • [LSI] B. Efron - Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction (2011)
  • [MT1] Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics
  • [MT2] How to Correctly Interpret P Values
  • [STO] J. Storey - The Positive False Discovery Rate: A Bayesian interpretation and the q-value (2003)

Come Work with Us!

We’re a diverse team dedicated to building great products, and we’d love your help. Do you want to build amazing products with amazing peers? Join us!

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.7: Chapter Homework

  • Last updated
  • Save as PDF
  • Page ID 14704

9.1  Null and Alternative Hypotheses

  • H 0 : ________
  • H a : ________
  • H 0 : __________
  • H a : __________

9.2  Outcomes and the Type I and Type II Errors

9.3  distribution needed for hypothesis testing, 9.4  full hypothesis test examples.

State the null hypothesis,  H 0 , and the alternative hypothesis.  H a , in terms of the appropriate parameter ( μ  or  p ).

  • The mean number of years Americans work before retiring is 34.
  • At most 60% of Americans vote in presidential elections.
  • The mean starting salary for San Jose State University graduates is at least $100,000 per year.
  • Twenty-nine percent of high school seniors get drunk each month.
  • Fewer than 5% of adults ride the bus to work in Los Angeles.
  • The mean number of cars a person owns in her lifetime is not more than ten.
  • About half of Americans prefer to live away from cities, given the choice.
  • Europeans have a mean paid vacation each year of six weeks.
  • The chance of developing breast cancer is under 11% for women.
  • Private universities' mean tuition cost is more than $20,000 per year.
  • p  < 0.30
  • p  ≤ 0.30
  • p  ≥ 0.30
  • p  > 0.30
  • p  = 0.20
  • p  > 0.20
  • p  < 0.20
  • p  ≤ 0.20
  • \(H_o: \overline {x} = 4.5, H_a: \overline{x} > 4.5 \)
  • H o :  μ  ≥ 4.5,  H a :  μ  < 4.5
  • H o :  μ  = 4.75,  H a :  μ  > 4.75
  • H o :  μ  = 4.5,  H a :  μ  > 4.5
  • The mean number of cars a person owns in his or her lifetime is not more than ten.
  • Private universities mean tuition cost is more than $20,000 per year.
  • State a consequence of committing a Type I error.
  • State a consequence of committing a Type II error.
  • To conclude the drug is safe when in, fact, it is unsafe.
  • Not to conclude the drug is safe when, in fact, it is safe.
  • To conclude the drug is safe when, in fact, it is safe.
  • Not to conclude the drug is unsafe when, in fact, it is unsafe.
  • at least 20%, when in fact, it is less than 20%.
  • 20%, when in fact, it is 20%.
  • less than 20%, when in fact, it is at least 20%.
  • less than 20%, when in fact, it is less than 20%.

The Type II error is not to reject that the mean number of hours of sleep LTCC students get per night is at least seven when, in fact, the mean number of hours

  • is more than seven hours.
  • is at most seven hours.
  • is at least seven hours.
  • is less than seven hours.
  • to conclude that the current mean hours per week is higher than 4.5, when in fact, it is higher
  • to conclude that the current mean hours per week is higher than 4.5, when in fact, it is the same
  • to conclude that the mean hours per week currently is 4.5, when in fact, it is higher
  • to conclude that the mean hours per week currently is no higher than 4.5, when in fact, it is not higher
  • ��(7.24,1.9322√)N(7.24,1.9322)
  • ��(7.24,1.93)N(7.24,1.93)

According to an article in Newsweek , the natural ratio of girls to boys is 100:105. In China, the birth ratio is 100: 114 (46.7% girls). Suppose you don’t believe the reported figures of the percent of girls born in China. You conduct a study. In this study, you count the number of girls and boys born in 150 randomly chosen recent births. There are 60 girls and 90 boys born of the 150. Based on your study, do you believe that the percent of girls born in China is 46.7?

66 .A poll done for Newsweek found that 13% of Americans have seen or sensed the presence of an angel. A contingent doubts that the percent is really that high. It conducts its own survey. Out of 76 Americans surveyed, only two had seen or sensed the presence of an angel. As a result of the contingent’s survey, would you agree with the Newsweek poll? In complete sentences, also give three reasons why the two polls might give different results.

67 .The mean work week for engineers in a start-up company is believed to be about 60 hours. A newly hired engineer hopes that it’s shorter. She asks ten engineering friends in start-ups for the lengths of their mean work weeks. Based on the results that follow, should she count on the mean work week to be shorter than 60 hours?

Data (length of mean work week): 70; 45; 55; 60; 65; 55; 55; 60; 50; 55.

68 . Sixty-eight percent of online courses taught at community colleges nationwide were taught by full-time faculty. To test if 68% also represents California’s percent for full-time faculty teaching the online classes, Long Beach City College (LBCC) in California, was randomly selected for comparison. In the same year, 34 of the 44 online courses LBCC offered were taught by full-time faculty. Conduct a hypothesis test to determine if 68% represents California. NOTE: For more accurate results, use more California community colleges and this past year's data.

69 . According to an article in Bloomberg Businessweek , New York City's most recent adult smoking rate is 14%. Suppose that a survey is conducted to determine this year’s rate. Nine out of 70 randomly chosen N.Y. City residents reply that they smoke. Conduct a hypothesis test to determine if the rate is still 14% or if it has decreased.

70 . The mean age of De Anza College students in a previous term was 26.6 years old. An instructor thinks the mean age for online students is older than 26.6. She randomly surveys 56 online students and finds that the sample mean is 29.4 with a standard deviation of 2.1. Conduct a hypothesis test.

71 . Registered nurses earned an average annual salary of $69,110. For that same year, a survey was conducted of 41 California registered nurses to determine if the annual salary is higher than $69,110 for California nurses. The sample average was $71,121 with a sample standard deviation of $7,489. Conduct a hypothesis test.

72 . La Leche League International reports that the mean age of weaning a child from breastfeeding is age four to five worldwide. In America, most nursing mothers wean their children much earlier. Suppose a random survey is conducted of 21 U.S. mothers who recently weaned their children. The mean weaning age was nine months (3/4 year) with a standard deviation of 4 months. Conduct a hypothesis test to determine if the mean weaning age in the U.S. is less than four years old.

73 . Over the past few decades, public health officials have examined the link between weight concerns and teen girls' smoking. Researchers surveyed a group of 273 randomly selected teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls were surveyed again. Sixty-three said they smoked to stay thin. Is there good evidence that more than thirty percent of the teen girls smoke to stay thin? After conducting the test, your decision and conclusion are

  • Reject \(H_0\): There is sufficient evidence to conclude that more than 30% of teen girls smoke to stay thin.
  • Do not reject \(H_0\): There is not sufficient evidence to conclude that less than 30% of teen girls smoke to stay thin.
  • Do not reject \(H_0\): There is not sufficient evidence to conclude that more than 30% of teen girls smoke to stay thin.
  • Reject \(H_0\): There is sufficient evidence to conclude that less than 30% of teen girls smoke to stay thin.

74 . A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students attended the opening night midnight showing of the latest Harry Potter movie. She surveys 84 of her students and finds that 11 of them attended the midnight showing. At a 1% level of significance, an appropriate conclusion is:

  • There is insufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is less than 20%.
  • There is sufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is more than 20%.
  • There is sufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is less than 20%.
  • There is insufficient evidence to conclude that the percent of EVC students who attended the midnight showing of Harry Potter is at least 20%.

75 . Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test. At a significance level of \(\alpha = 0.05\), what is the correct conclusion?

  • There is enough evidence to conclude that the mean number of hours is more than 4.75
  • There is enough evidence to conclude that the mean number of hours is more than 4.5
  • There is not enough evidence to conclude that the mean number of hours is more than 4.5
  • There is not enough evidence to conclude that the mean number of hours is more than 4.75

Instructions: For the following ten exercises, Hypothesis testing: For the following ten exercises, answer each question.

  • State the null and alternate hypothesis.
  • State the \(p\)-value.
  • State alpha.
  • What is your decision?
  • Write a conclusion.
  • Answer any other questions asked in the problem.

76 . According to the Center for Disease Control website, in 2011 at least 18% of high school students have smoked a cigarette. An Introduction to Statistics class in Davies County, KY conducted a hypothesis test at the local high school (a medium sized–approximately 1,200 students–small city demographic) to determine if the local high school’s percentage was lower. One hundred fifty students were chosen at random and surveyed. Of the 150 students surveyed, 82 have smoked. Use a significance level of 0.05 and using appropriate statistical evidence, conduct a hypothesis test and state the conclusions.

77 . A recent survey in the N.Y. Times Almanac indicated that 48.8% of families own stock. A broker wanted to determine if this survey could be valid. He surveyed a random sample of 250 families and found that 142 owned some type of stock. At the 0.05 significance level, can the survey be considered to be accurate?

78 . Driver error can be listed as the cause of approximately 54% of all fatal auto accidents, according to the American Automobile Association. Thirty randomly selected fatal accidents are examined, and it is determined that 14 were caused by driver error. Using \(\alpha = 0.05\), is the AAA proportion accurate?

79 . The US Department of Energy reported that 51.7% of homes were heated by natural gas. A random sample of 221 homes in Kentucky found that 115 were heated by natural gas. Does the evidence support the claim for Kentucky at the \(\alpha = 0.05\) level in Kentucky? Are the results applicable across the country? Why?

80 . For Americans using library services, the American Library Association claims that at most 67% of patrons borrow books. The library director in Owensboro, Kentucky feels this is not true, so she asked a local college statistic class to conduct a survey. The class randomly selected 100 patrons and found that 82 borrowed books. Did the class demonstrate that the percentage was higher in Owensboro, KY? Use \(\alpha = 0.01\) level of significance. What is the possible proportion of patrons that do borrow books from the Owensboro Library?

81 . The Weather Underground reported that the mean amount of summer rainfall for the northeastern US is at least 11.52 inches. Ten cities in the northeast are randomly selected and the mean rainfall amount is calculated to be 7.42 inches with a standard deviation of 1.3 inches. At the \(\alpha = 0.05\) level, can it be concluded that the mean rainfall was below the reported average? What if \(\alpha = 0.01\)? Assume the amount of summer rainfall follows a normal distribution.

82 . A survey in the N.Y. Times Almanac finds the mean commute time (one way) is 25.4 minutes for the 15 largest US cities. The Austin, TX chamber of commerce feels that Austin’s commute time is less and wants to publicize this fact. The mean for 25 randomly selected commuters is 22.1 minutes with a standard deviation of 5.3 minutes. At the \(\alpha = 0.10\) level, is the Austin, TX commute significantly less than the mean commute time for the 15 largest US cities?

83 . A report by the Gallup Poll found that a woman visits her doctor, on average, at most 5.8 times each year. A random sample of 20 women results in these yearly visit totals

3; 2; 1; 3; 7; 2; 9; 4; 6; 6; 8; 0; 5; 6; 4; 2; 1; 3; 4; 1

At the \(\alpha = 0.05\) level can it be concluded that the sample mean is higher than 5.8 visits per year?

84 . According to the N.Y. Times Almanac the mean family size in the U.S. is 3.18. A sample of a college math class resulted in the following family sizes: 5; 4; 5; 4; 4; 3; 6; 4; 3; 3; 5; 5; 6; 3; 3; 2; 7; 4; 5; 2; 2; 2; 3; 2 At \(\alpha = 0.05\) level, is the class’ mean family size greater than the national average? Does the Almanac result remain valid? Why?

85 . The student academic group on a college campus claims that freshman students study at least 2.5 hours per day, on average. One Introduction to Statistics class was skeptical. The class took a random sample of 30 freshman students and found a mean study time of 137 minutes with a standard deviation of 45 minutes. At \(\alpha = 0.01\) level, is the student academic group’s claim correct?

IMAGES

  1. xkcd: Slope Hypothesis Testing

    hypothesis testing xkcd

  2. xkcd: Hypothesis Generation

    hypothesis testing xkcd

  3. Multiple Hypothesis Testing

    hypothesis testing xkcd

  4. Hypothesis Testing Simply Explained

    hypothesis testing xkcd

  5. SDS 303: Proper Hypothesis Testing For Every Field

    hypothesis testing xkcd

  6. More p-values, more problems

    hypothesis testing xkcd

VIDEO

  1. Testing Ascent with Louise Champ CR tires on Mini-Moab

  2. Week 10 Part 1 Hypothesis testing methods

  3. Hypothesis Testing

  4. Hypothesis Testing #3

  5. Hypothesis testing with populations part 1

  6. Hypothesis Testing

COMMENTS

  1. 2533: Slope Hypothesis Testing

    Explanation. "Slope hypothesis testing" is a method of testing the significance of a hypothesis involving a scatter plot. In this comic, Cueball and Megan are performing a study comparing student exam grades to the volume of their screams. Student A has the worst grade and softest scream, but Student B has the best grades and Student C the ...

  2. xkcd: Slope Hypothesis Testing

    xkcd: Slope Hypothesis Testing. A webcomic of romance, sarcasm, math, and language. What If? is on YouTube! The first video answers "What if we aimed Hubble at Earth?". Follow the What If? channel to be notified about new videos.

  3. xkcd: Significant

    A webcomic of romance, sarcasm, math, and language. What If? is on YouTube! The first video answers "What if we aimed Hubble at Earth?". Follow the What If? channel to be notified about new videos. Significant.

  4. hypothesis testing

    I see that one time out of the twenty total tests they run, p < 0.05 p < 0.05, so they wrongly assume that during one of the twenty tests, the result is significant ( 0.05 = 1/20 0.05 = 1 / 20 ). xkcd jelly bean comic - "Significant". Title: Significant. Hover text: "'So, uh, we did the green study again and got no link.

  5. 882: Significant

    Explain xkcd is a wiki dedicated to explaining the webcomic xkcd. Go figure. 882: Significant. Explain xkcd: It's 'cause you're dumb. ... The comic and the comment above you are correct in saying that if the null hypothesis holds, 1 out of every 20 tests will produce a false positive: this is by definition of the p-value. ...

  6. xkcd: Slope Hypothesis Testing

    xkcd: Slope Hypothesis Testing. What If? is now on YouTube! Check out the first video for the answer to. "What if we aimed the Hubble Telescope at Earth?". and follow xkcd's What If? The Video Series channel to be notified about each new video.

  7. 1478: P-Values

    Discussion. IMHO the current explanation is misleading. The p-value describes how well the experiment output fits hypothesis. The hypothesis can be that the experiment output is random. The low p-values point out that the experiment output fits well with behavior predicted by the hypothesis. The higher the p-value the more the observed and ...

  8. 2569: Hypothesis Generation

    Explanation []. Miss Lenhart is teaching a science class and starts by formulating the fact that to perform any science you need to generate a hypothesis in order to test it.. The front row student, Cueball (presumably the rest of the students in the class are off panel), is thus prompted to ask the salient question of how one finds an original hypothesis. . By using a clever prompting ...

  9. 1574: Trouble for Science

    Explain xkcd is a wiki dedicated to explaining the webcomic xkcd. Go figure. 1574: Trouble for Science. Explain xkcd: It's 'cause you're dumb. ... P-values are used in hypothesis testing. The p-value is the probability of observing an effect, result or relationship in your sample data, given that no such effect, result, or relationship exists ...

  10. xkcd 2533: Slope Hypothesis Testing : r/xkcd

    xkcd 2533: Slope Hypothesis Testing. They've successfully shown that the microphone is fairly precise. The students' volumes are fairly consistent. Maybe the microphone clamps all values to small segments.

  11. 892: Null Hypothesis

    The null hypothesis is the hypothesis in a statistical analysis that indicates that the effect investigated by the analysis does not occur, i.e. 'null' as in zero effect. For example, the null hypothesis for a study about cell phones and cancer risk might be "Cell phones have no effect on cancer risk." The alternative hypothesis, by contrast ...

  12. xkcd 2569: Hypothesis Generation : r/xkcd

    243 votes, 19 comments. 151K subscribers in the xkcd community. /r/xkcd is the subreddit for the popular webcomic xkcd by Randall Munroe. Come to…

  13. PDF xkcd Significance NoDate

    But then Megan and Cueball ask the scientists to test whether only one color of jelly beans is responsible. They test 20 different colors, each at a significance level of 5%. If the probability that each trial gives. a false positive result is 1 in 20, then by testing 20 different colors it is now likely that at least one jelly bean test will ...

  14. r

    22. Slope Hypothesis Testing (Randall Munroe, xkcd) The problem with this comic is obviously that the "measurements" are not independent, violating a key assumption for computing valid p-values. But I see also a possible problem with the fact that the dependent variable "grade" is constant for each student. What is the correct way to analyze ...

  15. Multiple Hypothesis Testing

    Method 2: Throw a biased coin that comes up head with probability q q If the coin comes up tails, don't reject any of the hypotheses. If the coin comes up heads, select 1 random hypothesis and reject it. Though both methods are completely useless, they both control FDR at level q q .

  16. xkcd: Slope Hypothesis Testing

    This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. This means you're free to copy and share these comics (but not to sell them). More details..

  17. 9.E: Hypothesis Testing with One Sample (Exercises)

    An Introduction to Statistics class in Davies County, KY conducted a hypothesis test at the local high school (a medium sized-approximately 1,200 students-small city demographic) to determine if the local high school's percentage was lower. One hundred fifty students were chosen at random and surveyed.

  18. xkcd: Hypothesis Generation

    xkcd: Hypothesis Generation. A webcomic of romance, sarcasm, math, and language. What If? is on YouTube! The first video answers "What if we aimed Hubble at Earth?". Follow the What If? channel to be notified about new videos. Hypothesis Generation.

  19. 9.7: Chapter Homework

    Conduct a hypothesis test to see if your decision and conclusion would change if your belief were that the brown trout's mean I.Q. is not four. 65. According to an article in Newsweek, the natural ratio of girls to boys is 100:105. In China, the birth ratio is 100: 114 (46.7% girls). Suppose you don't believe the reported figures of the ...

  20. Hypothesis Testing for A Population Mean Checkpoint

    Study with Quizlet and memorize flashcards containing terms like According to Facebook's self-reported statistics, the average Facebook user has 130 Facebook friends. For a statistics project a student at Contra Costa College tests the hypothesis that CCC students will average more than 130 Facebook friends. She randomly selects 3 classes from the schedule of classes and distributes a survey ...

  21. xkcd: Null Hypothesis

    xkcd: Null Hypothesis. A webcomic of romance, sarcasm, math, and language. What If? is on YouTube! The first video answers "What if we aimed Hubble at Earth?". Follow the What If? channel to be notified about new videos. Null Hypothesis.

  22. xkcd: Earth Formation Site

    A webcomic of romance, sarcasm, math, and language. What If? is on YouTube! The first video answers "What if we aimed Hubble at Earth?". Follow the What If? channel to be notified about new videos. Doppler Effect.