Icon Partners

  • Quality Improvement
  • Talk To Minitab

Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics

Topics: Hypothesis Testing , Statistics

What do significance levels and P values mean in hypothesis tests? What is statistical significance anyway? In this post, I’ll continue to focus on concepts and graphs to help you gain a more intuitive understanding of how hypothesis tests work in statistics.

To bring it to life, I’ll add the significance level and P value to the graph in my previous post in order to perform a graphical version of the 1 sample t-test. It’s easier to understand when you can see what statistical significance truly means!

Here’s where we left off in my last post . We want to determine whether our sample mean (330.6) indicates that this year's average energy cost is significantly different from last year’s average energy cost of $260.

Descriptive statistics for the example

The probability distribution plot above shows the distribution of sample means we’d obtain under the assumption that the null hypothesis is true (population mean = 260) and we repeatedly drew a large number of random samples.

I left you with a question: where do we draw the line for statistical significance on the graph? Now we'll add in the significance level and the P value, which are the decision-making tools we'll need.

We'll use these tools to test the following hypotheses:

  • Null hypothesis: The population mean equals the hypothesized mean (260).
  • Alternative hypothesis: The population mean differs from the hypothesized mean (260).

What Is the Significance Level (Alpha)?

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

These types of definitions can be hard to understand because of their technical nature. A picture makes the concepts much easier to comprehend!

The significance level determines how far out from the null hypothesis value we'll draw that line on the graph. To graph a significance level of 0.05, we need to shade the 5% of the distribution that is furthest away from the null hypothesis.

Probability plot that shows the critical regions for a significance level of 0.05

In the graph above, the two shaded areas are equidistant from the null hypothesis value and each area has a probability of 0.025, for a total of 0.05. In statistics, we call these shaded areas the critical region for a two-tailed test. If the population mean is 260, we’d expect to obtain a sample mean that falls in the critical region 5% of the time. The critical region defines how far away our sample statistic must be from the null hypothesis value before we can say it is unusual enough to reject the null hypothesis.

Our sample mean (330.6) falls within the critical region, which indicates it is statistically significant at the 0.05 level.

We can also see if it is statistically significant using the other common significance level of 0.01.

Probability plot that shows the critical regions for a significance level of 0.01

The two shaded areas each have a probability of 0.005, which adds up to a total probability of 0.01. This time our sample mean does not fall within the critical region and we fail to reject the null hypothesis. This comparison shows why you need to choose your significance level before you begin your study. It protects you from choosing a significance level because it conveniently gives you significant results!

Thanks to the graph, we were able to determine that our results are statistically significant at the 0.05 level without using a P value. However, when you use the numeric output produced by statistical software , you’ll need to compare the P value to your significance level to make this determination.

Ready for a demo of Minitab Statistical Software? Just ask! 

Talk to Minitab

What Are P values?

P-values are the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.

This definition of P values, while technically correct, is a bit convoluted. It’s easier to understand with a graph!

To graph the P value for our example data set, we need to determine the distance between the sample mean and the null hypothesis value (330.6 - 260 = 70.6). Next, we can graph the probability of obtaining a sample mean that is at least as extreme in both tails of the distribution (260 +/- 70.6).

Probability plot that shows the p-value for our sample mean

In the graph above, the two shaded areas each have a probability of 0.01556, for a total probability 0.03112. This probability represents the likelihood of obtaining a sample mean that is at least as extreme as our sample mean in both tails of the distribution if the population mean is 260. That’s our P value!

When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value for our example and compare it to the common significance levels, it matches the previous graphical results. The P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

If we stick to a significance level of 0.05, we can conclude that the average energy cost for the population is greater than 260.

A common mistake is to interpret the P-value as the probability that the null hypothesis is true. To understand why this interpretation is incorrect, please read my blog post  How to Correctly Interpret P Values .

Discussion about Statistically Significant Results

A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. A test result is statistically significant when the sample statistic is unusual enough relative to the null hypothesis that we can reject the null hypothesis for the entire population. “Unusual enough” in a hypothesis test is defined by:

  • The assumption that the null hypothesis is true—the graphs are centered on the null hypothesis value.
  • The significance level—how far out do we draw the line for the critical region?
  • Our sample statistic—does it fall in the critical region?

Keep in mind that there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. The common alpha values of 0.05 and 0.01 are simply based on tradition. For a significance level of 0.05, expect to obtain sample means in the critical region 5% of the time when the null hypothesis is true . In these cases, you won’t know that the null hypothesis is true but you’ll reject it because the sample mean falls in the critical region. That’s why the significance level is also referred to as an error rate!

This type of error doesn’t imply that the experimenter did anything wrong or require any other unusual explanation. The graphs show that when the null hypothesis is true, it is possible to obtain these unusual sample means for no reason other than random sampling error. It’s just luck of the draw.

Significance levels and P values are important tools that help you quantify and control this type of error in a hypothesis test. Using these tools to decide when to reject the null hypothesis increases your chance of making the correct decision.

If you like this post, you might want to read the other posts in this series that use the same graphical framework:

  • Previous: Why We Need to Use Hypothesis Tests
  • Next: Confidence Intervals and Confidence Levels

If you'd like to see how I made these graphs, please read: How to Create a Graphical Version of the 1-sample t-Test .

minitab-on-linkedin

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings
  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Level of Significance & Hypothesis Testing

level of significance and hypothesis testing

In hypothesis testing , the level of significance is a measure of how confident you can be about rejecting the null hypothesis. This blog post will explore what hypothesis testing is and why understanding significance levels are important for your data science projects. In addition, you will also get to test your knowledge of level of significance towards the end of the blog with the help of quiz . These questions can help you test your understanding and prepare for data science / statistics interviews . Before we look into what level of significance is, let’s quickly understand what is hypothesis testing.

Table of Contents

What is Hypothesis testing and how is it related to significance level?

Hypothesis testing can be defined as tests performed to evaluate whether a claim or theory about something is true or otherwise. In order to perform hypothesis tests, the following steps need to be taken:

  • Hypothesis formulation: Formulate the null and alternate hypothesis
  • Data collection: Gather the sample of data
  • Statistical tests: Determine the statistical test and test statistics. The statistical tests can be z-test or t-test depending upon the number of data samples and/or whether the population variance is known otherwise.
  • Set the level of significance
  • Calculate the p-value
  • Draw conclusions: Based on the value of p-value and significance level, reject the null hypothesis or otherwise.

A detailed explanation is provided in one of my related posts titled hypothesis testing explained with examples .

What is the level of significance?

The level of significance is defined as the criteria or threshold value based on which one can reject the null hypothesis or fail to reject the null hypothesis. The level of significance determines whether the outcome of hypothesis testing is statistically significant or otherwise. The significance level is also called as alpha level.

Another way of looking at the level of significance is the value which represents the likelihood of making a type I error . You may recall that Type I error occurs while evaluating hypothesis testing outcomes. If you reject the null hypothesis by mistake, you end up making a Type I error. This scenario is also termed as “false positive”. Take an example of a person alleged with committing a crime. The null hypothesis is that the person is not guilty. Type I error happens when you reject the null hypothesis by mistake. Given the example, a Type I error happens when you reject the null hypothesis that the person is not guilty by mistake. The innocent person is convicted.

The level of significance can take values such as 0.1 , 0.05 , 0.01 . The most common value of the level of significance is 0.05 . The lower the value of significance level, the lesser is the chance of type I error. That would essentially mean that the experiment or hypothesis testing outcome would really need to be highly precise for one to reject the null hypothesis. The likelihood of making a type I error would be very low. However, that does increase the chances of making type II errors as you may make mistakes in failing to reject the null hypothesis. You may want to read more details in relation to type I errors and type II errors in this post – Type I errors and Type II errors in hypothesis testing

The outcome of the hypothesis testing is evaluated with the help of a p-value. If the p-value is less than the level of significance, then the hypothesis testing outcome is statistically significant. On the other hand, if the hypothesis testing outcome is not statistically significant or the p-value is more than the level of significance, then we fail to reject the null hypothesis. The same is represented in the picture below for a right-tailed test. I will be posting details on different types of tail test in future posts.

level of significance and hypothesis testing

The picture below represents the concept for two-tailed hypothesis test:

level of significance and two-tailed test

For example: Let’s say that a school principal wants to find out whether extra coaching of 2 hours after school help students do better in their exams. The hypothesis would be as follows:

  • Null hypothesis : There is no difference between the performance of students even after providing extra coaching of 2 hours after the schools are over.
  • Alternate hypothesis : Students perform better when they get extra coaching of 2 hours after the schools are over. This hypothesis testing example would require a level of significant value at 0.05 or simply put, it would need to be highly precise that there’s actually a difference between the performance of students based on whether they take extra coaching.

Now, let’s say that we conduct this experiment with 100 students and measure their scores in exams. The test statistics is computed to be z=-0.50 (p-value=0.62). Since the p-value is more than 0.05, we fail to reject the null hypothesis. There is not enough evidence to show that there’s a difference in the performance of students based on whether they get extra coaching.

While performing hypothesis tests or experiments, it is important to keep the level of significance in mind.

Why does one need a level of significance?

In hypothesis tests, if we do not have some sort of threshold by which to determine whether your results are statistically significant enough for you to reject the null hypothesis, then it would be tough for us to determine whether your findings are significant or not. This is why we take into account levels of significance when performing hypothesis tests and experiments.

Since hypothesis testing helps us in making decisions about our data, having a level of significance set up allows one to know what sort of chances their findings might have of actually being due to the null hypothesis. If you set your level of significance at 0.05 for example, it would mean that there’s only a five percent chance that the difference between groups (assuming two groups are tested) is due to random sampling error. So if we found a difference in the performance of students based on whether they take extra coaching, we would need to consider other factors that could have contributed to the difference.

This is why hypothesis testing and level of significance go hand in hand with one another: hypothesis tests help us know whether our data falls within a certain range where it’s statistically significant or not so statistically significant whereas the level of significance tells us how likely is it that our hypothesis testing results are not due to random sampling error.

How is the level of significance used in hypothesis testing?

The level of significance along with the test statistic and p-value formed a key part of hypothesis testing. The value that you derive from hypothesis testing depends on whether or not you accept/reject the null hypothesis, given your findings at each step. Before going into rejection vs non-rejection, let’s understand the terms better.

If the test statistic falls within the critical region, you reject the null hypothesis. This means that your findings are statistically significant and support the alternate hypothesis. The value of the p-value determines how likely it is for finding this outcome if, in fact, the null hypothesis were true. If the p-value is less than or equal to the level of significance, you reject the null hypothesis. This means that your hypothesis testing outcome was statistically significant at a certain degree and in favor of the alternate hypothesis.

If on the other hand, the p-value is greater than alpha level or significance level, then you fail to reject the null hypothesis. These findings are not statistically significant enough for one to reject the null hypothesis. The same is represented in the diagram below:

level of significance and p-value

Level of Significance – Quiz / Interview Questions

Here are some practice questions which can help you in testing your questions, and, also prepare for interviews.

#1. Which of the following will result in greater type II error?

#2. which one of the following is considered most popular choice of significance level, #3. level of significance is also called as ________, #4. which of the following is looks to be inappropriate level of significance, #5. the p-value of 0.03 is statistically significant for significance level as 0.01, #6. the statistically significant outcome of hypothesis testing would mean which of the following, #7. the p-value less than the level of significance would mean which of the following, #8. which of the following will result in greater type i error, recent posts.

Ajitesh Kumar

  • Model Parallelism vs Data Parallelism: Examples - April 11, 2024
  • Model Complexity & Overfitting in Machine Learning: How to Reduce - April 10, 2024
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade - April 9, 2024

Oops! Check your answers again. The minimum pass percentage is 70%.

Share your score!

Hypothesis testing is an important statistical concept that helps us determine whether the claim made about anything is true or otherwise. The hypothesis test statistic, level of significance, and p-value all work together to help you make decisions about your data. If our hypothesis tests show enough evidence to reject the null hypothesis, then we know statistically significant findings are at hand. This post gave you ideas for how you can use hypothesis testing in your experiments by understanding what it means when someone rejects or fails to reject the null hypothesis.

Ajitesh Kumar

3 responses.

Well explained with examples and helpful illustration

Thank you for your feedback

Well explained

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Model Parallelism vs Data Parallelism: Examples
  • Model Complexity & Overfitting in Machine Learning: How to Reduce
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade
  • Self-Prediction vs Contrastive Learning: Examples
  • Free IBM Data Sciences Courses on Coursera

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Hypothesis Testing (cont...)

Hypothesis testing, the null and alternative hypothesis.

In order to undertake hypothesis testing you need to express your research hypothesis as a null and alternative hypothesis. The null hypothesis and alternative hypothesis are statements regarding the differences or effects that occur in the population. You will use your sample to test which statement (i.e., the null hypothesis or alternative hypothesis) is most likely (although technically, you test the evidence against the null hypothesis). So, with respect to our teaching example, the null and alternative hypothesis will reflect statements about all statistics students on graduate management courses.

The null hypothesis is essentially the "devil's advocate" position. That is, it assumes that whatever you are trying to prove did not happen ( hint: it usually states that something equals zero). For example, the two different teaching methods did not result in different exam performances (i.e., zero difference). Another example might be that there is no relationship between anxiety and athletic performance (i.e., the slope is zero). The alternative hypothesis states the opposite and is usually the hypothesis you are trying to prove (e.g., the two different teaching methods did result in different exam performances). Initially, you can state these hypotheses in more general terms (e.g., using terms like "effect", "relationship", etc.), as shown below for the teaching methods example:

Depending on how you want to "summarize" the exam performances will determine how you might want to write a more specific null and alternative hypothesis. For example, you could compare the mean exam performance of each group (i.e., the "seminar" group and the "lectures-only" group). This is what we will demonstrate here, but other options include comparing the distributions , medians , amongst other things. As such, we can state:

Now that you have identified the null and alternative hypotheses, you need to find evidence and develop a strategy for declaring your "support" for either the null or alternative hypothesis. We can do this using some statistical theory and some arbitrary cut-off points. Both these issues are dealt with next.

Significance levels

The level of statistical significance is often expressed as the so-called p -value . Depending on the statistical test you have chosen, you will calculate a probability (i.e., the p -value) of observing your sample results (or more extreme) given that the null hypothesis is true . Another way of phrasing this is to consider the probability that a difference in a mean score (or other statistic) could have arisen based on the assumption that there really is no difference. Let us consider this statement with respect to our example where we are interested in the difference in mean exam performance between two different teaching methods. If there really is no difference between the two teaching methods in the population (i.e., given that the null hypothesis is true), how likely would it be to see a difference in the mean exam performance between the two teaching methods as large as (or larger than) that which has been observed in your sample?

So, you might get a p -value such as 0.03 (i.e., p = .03). This means that there is a 3% chance of finding a difference as large as (or larger than) the one in your study given that the null hypothesis is true. However, you want to know whether this is "statistically significant". Typically, if there was a 5% or less chance (5 times in 100 or less) that the difference in the mean exam performance between the two teaching methods (or whatever statistic you are using) is as different as observed given the null hypothesis is true, you would reject the null hypothesis and accept the alternative hypothesis. Alternately, if the chance was greater than 5% (5 times in 100 or more), you would fail to reject the null hypothesis and would not accept the alternative hypothesis. As such, in this example where p = .03, we would reject the null hypothesis and accept the alternative hypothesis. We reject it because at a significance level of 0.03 (i.e., less than a 5% chance), the result we obtained could happen too frequently for us to be confident that it was the two teaching methods that had an effect on exam performance.

Whilst there is relatively little justification why a significance level of 0.05 is used rather than 0.01 or 0.10, for example, it is widely used in academic research. However, if you want to be particularly confident in your results, you can set a more stringent level of 0.01 (a 1% chance or less; 1 in 100 chance or less).

Testimonials

One- and two-tailed predictions

When considering whether we reject the null hypothesis and accept the alternative hypothesis, we need to consider the direction of the alternative hypothesis statement. For example, the alternative hypothesis that was stated earlier is:

The alternative hypothesis tells us two things. First, what predictions did we make about the effect of the independent variable(s) on the dependent variable(s)? Second, what was the predicted direction of this effect? Let's use our example to highlight these two points.

Sarah predicted that her teaching method (independent variable: teaching method), whereby she not only required her students to attend lectures, but also seminars, would have a positive effect (that is, increased) students' performance (dependent variable: exam marks). If an alternative hypothesis has a direction (and this is how you want to test it), the hypothesis is one-tailed. That is, it predicts direction of the effect. If the alternative hypothesis has stated that the effect was expected to be negative, this is also a one-tailed hypothesis.

Alternatively, a two-tailed prediction means that we do not make a choice over the direction that the effect of the experiment takes. Rather, it simply implies that the effect could be negative or positive. If Sarah had made a two-tailed prediction, the alternative hypothesis might have been:

In other words, we simply take out the word "positive", which implies the direction of our effect. In our example, making a two-tailed prediction may seem strange. After all, it would be logical to expect that "extra" tuition (going to seminar classes as well as lectures) would either have a positive effect on students' performance or no effect at all, but certainly not a negative effect. However, this is just our opinion (and hope) and certainly does not mean that we will get the effect we expect. Generally speaking, making a one-tail prediction (i.e., and testing for it this way) is frowned upon as it usually reflects the hope of a researcher rather than any certainty that it will happen. Notable exceptions to this rule are when there is only one possible way in which a change could occur. This can happen, for example, when biological activity/presence in measured. That is, a protein might be "dormant" and the stimulus you are using can only possibly "wake it up" (i.e., it cannot possibly reduce the activity of a "dormant" protein). In addition, for some statistical tests, one-tailed tests are not possible.

Rejecting or failing to reject the null hypothesis

Let's return finally to the question of whether we reject or fail to reject the null hypothesis.

If our statistical analysis shows that the significance level is below the cut-off value we have set (e.g., either 0.05 or 0.01), we reject the null hypothesis and accept the alternative hypothesis. Alternatively, if the significance level is above the cut-off value, we fail to reject the null hypothesis and cannot accept the alternative hypothesis. You should note that you cannot accept the null hypothesis, but only find evidence against it.

Hypothesis Testing with Z-Test: Significance Level and Rejection Region

Join over 2 million students who advanced their careers with 365 Data Science. Learn from instructors who have worked at Meta, Spotify, Google, IKEA, Netflix, and Coca-Cola and master Python, SQL, Excel, machine learning, data analysis, AI fundamentals, and more.

how to test hypothesis at 5 level of significance

If you want to understand why hypothesis testing works, you should first have an idea about the significance level and the reject region . We assume you already know what a hypothesis is , so let’s jump right into the action.

What Is the Significance Level?

First, we must define the term significance level .

Normally, we aim to reject the null if it is false.

Significance level

However, as with any test, there is a small chance that we could get it wrong and reject a null hypothesis that is true.

Error, significance level

How Is the Significance Level Denoted?

The significance level is denoted by α and is the probability of rejecting the null hypothesis , if it is true.

α and is the probability of rejecting the null hypothesis, significance level

So, the probability of making this error.

Typical values for α are 0.01, 0.05 and 0.1. It is a value that we select based on the certainty we need. In most cases, the choice of α is determined by the context we are operating in, but 0.05 is the most commonly used value.

Most common, significance level

A Case in Point

Say, we need to test if a machine is working properly. We would expect the test to make little or no mistakes. As we want to be very precise, we should pick a low significance level such as 0.01.

The famous Coca Cola glass bottle is 12 ounces. If the machine pours 12.1 ounces, some of the liquid would be spilled, and the label would be damaged as well. So, in certain situations, we need to be as accurate as possible.

Significance level: Coca Cola example

Higher Degree of Error

However, if we are analyzing humans or companies, we would expect more random or at least uncertain behavior. Hence, a higher degree of error.

You expect more random behavior, significance level

For instance, if we want to predict how much Coca Cola its consumers drink on average, the difference between 12 ounces and 12.1 ounces will not be that crucial. So, we can choose a higher significance level like 0.05 or 0.1.

The difference between 12 and 12.1, significance level

Hypothesis Testing: Performing a Z-Test

Now that we have an idea about the significance level , let’s get to the mechanics of hypothesis testing.

Imagine you are consulting a university and want to carry out an analysis on how students are performing on average.

How students are performing on average, significance-level

The university dean believes that on average students have a GPA of 70%. Being the data-driven researcher that you are, you can’t simply agree with his opinion, so you start testing.

The null hypothesis is: The population mean grade is 70%.

This is a hypothesized value.

The alternative hypothesis is: The population mean grade is not 70%. You can see how both of them are denoted, below.

University Dean example: Null hypothesis equals the population mean

Visualizing the Grades

Assuming that the population of grades is normally distributed, all grades received by students should look in the following way.

Distribution of grades, significance level

That is the true population mean .

Performing a Z-test

Now, a test we would normally perform is the Z-test . The formula is:

Z equals the sample mean , minus the hypothesized mean , divided by the standard error .

Z equals the sample mean, minus the hypothesized mean, divided by the standard error, significance level

The idea is the following.

We are standardizing or scaling the sample mean we got. (You can quickly obtain it with our Mean, Median, Mode calculator .) If the sample mean is close enough to the hypothesized mean , then Z will be close to 0. Otherwise, it will be far away from it. Naturally, if the sample mean is exactly equal to the hypothesized mean , Z will be 0.

If the sample mean is exactly equal to the hypothesized mean, Z will be 0, significance level

In all these cases, we would accept the null hypothesis .

What Is the Rejection Region?

The question here is the following:

How big should Z be for us to reject the null hypothesis ?

Well, there is a cut-off line. Since we are conducting a two-sided or a two-tailed test, there are two cut-off lines, one on each side.

Distribution of Z (standard normal distribution), significance level

When we calculate Z , we will get a value. If this value falls into the middle part, then we cannot reject the null. If it falls outside, in the shaded region, then we reject the null hypothesis .

That is why the shaded part is called: rejection region , as you can see below.

Rejection region, significance level

What Does the Rejection Region Depend on?

The area that is cut-off actually depends on the significance level .

Say the level of significance , α , is 0.05. Then we have α divided by 2, or 0.025 on the left side and 0.025 on the right side.

The level of significance, α, is 0.05. Then we have α divided by 2, or 0.025 on the left side and 0.025 on the right side

Now these are values we can check from the z-table . When α is 0.025, Z is 1.96. So, 1.96 on the right side and minus 1.96 on the left side.

Therefore, if the value we get for Z from the test is lower than minus 1.96, or higher than 1.96, we will reject the null hypothesis . Otherwise, we will accept it.

One-sided test: Z score is 1.96

That’s more or less how hypothesis testing works.

We scale the sample mean with respect to the hypothesized value. If Z is close to 0, then we cannot reject the null. If it is far away from 0, then we reject the null hypothesis .

How does hypothesis testing work?

Example of One Tailed Test

What about one-sided tests? We have those too!

Let’s consider the following situation.

Paul says data scientists earn more than $125,000. So, H 0 is: μ 0 is bigger than $125,000.

The alternative is that μ 0 is lower or equal to 125,000.

Using the same significance level , this time, the whole rejection region is on the left. So, the rejection region has an area of α . Looking at the z-table, that corresponds to a Z -score of 1.645. Since it is on the left, it is with a minus sign.

One-sided test: Z score is 1.645

Accept or Reject

Now, when calculating our test statistic Z , if we get a value lower than -1.645, we would reject the null hypothesis . We do that because we have statistical evidence that the data scientist salary is less than $125,000. Otherwise, we would accept it.

One-sided test: Z score is - 1.645 - rejecting null hypothesis

Another One-Tailed Test

To exhaust all possibilities, let’s explore another one-tailed test.

Say the university dean told you that the average GPA students get is lower than 70%. In that case, the null hypothesis is:

μ 0 is lower than 70%.

While the alternative is:

μ 0` is bigger or equal to 70%.

University Dean example: Null hypothesis lower than the population mean

In this situation, the rejection region is on the right side. So, if the test statistic is bigger than the cut-off z-score, we would reject the null, otherwise, we wouldn’t.

One-sided test: test statistic is bigger than the cut-off z-score - reject the null hypothesis

Importance of the Significance Level and the Rejection Region

To sum up, the significance level and the reject region are quite crucial in the process of hypothesis testing. The level of significance conducts the accuracy of prediction. We (the researchers) choose it depending on how big of a difference a possible error could make. On the other hand, the reject region helps us decide whether or not to reject the null hypothesis . After reading this and putting both of them into use, you will realize how convenient they make your work.

Interested in taking your skills from good to great? Try statistics course for free !

Next Tutorial:  Providing a Few Linear Regression Examples

World-Class

Data Science

Learn with instructors from:

Iliya Valchanov

Co-founder of 365 Data Science

Iliya is a finance graduate with a strong quantitative background who chose the exciting path of a startup entrepreneur. He demonstrated a formidable affinity for numbers during his childhood, winning more than 90 national and international awards and competitions through the years. Iliya started teaching at university, helping other students learn statistics and econometrics. Inspired by his first happy students, he co-founded 365 Data Science to continue spreading knowledge. He authored several of the program’s online courses in mathematics, statistics, machine learning, and deep learning.

We Think you'll also like

Hypothesis Testing: Null Hypothesis and Alternative Hypothesis

Statistics Tutorials

Hypothesis Testing: Null Hypothesis and Alternative Hypothesis

Article by Iliya Valchanov

False Positive vs. False Negative: Type I and Type II Errors in Statistical Hypothesis Testing

Calculating and Using Covariance and Linear Correlation Coefficient

Calculating and Using Covariance and Linear Correlation Coefficient

Examples of Numerical and Categorical Variables

Examples of Numerical and Categorical Variables

Hypothesis Testing - Chi Squared Test

Lisa Sullivan, PhD

Professor of Biostatistics

Boston University School of Public Health

Introductory word scramble

Introduction

This module will continue the discussion of hypothesis testing, where a specific statement or hypothesis is generated about a population parameter, and sample statistics are used to assess the likelihood that the hypothesis is true. The hypothesis is based on available information and the investigator's belief about the population parameters. The specific tests considered here are called chi-square tests and are appropriate when the outcome is discrete (dichotomous, ordinal or categorical). For example, in some clinical trials the outcome is a classification such as hypertensive, pre-hypertensive or normotensive. We could use the same classification in an observational study such as the Framingham Heart Study to compare men and women in terms of their blood pressure status - again using the classification of hypertensive, pre-hypertensive or normotensive status.  

The technique to analyze a discrete outcome uses what is called a chi-square test. Specifically, the test statistic follows a chi-square probability distribution. We will consider chi-square tests here with one, two and more than two independent comparison groups.

Learning Objectives

After completing this module, the student will be able to:

  • Perform chi-square tests by hand
  • Appropriately interpret results of chi-square tests
  • Identify the appropriate hypothesis testing procedure based on type of outcome variable and number of samples

Tests with One Sample, Discrete Outcome

Here we consider hypothesis testing with a discrete outcome variable in a single population. Discrete variables are variables that take on more than two distinct responses or categories and the responses can be ordered or unordered (i.e., the outcome can be ordinal or categorical). The procedure we describe here can be used for dichotomous (exactly 2 response options), ordinal or categorical discrete outcomes and the objective is to compare the distribution of responses, or the proportions of participants in each response category, to a known distribution. The known distribution is derived from another study or report and it is again important in setting up the hypotheses that the comparator distribution specified in the null hypothesis is a fair comparison. The comparator is sometimes called an external or a historical control.   

In one sample tests for a discrete outcome, we set up our hypotheses against an appropriate comparator. We select a sample and compute descriptive statistics on the sample data. Specifically, we compute the sample size (n) and the proportions of participants in each response

Test Statistic for Testing H 0 : p 1 = p 10 , p 2 = p 20 , ..., p k = p k0

We find the critical value in a table of probabilities for the chi-square distribution with degrees of freedom (df) = k-1. In the test statistic, O = observed frequency and E=expected frequency in each of the response categories. The observed frequencies are those observed in the sample and the expected frequencies are computed as described below. χ 2 (chi-square) is another probability distribution and ranges from 0 to ∞. The test above statistic formula above is appropriate for large samples, defined as expected frequencies of at least 5 in each of the response categories.  

When we conduct a χ 2 test, we compare the observed frequencies in each response category to the frequencies we would expect if the null hypothesis were true. These expected frequencies are determined by allocating the sample to the response categories according to the distribution specified in H 0 . This is done by multiplying the observed sample size (n) by the proportions specified in the null hypothesis (p 10 , p 20 , ..., p k0 ). To ensure that the sample size is appropriate for the use of the test statistic above, we need to ensure that the following: min(np 10 , n p 20 , ..., n p k0 ) > 5.  

The test of hypothesis with a discrete outcome measured in a single sample, where the goal is to assess whether the distribution of responses follows a known distribution, is called the χ 2 goodness-of-fit test. As the name indicates, the idea is to assess whether the pattern or distribution of responses in the sample "fits" a specified population (external or historical) distribution. In the next example we illustrate the test. As we work through the example, we provide additional details related to the use of this new test statistic.  

A University conducted a survey of its recent graduates to collect demographic and health information for future planning purposes as well as to assess students' satisfaction with their undergraduate experiences. The survey revealed that a substantial proportion of students were not engaging in regular exercise, many felt their nutrition was poor and a substantial number were smoking. In response to a question on regular exercise, 60% of all graduates reported getting no regular exercise, 25% reported exercising sporadically and 15% reported exercising regularly as undergraduates. The next year the University launched a health promotion campaign on campus in an attempt to increase health behaviors among undergraduates. The program included modules on exercise, nutrition and smoking cessation. To evaluate the impact of the program, the University again surveyed graduates and asked the same questions. The survey was completed by 470 graduates and the following data were collected on the exercise question:

Based on the data, is there evidence of a shift in the distribution of responses to the exercise question following the implementation of the health promotion campaign on campus? Run the test at a 5% level of significance.

In this example, we have one sample and a discrete (ordinal) outcome variable (with three response options). We specifically want to compare the distribution of responses in the sample to the distribution reported the previous year (i.e., 60%, 25%, 15% reporting no, sporadic and regular exercise, respectively). We now run the test using the five-step approach.  

  • Step 1. Set up hypotheses and determine level of significance.

The null hypothesis again represents the "no change" or "no difference" situation. If the health promotion campaign has no impact then we expect the distribution of responses to the exercise question to be the same as that measured prior to the implementation of the program.

H 0 : p 1 =0.60, p 2 =0.25, p 3 =0.15,  or equivalently H 0 : Distribution of responses is 0.60, 0.25, 0.15  

H 1 :   H 0 is false.          α =0.05

Notice that the research hypothesis is written in words rather than in symbols. The research hypothesis as stated captures any difference in the distribution of responses from that specified in the null hypothesis. We do not specify a specific alternative distribution, instead we are testing whether the sample data "fit" the distribution in H 0 or not. With the χ 2 goodness-of-fit test there is no upper or lower tailed version of the test.

  • Step 2. Select the appropriate test statistic.  

The test statistic is:

We must first assess whether the sample size is adequate. Specifically, we need to check min(np 0 , np 1, ..., n p k ) > 5. The sample size here is n=470 and the proportions specified in the null hypothesis are 0.60, 0.25 and 0.15. Thus, min( 470(0.65), 470(0.25), 470(0.15))=min(282, 117.5, 70.5)=70.5. The sample size is more than adequate so the formula can be used.

  • Step 3. Set up decision rule.  

The decision rule for the χ 2 test depends on the level of significance and the degrees of freedom, defined as degrees of freedom (df) = k-1 (where k is the number of response categories). If the null hypothesis is true, the observed and expected frequencies will be close in value and the χ 2 statistic will be close to zero. If the null hypothesis is false, then the χ 2 statistic will be large. Critical values can be found in a table of probabilities for the χ 2 distribution. Here we have df=k-1=3-1=2 and a 5% level of significance. The appropriate critical value is 5.99, and the decision rule is as follows: Reject H 0 if χ 2 > 5.99.

  • Step 4. Compute the test statistic.  

We now compute the expected frequencies using the sample size and the proportions specified in the null hypothesis. We then substitute the sample data (observed frequencies) and the expected frequencies into the formula for the test statistic identified in Step 2. The computations can be organized as follows.

Notice that the expected frequencies are taken to one decimal place and that the sum of the observed frequencies is equal to the sum of the expected frequencies. The test statistic is computed as follows:

  • Step 5. Conclusion.  

We reject H 0 because 8.46 > 5.99. We have statistically significant evidence at α=0.05 to show that H 0 is false, or that the distribution of responses is not 0.60, 0.25, 0.15.  The p-value is p < 0.005.  

In the χ 2 goodness-of-fit test, we conclude that either the distribution specified in H 0 is false (when we reject H 0 ) or that we do not have sufficient evidence to show that the distribution specified in H 0 is false (when we fail to reject H 0 ). Here, we reject H 0 and concluded that the distribution of responses to the exercise question following the implementation of the health promotion campaign was not the same as the distribution prior. The test itself does not provide details of how the distribution has shifted. A comparison of the observed and expected frequencies will provide some insight into the shift (when the null hypothesis is rejected). Does it appear that the health promotion campaign was effective?  

Consider the following: 

If the null hypothesis were true (i.e., no change from the prior year) we would have expected more students to fall in the "No Regular Exercise" category and fewer in the "Regular Exercise" categories. In the sample, 255/470 = 54% reported no regular exercise and 90/470=19% reported regular exercise. Thus, there is a shift toward more regular exercise following the implementation of the health promotion campaign. There is evidence of a statistical difference, is this a meaningful difference? Is there room for improvement?

The National Center for Health Statistics (NCHS) provided data on the distribution of weight (in categories) among Americans in 2002. The distribution was based on specific values of body mass index (BMI) computed as weight in kilograms over height in meters squared. Underweight was defined as BMI< 18.5, Normal weight as BMI between 18.5 and 24.9, overweight as BMI between 25 and 29.9 and obese as BMI of 30 or greater. Americans in 2002 were distributed as follows: 2% Underweight, 39% Normal Weight, 36% Overweight, and 23% Obese. Suppose we want to assess whether the distribution of BMI is different in the Framingham Offspring sample. Using data from the n=3,326 participants who attended the seventh examination of the Offspring in the Framingham Heart Study we created the BMI categories as defined and observed the following:

  • Step 1.  Set up hypotheses and determine level of significance.

H 0 : p 1 =0.02, p 2 =0.39, p 3 =0.36, p 4 =0.23     or equivalently

H 0 : Distribution of responses is 0.02, 0.39, 0.36, 0.23

H 1 :   H 0 is false.        α=0.05

The formula for the test statistic is:

We must assess whether the sample size is adequate. Specifically, we need to check min(np 0 , np 1, ..., n p k ) > 5. The sample size here is n=3,326 and the proportions specified in the null hypothesis are 0.02, 0.39, 0.36 and 0.23. Thus, min( 3326(0.02), 3326(0.39), 3326(0.36), 3326(0.23))=min(66.5, 1297.1, 1197.4, 765.0)=66.5. The sample size is more than adequate, so the formula can be used.

Here we have df=k-1=4-1=3 and a 5% level of significance. The appropriate critical value is 7.81 and the decision rule is as follows: Reject H 0 if χ 2 > 7.81.

We now compute the expected frequencies using the sample size and the proportions specified in the null hypothesis. We then substitute the sample data (observed frequencies) into the formula for the test statistic identified in Step 2. We organize the computations in the following table.

The test statistic is computed as follows:

We reject H 0 because 233.53 > 7.81. We have statistically significant evidence at α=0.05 to show that H 0 is false or that the distribution of BMI in Framingham is different from the national data reported in 2002, p < 0.005.  

Again, the χ 2   goodness-of-fit test allows us to assess whether the distribution of responses "fits" a specified distribution. Here we show that the distribution of BMI in the Framingham Offspring Study is different from the national distribution. To understand the nature of the difference we can compare observed and expected frequencies or observed and expected proportions (or percentages). The frequencies are large because of the large sample size, the observed percentages of patients in the Framingham sample are as follows: 0.6% underweight, 28% normal weight, 41% overweight and 30% obese. In the Framingham Offspring sample there are higher percentages of overweight and obese persons (41% and 30% in Framingham as compared to 36% and 23% in the national data), and lower proportions of underweight and normal weight persons (0.6% and 28% in Framingham as compared to 2% and 39% in the national data). Are these meaningful differences?

In the module on hypothesis testing for means and proportions, we discussed hypothesis testing applications with a dichotomous outcome variable in a single population. We presented a test using a test statistic Z to test whether an observed (sample) proportion differed significantly from a historical or external comparator. The chi-square goodness-of-fit test can also be used with a dichotomous outcome and the results are mathematically equivalent.  

In the prior module, we considered the following example. Here we show the equivalence to the chi-square goodness-of-fit test.

The NCHS report indicated that in 2002, 75% of children aged 2 to 17 saw a dentist in the past year. An investigator wants to assess whether use of dental services is similar in children living in the city of Boston. A sample of 125 children aged 2 to 17 living in Boston are surveyed and 64 reported seeing a dentist over the past 12 months. Is there a significant difference in use of dental services between children living in Boston and the national data?

We presented the following approach to the test using a Z statistic. 

  • Step 1. Set up hypotheses and determine level of significance

H 0 : p = 0.75

H 1 : p ≠ 0.75                               α=0.05

We must first check that the sample size is adequate. Specifically, we need to check min(np 0 , n(1-p 0 )) = min( 125(0.75), 125(1-0.75))=min(94, 31)=31. The sample size is more than adequate so the following formula can be used

This is a two-tailed test, using a Z statistic and a 5% level of significance. Reject H 0 if Z < -1.960 or if Z > 1.960.

We now substitute the sample data into the formula for the test statistic identified in Step 2. The sample proportion is:

how to test hypothesis at 5 level of significance

We reject H 0 because -6.15 < -1.960. We have statistically significant evidence at a =0.05 to show that there is a statistically significant difference in the use of dental service by children living in Boston as compared to the national data. (p < 0.0001).  

We now conduct the same test using the chi-square goodness-of-fit test. First, we summarize our sample data as follows:

H 0 : p 1 =0.75, p 2 =0.25     or equivalently H 0 : Distribution of responses is 0.75, 0.25 

We must assess whether the sample size is adequate. Specifically, we need to check min(np 0 , np 1, ...,np k >) > 5. The sample size here is n=125 and the proportions specified in the null hypothesis are 0.75, 0.25. Thus, min( 125(0.75), 125(0.25))=min(93.75, 31.25)=31.25. The sample size is more than adequate so the formula can be used.

Here we have df=k-1=2-1=1 and a 5% level of significance. The appropriate critical value is 3.84, and the decision rule is as follows: Reject H 0 if χ 2 > 3.84. (Note that 1.96 2 = 3.84, where 1.96 was the critical value used in the Z test for proportions shown above.)

(Note that (-6.15) 2 = 37.8, where -6.15 was the value of the Z statistic in the test for proportions shown above.)

We reject H 0 because 37.8 > 3.84. We have statistically significant evidence at α=0.05 to show that there is a statistically significant difference in the use of dental service by children living in Boston as compared to the national data.  (p < 0.0001). This is the same conclusion we reached when we conducted the test using the Z test above. With a dichotomous outcome, Z 2 = χ 2 !   In statistics, there are often several approaches that can be used to test hypotheses. 

Tests for Two or More Independent Samples, Discrete Outcome

Here we extend that application of the chi-square test to the case with two or more independent comparison groups. Specifically, the outcome of interest is discrete with two or more responses and the responses can be ordered or unordered (i.e., the outcome can be dichotomous, ordinal or categorical). We now consider the situation where there are two or more independent comparison groups and the goal of the analysis is to compare the distribution of responses to the discrete outcome variable among several independent comparison groups.  

The test is called the χ 2 test of independence and the null hypothesis is that there is no difference in the distribution of responses to the outcome across comparison groups. This is often stated as follows: The outcome variable and the grouping variable (e.g., the comparison treatments or comparison groups) are independent (hence the name of the test). Independence here implies homogeneity in the distribution of the outcome among comparison groups.    

The null hypothesis in the χ 2 test of independence is often stated in words as: H 0 : The distribution of the outcome is independent of the groups. The alternative or research hypothesis is that there is a difference in the distribution of responses to the outcome variable among the comparison groups (i.e., that the distribution of responses "depends" on the group). In order to test the hypothesis, we measure the discrete outcome variable in each participant in each comparison group. The data of interest are the observed frequencies (or number of participants in each response category in each group). The formula for the test statistic for the χ 2 test of independence is given below.

Test Statistic for Testing H 0 : Distribution of outcome is independent of groups

and we find the critical value in a table of probabilities for the chi-square distribution with df=(r-1)*(c-1).

Here O = observed frequency, E=expected frequency in each of the response categories in each group, r = the number of rows in the two-way table and c = the number of columns in the two-way table.   r and c correspond to the number of comparison groups and the number of response options in the outcome (see below for more details). The observed frequencies are the sample data and the expected frequencies are computed as described below. The test statistic is appropriate for large samples, defined as expected frequencies of at least 5 in each of the response categories in each group.  

The data for the χ 2 test of independence are organized in a two-way table. The outcome and grouping variable are shown in the rows and columns of the table. The sample table below illustrates the data layout. The table entries (blank below) are the numbers of participants in each group responding to each response category of the outcome variable.

Table - Possible outcomes are are listed in the columns; The groups being compared are listed in rows.

In the table above, the grouping variable is shown in the rows of the table; r denotes the number of independent groups. The outcome variable is shown in the columns of the table; c denotes the number of response options in the outcome variable. Each combination of a row (group) and column (response) is called a cell of the table. The table has r*c cells and is sometimes called an r x c ("r by c") table. For example, if there are 4 groups and 5 categories in the outcome variable, the data are organized in a 4 X 5 table. The row and column totals are shown along the right-hand margin and the bottom of the table, respectively. The total sample size, N, can be computed by summing the row totals or the column totals. Similar to ANOVA, N does not refer to a population size here but rather to the total sample size in the analysis. The sample data can be organized into a table like the above. The numbers of participants within each group who select each response option are shown in the cells of the table and these are the observed frequencies used in the test statistic.

The test statistic for the χ 2 test of independence involves comparing observed (sample data) and expected frequencies in each cell of the table. The expected frequencies are computed assuming that the null hypothesis is true. The null hypothesis states that the two variables (the grouping variable and the outcome) are independent. The definition of independence is as follows:

 Two events, A and B, are independent if P(A|B) = P(A), or equivalently, if P(A and B) = P(A) P(B).

The second statement indicates that if two events, A and B, are independent then the probability of their intersection can be computed by multiplying the probability of each individual event. To conduct the χ 2 test of independence, we need to compute expected frequencies in each cell of the table. Expected frequencies are computed by assuming that the grouping variable and outcome are independent (i.e., under the null hypothesis). Thus, if the null hypothesis is true, using the definition of independence:

P(Group 1 and Response Option 1) = P(Group 1) P(Response Option 1).

 The above states that the probability that an individual is in Group 1 and their outcome is Response Option 1 is computed by multiplying the probability that person is in Group 1 by the probability that a person is in Response Option 1. To conduct the χ 2 test of independence, we need expected frequencies and not expected probabilities . To convert the above probability to a frequency, we multiply by N. Consider the following small example.

The data shown above are measured in a sample of size N=150. The frequencies in the cells of the table are the observed frequencies. If Group and Response are independent, then we can compute the probability that a person in the sample is in Group 1 and Response category 1 using:

P(Group 1 and Response 1) = P(Group 1) P(Response 1),

P(Group 1 and Response 1) = (25/150) (62/150) = 0.069.

Thus if Group and Response are independent we would expect 6.9% of the sample to be in the top left cell of the table (Group 1 and Response 1). The expected frequency is 150(0.069) = 10.4.   We could do the same for Group 2 and Response 1:

P(Group 2 and Response 1) = P(Group 2) P(Response 1),

P(Group 2 and Response 1) = (50/150) (62/150) = 0.138.

The expected frequency in Group 2 and Response 1 is 150(0.138) = 20.7.

Thus, the formula for determining the expected cell frequencies in the χ 2 test of independence is as follows:

Expected Cell Frequency = (Row Total * Column Total)/N.

The above computes the expected frequency in one step rather than computing the expected probability first and then converting to a frequency.  

In a prior example we evaluated data from a survey of university graduates which assessed, among other things, how frequently they exercised. The survey was completed by 470 graduates. In the prior example we used the χ 2 goodness-of-fit test to assess whether there was a shift in the distribution of responses to the exercise question following the implementation of a health promotion campaign on campus. We specifically considered one sample (all students) and compared the observed distribution to the distribution of responses the prior year (a historical control). Suppose we now wish to assess whether there is a relationship between exercise on campus and students' living arrangements. As part of the same survey, graduates were asked where they lived their senior year. The response options were dormitory, on-campus apartment, off-campus apartment, and at home (i.e., commuted to and from the university). The data are shown below.

Based on the data, is there a relationship between exercise and student's living arrangement? Do you think where a person lives affect their exercise status? Here we have four independent comparison groups (living arrangement) and a discrete (ordinal) outcome variable with three response options. We specifically want to test whether living arrangement and exercise are independent. We will run the test using the five-step approach.  

H 0 : Living arrangement and exercise are independent

H 1 : H 0 is false.                α=0.05

The null and research hypotheses are written in words rather than in symbols. The research hypothesis is that the grouping variable (living arrangement) and the outcome variable (exercise) are dependent or related.   

  • Step 2.  Select the appropriate test statistic.  

The condition for appropriate use of the above test statistic is that each expected frequency is at least 5. In Step 4 we will compute the expected frequencies and we will ensure that the condition is met.

The decision rule depends on the level of significance and the degrees of freedom, defined as df = (r-1)(c-1), where r and c are the numbers of rows and columns in the two-way data table.   The row variable is the living arrangement and there are 4 arrangements considered, thus r=4. The column variable is exercise and 3 responses are considered, thus c=3. For this test, df=(4-1)(3-1)=3(2)=6. Again, with χ 2 tests there are no upper, lower or two-tailed tests. If the null hypothesis is true, the observed and expected frequencies will be close in value and the χ 2 statistic will be close to zero. If the null hypothesis is false, then the χ 2 statistic will be large. The rejection region for the χ 2 test of independence is always in the upper (right-hand) tail of the distribution. For df=6 and a 5% level of significance, the appropriate critical value is 12.59 and the decision rule is as follows: Reject H 0 if c 2 > 12.59.

We now compute the expected frequencies using the formula,

Expected Frequency = (Row Total * Column Total)/N.

The computations can be organized in a two-way table. The top number in each cell of the table is the observed frequency and the bottom number is the expected frequency.   The expected frequencies are shown in parentheses.

Notice that the expected frequencies are taken to one decimal place and that the sums of the observed frequencies are equal to the sums of the expected frequencies in each row and column of the table.  

Recall in Step 2 a condition for the appropriate use of the test statistic was that each expected frequency is at least 5. This is true for this sample (the smallest expected frequency is 9.6) and therefore it is appropriate to use the test statistic.

We reject H 0 because 60.5 > 12.59. We have statistically significant evidence at a =0.05 to show that H 0 is false or that living arrangement and exercise are not independent (i.e., they are dependent or related), p < 0.005.  

Again, the χ 2 test of independence is used to test whether the distribution of the outcome variable is similar across the comparison groups. Here we rejected H 0 and concluded that the distribution of exercise is not independent of living arrangement, or that there is a relationship between living arrangement and exercise. The test provides an overall assessment of statistical significance. When the null hypothesis is rejected, it is important to review the sample data to understand the nature of the relationship. Consider again the sample data. 

Because there are different numbers of students in each living situation, it makes the comparisons of exercise patterns difficult on the basis of the frequencies alone. The following table displays the percentages of students in each exercise category by living arrangement. The percentages sum to 100% in each row of the table. For comparison purposes, percentages are also shown for the total sample along the bottom row of the table.

From the above, it is clear that higher percentages of students living in dormitories and in on-campus apartments reported regular exercise (31% and 23%) as compared to students living in off-campus apartments and at home (10% each).  

Test Yourself

 Pancreaticoduodenectomy (PD) is a procedure that is associated with considerable morbidity. A study was recently conducted on 553 patients who had a successful PD between January 2000 and December 2010 to determine whether their Surgical Apgar Score (SAS) is related to 30-day perioperative morbidity and mortality. The table below gives the number of patients experiencing no, minor, or major morbidity by SAS category.  

Question: What would be an appropriate statistical test to examine whether there is an association between Surgical Apgar Score and patient outcome? Using 14.13 as the value of the test statistic for these data, carry out the appropriate test at a 5% level of significance. Show all parts of your test.

In the module on hypothesis testing for means and proportions, we discussed hypothesis testing applications with a dichotomous outcome variable and two independent comparison groups. We presented a test using a test statistic Z to test for equality of independent proportions. The chi-square test of independence can also be used with a dichotomous outcome and the results are mathematically equivalent.  

In the prior module, we considered the following example. Here we show the equivalence to the chi-square test of independence.

A randomized trial is designed to evaluate the effectiveness of a newly developed pain reliever designed to reduce pain in patients following joint replacement surgery. The trial compares the new pain reliever to the pain reliever currently in use (called the standard of care). A total of 100 patients undergoing joint replacement surgery agreed to participate in the trial. Patients were randomly assigned to receive either the new pain reliever or the standard pain reliever following surgery and were blind to the treatment assignment. Before receiving the assigned treatment, patients were asked to rate their pain on a scale of 0-10 with higher scores indicative of more pain. Each patient was then given the assigned treatment and after 30 minutes was again asked to rate their pain on the same scale. The primary outcome was a reduction in pain of 3 or more scale points (defined by clinicians as a clinically meaningful reduction). The following data were observed in the trial.

We tested whether there was a significant difference in the proportions of patients reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) using a Z statistic, as follows. 

H 0 : p 1 = p 2    

H 1 : p 1 ≠ p 2                             α=0.05

Here the new or experimental pain reliever is group 1 and the standard pain reliever is group 2.

We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group or that:

In this example, we have

Therefore, the sample size is adequate, so the following formula can be used:

Reject H 0 if Z < -1.960 or if Z > 1.960.

We now substitute the sample data into the formula for the test statistic identified in Step 2. We first compute the overall proportion of successes:

We now substitute to compute the test statistic.

  • Step 5.  Conclusion.  

We now conduct the same test using the chi-square test of independence.  

H 0 : Treatment and outcome (meaningful reduction in pain) are independent

H 1 :   H 0 is false.         α=0.05

The formula for the test statistic is:  

For this test, df=(2-1)(2-1)=1. At a 5% level of significance, the appropriate critical value is 3.84 and the decision rule is as follows: Reject H0 if χ 2 > 3.84. (Note that 1.96 2 = 3.84, where 1.96 was the critical value used in the Z test for proportions shown above.)

We now compute the expected frequencies using:

The computations can be organized in a two-way table. The top number in each cell of the table is the observed frequency and the bottom number is the expected frequency. The expected frequencies are shown in parentheses.

A condition for the appropriate use of the test statistic was that each expected frequency is at least 5. This is true for this sample (the smallest expected frequency is 22.0) and therefore it is appropriate to use the test statistic.

(Note that (2.53) 2 = 6.4, where 2.53 was the value of the Z statistic in the test for proportions shown above.)

Chi-Squared Tests in R

The video below by Mike Marin demonstrates how to perform chi-squared tests in the R programming language.

Answer to Problem on Pancreaticoduodenectomy and Surgical Apgar Scores

We have 3 independent comparison groups (Surgical Apgar Score) and a categorical outcome variable (morbidity/mortality). We can run a Chi-Squared test of independence.

H 0 : Apgar scores and patient outcome are independent of one another.

H A : Apgar scores and patient outcome are not independent.

Chi-squared = 14.3

Since 14.3 is greater than 9.49, we reject H 0.

There is an association between Apgar scores and patient outcome. The lowest Apgar score group (0 to 4) experienced the highest percentage of major morbidity or mortality (16 out of 57=28%) compared to the other Apgar score groups.

how to test hypothesis at 5 level of significance

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6b.1 - steps in conducting a hypothesis test for \(\mu\), six steps for conducting a one-sample mean hypothesis test, steps 1-3 section  .

Let's apply the general steps for hypothesis testing to the specific case of testing a one-sample mean.

One Mean t-test Hypotheses

Conditions : The data comes from an approximately normal distribution or the sample size is at least 30.

One Mean t-test: \( t^*=\dfrac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}} \)

  • Rejection Region Approach

Steps 4-6 Section  

  • Left-Tailed Test
  • Right-Tailed Test
  • Two-Tailed Test

Reject \(H_0\) if \(t^* \le t_\alpha\)

Reject \(H_0\) if \(t^* \ge t_{1-\alpha}\)

Reject \(H_0\) if \(|t^*| \ge |t_{\alpha/2}|\)

  • P-Value Approach
  • If \(H_a \) is right-tailed, then the p-value is the probability the sample data produces a value equal to or greater than the observed test statistic.
  • If \(H_a \) is left-tailed, then the p-value is the probability the sample data produces a value equal to or less than the observed test statistic.
  • If \(H_a \) is two-tailed, then the p-value is two times the probability the sample data produces a value equal to or greater than the absolute value of the observed test statistic.

Example 6-7 Length of Lumber Section  

Continuing with our lumber example, the mean length of the lumber is supposed to be 8.5 feet. A builder wants to check whether the shipment of lumber she receives has a mean length different from 8.5 feet. If the builder observes that the sample mean of 61 pieces of lumber is 8.3 feet with a sample standard deviation of 1.2 feet, what will she conclude? Conduct this test at a 1% level of significance.

Conduct the test using the Rejection Region approach and the p-value approach.

Set up the hypotheses (since the research hypothesis is to check whether the mean is different from 8.5, we set it up as a two-tailed test):

\( H_0\colon \mu=8.5 \) vs. \(H_a\colon \mu\ne 8.5 \)

Can we use the t-test? The answer is yes since the sample size of 61 is sufficiently large (greater than 30).

\( t^*\le -2.660 \) or \(t^*\ge 2.660 \)

Emergency Room Wait Time Section  

Waiting room

The administrator at your local hospital states that on weekends the average wait time for emergency room visits is 10 minutes. Based on discussions you have had with friends who have complained on how long they waited to be seen in the ER over a weekend, you dispute the administrator's claim. You decide to test your hypothesis. Over the course of a few weekends, you record the wait time for 40 randomly selected patients. The average wait time for these 40 patients is 11 minutes with a standard deviation of 3 minutes.

Do you have enough evidence to support your hypothesis that the average ER wait time exceeds 10 minutes? You opt to conduct the test at a 5% level of significance.

At this point we want to check whether we can apply the central limit theorem. The sample size is greater than 30, so we should be okay.

This is a right-tailed test.

\( H_0\colon \mu=10 \) vs \(H_a\colon \mu>10 \)

Using the table from the text, it shows 35 and 40 degrees of freedom. We would use 35 degrees of freedom. With \(\alpha=0.05 \) , we see a value of 1.69. The critical value is 1.69 and the rejection region is any \(t^* \) such that \(t^*\ge 1.69 \) .

Note! If we used software (discussed in the next section), we will find the critical value to be 1.685.

Note! If we use software, the p-value is 0.0207.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Choosing the Right Statistical Test | Types & Examples

Choosing the Right Statistical Test | Types & Examples

Published on January 28, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Statistical tests are used in hypothesis testing . They can be used to:

  • determine whether a predictor variable has a statistically significant relationship with an outcome variable.
  • estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

If you already know what types of variables you’re dealing with, you can use the flowchart to choose the right statistical test for your data.

Statistical tests flowchart

Table of contents

What does a statistical test do, when to perform a statistical test, choosing a parametric test: regression, comparison, or correlation, choosing a nonparametric test, flowchart: choosing a statistical test, other interesting articles, frequently asked questions about statistical tests.

Statistical tests work by calculating a test statistic – a number that describes how much the relationship between variables in your test differs from the null hypothesis of no relationship.

It then calculates a p value (probability value). The p -value estimates how likely it is that you would see the difference described by the test statistic if the null hypothesis of no relationship were true.

If the value of the test statistic is more extreme than the statistic calculated from the null hypothesis, then you can infer a statistically significant relationship between the predictor and outcome variables.

If the value of the test statistic is less extreme than the one calculated from the null hypothesis, then you can infer no statistically significant relationship between the predictor and outcome variables.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

how to test hypothesis at 5 level of significance

You can perform statistical tests on data that have been collected in a statistically valid manner – either through an experiment , or through observations made using probability sampling methods .

For a statistical test to be valid , your sample size needs to be large enough to approximate the true distribution of the population being studied.

To determine which statistical test to use, you need to know:

  • whether your data meets certain assumptions.
  • the types of variables that you’re dealing with.

Statistical assumptions

Statistical tests make some common assumptions about the data they are testing:

  • Independence of observations (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).
  • Homogeneity of variance : the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test’s effectiveness.
  • Normality of data : the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data .

If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a nonparametric statistical test , which allows you to make comparisons without any assumptions about the data distribution.

If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).

Types of variables

The types of variables you have usually determine what type of statistical test you can use.

Quantitative variables represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:

  • Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).
  • Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one (e.g. 1 tree).

Categorical variables represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:

  • Ordinal : represent data with an order (e.g. rankings).
  • Nominal : represent group names (e.g. brands or species names).
  • Binary : represent data with a yes/no or 1/0 outcome (e.g. win or lose).

Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment , these are the independent and dependent variables ). Consult the tables below to see which test best matches your variables.

Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

Regression tests

Regression tests look for cause-and-effect relationships . They can be used to estimate the effect of one or more continuous variables on another variable.

Comparison tests

Comparison tests look for differences among group means . They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).

Correlation tests

Correlation tests check whether variables are related without hypothesizing a cause-and-effect relationship.

These can be used to test whether two variables you want to use in (for example) a multiple regression test are autocorrelated.

Non-parametric tests don’t make as many assumptions about the data, and are useful when one or more of the common statistical assumptions are violated. However, the inferences they make aren’t as strong as with parametric tests.

Prevent plagiarism. Run a free check.

This flowchart helps you choose among parametric tests. For nonparametric alternatives, check the table above.

Choosing the right statistical test

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient
  • Null hypothesis

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Statistical tests commonly assume that:

  • the data are normally distributed
  • the groups that are being compared have similar variance
  • the data are independent

If your data does not meet these assumptions you might still be able to use a nonparametric statistical test , which have fewer requirements but also make weaker inferences.

A test statistic is a number calculated by a  statistical test . It describes how far your observed data is from the  null hypothesis  of no relationship between  variables or no difference among sample groups.

The test statistic tells you how different two or more groups are from the overall population mean , or how different a linear slope is from the slope predicted by a null hypothesis . Different test statistics are used in different statistical tests.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).

Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).

You need to know what type of variables you are working with to choose the right statistical test for your data and interpret your results .

Discrete and continuous variables are two types of quantitative variables :

  • Discrete variables represent counts (e.g. the number of objects in a collection).
  • Continuous variables represent measurable amounts (e.g. water volume or weight).

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Choosing the Right Statistical Test | Types & Examples. Scribbr. Retrieved April 15, 2024, from https://www.scribbr.com/statistics/statistical-tests/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, hypothesis testing | a step-by-step guide with easy examples, test statistics | definition, interpretation, and examples, normal distribution | examples, formulas, & uses, what is your plagiarism score.

Module 8: Inference for One Proportion

Hypothesis testing (5 of 5), learning outcomes.

  • Recognize type I and type II errors.

What Can Go Wrong: Two Types of Errors

Statistical investigations involve making decisions in the face of uncertainty, so there is always some chance of making a wrong decision. In hypothesis testing, two types of wrong decisions can occur.

If the null hypothesis is true, but we reject it, the error is a type I error.

If the null hypothesis is false, but we fail to reject it, the error is a type II error.

The following table summarizes type I and II errors.

Hypothesis testing matrices. If we reject H null and H null is false, when we have correctly rejected the null hypothesis. If we reject H null and H null is tue, we have made a Type I error. If we accept H null and H null is trie, we have correct accepted the null hypothesis. If we accept H null and H null is false, we have made a Type II error.

Type I and type II errors are not caused by mistakes. These errors are the result of random chance. The data provide evidence for a conclusion that is false. It’s no one’s fault!

Data Use on Smart Phones

Teens using smartphones

In a previous example, we looked at a hypothesis test about data usage on smart phones. The researcher investigated the claim that the mean data usage for all teens is greater than 62 MBs. The sample mean was 75 MBs. The P-value was approximately 0.023. In this situation, the P-value is the probability that we will get a sample mean of 75 MBs or higher if the true mean is 62 MBs.

Notice that the result (75 MBs) isn’t impossible, only very unusual. The result is rare enough that we question whether the null hypothesis is true. This is why we reject the null hypothesis. But it is possible that the null hypothesis hypothesis is true and the researcher happened to get a very unusual sample mean. In this case, the result is just due to chance, and the data have led to a type I error: rejecting the null hypothesis when it is actually true.

White Male Support for Obama in 2012

In a previous example, we conducted a hypothesis test using poll results to determine if white male support for Obama in 2012 will be less than 40%. Our poll of white males showed 35% planning to vote for Obama in 2012. Based on the sampling distribution, we estimated the P-value as 0.078. In this situation, the P-value is the probability that we will get a sample proportion of 0.35 or less if 0.40 of the population of white males support Obama.

At the 5% level, the poll did not give strong enough evidence for us to conclude that less than 40% of white males will vote for Obama in 2012.

Which type of error is possible in this situation? If, in fact, it is true that less than 40% of this population support Obama, then the data led to a type II error: failing to reject a null hypothesis that is false. In other words, we failed to accept an alternative hypothesis that is true.

We definitely did not make a type I error here because a type I error requires that we reject the null hypothesis.

What Is the Probability That We Will Make a Type I Error?

If the significance level is 5% (α = 0.05), then 5% of the time we will reject the null hypothesis (when it is true!). Of course we will not know if the null is true. But if it is, the natural variability that we expect in random samples will produce rare results 5% of the time. This makes sense because we assume the null hypothesis is true when we create the sampling distribution. We look at the variability in random samples selected from the population described by the null hypothesis.

Similarly, if the significance level is 1%, then 1% of the time sample results will be rare enough for us to reject the null hypothesis hypothesis. So if the null hypothesis is actually true, then by chance alone, 1% of the time we will reject a true null hypothesis. The probability of a type I error is therefore 1%.

In general, the probability of a type I error is α.

What Is the Probability That We Will Make a Type II Error?

The probability of a type I error, if the null hypothesis is true, is equal to the significance level. The probability of a type II error is much more complicated to calculate. We can reduce the risk of a type I error by using a lower significance level. The best way to reduce the risk of a type II error is by increasing the sample size. In theory, we could also increase the significance level, but doing so would increase the likelihood of a type I error at the same time. We discuss these ideas further in a later module.

A Fair Coin

In the long run, a fair coin lands heads up half of the time. (For this reason, a weighted coin is not fair.) We conducted a simulation in which each sample consists of 40 flips of a fair coin. Here is a simulated sampling distribution for the proportion of heads in 2,000 samples. Results ranged from 0.25 to 0.75.

A distribution bar graph with results ranging from 0.25 to 0.75. The center at 0.5 has the highest bar, and on either side the bars get lower. The graph is in the traditional bell curve shape, but with a slightly smaller slope on the left side of the peak.

In general, if the null hypothesis is true, the significance level gives the probability of making a type I error. If we conduct a large number of hypothesis tests using the same null hypothesis, then, a type I error will occur in a predictable percentage (α) of the hypothesis tests. This is a problem! If we run one hypothesis test and the data is significant at the 5% level, we have reasonably good evidence that the alternative hypothesis is true. If we run 20 hypothesis tests and the data in one of the tests is significant at the 5% level, it doesn’t tell us anything! We expect 5% of the tests (1 in 20) to show significant results just due to chance.

Cell Phones and Brain Cancer

A man using a cell phone

The following is an excerpt from a 1999 New York Times article titled “Cell phones: questions but no answers,” as referenced by David S. Moore in Basic Practice of Statistics (4th ed., New York: W. H. Freeman, 2007):

  • A hospital study that compared brain cancer patients and a similar group without brain cancer found no statistically significant association between cell phone use and a group of brain cancers known as gliomas. But when 20 types of glioma were considered separately, an association was found between cell phone use and one rare form. Puzzlingly, however, this risk appeared to decrease rather than increase with greater mobile phone use.

This is an example of a probable type I error. Suppose we conducted 20 hypotheses tests with the null hypothesis “Cell phone use is not associated with cancer” at the 5% level. We expect 1 in 20 (5%) to give significant results by chance alone when there is no association between cell phone use and cancer. So the conclusion that this one type of cancer is related to cell phone use is probably just a result of random chance and not an indication of an association.

Click here to see a fun cartoon that illustrates this same idea.

How Many People Are Telepathic?

Telepathy is the ability to read minds. Researchers used Zener cards in the early 1900s for experimental research into telepathy.

5 Zener cards. The first has a circle, the second a +, the third three wavy lines, the fourth a square, and the fifth a star.

In a telepathy experiment, the “sender” looks at 1 of 5 Zener cards while the “receiver” guesses the symbol. This is repeated 40 times, and the proportion of correct responses is recorded. Because there are 5 cards, we expect random guesses to be right 20% of the time (1 out of 5) in the long run. So in 40 tries, 8 correct guesses, a proportion of 0.20, is common. But of course there will be variability even when someone is just guessing. Thirteen or more correct in 40 tries, a proportion of 0.325, is statistically significant at the 5% level. When people perform this well on the telepathy test, we conclude their performance is not due to chance and take it as an indication of the ability to read minds.

In the next section, “Hypothesis Test for a Population Proportion,” we learn the details of hypothesis testing for claims about a population proportion. Before we get into the details, we want to step back and think more generally about hypothesis testing. We close our introduction to hypothesis testing with a helpful analogy.

Courtroom Analogy for Hypothesis Tests

When a defendant stands trial for a crime, he or she is innocent until proven guilty. It is the job of the prosecution to present evidence showing that the defendant is guilty beyond a reasonable doubt . It is the job of the defense to challenge this evidence to establish a reasonable doubt. The jury weighs the evidence and makes a decision.

When a jury makes a decision, it has only two possible verdicts:

  • Guilty: The jury concludes that there is enough evidence to convict the defendant. The evidence is so strong that there is not a reasonable doubt that the defendant is guilty.
  • Not Guilty: The jury concludes that there is not enough evidence to conclude beyond a reasonable doubt that the person is guilty. Notice that they do not conclude that the person is innocent. This verdict says only that there is not enough evidence to return a guilty verdict.

How is this example like a hypothesis test?

The null hypothesis is “The person is innocent.” The alternative hypothesis is “The person is guilty.” The evidence is the data. In a courtroom, the person is assumed innocent until proven guilty. In a hypothesis test, we assume the null hypothesis is true until the data proves otherwise.

The two possible verdicts are similar to the two conclusions that are possible in a hypothesis test.

Reject the null hypothesis: When we reject a null hypothesis, we accept the alternative hypothesis. This is like a guilty verdict. The evidence is strong enough for the jury to reject the assumption of innocence. In a hypothesis test, the data is strong enough for us to reject the assumption that the null hypothesis is true.

Fail to reject the null hypothesis: When we fail to reject the null hypothesis, we are delivering a “not guilty” verdict. The jury concludes that the evidence is not strong enough to reject the assumption of innocence, so the evidence is too weak to support a guilty verdict. We conclude the data is not strong enough to reject the null hypothesis, so the data is too weak to accept the alternative hypothesis.

How does the courtroom analogy relate to type I and type II errors?

Type I error: The jury convicts an innocent person. By analogy, we reject a true null hypothesis and accept a false alternative hypothesis.

Type II error: The jury says a person is not guilty when he or she really is. By analogy, we fail to reject a null hypothesis that is false. In other words, we do not accept an alternative hypothesis when it is really true.

Let’s Summarize

In this section, we introduced the four-step process of hypothesis testing:

Step 1: Determine the hypotheses.

  • The hypotheses are claims about the population(s).
  • The null hypothesis is a hypothesis that the parameter equals a specific value.
  • The alternative hypothesis is the competing claim that the parameter is less than, greater than, or not equal to the parameter value in the null. The claim that drives the statistical investigation is usually found in the alternative hypothesis.

Step 2: Collect the data.

Because the hypothesis test is based on probability, random selection or assignment is essential in data production.

Step 3: Assess the evidence.

  • Use the data to find a P-value.
  • The P-value is a probability statement about how unlikely the data is if the null hypothesis is true.
  • More specifically, the P-value gives the probability of sample results at least as extreme as the data if the null hypothesis is true.

Step 4: Give the conclusion.

  • A small P-value says the data is unlikely to occur if the null hypothesis is true. We therefore conclude that the null hypothesis is probably not true and that the alternative hypothesis is true instead.
  • We often choose a significance level as a benchmark for judging if the P-value is small enough. If the P-value is less than or equal to the significance level, we reject the null hypothesis and accept the alternative hypothesis instead.
  • If the P-value is greater than the significance level, we say we “fail to reject” the null hypothesis. We never say that we “accept” the null hypothesis. We just say that we don’t have enough evidence to reject it. This is equivalent to saying we don’t have enough evidence to support the alternative hypothesis.
  • Our conclusion will respond to the research question, so we often state the conclusion in terms of the alternative hypothesis.

Inference is based on probability, so there is always uncertainty. Although we may have strong evidence against it, the null hypothesis may still be true. If this is the case, we have a type I error. Similarly, even if we fail to reject the null hypothesis, it does not mean the alternative hypothesis is false. In this case, we have a type II error. These errors are not the result of a mistake in conducting the hypothesis test. They occur because of random chance.

Contribute!

Improve this page Learn More

  • Concepts in Statistics. Provided by : Open Learning Initiative. Located at : http://oli.cmu.edu . License : CC BY: Attribution
  • Inferential Statistics Decision Making Table. Authored by : Wikimedia Commons: Adapted by Lumen Learning. Located at : https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Inferential_Statistics_Decision_Making_Table.png/120px-Inferential_Statistics_Decision_Making_Table.png . License : CC BY: Attribution

Footer Logo Lumen Waymaker

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Statistics and probability

Course: statistics and probability   >   unit 12, hypothesis testing and p-values.

  • One-tailed and two-tailed tests
  • Z-statistics vs. T-statistics
  • Small sample hypothesis test
  • Large sample proportion hypothesis testing

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Good Answer

Video transcript

P-Value And Statistical Significance: What It Is & Why It Matters

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

The p-value in statistics quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favoring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.

P-Value Explained in Normal Distribution

Hypothesis testing

When you perform a statistical test, a p-value helps you determine the significance of your results in relation to the null hypothesis.

The null hypothesis (H0) states no relationship exists between the two variables being studied (one variable does not affect the other). It states the results are due to chance and are not significant in supporting the idea being investigated. Thus, the null hypothesis assumes that whatever you try to prove did not happen.

The alternative hypothesis (Ha or H1) is the one you would believe if the null hypothesis is concluded to be untrue.

The alternative hypothesis states that the independent variable affected the dependent variable, and the results are significant in supporting the theory being investigated (i.e., the results are not due to random chance).

What a p-value tells you

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true).

The level of statistical significance is often expressed as a p-value between 0 and 1.

The smaller the p -value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.

Remember, a p-value doesn’t tell you if the null hypothesis is true or false. It just tells you how likely you’d see the data you observed (or more extreme data) if the null hypothesis was true. It’s a piece of evidence, not a definitive proof.

Example: Test Statistic and p-Value

Suppose you’re conducting a study to determine whether a new drug has an effect on pain relief compared to a placebo. If the new drug has no impact, your test statistic will be close to the one predicted by the null hypothesis (no difference between the drug and placebo groups), and the resulting p-value will be close to 1. It may not be precisely 1 because real-world variations may exist. Conversely, if the new drug indeed reduces pain significantly, your test statistic will diverge further from what’s expected under the null hypothesis, and the p-value will decrease. The p-value will never reach zero because there’s always a slim possibility, though highly improbable, that the observed results occurred by random chance.

P-value interpretation

The significance level (alpha) is a set probability threshold (often 0.05), while the p-value is the probability you calculate based on your study or analysis.

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

A p-value less than or equal to a predetermined significance level (often 0.05 or 0.01) indicates a statistically significant result, meaning the observed data provide strong evidence against the null hypothesis.

This suggests the effect under study likely represents a real relationship rather than just random chance.

For instance, if you set α = 0.05, you would reject the null hypothesis if your p -value ≤ 0.05. 

It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

Therefore, we reject the null hypothesis and accept the alternative hypothesis.

Example: Statistical Significance

Upon analyzing the pain relief effects of the new drug compared to the placebo, the computed p-value is less than 0.01, which falls well below the predetermined alpha value of 0.05. Consequently, you conclude that there is a statistically significant difference in pain relief between the new drug and the placebo.

What does a p-value of 0.001 mean?

A p-value of 0.001 is highly statistically significant beyond the commonly used 0.05 threshold. It indicates strong evidence of a real effect or difference, rather than just random variation.

Specifically, a p-value of 0.001 means there is only a 0.1% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.

Such a small p-value provides strong evidence against the null hypothesis, leading to rejecting the null in favor of the alternative hypothesis.

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis; we can only reject it or fail to reject it.

Note : when the p-value is above your threshold of significance,  it does not mean that there is a 95% probability that the alternative hypothesis is true.

One-Tailed Test

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Two-Tailed Test

statistical significance two tailed

How do you calculate the p-value ?

Most statistical software packages like R, SPSS, and others automatically calculate your p-value. This is the easiest and most common way.

Online resources and tables are available to estimate the p-value based on your test statistic and degrees of freedom.

These tables help you understand how often you would expect to see your test statistic under the null hypothesis.

Understanding the Statistical Test:

Different statistical tests are designed to answer specific research questions or hypotheses. Each test has its own underlying assumptions and characteristics.

For example, you might use a t-test to compare means, a chi-squared test for categorical data, or a correlation test to measure the strength of a relationship between variables.

Be aware that the number of independent variables you include in your analysis can influence the magnitude of the test statistic needed to produce the same p-value.

This factor is particularly important to consider when comparing results across different analyses.

Example: Choosing a Statistical Test

If you’re comparing the effectiveness of just two different drugs in pain relief, a two-sample t-test is a suitable choice for comparing these two groups. However, when you’re examining the impact of three or more drugs, it’s more appropriate to employ an Analysis of Variance ( ANOVA) . Utilizing multiple pairwise comparisons in such cases can lead to artificially low p-values and an overestimation of the significance of differences between the drug groups.

How to report

A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty).

Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g., less than 5%).

Example: Reporting the results

In our comparison of the pain relief effects of the new drug and the placebo, we observed that participants in the drug group experienced a significant reduction in pain ( M = 3.5; SD = 0.8) compared to those in the placebo group ( M = 5.2; SD  = 0.7), resulting in an average difference of 1.7 points on the pain scale (t(98) = -9.36; p < 0.001).

The 6th edition of the APA style manual (American Psychological Association, 2010) states the following on the topic of reporting p-values:

“When reporting p values, report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available.” (p. 114)

  • Do not use 0 before the decimal point for the statistical value p as it cannot equal 1. In other words, write p = .001 instead of p = 0.001.
  • Please pay attention to issues of italics ( p is always italicized) and spacing (either side of the = sign).
  • p = .000 (as outputted by some statistical packages such as SPSS) is impossible and should be written as p < .001.
  • The opposite of significant is “nonsignificant,” not “insignificant.”

Why is the p -value not enough?

A lower p-value  is sometimes interpreted as meaning there is a stronger relationship between two variables.

However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

To understand the strength of the difference between the two groups (control vs. experimental) a researcher needs to calculate the effect size .

When do you reject the null hypothesis?

In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

Remember, rejecting the null hypothesis doesn’t prove the alternative hypothesis; it just suggests that the alternative hypothesis may be plausible given the observed data.

The p -value is conditional upon the null hypothesis being true but is unrelated to the truth or falsity of the alternative hypothesis.

What does p-value of 0.05 mean?

If your p-value is less than or equal to 0.05 (the significance level), you would conclude that your result is statistically significant. This means the evidence is strong enough to reject the null hypothesis in favor of the alternative hypothesis.

Are all p-values below 0.05 considered statistically significant?

No, not all p-values below 0.05 are considered statistically significant. The threshold of 0.05 is commonly used, but it’s just a convention. Statistical significance depends on factors like the study design, sample size, and the magnitude of the observed effect.

A p-value below 0.05 means there is evidence against the null hypothesis, suggesting a real effect. However, it’s essential to consider the context and other factors when interpreting results.

Researchers also look at effect size and confidence intervals to determine the practical significance and reliability of findings.

How does sample size affect the interpretation of p-values?

Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals.

With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. In contrast, smaller sample sizes may not have enough statistical power to detect smaller effects, resulting in higher p-values.

Therefore, a larger sample size increases the chances of finding statistically significant results when there is a genuine effect, making the findings more trustworthy and robust.

Can a non-significant p-value indicate that there is no effect or difference in the data?

No, a non-significant p-value does not necessarily indicate that there is no effect or difference in the data. It means that the observed data do not provide strong enough evidence to reject the null hypothesis.

There could still be a real effect or difference, but it might be smaller or more variable than the study was able to detect.

Other factors like sample size, study design, and measurement precision can influence the p-value. It’s important to consider the entire body of evidence and not rely solely on p-values when interpreting research findings.

Can P values be exactly zero?

While a p-value can be extremely small, it cannot technically be absolute zero. When a p-value is reported as p = 0.000, the actual p-value is too small for the software to display. This is often interpreted as strong evidence against the null hypothesis. For p values less than 0.001, report as p < .001

Further Information

  • P-values and significance tests (Kahn Academy)
  • Hypothesis testing and p-values (Kahn Academy)
  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “ p “< 0.05”.
  • Criticism of using the “ p “< 0.05”.
  • Publication manual of the American Psychological Association
  • Statistics for Psychology Book Download

Bland, J. M., & Altman, D. G. (1994). One and two sided tests of significance: Authors’ reply.  BMJ: British Medical Journal ,  309 (6958), 874.

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research.  American Journal of Public Health ,  78 (12), 1568-1574.

Goodman, S. (2008, July). A dirty dozen: twelve p-value misconceptions . In  Seminars in hematology  (Vol. 45, No. 3, pp. 135-140). WB Saunders.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded P-value.  Epidemiology (Cambridge, Mass.) ,  9 (1), 7-8.

Print Friendly, PDF & Email

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

8.E: Testing Hypotheses (Exercises)

  • Last updated
  • Save as PDF
  • Page ID 1104

These are homework exercises to accompany the Textmap created for "Introductory Statistics" by Shafer and Zhang.

8.1: The Elements of Hypothesis Testing

State the null and alternative hypotheses for each of the following situations. (That is, identify the correct number \(\mu _0\) and write \(H_0:\mu =\mu _0\) and the appropriate analogous expression for \(H_a\).)

  • The average July temperature in a region historically has been \(74.5^{\circ}F\). Perhaps it is higher now.
  • The average weight of a female airline passenger with luggage was \(145\) pounds ten years ago. The FAA believes it to be higher now.
  • The average stipend for doctoral students in a particular discipline at a state university is \(\$14,756\). The department chairman believes that the national average is higher.
  • The average room rate in hotels in a certain region is \(\$82.53\). A travel agent believes that the average in a particular resort area is different.
  • The average farm size in a predominately rural state was \(69.4\) acres. The secretary of agriculture of that state asserts that it is less today.
  • The average time workers spent commuting to work in Verona five years ago was \(38.2\) minutes. The Verona Chamber of Commerce asserts that the average is less now.
  • The mean salary for all men in a certain profession is \(\$58,291\). A special interest group thinks that the mean salary for women in the same profession is different.
  • The accepted figure for the caffeine content of an \(8\)-ounce cup of coffee is \(133\) mg. A dietitian believes that the average for coffee served in a local restaurants is higher.
  • The average yield per acre for all types of corn in a recent year was \(161.9\) bushels. An economist believes that the average yield per acre is different this year.
  • An industry association asserts that the average age of all self-described fly fishermen is \(42.8\) years. A sociologist suspects that it is higher.

Describe the two types of errors that can be made in a test of hypotheses.

Under what circumstance is a test of hypotheses certain to yield a correct decision?

  • \(H_0:\mu =74.5\; vs\; H_a:\mu >74.5\)
  • \(H_0:\mu =145\; vs\; H_a:\mu >145\)
  • \(H_0:\mu =14756\; vs\; H_a:\mu >14756\)
  • \(H_0:\mu =82.53\; vs\; H_a:\mu \neq 82.53\)
  • \(H_0:\mu =69.4\; vs\; H_a:\mu <69.4\)
  • A Type I error is made when a true \(H_0\) is rejected. A Type II error is made when a false \(H_0\) is not rejected.

8.2: Large Sample Tests for a Population Mean

  • \(H_0:\mu =27\; vs\; H_a:\mu <27\; @\; \alpha =0.05\)
  • \(H_0:\mu =52\; vs\; H_a:\mu \neq 52\; @\; \alpha =0.05\)
  • \(H_0:\mu =-105\; vs\; H_a:\mu >-105\; @\; \alpha =0.10\)
  • \(H_0:\mu =78.8\; vs\; H_a:\mu \neq 78.8\; @\; \alpha =0.10\)
  • \(H_0:\mu =17\; vs\; H_a:\mu <17\; @\; \alpha =0.01\)
  • \(H_0:\mu =880\; vs\; H_a:\mu \neq 880\; @\; \alpha =0.01\)
  • \(H_0:\mu =-12\; vs\; H_a:\mu >-12\; @\; \alpha =0.05\)
  • \(H_0:\mu =21.1\; vs\; H_a:\mu \neq 21.1\; @\; \alpha =0.05\)
  • \(H_0:\mu =141\; vs\; H_a:\mu <141\; @\; \alpha =0.20\)
  • \(H_0:\mu =-54\; vs\; H_a:\mu <-54\; @\; \alpha =0.05\)
  • \(H_0:\mu =98.6\; vs\; H_a:\mu \neq 98.6\; @\; \alpha =0.05\)
  • \(H_0:\mu =3.8\; vs\; H_a:\mu >3.8\; @\; \alpha =0.001\)
  • \(H_0:\mu =-62\; vs\; H_a:\mu \neq -62\; @\; \alpha =0.005\)
  • \(H_0:\mu =73\; vs\; H_a:\mu >73\; @\; \alpha =0.001\)
  • \(H_0:\mu =1124\; vs\; H_a:\mu <1124\; @\; \alpha =0.001\)
  • \(H_0:\mu =0.12\; vs\; H_a:\mu \neq 0.12\; @\; \alpha =0.001\)
  • Testing \(H_0:\mu =72.2\; vs\; H_a:\mu >72.2,\; \sigma \; \text{unknown}\; n=55,\; \bar{x}=75.1,\; s=9.25\)
  • Testing \(H_0:\mu =58\; vs\; H_a:\mu >58,\; \sigma =1.22\; n=40,\; \bar{x}=58.5,\; s=1.29\)
  • Testing \(H_0:\mu =-19.5\; vs\; H_a:\mu <-19.5,\; \sigma \; \text{unknown}\; n=30,\; \bar{x}=-23.2,\; s=9.55\)
  • Testing \(H_0:\mu =805\; vs\; H_a:\mu \neq 805,\; \sigma =37.5\; n=75,\; \bar{x}=818,\; s=36.2\)
  • Testing \(H_0:\mu =342\; vs\; H_a:\mu <342,\; \sigma =11.2\; n=40,\; \bar{x}=339,\; s=10.3\)
  • Testing \(H_0:\mu =105\; vs\; H_a:\mu >105,\; \sigma =5.3\; n=80,\; \bar{x}=107,\; s=5.1\)
  • Testing \(H_0:\mu =-13.5\; vs\; H_a:\mu \neq -13.5,\; \sigma \; \text{unknown}\; n=32,\; \bar{x}=-13.8,\; s=1.5\)
  • Testing \(H_0:\mu =28\; vs\; H_a:\mu \neq 28,\; \sigma \; \text{unknown}\; n=68,\; \bar{x}=27.8,\; s=1.3\)
  • Test \(H_0:\mu =212\; vs\; H_a:\mu <212\; @\; \alpha =0.10,\; \sigma \; \text{unknown}\; n=36,\; \bar{x}=211.2,\; s=2.2\)
  • Test \(H_0:\mu =-18\; vs\; H_a:\mu >-18\; @\; \alpha =0.05,\; \sigma =3.3\; n=44,\; \bar{x}=-17.2,\; s=3.1\)
  • Test \(H_0:\mu =24\; vs\; H_a:\mu \neq 24\; @\; \alpha =0.02,\; \sigma \; \text{unknown}\; n=50,\; \bar{x}=22.8,\; s=1.9\)
  • Test \(H_0:\mu =105\; vs\; H_a:\mu >105\; @\; \alpha =0.05,\; \sigma \; \text{unknown}\; n=30,\; \bar{x}=108,\; s=7.2\)
  • Test \(H_0:\mu =21.6\; vs\; H_a:\mu <21.6\; @\; \alpha =0.01,\; \sigma \; \text{unknown}\; n=78,\; \bar{x}=20.5,\; s=3.9\)
  • Test \(H_0:\mu =-375\; vs\; H_a:\mu \neq -375\; @\; \alpha =0.01,\; \sigma =18.5\; n=31,\; \bar{x}=-388,\; s=18.0\)

Applications

  • In the past the average length of an outgoing telephone call from a business office has been \(143\) seconds. A manager wishes to check whether that average has decreased after the introduction of policy changes. A sample of \(100\) telephone calls produced a mean of \(133\) seconds, with a standard deviation of \(35\) seconds. Perform the relevant test at the \(1\%\) level of significance.
  • The government of an impoverished country reports the mean age at death among those who have survived to adulthood as \(66.2\) years. A relief agency examines \(30\) randomly selected deaths and obtains a mean of \(62.3\) years with standard deviation \(8.1\) years. Test whether the agency’s data support the alternative hypothesis, at the \(1\%\) level of significance, that the population mean is less than \(66.2\).
  • The average household size in a certain region several years ago was \(3.14\) persons. A sociologist wishes to test, at the \(5\%\) level of significance, whether it is different now. Perform the test using the information collected by the sociologist: in a random sample of \(75\) households, the average size was \(2.98\) persons, with sample standard deviation \(0.82\) person.
  • The recommended daily calorie intake for teenage girls is \(2,200\) calories/day. A nutritionist at a state university believes the average daily caloric intake of girls in that state to be lower. Test that hypothesis, at the \(5\%\) level of significance, against the null hypothesis that the population average is \(2,200\) calories/day using the following sample data: \(n=36,\; \bar{x}=2,150,\; s=203\)
  • An automobile manufacturer recommends oil change intervals of \(3,000\) miles. To compare actual intervals to the recommendation, the company randomly samples records of \(50\) oil changes at service facilities and obtains sample mean \(3,752\) miles with sample standard deviation \(638\) miles. Determine whether the data provide sufficient evidence, at the \(5\%\) level of significance, that the population mean interval between oil changes exceeds \(3,000\) miles.
  • A medical laboratory claims that the mean turn-around time for performance of a battery of tests on blood samples is \(1.88\) business days. The manager of a large medical practice believes that the actual mean is larger. A random sample of \(45\) blood samples yielded mean \(2.09\) and sample standard deviation \(0.13\) day. Perform the relevant test at the \(10\%\) level of significance, using these data.
  • A grocery store chain has as one standard of service that the mean time customers wait in line to begin checking out not exceed \(2\) minutes. To verify the performance of a store the company measures the waiting time in \(30\) instances, obtaining mean time \(2.17\) minutes with standard deviation \(0.46\) minute. Use these data to test the null hypothesis that the mean waiting time is \(2\) minutes versus the alternative that it exceeds \(2\) minutes, at the \(10\%\) level of significance.
  • A magazine publisher tells potential advertisers that the mean household income of its regular readership is \(\$61,500\). An advertising agency wishes to test this claim against the alternative that the mean is smaller. A sample of \(40\) randomly selected regular readers yields mean income \(\$59,800\) with standard deviation \(\$5,850\). Perform the relevant test at the \(1\%\) level of significance.
  • Authors of a computer algebra system wish to compare the speed of a new computational algorithm to the currently implemented algorithm. They apply the new algorithm to \(50\) standard problems; it averages \(8.16\) seconds with standard deviation \(0.17\) second. The current algorithm averages \(8.21\) seconds on such problems. Test, at the \(1\%\) level of significance, the alternative hypothesis that the new algorithm has a lower average time than the current algorithm.
  • A random sample of the starting salaries of \(35\) randomly selected graduates with bachelor’s degrees last year gave sample mean and standard deviation \(\$41,202\) and \(\$7,621\), respectively. Test whether the data provide sufficient evidence, at the \(5\%\) level of significance, to conclude that the mean starting salary of all graduates last year is less than the mean of all graduates two years before, \(\$43,589\).

Additional Exercises

  • Test at the \(10\%\) level of significance the null hypothesis that the mean household income of customers of the chain is \(\$48,750\) against that alternative that it is different from \(\$48,750\).
  • The sample mean is greater than \(\$48,750\), suggesting that the actual mean of people who patronize this store is greater than \(\$48,750\). Perform this test, also at the \(10\%\) level of significance. (The computation of the test statistic done in part (a) still applies here.)
  • Test at the \(1\%\) level of significance the null hypothesis that the actual mean time for this repair differs from one hour.
  • The sample mean is less than one hour, suggesting that the mean actual time for this repair is less than one hour. Perform this test, also at the \(1\%\) level of significance. (The computation of the test statistic done in part (a) still applies here.)

Large Data Set Exercises

Large Data Set missing from the original

  • Large \(\text{Data Set 1}\) records the SAT scores of \(1,000\) students. Regarding it as a random sample of all high school students, use it to test the hypothesis that the population mean exceeds \(1,510\), at the \(1\%\) level of significance. (The null hypothesis is that \(\mu =1510\)).
  • Large \(\text{Data Set 1}\) records the GPAs of \(1,000\) college students. Regarding it as a random sample of all college students, use it to test the hypothesis that the population mean is less than \(2.50\), at the \(10\%\) level of significance. (The null hypothesis is that \(\mu =2.50\)).
  • Regard the data as arising from a census of all students at a high school, in which the SAT score of every student was measured. Compute the population mean \(\mu\).
  • Regard the first \(50\) students in the data set as a random sample drawn from the population of part (a) and use it to test the hypothesis that the population mean exceeds \(1,510\), at the \(10\%\) level of significance. (The null hypothesis is that \(\mu =1510\)).
  • Is your conclusion in part (b) in agreement with the true state of nature (which by part (a) you know), or is your decision in error? If your decision is in error, is it a Type I error or a Type II error?
  • Regard the data as arising from a census of all freshman at a small college at the end of their first academic year of college study, in which the GPA of every such person was measured. Compute the population mean \(\mu\).
  • Regard the first \(50\) students in the data set as a random sample drawn from the population of part (a) and use it to test the hypothesis that the population mean is less than \(2.50\), at the \(10\%\) level of significance. (The null hypothesis is that \(\mu =2.50\)).
  • \(Z\leq -1.645\)
  • \(Z\leq -1.645\; or\; Z\geq 1.96\)
  • \(Z\geq 1.28\)
  • \(Z\leq -1.645\; or\; Z\geq 1.645\)
  • \(Z\leq -0.84\)
  • \(Z\leq -1.96\; or\; Z\geq 1.96\)
  • \(Z\geq 3.1\)
  • \(Z = 2.235\)
  • \(Z = 2.592\)
  • \(Z = -2.122\)
  • \(Z = 3.002\)
  • \(Z = -2.18,\; -z_{0.10}=-1.28,\; \text{reject}\; H_0\)
  • \(Z = 1.61,\; z_{0.05}=1.645,\; \text{do not reject}\; H_0\)
  • \(Z = -4.47,\; -z_{0.01}=-2.33,\; \text{reject}\; H_0\)
  • \(Z = -2.86,\; -z_{0.01}=-2.33,\; \text{reject}\; H_0\)
  • \(Z = -1.69,\; -z_{0.025}=-1.96,\; \text{do not reject}\; H_0\)
  • \(Z = 8.33,\; z_{0.05}=1.645,\; \text{reject}\; H_0\)
  • \(Z = 2.02,\; z_{0.10}=1.28,\; \text{reject}\; H_0\)
  • \(Z = -2.08,\; -z_{0.01}=-2.33,\; \text{do not reject}\; H_0\)
  • \(Z =2.54,\; z_{0.05}=1.645,\; \text{reject}\; H_0\)
  • \(Z = 2.54,\; z_{0.10}=1.28,\; \text{reject}\; H_0\)
  • \(H_0:\mu =1510\; vs\; H_a:\mu >1510\). Test Statistic: \(Z = 2.7882\). Rejection Region: \([2.33,\infty )\). Decision: Reject \(H_0\).
  • \(\mu _0=1528.74\)
  • \(H_0:\mu =1510\; vs\; H_a:\mu >1510\). Test Statistic: \(Z = -1.41\). Rejection Region: \([1.28,\infty )\). Decision: Fail to reject \(H_0\).
  • No, it is a Type II error.

8.3: The Observed Significance of a Test

  • Testing \(H_0:\mu =54.7\; vs\; H_a:\mu <54.7,\; \text{test statistic}\; z=-1.72\)
  • Testing \(H_0:\mu =195\; vs\; H_a:\mu \neq 195,\; \text{test statistic}\; z=-2.07\)
  • Testing \(H_0:\mu =-45\; vs\; H_a:\mu >-45,\; \text{test statistic}\; z=2.54\)
  • Testing \(H_0:\mu =0\; vs\; H_a:\mu \neq 0,\; \text{test statistic}\; z=2.82\)
  • Testing \(H_0:\mu =18.4\; vs\; H_a:\mu <18.4,\; \text{test statistic}\; z=-1.74\)
  • Testing \(H_0:\mu =63.85\; vs\; H_a:\mu >63.85,\; \text{test statistic}\; z=1.93\)
  • Testing \(H_0:\mu =27.5\; vs\; H_a:\mu >27.5,\; n=49,\; \bar{x}=28.9,\; s=3.14,\; \text{test statistic}\; z=3.12\)
  • Testing \(H_0:\mu =581\; vs\; H_a:\mu <581,\; n=32,\; \bar{x}=560,\; s=47.8,\; \text{test statistic}\; z=-2.49\)
  • Testing \(H_0:\mu =138.5\; vs\; H_a:\mu \neq 138.5,\; n=44,\; \bar{x}=137.6,\; s=2.45,\; \text{test statistic}\; z=-2.44\)
  • Testing \(H_0:\mu =-17.9\; vs\; H_a:\mu <-17.9,\; n=34,\; \bar{x}=-18.2,\; s=0.87,\; \text{test statistic}\; z=-2.01\)
  • Testing \(H_0:\mu =5.5\; vs\; H_a:\mu \neq 5.5,\; n=56,\; \bar{x}=7.4,\; s=4.82,\; \text{test statistic}\; z=2.95\)
  • Testing \(H_0:\mu =1255\; vs\; H_a:\mu >1255,\; n=152,\; \bar{x}=1257,\; s=7.5,\; \text{test statistic}\; z=3.29\)
  • Testing \(H_0:\mu =82.9\; vs\; H_a:\mu <82.9\; @\; \alpha =0.05\), observed significance \(p=0.038\)
  • Testing \(H_0:\mu =213.5\; vs\; H_a:\mu \neq 213.5\; @\; \alpha =0.01\), observed significance \(p=0.038\)
  • Testing \(H_0:\mu =31.4\; vs\; H_a:\mu >31.4\; @\; \alpha =0.10\), observed significance \(p=0.062\)
  • Testing \(H_0:\mu =-75.5\; vs\; H_a:\mu <-75.5\; @\; \alpha =0.05\), observed significance \(p=0.062\)
  • Perform the test at the \(1\%\) level of significance using the critical value approach.
  • Compute the observed significance of the test.
  • Perform the test at the \(1\%\) level of significance using the \(p\)-value approach. You need not repeat the first three steps, already done in part (a).
  • Perform the relevant test of hypotheses at the \(20\%\) level of significance using the critical value approach.
  • Perform the test at the \(20\%\) level of significance using the \(p\)-value approach. You need not repeat the first three steps, already done in part (a).
  • Perform the test at the \(10\%\) level of significance using the critical value approach.
  • Perform the test at the \(10\%\) level of significance using the \(p\)-value approach. You need not repeat the first three steps, already done in part (a).
  • Test at the \(5\%\) level of significance whether the mean increase with the new class scheduling is different from \(576\) word families, using the critical value approach.
  • Perform the test at the \(5\%\) level of significance using the \(p\)-value approach. You need not repeat the first three steps, already done in part (a).
  • Test at the \(5\%\) level of significance whether the mean yield under the new scheme is greater than \(44.8\) bu/acre, using the critical value approach.
  • Test at the \(5\%\) level of significance whether the mean visit time for the new page is less than the former mean of \(23.6\) seconds, using the critical value approach.
  • \(p\text{-value}=0.0427\)
  • \(p\text{-value}=0.0384\)
  • \(p\text{-value}=0.0055\)
  • \(p\text{-value}=0.0009\)
  • \(p\text{-value}=0.0064\)
  • \(p\text{-value}=0.0146\)
  • reject \(H_0\)
  • do not reject \(H_0\)
  • \(Z=3.23,\; z_{0.01}=2.33\), reject \(H_0\)
  • \(p\text{-value}=0.0006\)
  • \(Z=0.68,\; z_{0.05}=1.645\), do not reject \(H_0\)
  • \(p\text{-value}=0.4966\)
  • \(Z=2.22,\; z_{0.05}=1.645\), reject \(H_0\)
  • \(p\text{-value}=0.0132\)

8.4: Small Sample Tests for a Population Mean

  • \(H_0: \mu =27\; vs\; H_a:\mu <27\; @\; \alpha =0.05,\; n=12,\; \sigma =2.2\)
  • \(H_0: \mu =52\; vs\; H_a:\mu \neq 52\; @\; \alpha =0.05,\; n=6,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =-105\; vs\; H_a:\mu >-105\; @\; \alpha =0.10,\; n=24,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =78.8\; vs\; H_a:\mu \neq 78.8\; @\; \alpha =0.10,\; n=8,\; \sigma =1.7\)
  • \(H_0: \mu =17\; vs\; H_a:\mu <17\; @\; \alpha =0.01,\; n=26,\; \sigma =0.94\)
  • \(H_0: \mu =880\; vs\; H_a:\mu \neq 880\; @\; \alpha =0.01,\; n=4,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =-12\; vs\; H_a:\mu >-12\; @\; \alpha =0.05,\; n=18,\; \sigma =1.1\)
  • \(H_0: \mu =21.1\; vs\; H_a:\mu \neq 21.1\; @\; \alpha =0.05,\; n=23,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =141\; vs\; H_a:\mu <141\; @\; \alpha =0.20,\; n=29,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =-54\; vs\; H_a:\mu <-54\; @\; \alpha =0.05,\; n=15,\; \sigma =1.9\)
  • \(H_0: \mu =98.6\; vs\; H_a:\mu \neq 98.6\; @\; \alpha =0.05,\; n=12,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =3.8\; vs\; H_a:\mu >3.8\; @\; \alpha =0.001,\; n=27,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =-62\; vs\; H_a:\mu \neq -62\; @\; \alpha =0.005,\; n=8,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =73\; vs\; H_a:\mu >73\; @\; \alpha =0.001,\; n=22,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =1124\; vs\; H_a:\mu <1124\; @\; \alpha =0.001,\; n=21,\; \sigma \; \text{unknown} \)
  • \(H_0: \mu =0.12\; vs\; H_a:\mu \neq 0.12\; @\; \alpha =0.001,\; n=14,\; \sigma =0.026\)
  • Test \(H_0: \mu =50\; vs\; H_a:\mu \neq 50\; @\; \alpha =0.01\).
  • Estimate the observed significance of the test in part (a) and state a decision based on the \(p\)-value approach to hypothesis testing.
  • Test \(H_0: \mu =0\; vs\; H_a:\mu <0\; @\; \alpha =0.001\).
  • Test \(H_0: \mu =250\; vs\; H_a:\mu >250\; @\; \alpha =0.05\).
  • Test \(H_0: \mu =85.5\; vs\; H_a:\mu \neq 85.5\; @\; \alpha =0.01\).
  • Researchers wish to test the efficacy of a program intended to reduce the length of labor in childbirth. The accepted mean labor time in the birth of a first child is \(15.3\) hours. The mean length of the labors of \(13\) first-time mothers in a pilot program was \(8.8\) hours with standard deviation \(3.1\) hours. Assuming a normal distribution of times of labor, test at the \(10\%\) level of significance test whether the mean labor time for all women following this program is less than \(15.3\) hours.
  • A dairy farm uses the somatic cell count (SCC) report on the milk it provides to a processor as one way to monitor the health of its herd. The mean SCC from five samples of raw milk was \(250,000\) cells per milliliter with standard deviation \(37,500\) cell/ml. Test whether these data provide sufficient evidence, at the \(10\%\) level of significance, to conclude that the mean SCC of all milk produced at the dairy exceeds that in the previous report, \(210,250\) cell/ml. Assume a normal distribution of SCC.
  • Six coins of the same type are discovered at an archaeological site. If their weights on average are significantly different from \(5.25\) grams then it can be assumed that their provenance is not the site itself. The coins are weighed and have mean \(4.73\) g with sample standard deviation \(0.18\) g. Perform the relevant test at the \(0.1\%\) (\(\text{1/10th of}\; 1\%\)) level of significance, assuming a normal distribution of weights of all such coins.
  • An economist wishes to determine whether people are driving less than in the past. In one region of the country the number of miles driven per household per year in the past was \(18.59\) thousand miles. A sample of \(15\) households produced a sample mean of \(16.23\) thousand miles for the last year, with sample standard deviation \(4.06\) thousand miles. Assuming a normal distribution of household driving distances per year, perform the relevant test at the \(5\%\) level of significance.
  • Assuming that daily iron intake in women is normally distributed, perform the test that the actual mean daily intake for all women is different from \(18\) mg/day, at the \(10\%\) level of significance.
  • The sample mean is less than \(18\), suggesting that the actual population mean is less than \(18\) mg/day. Perform this test, also at the \(10\%\) level of significance. (The computation of the test statistic done in part (a) still applies here.)
  • Assuming that temperature is normally distributed, perform the test that the mean temperature of dispensed beverages is different from \(170^{\circ}F\), at the \(10\%\) level of significance.
  • The sample mean is greater than \(170\), suggesting that the actual population mean is greater than \(170^{\circ}F\). Perform this test, also at the \(10\%\) level of significance. (The computation of the test statistic done in part (a) still applies here.)
  • Assuming a normal distribution of recovery times, perform the relevant test of hypotheses at the \(10\%\) level of significance.
  • Would the decision be the same at the \(5\%\) level of significance? Answer either by constructing a new rejection region (critical value approach) or by estimating the \(p\)-value of the test in part (a) and comparing it to \(\alpha \).
  • Assuming a normal distribution of errors, test the null hypothesis that the predictions are unbiased (the mean of the population of all errors is \(0\)) versus the alternative that it is biased (the population mean is not \(0\)), at the \(1\%\) level of significance.
  • Would the decision be the same at the \(5\%\) level of significance? The \(10\%\) level of significance? Answer either by constructing new rejection regions (critical value approach) or by estimating the \(p\)-value of the test in part (a) and comparing it to \(\alpha \).
  • Pasteurized milk may not have a standardized plate count (SPC) above \(20,000\) colony-forming bacteria per milliliter (cfu/ml). The mean SPC for five samples was \(21,500\) cfu/ml with sample standard deviation \(750\) cfu/ml. Test the null hypothesis that the mean SPC for this milk is \(20,000\) versus the alternative that it is greater than \(20,000\), at the \(10\%\) level of significance. Assume that the SPC follows a normal distribution.
  • One water quality standard for water that is discharged into a particular type of stream or pond is that the average daily water temperature be at most \(18^{\circ}F\). Six samples taken throughout the day gave the data: \[\begin{matrix} 16.8 & 21.5 & 19.1 & 12.8 & 18.0 & 20.7 \end{matrix}\] The sample mean exceeds \(\bar{x}=18.15\), but perhaps this is only sampling error. Determine whether the data provide sufficient evidence, at the \(10\%\) level of significance, to conclude that the mean temperature for the entire day exceeds \(18^{\circ}F\).
  • A calculator has a built-in algorithm for generating a random number according to the standard normal distribution. Twenty-five numbers thus generated have mean \(0.15\) and sample standard deviation \(0.94\). Test the null hypothesis that the mean of all numbers so generated is \(0\) versus the alternative that it is different from \(0\), at the \(20\%\) level of significance. Assume that the numbers do follow a normal distribution.
  • At every setting a high-speed packing machine delivers a product in amounts that vary from container to container with a normal distribution of standard deviation \(0.12\) ounce. To compare the amount delivered at the current setting to the desired amount \(64.1\) ounce, a quality inspector randomly selects five containers and measures the contents of each, obtaining sample mean \(63.9\) ounces and sample standard deviation \(0.10\) ounce. Test whether the data provide sufficient evidence, at the \(5\%\) level of significance, to conclude that the mean of all containers at the current setting is less than \(64.1\) ounces.
  • Assuming a normal distribution of shear strengths, test the null hypothesis that the mean shear strength of all bolts in the shipment is \(4,350\) lb versus the alternative that it is less than \(4,350\) lb, at the \(10\%\) level of significance.
  • Estimate the \(p\)-value (observed significance) of the test of part (a).
  • Compare the \(p\)-value found in part (b) to \(\alpha = 0.10\) andmake a decision based on the \(p\)-value approach. Explain fully.
  • Determine if these data provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the mean average sentence length in the document is less than \(48.72\).
  • Estimate the \(p\)-value of the test.
  • Based on the answers to parts (a) and (b), state whether or not it is likely that the document was written by Oberon Theseus.
  • \(T\leq -2.571\; or\; T \geq 2.571\)
  • \(T \geq 1.319\)
  • \(Z\leq -1645\; or\; Z \geq 1.645\)
  • \(T\leq -0.855\)
  • \(T\leq -2.201\; or\; T \geq 2.201\)
  • \(T \geq 3.435\)
  • \(T=-2.690,\; df=19,\; -t_{0.005}=-2.861,\; \text{do not reject }H_0\)
  • \(0.01<p-value<0.02,\; \alpha =0.01,\; \text{do not reject }H_0\)
  • \(T=2.398,\; df=7,\; t_{0.05}=1.895,\; \text{reject }H_0\)
  • \(0.01<p-value<0.025,\; \alpha =0.05,\; \text{reject }H_0\)
  • \(T=-7.560,\; df=12,\; -t_{0.10}=-1.356,\; \text{reject }H_0\)
  • \(T=-7.076,\; df=5,\; -t_{0.0005}=-6.869,\; \text{reject }H_0\)
  • \(T=-1.483,\; df=14,\; -t_{0.05}=-1.761,\; \text{do not reject }H_0\)
  • \(T=-1.483,\; df=14,\; -t_{0.10}=-1.345,\; \text{reject }H_0\)
  • \(T=2.069,\; df=6,\; t_{0.10}=1.44,\; \text{reject }H_0\)
  • \(T=2.069,\; df=6,\; t_{0.05}=1.943,\; \text{reject }H_0\)
  • \(T=4.472,\; df=4,\; t_{0.10}=1.533,\; \text{reject }H_0\)
  • \(T=0.798,\; df=24,\; t_{0.10}=1.318,\; \text{do not reject }H_0\)
  • \(T=-1.773,\; df=4,\; -t_{0.05}=-2.132,\; \text{do not reject }H_0\)
  • \(0.05<p-value<0.10\)
  • \(\alpha =0.05,\; \text{do not reject }H_0\)

8.5: Large Sample Tests for a Population Proportion

On all exercises for this section you may assume that the sample is sufficiently large for the relevant test to be validly performed.

  • Testing \(H_0:p=0.50\; vs\; H_a:p>0.50,\; n=360,\; \hat{p}=0.56\).
  • Testing \(H_0:p=0.50\; vs\; H_a:p\neq 0.50,\; n=360,\; \hat{p}=0.56\).
  • Testing \(H_0:p=0.37\; vs\; H_a:p<0.37,\; n=1200,\; \hat{p}=0.35\).
  • Testing \(H_0:p=0.72\; vs\; H_a:p<0.72,\; n=2100,\; \hat{p}=0.71\).
  • Testing \(H_0:p=0.83\; vs\; H_a:p\neq 0.83,\; n=500,\; \hat{p}=0.86\).
  • Testing \(H_0:p=0.22\; vs\; H_a:p<0.22,\; n=750,\; \hat{p}=0.18\).
  • For each part of Exercise 1 construct the rejection region for the test for \(\alpha = 0.05\) and make the decision based on your answer to that part of the exercise.
  • For each part of Exercise 2 construct the rejection region for the test for \(\alpha = 0.05\) and make the decision based on your answer to that part of the exercise.
  • For each part of Exercise 1 compute the observed significance (\(p\)-value) of the test and compare it to \(\alpha = 0.05\) in order to make the decision by the \(p\)-value approach to hypothesis testing.
  • For each part of Exercise 2 compute the observed significance (\(p\)-value) of the test and compare it to \(\alpha = 0.05\) in order to make the decision by the \(p\)-value approach to hypothesis testing.
  • Testing \(H_0:p=0.55\; vs\; H_a:p>0.55\; @\; \alpha =0.05,\; n=300,\; \hat{p}=0.60\).
  • Testing \(H_0:p=0.47\; vs\; H_a:p\neq 0.47\; @\; \alpha =0.01,\; n=9750,\; \hat{p}=0.46\).
  • Testing \(H_0:p=0.15\; vs\; H_a:p\neq 0.15\; @\; \alpha =0.001,\; n=1600,\; \hat{p}=0.18\).
  • Testing \(H_0:p=0.90\; vs\; H_a:p>0.90\; @\; \alpha =0.01,\; n=1100,\; \hat{p}=0.91\).
  • Testing \(H_0:p=0.37\; vs\; H_a:p\neq 0.37\; @\; \alpha =0.005,\; n=1300,\; \hat{p}=0.40\).
  • Testing \(H_0:p=0.94\; vs\; H_a:p>0.94\; @\; \alpha =0.05,\; n=1200,\; \hat{p}=0.96\).
  • Testing \(H_0:p=0.25\; vs\; H_a:p<0.25\; @\; \alpha =0.10,\; n=850,\; \hat{p}=0.23\).
  • Testing \(H_0:p=0.33\; vs\; H_a:p\neq 0.33\; @\; \alpha =0.05,\; n=1100,\; \hat{p}=0.30\).
  • Five years ago \(3.9\%\) of children in a certain region lived with someone other than a parent. A sociologist wishes to test whether the current proportion is different. Perform the relevant test at the \(5\%\) level of significance using the following data: in a random sample of \(2,759\) children, \(119\) lived with someone other than a parent.
  • The government of a particular country reports its literacy rate as \(52\%\). A nongovernmental organization believes it to be less. The organization takes a random sample of \(600\) inhabitants and obtains a literacy rate of \(42\%\). Perform the relevant test at the \(0.5\%\) (one-half of \(1\%\)) level of significance.
  • Two years ago \(72\%\) of household in a certain county regularly participated in recycling household waste. The county government wishes to investigate whether that proportion has increased after an intensive campaign promoting recycling. In a survey of \(900\) households, \(674\) regularly participate in recycling. Perform the relevant test at the \(10\%\) level of significance.
  • Prior to a special advertising campaign, \(23\%\) of all adults recognized a particular company’s logo. At the close of the campaign the marketing department commissioned a survey in which \(311\) of \(1,200\) randomly selected adults recognized the logo. Determine, at the \(1\%\) level of significance, whether the data provide sufficient evidence to conclude that more than \(23\%\) of all adults now recognize the company’s logo.
  • A report five years ago stated that \(35.5\%\) of all state-owned bridges in a particular state were “deficient.” An advocacy group took a random sample of \(100\) state-owned bridges in the state and found \(33\) to be currently rated as being “deficient.” Test whether the current proportion of bridges in such condition is \(35.5\%\) versus the alternative that it is different from \(35.5\%\), at the \(10\%\) level of significance.
  • In the previous year the proportion of deposits in checking accounts at a certain bank that were made electronically was \(45\%\). The bank wishes to determine if the proportion is higher this year. It examined \(20,000\) deposit records and found that \(9,217\) were electronic. Determine, at the \(1\%\) level of significance, whether the data provide sufficient evidence to conclude that more than \(45\%\) of all deposits to checking accounts are now being made electronically.
  • Test whether the true proportion of the state’s population that is impoverished is less than \(12\%\), at the \(5\%\) level of significance.
  • Test whether the true proportion of all life insurance claims made to this company that are settled within \(30\) days is less than \(85\%\), at the \(5\%\) level of significance.
  • Test whether the true proportion of all smokers who began smoking before age \(18\) is less than \(90\%\), at the \(1\%\) level of significance.
  • Test whether the true proportion of all current business that is with repeat customers is less than \(68\%\), at the \(1\%\) level of significance.
  • A rule of thumb is that for working individuals one-quarter of household income should be spent on housing. A financial advisor believes that the average proportion of income spent on housing is more than \(0.25\). In a sample of \(30\) households, the mean proportion of household income spent on housing was \(0.285\) with a standard deviation of \(0.063\). Perform the relevant test of hypotheses at the \(1\%\) level of significance. Hint: This exercise could have been presented in an earlier section.
  • Ice cream is legally required to contain at least \(10\%\) milk fat by weight. The manufacturer of an economy ice cream wishes to be close to the legal limit, hence produces its ice cream with a target proportion of \(0.106\) milk fat. A sample of five containers yielded a mean proportion of \(0.094\) milk fat with standard deviation \(0.002\). Test the null hypothesis that the mean proportion of milk fat in all containers is \(0.106\) against the alternative that it is less than \(0.106\), at the \(10\%\) level of significance. Assume that the proportion of milk fat in containers is normally distributed. Hint: This exercise could have been presented in an earlier section.

Large Data Sets missing

  • Large \(\text{Data Sets 4 and 4A}\) list the results of \(500\) tosses of a die. Let \(p\) denote the proportion of all tosses of this die that would result in a five. Use the sample data to test the hypothesis that \(p\) is different from \(1/6\), at the \(20\%\) level of significance.
  • Large \(\text{Data Set 6}\) records results of a random survey of \(200\) voters in each of two regions, in which they were asked to express whether they prefer Candidate \(A\) for a U.S. Senate seat or prefer some other candidate. Use the full data set (\(400\) observations) to test the hypothesis that the proportion \(p\) of all voters who prefer Candidate \(A\) exceeds \(0.35\). Test at the \(10\%\) level of significance.
  • Lines \(2\) through \(536\) in Large \(\text{Data Set 11}\) is a sample of \(535\) real estate sales in a certain region in 2008. Those that were foreclosure sales are identified with a \(1\) in the second column. Use these data to test, at the \(10\%\) level of significance, the hypothesis that the proportion \(p\) of all real estate sales in this region in 2008 that were foreclosure sales was less than \(25\%\). (The null hypothesis is \(H_0:p=0.25\)).
  • Lines \(537\) through \(1106\) in Large \(\text{Data Set 11}\) is a sample of \(570\) real estate sales in a certain region in 2010. Those that were foreclosure sales are identified with a \(1\) in the second column. Use these data to test, at the \(5\%\) level of significance, the hypothesis that the proportion \(p\) of all real estate sales in this region in 2010 that were foreclosure sales was greater than \(23\%\). (The null hypothesis is \(H_0:p=0.25\)).
  • \(Z = 2.277\)
  • \(Z = -1.435\)
  • \(Z \geq 1.645\); reject \(H_0\)
  • \(Z\leq -1.96\; or\; Z \geq 1.96\); reject \(H_0\)
  • \(Z \leq -1.645\); do not reject \(H_0\)
  • \(p-value=0.0116,\; \alpha =0.05\); reject \(H_0\)
  • \(p-value=0.0232,\; \alpha =0.05\); reject \(H_0\)
  • \(p-value=0.0749,\; \alpha =0.05\); do not reject \(H_0\)
  • \(Z=1.74,\; z_{0.05}=1.645\); reject \(H_0\)
  • \(Z=-1.98,\; -z_{0.005}=-2.576\); do not reject \(H_0\)
  • \(Z=2.24,\; p-value=0.025,\alpha =0.005\); do not reject \(H_0\)
  • \(Z=2.92,\; p-value=0.0018,\alpha =0.05\); reject \(H_0\)
  • \(Z=1.11,\; z_{0.025}=1.96\); do not reject \(H_0\)
  • \(Z=1.93,\; z_{0.10}=1.28\); reject \(H_0\)
  • \(Z=-0.523,\; \pm z_{0.05}=\pm 1.645\); do not reject \(H_0\)
  • \(Z=-1.798,\; -z_{0.05}=-1.645\); reject \(H_0\)
  • \(p-value=0.0359\)
  • \(Z=-8.92,\; -z_{0.01}=-2.33\); reject \(H_0\)
  • \(p-value\approx 0\)
  • \(Z=3.04,\; z_{0.01}=2.33\); reject \(H_0\)
  • \(H_0:p=1/6\; vs\; H_a:p\neq 1/6\). Test Statistic: \(Z = -0.76\). Rejection Region: \((-\infty ,-1.28]\cup [1.28,\infty )\). Decision: Fail to reject \(H_0\).
  • \(H_0:p=0.25\; vs\; H_a:p<0.25\). Test Statistic: \(Z = -1.17\). Rejection Region: \((-\infty ,-1.28]\). Decision: Fail to reject \(H_0\).

Contributor

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

K12 LibreTexts

9.6: Significance Test for a Mean

  • Last updated
  • Save as PDF
  • Page ID 5788

Significance Testing for Means

Evaluating hypotheses for population means using large samples.

When testing a hypothesis for the mean of a normal distribution, we follow a series of six basic steps:

  • State the null and alternative hypotheses.
  • Choose an α level
  • Set the criterion (critical values) for rejecting the null hypothesis.
  • Compute the test statistic.
  • Make a decision (reject or fail to reject the null hypothesis)
  • Interpret the result

If we reject the null hypothesis we are saying that the difference between the observed sample mean and the hypothesized population mean is too great to be attributed to chance. When we fail to reject the null hypothesis, we are saying that the difference between the observed sample mean and the hypothesized population mean is probable if the null hypothesis is true. Essentially, we are willing to attribute this difference to sampling error.

The school nurse was wondering if the average height of 7th graders has been increasing. Over the last 5 years, the average height of a 7th grader was 145 cm with a standard deviation of 20 cm. The school nurse takes a random sample of 200 students and finds that the average height this year is 147 cm. Conduct a single-tailed hypothesis test using a .05 significance level to evaluate the null and alternative hypotheses.

First, we develop our null and alternative hypotheses:

H0:μHa:μ=145>145

Choose α=.05. The critical value for this one tailed test is 1.64. Any test statistic greater than 1.64 will be in the rejection region.

Next, we calculate the test statistic for the sample of 7th graders.

Screen Shot 2020-07-20 at 6.02.14 PM.png

Since the calculated z−score of 1.414 is smaller than 1.64 and thus does not fall in the critical region. Our decision is to fail to reject the null hypothesis and conclude that the probability of obtaining a sample mean equal to 147 if the mean of the population is 145 is likely to have been due to chance.

Testing a Mean Hypothesis Using P-values

We can also test a mean hypothesis using p-values. The following examples show how to do this.

A sample of size 157 is taken from a normal distribution, with a standard deviation of 9. The sample mean is 65.12. Use the 0.01 significance level to test the claim that the population mean is greater than 65.

We always put equality in the null hypothesis, so our claim will be in the alternative hypothesis.

HA:μ>65

The test statistic is:

Screen Shot 2020-07-20 at 6.02.47 PM.png

Now we will find the probability of observing a test statistic at least this extreme when assuming the null hypothesis. Since our alternative hypothesis is that the mean is greater, we want to find the probability of z scores that are greater than our test statistics. The p-value we are looking for is:

p-value=P(z>0.17)=1−P(z<0.17)

Using a z-score table:

p-value=P(z>0.0.167)=1−P(z<0.167)=1−0.6064=0.3936>0.01

The probability of observing a test statistic at least as big as the z=0.17 is 0.3936. Since this is greater than our significance level, 0.01, we fail to reject the null hypothesis. This means that the data does not support the claim that the mean is greater than 65.

Testing a Mean Hypothesis When the Population Standard Deviation is Known

We can also use the standard normal distribution, or z-scores, to test a mean hypothesis when the population standard deviation is known. The next two examples, though they have a smaller sample size, have a known population standard deviation.

1. A sample of size 50 is taken from a normal distribution, with a known population standard deviation of 26. The sample mean is 167.02. Use the 0.05 significance level to test the claim that the population mean is greater than 170.

H 0 :μ=170

H A :μ>170

Screen Shot 2020-07-20 at 6.25.27 PM.png

p-value=P(z>0.811)=1−P(z<0.811)=1−0.791=0.209>0.05

The probability of observing a test statistic at least as big as the z=0.81 is 0.209. Since this is greater than our significance level, 0.05, we fail to reject the null hypothesis. This means that the data does not support the claim that the mean is greater than 170.

2. A sample of size 20 is taken from a normal distribution, with a known population standard deviation of 0.01. The sample mean is 0.194. Use the 0.01 significance level to test the claim that the population mean is equal to 0.22.

We always put equality in the null hypothesis, so our claim will be in the null hypothesis. There is no reason to do a left or right tailed test, so we will do a two tailed test:

H 0 :μ=0.22

H A :μ≠0.22

Screen Shot 2020-07-20 at 6.42.22 PM.png

Now we will find the probability of observing a test statistic at least this extreme when assuming the null hypothesis. Since our alternative hypothesis is that the mean is not equal to 0.22, we need to find the probability of being less than -2.91, and we also need to find the probability of being greater than positive 2.91. However, since the normal distribution is symmetric, these probabilities will be the same, so we can find one and multiply it by 2:

p-value=2⋅P(z<−2.91)=2⋅0.0018=0.0036⧸>0.01

The probability of observing a test statistic at least as extreme as z=−2.91 is 0.0036. Since this is less than our significance level, 0.01, we reject the null hypothesis. This means that the data does not support the claim that the mean is equal to 0.22.

A sample of size 36 is taken from a normal distribution, with a known population standard deviation of 57. The sample mean is 988.93. Use the 0.05 significance level to test the claim that the population mean is less than 1000.

We always put equality in the null hypothesis, so our claim will be in the alternative hypothesis:

H 0 :μ=1000

H A :μ<1000

Screen Shot 2020-07-20 at 6.42.59 PM.png

Now we will find the probability of observing a test statistic at least this extreme when assuming the null hypothesis. Since our alternative hypothesis is that the mean is less than 1000, we need to find the probability of z scores less than -1.17:

p-value=P(z<−1.17)=0.1210>0.05

The probability of observing a test statistic at least as extreme as z=−1.17 is 0.1210. Since this is greater than our significance level, 0.05, we fail to reject the null hypothesis. This means that the data does not support the claim that the mean is less than 1000.

  • True or False: When we fail to reject the null hypothesis, we are saying that the difference between the observed sample mean and the hypothesized population mean is probable if the null hypothesis is true.
  • What would the null and alternative hypotheses be for this scenario?
  • What would the standard error be for this particular scenario?
  • Describe in your own words how you would set the critical regions and what they would be at an alpha level of .05.
  • Test the null hypothesis and explain your decision
  • A one-tailed or two-tailed test
  • .05 or .01 level of significance
  • A sample size of n=144 or n=444
  • A coal miner claims that the mean number of coal mined per day is more than 30,000 pounds. A random sample of 150 days finds that the mean number of pounds of coal mined is 20,000 pounds with a standard deviation of 1,000. Test the claim at the 5% level of significance.
  • A high school teacher claims that the average time a student spends on math homework is less than one hour. A random sample of 250 students is drawn and the mean time spent on math homework in this sample was 45 minutes with a standard deviation of 10. Test the teacher’s claim at the 1% level of significance.
  • A student claims that the average time spent studying for a statistics exam is 1.5 hours. A random sample of 200 students is drawn and the sample mean is 150 minutes with a standard deviation of 15. Test the claim at the 10% level of significance.

For problems 7-14 , IQ tests are designed to have a standard deviation of 15 points. They are intended to have a mean of 100 points. For the following data on scores for the new IQ tests, test the claim that their mean is equal to 100. Use 0.05 significance level.

  • n=107,x̄=94.77
  • n=56,x̄=109.0012
  • n=17,x̄=100.13
  • n=37,x̄=78.92
  • n=72,x̄=98.73
  • n=10,x̄=103.34
  • n=80,x̄=98.38
  • n=150,x̄=108.89

For 15-16, find the p-value. Explain whether you will reject or fail to reject based on the p-value.

  • Test the claim that the mean is greater than 27, if n=101,x̄=26.99,σ=5
  • Test the claim that the mean is less than 10,000, if n=81,x̄=9941.06,σ=1000

Review (Answers)

To view the Review answers, open this PDF file and look for section 8.4.

Additional Resources

Video: Z Test for Mean

Practice: Significance Test for a Mean

Real World: Paying Attention to Heredity

Statology

Statistics Made Easy

How to Write Hypothesis Test Conclusions (With Examples)

A   hypothesis test is used to test whether or not some hypothesis about a population parameter is true.

To perform a hypothesis test in the real world, researchers obtain a random sample from the population and perform a hypothesis test on the sample data, using a null and alternative hypothesis:

  • Null Hypothesis (H 0 ): The sample data occurs purely from chance.
  • Alternative Hypothesis (H A ): The sample data is influenced by some non-random cause.

If the p-value of the hypothesis test is less than some significance level (e.g. α = .05), then we reject the null hypothesis .

Otherwise, if the p-value is not less than some significance level then we fail to reject the null hypothesis .

When writing the conclusion of a hypothesis test, we typically include:

  • Whether we reject or fail to reject the null hypothesis.
  • The significance level.
  • A short explanation in the context of the hypothesis test.

For example, we would write:

We reject the null hypothesis at the 5% significance level.   There is sufficient evidence to support the claim that…

Or, we would write:

We fail to reject the null hypothesis at the 5% significance level.   There is not sufficient evidence to support the claim that…

The following examples show how to write a hypothesis test conclusion in both scenarios.

Example 1: Reject the Null Hypothesis Conclusion

Suppose a biologist believes that a certain fertilizer will cause plants to grow more during a one-month period than they normally do, which is currently 20 inches. To test this, she applies the fertilizer to each of the plants in her laboratory for one month.

She then performs a hypothesis test at a 5% significance level using the following hypotheses:

  • H 0 : μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
  • H A : μ > 20 inches (the fertilizer will cause mean plant growth to increase)

Suppose the p-value of the test turns out to be 0.002.

Here is how she would report the results of the hypothesis test:

We reject the null hypothesis at the 5% significance level.   There is sufficient evidence to support the claim that this particular fertilizer causes plants to grow more during a one-month period than they normally do.

Example 2: Fail to Reject the Null Hypothesis Conclusion

Suppose the manager of a manufacturing plant wants to test whether or not some new method changes the number of defective widgets produced per month, which is currently 250. To test this, he measures the mean number of defective widgets produced before and after using the new method for one month.

He performs a hypothesis test at a 10% significance level using the following hypotheses:

  • H 0 : μ after = μ before (the mean number of defective widgets is the same before and after using the new method)
  • H A : μ after ≠ μ before (the mean number of defective widgets produced is different before and after using the new method)

Suppose the p-value of the test turns out to be 0.27.

Here is how he would report the results of the hypothesis test:

We fail to reject the null hypothesis at the 10% significance level.   There is not sufficient evidence to support the claim that the new method leads to a change in the number of defective widgets produced per month.

Additional Resources

The following tutorials provide additional information about hypothesis testing:

Introduction to Hypothesis Testing 4 Examples of Hypothesis Testing in Real Life How to Write a Null Hypothesis

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

IMAGES

  1. Level of Significance in Hypothesis Testing

    how to test hypothesis at 5 level of significance

  2. Significance Level and Power of a Hypothesis Test Tutorial

    how to test hypothesis at 5 level of significance

  3. Understanding Hypothesis Tests: Significance Levels (Alpha) and P

    how to test hypothesis at 5 level of significance

  4. An easy-to-understand summary of significance level

    how to test hypothesis at 5 level of significance

  5. Significance Level and Power of a Hypothesis Test Tutorial

    how to test hypothesis at 5 level of significance

  6. Hypothesis Testing- Meaning, Types & Steps

    how to test hypothesis at 5 level of significance

VIDEO

  1. One Tailed Test Vs. Two Tailed Test

  2. Difference Between Significance Level and Confidence Level in Research

  3. Hypothesis testing

  4. Why the Level of Significance is 5%

  5. Hypothesis Test for One Proportion

  6. Hypothesis Testing

COMMENTS

  1. Understanding Hypothesis Tests: Significance Levels (Alpha) and P

    The P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level. If we stick to a significance level of 0.05, we can conclude that the average energy cost for the population is greater than 260. A common mistake is to interpret the P-value as the probability that the null hypothesis is true.

  2. How Hypothesis Tests Work: Significance Levels (Alpha) and P values

    Using P values and Significance Levels Together. If your P value is less than or equal to your alpha level, reject the null hypothesis. The P value results are consistent with our graphical representation. The P value of 0.03112 is significant at the alpha level of 0.05 but not 0.01.

  3. Understanding Significance Levels in Statistics

    While this post looks at significance levels from a conceptual standpoint, learn about the significance level and p-values using a graphical representation of how hypothesis tests work. Additionally, my post about the types of errors in hypothesis testing takes a deeper look at both Type 1 and Type II errors, and the tradeoffs between them.

  4. Level of Significance & Hypothesis Testing

    The level of significance is defined as the criteria or threshold value based on which one can reject the null hypothesis or fail to reject the null hypothesis. The level of significance determines whether the outcome of hypothesis testing is statistically significant or otherwise. The significance level is also called as alpha level.

  5. Hypothesis Testing

    Step 5: Present your findings. The results of hypothesis testing will be presented in the results and discussion sections of your research paper, dissertation or thesis.. In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p-value).

  6. Hypothesis Testing

    Hypothesis Testing Significance levels. The level of statistical significance is often expressed as the so-called p-value. Depending on the statistical test you have chosen, you will calculate a probability ... We reject it because at a significance level of 0.03 (i.e., less than a 5% chance), the result we obtained could happen too frequently ...

  7. Hypothesis Testing: Significance Level & Rejection Region

    So, we can choose a higher significance level like 0.05 or 0.1. Hypothesis Testing: Performing a Z-Test. Now that we have an idea about the significance level, let's get to the mechanics of hypothesis testing. Imagine you are consulting a university and want to carry out an analysis on how students are performing on average.

  8. Significance tests (hypothesis testing)

    Unit test. Significance tests give us a formal process for using sample data to evaluate the likelihood of some claim about a population value. Learn how to conduct significance tests and calculate p-values to see how likely a sample result is to occur by random chance. You'll also see how we use p-values to make conclusions about hypotheses.

  9. 8.3: The Observed Significance of a Test

    The value of the test statistic was z = 2.490, which by Figure 7.1.5 cuts off a tail of area 0.0064, as shown in Figure 8.3.1. Since the test was two-tailed, the observed significance is 2 × 0.0064 = 0.0128. Figure 8.3.1: Area of the Tail for Example 8.3.1.

  10. Hypothesis Testing

    Using 14.13 as the value of the test statistic for these data, carry out the appropriate test at a 5% level of significance. Show all parts of your test. Answer. In the module on hypothesis testing for means and proportions, we discussed hypothesis testing applications with a dichotomous outcome variable and two independent comparison groups.

  11. 7.5: Critical values, p-values, and significance level

    When we use z z -scores in this way, the obtained value of z z (sometimes called z z -obtained) is something known as a test statistic, which is simply an inferential statistic used to test a null hypothesis. The formula for our z z -statistic has not changed: z = X¯¯¯¯ − μ σ¯/ n−−√ (7.5.1) (7.5.1) z = X ¯ − μ σ ¯ / n.

  12. Tests of Significance

    Significance Levels The significance level for a given hypothesis test is a value for which a P-value less than or equal to is considered statistically significant. Typical values for are 0.1, 0.05, and 0.01. These values correspond to the probability of observing such an extreme value by chance. In the test score example above, the P-value is 0.0082, so the probability of observing such a ...

  13. 6b.1

    With this range of possible p-values exceeding our 1% level of significance for the test, we fail to reject the null hypothesis. Step 6: State an overall conclusion. With a test statistic of - 1.3 and p-value between 0.1 to 0.2, we fail to reject the null hypothesis at a 1% level of significance since the p-value would exceed our significance ...

  14. Choosing the Right Statistical Test

    Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test. Significance is usually denoted by a p-value, or probability value. Statistical significance is arbitrary - it depends on the threshold, or alpha value, chosen by the ...

  15. Hypothesis Testing and Confidence Intervals

    The relationship between the confidence level and the significance level for a hypothesis test is as follows: Confidence level = 1 - Significance level (alpha) For example, if your significance level is 0.05, the equivalent confidence level is 95%. Both of the following conditions represent statistically significant results: The P-value in a ...

  16. Using P-values to make conclusions (article)

    Onward! We use p -values to make conclusions in significance testing. More specifically, we compare the p -value to a significance level α to make conclusions about our hypotheses. If the p -value is lower than the significance level we chose, then we reject the null hypothesis H 0 in favor of the alternative hypothesis H a .

  17. Hypothesis Testing (5 of 5)

    If the significance level is 5% (α = 0.05), then 5% of the time we will reject the null hypothesis (when it is true!). Of course we will not know if the null is true. ... In the next section, "Hypothesis Test for a Population Proportion," we learn the details of hypothesis testing for claims about a population proportion. Before we get ...

  18. 11.8: Significance Testing and Confidence Intervals

    There is a close relationship between confidence intervals and significance tests. Specifically, if a statistic is significantly different from 0 0 at the 0.05 0.05 level, then the 95% 95 % confidence interval will not contain 0 0. All values in the confidence interval are plausible values for the parameter, whereas values outside the interval ...

  19. Hypothesis testing and p-values (video)

    In this video there was no critical value set for this experiment. In the last seconds of the video, Sal briefly mentions a p-value of 5% (0.05), which would have a critical of value of z = (+/-) 1.96. Since the experiment produced a z-score of 3, which is more extreme than 1.96, we reject the null hypothesis.

  20. Understanding P-Values and Statistical Significance

    In a one-tailed test, the entire significance level is allocated to one tail of the distribution. For example, if you are using a significance level of 0.05 (5%), you would reject the null hypothesis if your data point falls in the 5% tail on either the right (for a right-tailed test) or the left (for a left-tailed test) end of the distribution.

  21. 8.E: Testing Hypotheses (Exercises)

    For each part of Exercise 2 compute the observed significance (\ (p\)-value) of the test and compare it to \ (\alpha = 0.05\) in order to make the decision by the \ (p\)-value approach to hypothesis testing. Perform the indicated test of hypotheses using the critical value approach.

  22. 9.6: Significance Test for a Mean

    The school nurse takes a random sample of 200 students and finds that the average height this year is 147 cm. Conduct a single-tailed hypothesis test using a .05 significance level to evaluate the null and alternative hypotheses. First, we develop our null and alternative hypotheses: H0:μHa:μ=145>145. Choose α=.05.

  23. How to Write Hypothesis Test Conclusions (With Examples)

    When writing the conclusion of a hypothesis test, we typically include: Whether we reject or fail to reject the null hypothesis. The significance level. A short explanation in the context of the hypothesis test. For example, we would write: We reject the null hypothesis at the 5% significance level.

  24. Answered: decreasing the significance level of a…

    Q: n a sample of 1313 randomly selected high school seniors, the mean score on a standardized test was… A: The objective of this question is to determine whether the t-value for the original sample falls…