Chapter 6 Introduction to Inference
We’re at an interesting crossroads in the course where we’ve looked at summarizing data graphically, summarizing data numerically. We’ve looked at study design. We’ve looked at probability. And we want to combine several of those ideas into what’s called inference, statistical inference.
The overarching idea is, you’re wrong. Don’t take it personally but you’re wrong, and how wrong are you? And that’s the question we’re going to try to answer when we look at statistical inference.
So what is inference? Statistical inference is the process of drawing conclusions about the entire population based on the information in a sample. A picture might help. So there’s a population of interest. In the population has things such as the true population mean mu and the true population standard deviation stigma and the true population correlation rho. And we’ll discuss these symbols more on the next slide.
Now we want to know what these values are. And what we’ll emphasize and what’s important to always remember is they’re fixed. They’re unknown to us, but there’s a fixed value for each of these that we’d like to know what they are.
Now if we were able to see the entire population, we would easily be able to figure out their value. But that’s too expensive or time consuming or difficult to do, so what we do is we take a random sample. And in the study-design unit we talked about different ways of doing that. And statistical inference is the ability to go from a sample back to the population and say what we think the true value of these population parameters are. So remember, population parameters are fixed in value. They’re unknown to us, and we’d like to know what they are.
So we distinguish between what we call parameters and statistics. Parameters are population values and they represent the true average amount of a mortgage that Americans in the United States have, the true percentage of Americans who have Amazon Prime, the true amount of time it takes Walmart to process a return when you go back to a store. Statistics are literally functions of the data. So statistics are calculated from sample data, and they’re used as estimates for the population parameters.
Now it can get confusing because we used the word mean back and forth, we used the word variance back and forth, and we used the word standard deviation back and forth. And so what we have is a point estimate is a single value. That’s our sample estimate. And that’s our best guess for the parameter value. And we also call a point estimate a sample statistic to emphasize that it’s calculated based on sample data. Point estimates are very familiar to us already. We’ve seen ideas such as the sample correlation R and the sample mean y bar.
Some notation is as follows. So we use the word mean and proportion all the time, but we’ll talk about the population one. We’ll talk about the sample one. And what you’ll see are two columns, Greek letters in one column. And those are called population parameters. These are fixed. They’re unknown to us, their true value, but they’re fixed. They don’t vary at all. Sample statistics– and the key word is sample– are formed from data. And so y bar’s the sample mean. s is the sample standardization. p is the sample proportion. r is the sample correlation. And they actually vary because they depend on the sample that you collected. Every time you collect a different sample, you probably are going to get a different y bar, a different s, a different p. And we’ll discuss that. That actually is called sample variation, and that’s a very important concept.
There’s two main types of inference. There’s what’s called interval estimates. And that takes a point estimate and gives you an interval, a lower and upper bound for what you think your estimate is. So it answers the question, what are the most plausible values of mu or pi? Remember, we’re asking questions about the population parameters mu, the population proportion pi. You can also inference which is called hypothesis testing. And that’s asking a specific question about mu or pi. So that’s asking, is it true, you think, that the sample mean is really 5, or is it true that the sample proportion is really 0.2? I probably wanted to say pi here, so I’m going to replace that mu with a pi.
But hypothesis testing is asking a very specific question about a population parameter. Interval estimates are giving you a range where you think the true value might lie. They both have useful purposes, as we’ll soon see.
The estimation process works as follows. There’s a population of interest that we’re going to draw from. There’s a true but unknown mu, population mean. There’s a true but unknown pi, which is the population proportion. We take a sample.
From the sample we form statistics. So y bar will be our guess of mu and p will be our guess of the population proportion. And then we might ask questions such as, I am 95– or we might have answers to questions like, what do you think the true value of mu is? What do you think the true value of pi is?
Remember, our questions have to do with the population parameters. We don’t have questions about sample statistics. We know what they are because they’re formed from the sample data we collect. But we want to answer questions about the population parameters.
So we might have statements of the form, I am 95% confident that a true mu is between 40 and 60. Or we might be able to answer statements such as, can I conclude that pi is greater than 0.7? Notice that these statements are about mu and pi, the true but unknown population parameters.
Some examples would be we’re 95% confident that between 52% and 61% of male consumers prefer the Gillette Sensor razor to all other brands. Can we conclude that a higher proportion of female car buyers will purchase a Subaru next year than male car buyers? Or, is spending more money on research development associated with an increase in revenue? These are all questions about a population of interest, and we’ll go through different inferential procedures to understand how to analyze these questions.
file_download Downloads Introduction to Inference.pptx
6.1 Variability of Sample Statistics
Variability is an important concept in statistics, and it helps us a lot when we’re performing inference. So let’s go back to this idea of parameters versus statistics for a moment. Recall that statistical inference is the process of drawing conclusions about the entire population based on the information in a sample. A parameter is a number that describes some aspect of a population. A statistic is a number that’s computed from the data in a sample and is an estimate of the parameter.
The sample statistic depends on the data collected. Stop. OK, that’s such a simple idea, but it’s such an important idea to this section– that the average of a data set depends on the data you collect. That should make a lot of sense. So every time you collect different data, you’re probably going to get a different average value. And that’s crucial for understanding and being able to do inference.
So the sample statistic depends on the data collected, while the parameter is a fixed value. The idea is that y-bar is random, mu is a fixed value. We don’t know what it is. We’d like to know what it is, but we don’t know what it is.
Although the sample statistic, our estimate, varies from sample to sample, we can use statistical theory to model the variability and understand how much our estimate might differ from the true value. It sounds like a fancy concept. But with a few images, you’ll understand what’s going on.
We actually start with a polling example. Suppose we polled 100 randomly selected voters just before an election, and only 33 said they would vote for our candidate. What can we say proportion of all voters who will vote for our candidate? We know 33% of our voters would. Does that mean all the voters, the proportion is 33%?
Our guess, of course, is 0.33. But how wrong are we? We know we’re not correct. We know we’ve got to be wrong. We only asked 100. We didn’t ask all the voters possible. So can we quantify how wrong we are?
Well, we can get an idea of how wrong we are by doing this idea of repeated sampling. We’re not prepared to say the true proportion of voters in favor of our candidate is pi equal 0.33. Notice how we went from p equal 0.33 to pi. This is the true but unknown proportion. We’d like to know what that value is. This is our guess from the data we collected.
If we polled another 100 voters, would we still obtain 33% in favor? Probably not. We may not be that far off, but we probably would not get 0.33 again. A new poll might reveal 37% in favor or 27% in favor, and so on.
Can we quantify this variability without a new sample? That is, can we quantify how different we’ll get each time we do a new poll of 100 people? Well, suppose we do a thought experiment. Imagine the following scenario. Suppose the true support for our candidate is 28% of the population. So that’s the true pi. We don’t know that, of course. We’re trying to figure that out. So suppose we’re not just taking a survey of 100 randomly selected people. We’re going to send out an army of pollsters, each one asking 100 people who are they going to vote for.
So pollster 1 asks 100 people, and they’ll get a p. Pollster 2 asks 100 people. They’ll get another p. And pollster 3 and 4 and 5, and we’ll have 100 pollsters going out. And each one of them will ask 100 people, are you going to vote for our candidate? And each of them will get a p.
Well, what we can do is, suppose we send out a bunch of these pollsters? We could form a histogram of these different p’s. So suppose we send out 1,000 pollsters. Each one of them asks 100 people, who are you going to vote for? And we get a histogram of the 1,000 p’s that result from this data collection exercise.
Well, the histogram has an interesting shape. It looks bell-shaped. It looks like the normal distribution that we learned in the probability section. It also seems to be centered around 0.28, which is interesting, because that’s the true pi.
We don’t know that, right? We wouldn’t know that in practice. But it turns out if we collected a bunch of these p’s, it looks like it’s bell-shaped and looks like it’s also centered around a true proportion, which is 0.28.
What we’re interested in, though, is can we describe this spread? This thing seems a little bit spread out. And if we could get a handle on how spread out that is, maybe we could quantify how good our guess is.
We might get really unlucky and might get a poll where it’s 0.15 in favor of a candidate, or we might be as high as maybe 0.44 or so. Now, we don’t know. In practice, we do this once. We don’t do repeated sampling. We’ll do it once. And we’ll get one p, which is an estimate of pi.
But it’d be nice to quantify how far our one p is from the true pi. Can we do that without even knowing the true pi? And that’s what we’re moving towards.
So what’s the spread in our estimates? There is spread in our estimates because we’re working with sample data. Every time we get a new sample, we might get a different p. So we might see a p as low as 0.1## We might see a p as high as 0.44. The standard deviation, we know that as a measure of spread, is a measure of the variability of these different estimates.
We can quantify that. In a survey situation like we have here, there’s a relatively easy formula to quantify the standardization of the estimate p. It’s a function of both the unknown population proportion pi and our sample size n. And we write it as the standard deviation of p equals pi times 1 minus pi over n, the square root of the entire fraction there.
Now, you might say, if you’re very astute, how does this formula help us, because this formula involves the unknown pi, which in practice we know we don’t know. We know we don’t know. That sounds a little bit weird. But we know we don’t know the unknown pi.
We want to know how spread out our guesses are. And this is a formula that tells me how spread out my estimates are. But it depends on the unknown pi. Well, hang on. In the next unit, we’ll explain a bit more how you do this in practice. This is a theoretical result that we’ll then tell you how to use in an applied setting.
So how do we use the standard deviation in practice? Well, this is important, because we normally don’t take repeated samples. We normally don’t take 1,000 surveys of 100 people each. We have one estimate from one survey. We get one p. And we want to know how good our one p is.
And that’s the point in the next segment. We’ll discuss how to use the standard deviation of p to create interval estimates. And this will tell us how wrong we are. We’ll get a lower bound and an upper bound for how good our guess is and where we think the true but unknown pi does lie.
One thing we want to emphasize before we leave this unit is the importance of sample size. So sample size is important when you perform estimation. There should be something that makes sense in terms of the more data we get, we should get a better estimate. But how can we quantify that?
Well, we can do another simulation, similar to our first polling example, to show you the effect of sample size on the standard deviation or the spread of the estimates. So suppose again the true support for our candidate is 28%. Again, that’s pi. That’s the true population proportion.
We don’t know that, of course. I want to emphasize that. Population parameters, we do not know. We want to estimate from sample data. What we want to show you is sampling variability if we survey 50, 200, or 500 people.
Now, if you remember the formula, the standard deviation of p equals the square root of pi times 1 minus pi over n. pi is 0.28. What we’re doing is we’re changing n. Now, what should happen? n’s in the bottom of this fraction, so we should see, as n gets larger and larger, the standard deviation should get narrower and narrower.
So let’s check the graph and see what we get here. So this is n equals 50. And we’re looking at this spread. Again, it seems centered around 0.28. But how spread out are our estimates?
Now we increase sample size to 100, and these are on the same scale. And you can see it’s now narrower and narrower. If we increase sample size even more, it gets narrower and narrower. You can think, if sample size got really, really big, this is going get narrower and narrower and eventually be a stick right at the true value, 0.28. Sample size does matter when doing estimation. The histograms show that the larger the sample size, the lower the variability in the sampling distribution, so the smaller the standard deviation of the sample statistic. Theory shows that in the formula for the standard deviation. But it’s nice to see that simulation shows the exact same result. And this should make sense. A larger sample allows us to collect more information and estimate a population parameter more precisely.
file_download Downloads Variability of Sample Statistics.pptx
6.2 Interval Estimates for Proportions and Means
We now show how to use sampling variability to create what are called interval estimates for proportions and means. Go back to our polling example because we love survey data and polling examples to motivate things.
Suppose we polled 100 randomly selected voters just before an election and 33 said they would vote for our candidate. What can we say about proportion of all voters who will vote for our candidate? Our estimate, p, is 0.33– 33%– of all voters will vote for our candidate. But we could be wrong. How wrong are we?
And let’s remember the issues. Our guess, or our estimate, is what we call p. The true, but unknown, value is called pi. And we’d like some idea how close our estimate is to the true, unknown value of pi. Is there a chance pi is above 0.5? Is there chance pi is below 0.4? Just– we get a 0.33, but how close is that to the true but unknown value of pi?
There’s at least two issues at play here. Issue one is we need to think about whether our sample is really representative of the population of voters– whether people will tell the pollsters the truth about how they will vote. Statistical calculations cannot help us grapple with these issues. There are study design issues at work here. We’ll assume that the sample is perfectly representative of the population of voters. That every person will vote as they said they will vote, as they would on the poll– that people are being truthful to us. So even the most fantasy way of interval estimation will not help you if you have problems in drawing a good representative sample.
Issue two is we need to think about sampling error, called the standard error. And that is, just by chance, our sample may contain a smaller or large fraction of people voting for our candidate than does the overall population. How do we account for that? We can use knowledge about how sample statistics vary do find a confidence interval for the population parameter. A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples. The success rate– that is the proportion of all samples whose intervals contain the parameter– is known as the confidence level.
Now, 100% confidence interval for proportions is actually very easy to create. Since we only know the proportion of one sample, there’s no way to be absolutely sure about the proportion in the population. The best we can do is calculate a range of values that bracket the population proportion. However, for proportions, we’re 100% confident the true proportion is in the interval is 0 to 1. That is, it has to be an interval because we’re talking about a proportion.
However, is this interval useful? Not at all. It sounds impressive. I’m 100% confident I know where the true but unknown pi is. Pi has to be somewhere in the interval. Well, if your consultant that you’ve hired, tells you the interval is 0 to 1, you’d probably want to fire them because that’s not that useful result. You know it’s in there, but it would be nice if it was a little bit narrower than that.
A wide interval, such as 0 to 1 is not that helpful. To create a narrower and more useful range, you must accept the possibility that the interval will not include the true population value. There’s always a chance we might be wrong. Statisticians usually accept a 5% chance that the range will not include the true population value. This range, or interval, is called a 95% confidence interval, often abbreviated, 95% CI, for conference interval.
What does the 95% confidence interval mean? We are confident actually in the process of constructing interval, not in any one interval. What we mean by that is a 95% confidence interval means that if we construct 100 intervals using this method, on average, 95% of them will include the true but unknown parameter. Five of them will be wrong, unfortunately, on average. We don’t know exactly which ones, but we know the process works, in general.
A formula for the 95% confidence interval is relatively easy to give for the population proportion. So we’re talking about a 95% interval for the population proportion. 95% confidence intervals are always given for the population values. So we’re trying to get a handle on where the true but unknown population proportion pi is. And a 95% confidence interval is given by the following formula. Our estimate is p, so it’s p plus or minus 2 times the quantity square root p 1 minus p over n– often abbreviated p plus or minus 2SE and that’s called the standard error. So SE is called the standard error.
And all this is used and calculated based on sample data– n as our sample size, p as our estimate. So everything is known to us. This is an easy formula to work with. The quantity 2SE is called the margin of error for the survey. So that’s everything past the plus minus and that’s called the margin of error.
As example, suppose you have a small online business, and this month 200 users signed up on your website. And 23 of them bought your premium $800 service. The rest of them are just moochers, and they’re using your free content. So you want to know, what’s your conversion rate? What’s a 95% confidence interval for your conversion rate from this sample?
Well, it’s easy enough to do this in R. At the end of this unit, we’ll show you R code to calculate all this. And this is what the output looks like. We have 23 out of 200. This is our sample, p. This is our 23– trying to write an equal sign. That’s 23 out of 200. What we have circled, here, is it even tells you– how nice, that’s what it is– the 95% confidence interval for the true proportion. We’re pretty sure the true proportion is somewhere in the interval, 0.075 to 0.169– 0.17, for the most part.
We don’t know exactly what the true p is. We’re pretty sure the true pi– I said true p. We meant true pi. We’re pretty sure the true pi is somewhere in that interval. It could be as low as 0.08. It could be as high as 0.17. Remember, this is just our guess. We can’t claim that’s the true pi, just based on one sample of data. The confidence interval gives us a range of estimates. We’re pretty sure the true pie is somewhere in that interval.
Again, understanding confidence intervals, a 95% confidence interval should work 95% of the time. This means, again, if we generate many different intervals, around 5% won’t contain the true population proportion, or the population parameter, in general. We don’t know if any one given interval does or does not, but we have faith in the process of constructing these intervals, and that’s why we use the phrase, we’re 95% confident the true proportion is in that interval.
To show you visually, you could look at many conference intervals. So we collect a sample of data. We form a guess of the population proportion. And we have an interval. So here, the green line is pi, the true portion, the red dot is our estimate, p. Every time we collect data, we form an estimate, p, we form an interval. As you can see, most intervals will contain the true but unknown population proportion pi. Some of them will not. This is too high, this one’s too low, but we generated 20 intervals, here– two of them did not. In general, though we trust a process, 95% of a 100 intervals should contain the true unknown population parameter.
We have what’s called the confidence level of the interval. This tells us how often the estimation method is successful. If the method has a 95% confidence level, it works in 95% of surveys. We say the method works, if the interval captures the true value of the population parameter. The confidence level measures the success rate of the method, not of any one particular interval.
You can also change the confidence level by changing the margin of error. The greater the margin of error, the higher our confidence level. So as we see from this table, if you do plus or minus 1SE, so the margin of error is 1SE, you have a 68% confidence interval. 2SE is our standard one that’s 95% confidence. 3SE is a 99% confidence level. So what do we see? As our confidence goes up– so as confidence goes up, the width goes which way? The width of the interval always goes up.
You always want to be a little bit wary if someone comes to you with something other than a 95% confidence interval. So if someone’s giving you a lower level confidence interval, they may be a reason why. They may be trying to hide a number from you. So if they don’t want to show that a certain value is in there, they want to show you do not have 50% market share, or they want to hide something, they may give you a narrower interval, and, say, only give you a 90% confidence or you can even go down to 80 or 70 if you want.
If someone gives you a 99% confidence interval, you want to be a little bit wary by that. They’re trying to include a number. That sounds fantastic. I’m 99% confident I know where the true unknown population proportion is. Well, remember, a 100% confidence sounds fantastic, but that interval is 0 to 1, and you don’t want anyone to give you that.
So always be a little bit wary. If they’re not giving you a 95% confidence, which is the default setting in every stat package, including R, you want to ask them, why are you giving me a 90% interval or why are you giving me a 99% interval? So is there a reason why they are trying to exclude a number? Is there a reason why they’re trying to include a number? So 95 is the default we normally use. And ask questions if they are giving you something other than that.
We know how to do confidence intervals now for the proportion. You can do a similar idea for the mean. So we can create a confidence interval for the mean. The 95% confidence interval formula for a population mean, mu, is given by x bar plus or minus 2SE.
As we’ll see throughout the course, from this point on, you can create confidence intervals for many quantities out there. There are all of the general form, where you have an estimate, and then you have plus or minus 2 times standard error. And that standard error represents how variable your estimate is.
Of course, the standard error is a little bit different for means than it is for proportions. For means, the standard error takes the shape of s over root n. s is the standard deviation of our data set and n, of course, is the sample size.
Note this interval is valid if n is larger than 30. R does some adjustments if the sample size is small. Essentially, you have to change this to if the sample size is small, there’s a little bit more variability going on, and so you have to go and adjust the interval that’s given. R does this for us automatically.
Here’s an example of calculating a population mean. If you’ll recall the haircut survey data from model 2, what we were doing is we went to Harvard Square. We want to form– we want to create a new haircut shop in Harvard Square. We want to know what the market will bear for haircut prices. So we did a survey. And we asked, what do you usually pay for your haircut? So using R, we find a 95% confidence interval for average price paid to be as follows. And we’ll show you the code at the end of this unit to show you how to create this, but it’s circled, here. R is very nice. This is the default– the 95% confidence interval.
And we find out that– remember, what is this interval for? It’s for mu. Mu is the true, average price paid for consumers in Harvard Square for a haircut. We don’t know what true mu is. We’re trying to figure it out. We get a sample of data. We get a sample mean. This, again, is our y bar. And that’s our guess of mu. That’s our estimate. But we know mu is not exactly 34.67. We’re wrong. How wrong we are– well, that’s what this interval tells us.
We’re pretty sure– we’re 95% confident– the true unknown mu is in interval $29.27 to $40. So an $11 swing. So it’s a pretty wide interval there. And that has to do with the sample size and also, how much spread is in the data. This was, essentially, a skewed distribution, If you recall. So we don’t know the true mu, but we know it’s somewhere inside an interval.
We can break this down by gender. We could form a confidence interval for men, a confidence interval for women. And one question of interest, of course, is is the average price paid for men different than the average price paid for women. And we’ll examine that in just a few slides from now.
file_download Downloads Interval Estimates for Proportions and Means.pptx
6.3 Introduction to Hypothesis Testing
As you might recall, we said there were two types of inference. There’s confidence intervals and hypothesis testing. We just showed you how to do confidence intervals for means and proportions, and now we’ll go over the logic of hypothesis testing.
We actually don’t do hypothesis testing in practice until the next unit. We need to go over the logic of it, because it’s a little bit involved. It’s not that difficult to understand, but you have to understand the logic of hypothesis testing before we can actually show you how to do it in practice.
So we’ve discussed how to make inference on parameters two different ways. In plain English, how do we figure out what the heck the true meaning of proportion is? So we started with what are called point estimates, and then we showed you confidence intervals.
Point estimates give you a single best guess and by themselves are pretty useless, because it doesn’t make a lot of sense to know y bar is this or p is this. You want to know, well, how wrong are we? That’s a guess. How good is my guess? So that’s what confidence intervals do. Confidence intervals give you a range of guesses.
Alternatively, you might be interested in using the information sample to test a specific statement, called a hypothesis, about parameters in the population.
So what is a hypothesis? It’s a statement regarding a characteristic of one or more populations. The hypothesis can be about a single population parameter, such as mu or pi for example. Or in future units, we’ll extend hypothesis testing to other situations.
Some example hypotheses might be, last year 62% of American adults regularly volunteered their time for charity work. A researcher believes that this percentage is different today. A car manufacturer claims its new hybrid car has a mean mileage of 60 miles per gallon. Do the data support this claim? Do different web page designs lead to different purchase behaviors? These are all hypotheses about the population. And we’re going to learn how to use sample data to answer hypotheses questions about the population.
A caution, as always– we test these types of statements using sample data, because it’s usually impossible, impractical, or probably too expensive to gain access to entire population. Because we’re using sample data, there will always be a chance of making a mistake.
If population data were available, there of course is no need for any inferential statistics. We would be able to calculate the population mean and mu, the population proportion of pi, and so forth. But whenever you work with sample data, there’s always a chance of making a mistake.
Hypothesis testing steps work as follows. You make a statement regarding the nature of the population. That’s very important you do this first– no peeking at your data. You then collect evidence, sample data to test the statement. You then analyze the data to assess the plausibility of your statement. And then you state conclusions about the hypotheses.
We start with what’s called the null hypothesis, as denoted H sub O. And it’s a statement about our population we want to test. It is always stated as an equality, and it’s always written in terms of the population parameter. So we would say H sub O mu equals 60, where mu is the average miles per gallon of Ford’s new hybrid car. Or H sub O pi equals 0.4, where pi is the proportion of female voters who will vote for your candidate.
Now, you would never write a hypothesis in terms of sample data. You would never write H sub O y bar equals 60. That’s verboten. You always write hypotheses in terms of population values. It’s a question about the population. You know if y bar equals 60, because you collect data and you’ll know, is y bar 60, yes or no? Population quantities are what you write hypotheses about.
Now, what seems bizarre– but it’s the logic hypothesis testing– we are typically hoping to disprove H sub O. So we’ve set it up, but we’re typically hoping to disprove H sub O. The alternative hypothesis is denoted H sub A, for alternative, and it’s the assertion that’s contrary to H sub O.
In these notes we’ll use what are called two-sided alternatives, which are written as follows. The null hypothesis is that mu equals 60. The alternative hypothesis is that mu does not equal 60. It’s called a two-sided hypothesis because there are two different ways you can not equal 60. You can either be bigger than 60 or less than 60. So that’s two ways you can not equal 60. So this is called a two-sided hypothesis test.
This is the same thing for proportions, a two-sided hypothesis test for proportions. H sub O pi equals 0.4 versus the alternative hypothesis, pi does not equal 0.4. Now, this is where the logic of hypothesis testing takes a little bit of effort to understand. H sub A is the statement we are hoping to conclude is correct. And so this is the logic of proving versus disproving.
Since we only have sample data, we can really only disprove a theory– not prove it, since we haven’t seen all the data. So we’re asking a question about a population parameter. Does mu equal 60? Is it really true that Ford has a new hybrid that the average miles per gallon is 60 miles per gallon?
Well, we’re not able to test every single car that exists. So how can we say that yeah, mu equals 60? It’s actually going to be a lot easier to disprove something than to prove something. In hypothesis testing, we assume the null hypothesis is true then see if our result is so inconsistent as to make it unreasonable to be true. That is, we hope to disprove the null hypothesis and conclude the alternative is true.
As an example– and I love this example, because I actually own a three-legged cat called Cassino– suppose the null hypothesis that all cats have four legs. To prove this null hypothesis, what would you need to do? You’d need to gather up every cat in the world and prove they all have four legs– a virtually impossible thing to do. We all know how hard herding cats is. But to disprove this hypothesis, what would we need to do? We’d to go to my house and meet my cat Cassino, who has three legs. So it’s actually easier to disprove H sub O, and hence say Ha is true than prove H sub O is true.
That is, a statement about the population is hard to prove because you never see everyone in the population. What we’re hoping for is to find a counterexample we’re trying to find an example that says, you know what? The null’s not true. And then we can say the alternative is true.
So the logic of hypothesis testing is you’re hoping to reject the null hypothesis by finding a three-legged cat and then saying, aha, the alternative hypothesis is true.
In fact, as we said, you can usually never prove H sub O true. Why? Because you’re never going to see the entire population. I said usually because yeah, if you’re under a situation where you’re able to look at the entire population, you can decide whether H sub O really is true. But in general, you’re dealing with sample data so you’re never going to be able to prove the null hypothesis true, because you’re never going to see everyone in the population.
Because of this logic, because of this idea that you’re never able to say H sub naught is absolutely true, we never accept the null hypothesis– because again, without having access to the entire population, we don’t know the exact value of the parameters stated in the null hypothesis.
Rather, we do not say we accept the null. We say we do not reject the null hypothesis. So we can’t say it’s true. We just say there’s not enough evidence to reject the null hypothesis. This is like the court system. We never declare a defendant innocent, but rather say the defendant is not guilty. There was not enough evidence to find the defendant guilty. We can’t say they’re innocent. We just say they’re not guilty.
So when you run a hypothesis test, there are several outcomes. You can reject the null hypothesis when the alternative hypothesis is true. This decision would be correct. You do not reject the null hypothesis when the null hypothesis is true. This decision would be correct. And there’s a chance we’ve come to the wrong conclusion. But this error we control with the hypothesis testing procedure.
In the next few slides, we’ll show you how to actually run a hypothesis test for proportions and means and understand the different outcomes that can happen.
file_download Downloads Introduction to Hypothesis Testing.pptx
6.4 Testing a Proportion or Mean
We now go through the mechanics of testing hypotheses for proportions and means. With hypothesis testing, it’s much easier to work through an example than just start with theory right away. So suppose last year, 40% of US cable users used digital video recording systems– DVRs, such as TiVo.
A cable company is wondering if this proportion has now changed, and wants to test the following hypothesis. The null hypothesis is pi equal 0.4. The alternative hypothesis is pi does not equal 0.4.
Now remember, we can’t emphasize this enough. We always put the population quantities there. It makes no sense to say, does p equal 0.4, versus p not equal to 0.4. You’ll know that as soon as you collect your data. It is or isn’t.
But we’re asking something about the population quantity. pi is fixed. P varies. It’s based on the sample. And we want to know– is there evidence that pi equals 0.4? Or is there evidence pi does not equal 0.4?
The logic of hypothesis testing works as follows. We gather data. We obtain the sample statistic, p. The hypothesis is about pi. But we don’t know the true value. It’s fixed. But we have no idea what it is. P is our estimate for pi.
So we’re asking the question– let’s look at the null hypothesis again. We’re asking the question, does pi equal 0.4? Well, it’s very natural to say, well, does p equal 0.4? Or is p away from 0.4?
So we have a question of interest about pi. We don’t know what pi is. We collect data. And p is our estimate. So any question saying, is pi near 0.4, is pi away from 0.4, has to be answered by using our estimate, p.
So naturally the question is, how big a difference is there? How far is p from 0.4? Is p near 0.4 or far from 0.4? If this difference was around 0– and on either side, because it’s a two-sided hypothesis. So you could say, let’s look at the absolute value of p minus 0.4. Because we’re interested in large positive or large– it could be above 0.4 or below 0.4.
If this difference was around 0, we wouldn’t reject the null hypothesis that pi equals 0.4. That would seem plausible. Again, we don’t observe pi. P is our guess, is our estimate. And if that’s near 0.4, we think, yeah, it seems plausible that’s pi’s 0.4.
However, what’s considered a big difference? How far away does p have to be from 0.4 for us to say, you know what? I think pi does not equal 0.4. Pi is different than 0.4. To measure the difference between the sample statistic, p, and the hypothesized value, 0.4, we create what’s called a test statistic. The test statistic for testing proportion is defined to be z. And it’s the estimate, which is p, minus the hypothesized value, pi sub naught, over what’s called the standard error. And this is a useful, useful statistic. And our decision is to reject the null hypothesis for large values of z, positive or negative.
To understand this thing, only look at the top. The bottom is always positive. It’s the standard error. And it’s used to knock out the units. And it’s used for sample variability. But really, if you want to understand what’s going on, look at the top of that fraction. What would it mean for z to be around 0? If z was around 0, what would that mean? Well, let’s say if z were 0, that would mean p minus pi not equals 0. And we can do some fancy math here. That means p equals pi naught. So essentially, if the top of that fraction is around 0, we know that p is close to pi sub naught. So p would be close to 0.4.
However, if the top of that fraction is large, that would mean that p is far away from pi sub naught. So that gives us an idea when you’d want to say the null is probably supported and when you’d want to say the null is not supported. You want to reject it. And the alternative hypothesis is true.
So that test statistic is very important. But to really understand what’s going on, you want to study the top of that test statistic. And that really makes it clear how to interpret what’s going on. Well, we can determine what’s considered a small or a large value of the test statistic by calculating what’s called a p-value. And that’s just short for probability value. The p-value is the probability of observing the test statistic or a more extreme one, assuming the null is true. And it really is a measure of consistency.
How consistent is what we observed under the null hypothesis? That’s what the p value is measuring. The further away our estimate is from the hypothesized value, the closer our p-value is to 0. Because a p-value is a probability, it is always in the range 0 to 1.
The smaller the p-value, the less likely we were to have seen the data if the null hypothesis were true. Data are inconsistent with the null hypothesis. We would then reject the null hypothesis.
The larger the p-value– our data are not inconsistent with having been produced using the null hypothesis– we would fail to reject the null hypothesis. So in many ways, the p-value is a consistency measure. The higher the p-value, the more consistent our data is, having been produced under the null hypothesis. The smaller the p-value, the more consistent our data is, having been produced under the alternative hypothesis.
R computes the p-value for us. It’s not something we have to worry about computing by hand.
There is a continuum for the p-value. Many people say that if a p-value is below 0.05, then you reject the null hypothesis. We have that in a mantra that says, “if p is low, Ho must go.” But in reality, there really is a continuum for p-values that works as follows. A p-value above 0.10, we would definitely say no evidence against null hypothesis. In fact, we would say we fail to reject the null hypothesis.
And then there’s a continuum here. You could say between 0.05 and 0.1, weak evidence. 0.01 to 0.05, moderate evidence against the null hypothesis. P-value of below 0.01, strong evidence against the null hypothesis. Whenever we say against the null hypothesis, that means we would reject the null hypothesis and accept the alternative hypothesis.
We use the phrase “statistical significance.” We say the sample results are statistically significant if we have convincing evidence against the null hypothesis and in favor of the alternative hypothesis. That means we would reject the null and accept the alternative hypothesis.
Although many people reject the null hypothesis if the p-value is below 0.05, a modern take on hypothesis testing is to simply report the p-value and let the reader decide on the course of action. However, for this course, the mantra “if p is low, Ho must go,” thinking if p is below 0.05 you reject the null hypothesis, will serve you well when you’re trying to interpret our output.
As an example, let’s go back to the DVR question. We want to test if the population proportion is 0.4, the alternate hypothesis that the population proportion is not 0.4. We survey 300 people. And 129 use a DVR on a regular basis.
The R output for testing proportion is as follows. And again, at the end of this unit we’ll go over R code for doing all this. Here’s the test statistic– by itself, really hard to interpret. It’s not 0. But is it far away from 0 that we would reject?
What we care about here is the p-value. If p is low, Ho must go. That p-value is not below 0.0## We would fail to reject the null hypothesis.
There is a consistency in what we’re doing here with the confidence interval. So here’s our sample proportion, 0.43. And here is a confidence interval for where we think the true pi is. And lo and behold, what’s inside that interval? 0.4 is inside the interval, which means 0.4 is a plausible value. So if someone said, is 0.4 a plausible value for the population proportion? According to the confidence interval, yes it is.
Another way to answer that question is, we have a high p-value. So under the null hypothesis, that the true proportion is 0.4, the data we observed– a sample statistic of 0.43– is not that inconsistent. It’s a consistent idea. If the true proportion is 0.4, it’s not that unusual to get a value of 0.43.
How’s that p-value of 0.288 actually calculated? Well, the test statistic is 1.0607. The p-value is calculated to be the shaded area. So the question is, how extreme was the test statistic you saw?
So we add up these two shaded areas. And that’s what the p-value is. And so the p-value is saying, how extreme is what you saw? And you can see, if you saw a z value of around 0, the p-value would actually be 1, because it wouldn’t be an extreme value whatsoever. We’ll come back to this idea in just a few slides to cement it in your brains.
How do you state our conclusion? Since the p-value equals 0.2888, it’s not less than 0.05, we fail to reject the null hypothesis. We would say that the sample data does not disprove the null hypothesis. We fail to reject the null hypothesis.
Note that we can’t say the null hypothesis is true since we haven’t seen everyone in the population. We just have to say we can’t reject the null hypothesis.
Now, this is a useful idea to remember. As p increases, the p-value decreases. What do I mean by that? Well, suppose we have the null and the alternative hypothesis as follows. And suppose we have different sample proportions– always the same sample size here. But suppose we saw a sample proportion of 0.4, 0.41, 0.42, so on and so forth.
What seems to be happening to the p-value? This is what’s interesting here. As we get further and further away from the hypothesized value of 0.4– and we only happen to be going on the right-hand side of 0.4. This would also work the same way if we went in the opposite direction. But what do we see happening?
As we get further and further away from the hypothesized value of 0.4, there should be more evidence that the null is not true. And how is that seen in the table? The p-value is getting smaller and smaller. So eventually, you would reject. You would reject. You’d reject. Maybe here, also– that’s a small p-value.
But this is what the p-value is measuring. Is what you’re seeing consistent with the null or consistent with the alternative? And it should stand to reason that the further and further I get a sample proportion away from 0.4, I should be rejecting.
And that’s what you’re seeing here. As you get further and further away, you are rejecting the null hypothesis. So that shows you the relationship with the sample statistic p and the p-value. This is testing proportion. You can do similar ideas for testing a mean.
Now, we start with an example, as always. Suppose we have a medicine that is being manufactured. And each pill is supposed to have 14 milligrams of the active ingredient.
And we care about quality control. There are important issues if our pills have too much– maybe bad side effects– or too little– no relief of the medicine in them. So we want 14 milligrams, on average, in our pills. What are our null and alternative hypotheses?
Remember, again– and we always like to emphasize this– that the null and the alternative are written terms of population quantities. So we’re asking, does the population mean equal 14? Does the population mean not equal to 14?
And we also want to emphasize that we write the null and the alternative before we look at our data. You don’t peek at the data and then determine, oh, if we write it this way, we’ll get the result we want. You’re supposed to set the hypotheses first. And then look at your data.
There is a test statistic for testing your mean. It’s slightly different looking than the test statistic for testing a proportion. For testing a mean, the test statistic is defined as follows, and called a t instead of a z.
So the test statistic is our estimate, which is y bar, minus mu sub naught, which is our hypothesized value, divided by the standard error. So the setup in general is we want to test as mu equals some hypothesized value mu sub naught. And the alternative is that mu does not equal some hypothesized value mu sub naught.
It’s called a t since it follows what’s called the t-distribution– very similar to a standard normal distribution, just slightly different. That’s why it has the name a t.
Now, if you want to examine what’s going on, like we did with the proportion, most of the action is in the numerator here. So we’re asking, does mu equal mu sub naught? We don’t get to observe mu. Y bar is our proxy for mu. So it’s natural to look at the distance between y bar and mu sub naught.
If that value is near 0– so the quantity here is near 0– you’re going to think, oh, there’s evidence that the null is plausible. However, if that quantity is far away from 0, you’re going to think, reject the null. There’s evidence the alternative is plausible.
So testing a mean to show this in practice, suppose we collect 30 pills. And we find the sample average is 14.3. And the standard deviation’s 2.3 for the amount of active ingredient. The R output is as follows. Here we have your test statistic– 0.71422. That, by itself– not so useful. In a nutshell, that’s in a vacuum. You can’t interpret that. What we care about is the p-value.
So the p-value is 0.4807. Since the rule is, if p is low, Ho must go, if the p-value is blow 0.05, we reject the null hypothesis. That’s not– we would not reject the null hypothesis.
Another way to look at this, of course, is we’re testing the null hypothesis, mu equal 14. We’re testing the alternative hypothesis, mu does not equal 14. And there is a relationship between confidence intervals and hypothesis testing.
Down here, here’s the confidence interval. We don’t know what mu is. But we’re pretty sure mu is in that confidence interval. So a logical question is, is it plausible that mu equals 14?
You look inside the interval and you say, well, yeah. That’s plausible. We don’t know exactly what mu is. But we’re pretty sure mu is somewhere in that interval. 14 is in that interval, which means it’s a plausible value for mu.
So we have to fail to reject the null hypothesis. We can’t accept it, because we haven’t seen all sample data as possible. We fail to reject the null hypothesis. The p-value is the probability of getting a test statistic value of what we observed, 0.71422 or greater. And this is a picture, again, of the p-value calculation. So we look at the test statistic, either large values or smaller values of the test statistic.
And we calculate that shaded area. And again, as the test statistic gets closer and closer to 0, that means our hypothesized value is closer and closer to our sample value. The p-value will get larger and larger and eventually go to 1.
As an example to, again, reinforce the idea of the relationship between the sample statistic and the p-value, suppose our sample mean was now 1## 3. So in the previous example it was 14.3. We failed to reject. But now suppose that y bar was 1## 3, with the same standard deviation as before. What should happen to the p-value, and why?
Well, remember, what are we testing? We’re testing the hypothesis that mu equals 14. And we’re testing the alternative hypothesis that mu does not equal 14. As we get further and further away from 14– so we saw, y bar equals 14.3 in the previous example. We failed to reject. But as we get further and further away from the hypothesized value, it would seem logical that the p-value should get lower and lower. There should be more evidence that the alternative is true.
You’re not going to believe the null. You’re going to think, oh, I’m getting a further and further away value from the hypothesized value, 14. I’m going to reject.
So what should happen to the p-value? It should go down. There’s more evidence the null is not true. And if we run the R, we do indeed see the p-value is now 0.004. We would reject the null hypothesis. Again, if p is low, Ho must go. If the p-value is below 0.05, we reject. And seeing a y bar of 1## 3 is not consistent with the null that the true population mean is 14.
Again, another way to look at this is, here’s the confidence interval for mu. We don’t know what mu is. But we’re pretty sure it’s inside that interval. Is 14 in that interval now? No. 14 is not a plausible value. And we can see that in two ways. We can see that we get a really small p-value over here. Or the confidence interval now does not span 14. 14 is not in it anymore.
file_download Downloads Testing a Proportion or Mean.pptx
6.5 Two Sample Testing
We’ve seen how to do confidence intervals and hypothesis testing for one sample proportions and for one sample means. We now will get two sample testing. And this is very popular, because often you want to see if there’s a difference between men and women in the response to a survey. You want to see if the West Coast drivers get different miles per gallon than East Coast drivers. So you want to compare two different groups. We call that two sample testing, and very useful in business context.
So we’ve learned how to make inferences about a single population. We’re now going to learn how to compare two populations. Such problems often arise in practice. We may wish to compare the mean retirement ages of workers in the public and private sectors, or compare the percentage of women and men who will purchase a Subaru next year.
Businesses are increasingly beginning to use data to drive decision making, and are often using hypothesis tests for experiments. In website analytic projects involving randomized experiments called A/B tests, two sample hypothesis tests are used to analyze the results. Different ad placements on web sites, whether you get offered a pop up for free shipping or for a 10% off coupon, those are popular tests done in web analytics, and they’re called A/B tests.
As an example, consider two independent samples. So we assume we have two simple random samples taken from two different populations, and we wish to test the hypothesis that the two means are equal versus the two means are not equal. Sometimes this is also written in an equivalent manner, as mu 1 minus mu 2 equals 0, mu 1 minus mu 2 does not equal 0. It’s exactly the same, just some places like to write it as a difference, some like to write it as an equality. So the null hypothesis is that the two sample means are the same. The alternative hypothesis is that the two sample means are different.
Similar in spirit to what we did for the one sample case, we’ll examine the difference y bar 1 minus y bar 2, where the obvious notation is y bar 1 is the mean of group one and y bar 2 is the mean of group two. So again, let’s remember what’s going on here. We have a hypothesis where we’re asking does mu 1 equal mu 2. Well, we don’t get to see mu 1. We don’t get to see mu 2. y bar 1 is our proxy, is our estimate for mu 1. y bar 2 is our proxy, our fill-in, our estimate, for mu 2.
So asking if mu 1 equals mu 2, an obvious thing to look at is asking let’s look at the difference of y bar 1 minus y bar 2. If that difference is near 0, that would seem plausible that maybe the null is true. If that difference is away from 0, we would say the null’s not true. The alternative has to be true. So again, we look at the difference. If the difference is near 0, there’s no need to reject the null. If the difference is away from 0, we will reject the null hypothesis. R computes a p-value for us, which helps us decide what to do.
So as an example, a company is thinking of purchasing a new customer relationship management software suite. That’s a mouthful. Essentially, CRM– customer relationship management software– that helps you keep track of cold calling, helps you keep track of contacts with customers. And so ideally, they only want to purchase the software if it reduces the amount of time to create new customer records, so employees are more efficient. 20 salespeople were randomized to test out the new system versus the current one. The mean time for 11 users to complete the task under the current system was 37 seconds with a standard deviation of 22.4 seconds. For 9 users on the new system, the mean time to completion was 18 seconds with a standard deviation of 13.4 seconds. Is there a difference in the task time between the two systems on average?
From R we have the following output. What can we conclude? Well, let’s remember what we’re trying to figure out here. So what are we testing? We’re testing the hypothesis mu 1 equal mu 2, and the alternative hypothesis is that mu 1 does not equal mu 2. So in English we’re asking, on average is the task completion time the same for the two different software suites, or is there evidence that on average, the task completion time is different?
So there’s several bits of things going on in this output. We can just jump to what’s circled, but we also want to highlight some other things on the output. The p-value is 0.03, and we know our general interpretation as always is, if p is low Ho must go. If the p-value is below 0.05, there is strong evidence against the null hypothesis. The data is not consistent with the null and consistent with the alternative. We would say reject the null, accept the alternative. So here we would just say reject null and accept the alternative, because the p-value is so low, below 0.05.
Here, by the way, is the mean of group one and group two. R just by nature call one group x, one group y. We’ve been calling y bar one and y bar 2 to denote the two different groups, a similar idea. And now what it’s giving you here is– that’s the confidence interval for y bar 1 minus y bar 2. And what’s interesting– that’s supposed to be a minus sign, so y bar 1 here minus y bar 2– what’s interesting is what is not inside that confidence interval. So if someone says, interpret the confidence interval for the difference of two means, what we mean by that is we want to know is the interval all negative, is the interval all positive, or does it span 0. Is 0 in that interval? What do we mean by that?
Well, what is this a confidence interval for? This is a confidence interval for the difference mu 1 minus mu 2. It’s based on our guess, y bar 1 minus y bar 2, but it’s a confidence interval for the difference mu 1 minus mu 2. And this tells me I’m 95% confident the true difference is between 1.89 and 36. Those numbers are huge, but what’s important is it’s an all positive interval. It’s not plausible at all that mu 1 minus mu 2 equals 0. That would be when they’re equal. So the fact that the interval is all positive is again the same evidence as the p-value telling us that we reject the null of equality. We accept your alternative that they are in fact different. They’re not equal to each other. mu 1 does not equal mu 2.
So to do two sample comparisons if we have two independent samples, we generate a p-value from R, and we decide if p is low, Ho must go. We reject the null hypothesis of equality. If the p value is above 0.05, we fail to reject the null hypothesis. That’s called two independent samples.
There’s something different called paired data, and paired data is a different animal and it has to be treated differently than what we saw with two independent samples. Consider the following data from Expedia.com. And we have data on 10 cities and we look at the price of a room at a Hampton Inn and we look at the price of a La Quinta Inn in these 10 different cities across the country. We want to test the claim that Hampton Inn hotels are priced differently on average than La Quinta Inns.
So here’s our data. And what we have is Houston, Tampa Bay, 10 different cities, and we have the Hampton Inn price in that city and the La Quinta Inn price in that city. And we want to test the hypothesis on average are they the same price or are they different. What’s very important to point out is this is not independent data. The first procedure we went over assumed you have two columns of data and they’re independent of each other. We have some employees randomized to use the current software suite, some employees randomized to use the other software suite, two independent groups. In this case, we have what’s called paired data or matched data. They’re matched on each city and they have to be treated a little bit different.
These data are called paired or matched, where the cities are the strata. The two hotels within each city share many common characteristics– cost of living for the city, cost of labor, cost of energy and food in each city. There’s going to be some shared characteristics they all have in each city. This data have to be treated differently since they are not two independent samples.
So the hypothesis is, again, are the two things equal? Are the two means different? But the way you analyze paired data is you first calculate the difference within each pair, and then you compare the average differences to 0. So what we’re doing is we’re comparing the average difference equals 0, the average difference does not equal 0. So what are we doing?
We go back to each pair and we calculate the difference. So this would be a minus 6. This would be a 16. This would be a 12. So on and so forth. We call this the difference, and then we calculate the average difference. And so if you have paired data, you can’t analyze them independently. You have to take the difference, and then analyze the paired difference. And that’s the way the hypothesis test is done. So the hypothesis test of interest is, is the average difference 0, is the average difference not 0. So to analyze paired data, we first calculate the difference within each pair and then compare the average differences to 0.
Luckily, R can do this for us automatically, as we’ll see in just a little bit when we do R unit for this section. For the hotel comparison data, we do what’s called a paired t-test as opposed to an independent sample t-test, and we get a p-value here. And again, if p is low, Ho must go. So what are we testing? Always good to write down what the null and the alternative are. We’re testing is the average difference 0. The alternative is that the average difference is not 0. And again, if p is low, Ho must go. We fail to reject the null hypothesis. There’s not enough evidence to conclude that the average price of Hampton Inns is different from the average price of La Quinta Inns.
file_download Downloads Two Sample Testing.pptx
6.6 Two Sample Testing
We just saw how to work with means, in terms of two independent groups of data, or a paired or match group of data. And we can do a similar idea for independent groups of proportions. So you have two independent samples of proportions, and you want to compare them.
So instead of means, we can examine it proportions from two independent populations are equal. So in this case, we seek the test, does pi 1 equal pi 2, and the alternative is that pi 1 does not equal pi 2. Again, an equivalent way of writing these is pi 1 minus pi 2 equals 0, and pi 1 minus pi 2 does not equal 0.
As an example, over a period of two weeks, visitors to a website were randomly presented two different promotions. Free shipping was presented to 455 users and 37 purchased the product, whereas a 10% off coupon was presented to 438 users and 22 purchased the product. And we’re assuming that these are two different groups of users. So the 455 and the 438 were not interspersed. It’s two independent groups of users. And we want to know proportion of purchase– was it the same or was it different. So essentially, was there a difference in the promotion, causing people to purchase– I shouldn’t say causing, but having people purchase, or was there no difference in the two promotions.
The R output looks as follows. These tests for proportions are very easy, as we’ll soon see how to run an R. And from this data, we have the following output from R. You can sort of see what the top looks like, and we’re showing you just a little bit how you use R. But that’s the data in shorthand R notation. 37 yeses out of 45## 22 yeses out of 438.
Now, what we want to grab here is the p-value. And so, again, what are we testing? We always want to write this down so we can remember how to interpret the p-value. The null hypothesis is that the percentage that purchased a product was the same. And then the alternative hypothesis is that the percentage who purchased is not the same.
If p is low, hn must go. So again, we examine the p-value, and that’s a strength of evidence. And so, if the p-value is high, our data is consistent with having been produced under the null hypothesis. If the p-value is low, our data is not consistent with having been produced under the null hypothesis. So if pi is low, hn must go. That’s not below 0.0## We’ve failed to reject the null hypothesis.
6.7 R Code for Statistical Inference
We’re now going to go over some R code to show you how to do one sample confidence intervals for means and proportions, one-sample hypothesis testing for means and proportions, and two-sample hypothesis testing for means and proportions.
So we go back to our familiar RStudio setup, and these are Unit 5 R commands. A confidence level for proportion, and we can just do the prop.test command to do that. So if remember this, 200 users signed up on your website, 23 of them bought your $800 service. The others were the free moochers. What is your proportion of users who actually buy your service?
We do prop.test. We give it the number of yeses over the total sample size. So let’s run that. Here we have the data, 23 out of 200. And here’s the 95% confidence interval, same as in the slide, 0.075 to 0.169. You can change the confidence level. So 95% is the default for every confidence interval routine built into R. If you want to change the confidence level, you say conf.level. And you have to give it a decimal for whatever you want. So let’s run this and see what happens.
And by the way, can you predict what’s going to happen? Is it going to be a narrower or a wider? We’re less confident, so the interval has to get narrower. Less confident, narrower interval, so we run this. And it was 0.075 to 0.169.
It’s now 0.8 to 0.16. It got slightly narrower, because we’re less confident. And it tells us a 90% confidence interval. Now, the sample proportion is exactly the same. We haven’t changed the data. It’s still 23 yeses out of 200. But what we just did was we changed the confidence level.
We can also do confidence levels for means. So we’re going to read in the haircut data. So we saw this in an earlier unit, where we read in the haircut data. And we can just make sure that was read in. Let’s look at the names here and make sure we got it. So it’s male and haircut.
Now, let’s figure out a confidence level for the average haircut cost. And so we do what’s called the t.test command. We just give the data here, and it’s going to be a 95% confidence interval for the population mean. And here we go, 95% confidence interval. We’re 95% confident the true average cost of a haircut, the true average price people pay for a haircut, is between $29.27 and $40.0708. Again, as we did with the proportion, you can change the confidence level, make it a 90. That’ll be a more narrow interval. And you can see that goes from 30 to 39, so it’s a little bit narrower than what we saw before. Now, you can also do it for subgroups.
So we saw this by command before. And what this does is, we give it the data, we give it a grouping variable, which in this case is male, and we give it the command we want to do, which was t.test. We saw how to use this command. We could put in mean or standard deviation in the by command.
But now we want to run the t.test function for the different subgroups, male and female. So let’s run this and see what the output looks like. You get a lot of output, but it says group 0, here’s the confidence interval. So this says we’re 95% confident, on average, women spend between $3## 83 and $52.78 on haircuts.
And then this is for the males. This is group 1. On average, we’re 95% confident, and it’s a lower interval. Men spend between $19.50 or so to $31.90 on their haircuts. So very useful, the by command. You can put almost any R function in here so it’s easy to get confidence intervals for subgroups, if, again, you have a grouping variable you can put in.
So that’s confidence intervals for one-sample mean, one-sample proportion. Let’s look at hypothesis testing. This was the DVR. We want to know if 40% of people are using a DVR on a regular basis. We asked 300 people, and 129 said yes. So let’s see what we get here for our hypothesis test. This is called the prop.test command, stands for testing a proportion.
And what we find out is the p-value is 0.288. That’s exactly the p-value we saw in our slides. And we know if p is low, HO must go. We fail to reject, because it’s not a small p-value. It’s telling you exactly what the alternative hypothesis is. The null hypothesis is always equality.
So we’re putting in here the hypothesized value, and we call that p equals 0.4. That’s the hypothesized value. And it says here, no probability is 0.4. That’s the null. And the alternative hypothesis at true p is not equal to a 0.4. That’s the alternative hypothesis.
Now, unfortunately, proportions are a little bit of a strange animal, and there are many ways to run a hypothesis tests on proportions. So what R likes to do is it actually does not give you what we call the z-test statistic. It gives you, essentially, the square of the z statistic here.
If you really want to get the z statistic, as in the notes, you have to call in the BSDA package. You say library BSDA. You load it in, then you have to call what’s called the zsum.test. And you have to explicitly say, the proportion you saw, which is 129 out of 30, and then you have to give it the standard error, p times 1 minus p. And then you also have to separately give it n, and then you have to give it the hypothesized value.
It’s a lot of work to get the exact same p-value, as we’ll see in a second. If we run this, look at the p-value here. It’s exactly the same as the prop.test p-value. It’s a little bit confusing if you’ve never seen this before, but the only reason we’re doing this is to show you you’re getting the exact z-value that we gave in the notes. So if you go through the notes, you’ll see we gave a z of 1.0607.
That was p minus pi sub0 over the standard error. And R can calculate that for you. We don’t really need that in practice. All we really need is this p-value, so we don’t really care if we’re getting this z statistic in practice, or if we’re getting the square of it.
All we care about is interpreting that p-value. So be aware of that. prop.test is absolutely fine to run. Getting the p-value as the correct p-value, we interpret that for doing a one sample test of a proportion. Now, we can do a test of a mean. So if we remember, we did the test to see if, on average, the amount of drug in a pill was 14 milligrams or was it not 14 milligrams. So the first example we did in the notes, we saw a sample mean of 14.3. And we want to test, does the true mean equal 14, does the true mean not equal to 14.
That’s what this test is doing. And when we run this, we get a warning message. And in general, warning messages in R, ignore. If it’s an error, yeah, that’s a big deal, but it’s just telling you a little bit of fancy warning. We ignore that down here.
This is the p-value we got from the notes, 0.4807. And again, it’s telling you what’s going on. The true mean is not equal to 14, and the hypothesized value is 14. Remember, the null is always a quality. So if you see the alternative that the true mean’s not equal to 14, you can always work backwards and figure out, oh, the null was, it does equal 14.
The p-value is what we care about here. If p is low, HO must go. We fail to reject the null. There’s not enough evidence to say the true mean is different from 14.
Now, if you remember, we did another example on that slide, where we said, what would happen to the p-value if we saw a y-bar of 1## 3, not 14.3, but 1## 3? And that’s what we’re doing in this second example here. The p-value should go from 0.4807 to something much smaller, and that’s in fact what we’ll see. The p-value went real small. We would now reject a null and accept the alternative, because we saw a value of y-bar far away from the hypothesized value of 14. That’s one-sample testing of a mean and one-sample testing of proportion. We can do two-sample testing of means. This was the example, looking at the customer relationship management software, where we had 11 people using the standard system, nine people trying out the new system. We had the amount of time people took on the standard system, the amount of time people took on the new system.
So all we do is we do what’s called the tsum.test. We have to give it all the different summary statistics here. We run this, and we’ll get the p-value that we saw in the notes, p-value of 0.03159. Reject the null, the strong evidence, there is a difference between the two systems.
Finally, we talked about paired data and how you analyze paired data. And we had data from the Hampton Inn and La Quinta Inn. And in this case, we’re not reading it in with a CSV file. It’s small enough that I’m just typing it in in the two vectors here.
So I have one data on the top, the other data set on the bottom. Let’s go and highlight this and run so R knows about these two variables. | we want to compare, on average, are they the same price or are they different prices? Because this is paired data, we’re going to use the t.test, but we have to say paired equals true to tell R to treat this as paired data.
So we run this, we get paired t-test. That’s a reminder that we are running a paired t-test. We see a p-value of 0.3099. As we know, if p is low, HO must go. So that tells us to fail to reject the null hypothesis. So we saw how to do a group of independent means, paired means. Finally, we want to do proportions, and this is how we test two proportions. So this was looking at whether free shipping or a 10% off coupon was more prevalent for getting people to make a purchase on a website.
So we have 37 out of 455 people offered free shipping made a purchase. And 22 out of 438 people made a purchase who were offered free shipping, and we want to get a hypothesis for that. Again, prop.test can be used for one-sample or two-sample.
So we run this, and we get the p-value you saw in the notes of 0.06152. Because that’s not below 0.05, we fail to reject the null hypothesis. There’s not enough evidence to say there’s a difference in the two proportions.
6.8 Pitfalls in Statistical Inference
This is the critical assessment, or the wrap up, of the statistical inference module.
So what have we learned so far? Statistical inference allow statements to be made about populations from a sample. Remember, they’re population values. Those are the Greek letters– mu, pi. We want to know what they are. We get a sample of data, and we form estimates. And we’d like to know how good our estimates are.
Confidence intervals provide a range of plausible values for a population parameter. Decisions about population parameters can be made by using what are called hypothesis tests. Inferences can be made for population means and population proportions for both one and two sample situations.
Now that you’ve gone through different ideas in statistical inference, what are some challenges and issues you think arise when trying to create confidence intervals and interpret hypothesis tests?
So what are some words of caution when performing statistical inference? Well Louise, we talk about bias and sampling. And what people find interesting is they say I understand I’m dealing with proportions. I’m dealing with means. I know what the right command is to use in R. I know how to interpret a p-value. Can they still make mistakes?
Yeah, it’s always really important to keep in mind where your data comes from. And this is something that we’ve touched on in previous modules is sort of the experimental design stage. So there are multiple stages, and keeping all of them in mind in the analysis stage is really important.
So even though you’re doing the right thing, your data might not be right, or might be messed up in some way, the sampling was wrong, or things like that.
So as usual, think of your data. It’s not just getting the right routine in R. Think about where your data is coming from. Think about how it was collected. That’s vitally important.
Yeah, definitely.
As much as knowing how to interpret a p-value, you want to know a lot about your data, correct? OK, good to keep in mind there.
Now, this happens a lot Reagen where do we worry about sample size? Could we get a p-value that’s large– do the small sample sizes?
Yeah, definitely. A lot of times when we’re interpreting p-values and doing hypothesis tests, we’re trying to say that there is sufficient evidence in the data to lead us to a particular conclusion. So sometimes when we don’t find a significant result or if we have a large p-value, it might not mean that there’s nothing going on there, that there’s no relationship, or that in truth the hypothesis test wouldn’t be significant.
Rather it just means that the data that we have perhaps due to sample size, just it doesn’t give us sufficient evidence. So we might just not have enough data available to lead us to the correct conclusion. Now on the flip side of that, there’s also things that can happen when you have a lot of data. So what happens when we have a lot of data?
Well, sometimes when you have a lot of data, you can find spurious–
Anything, right?
Yeah, random results.
Everything shows up as being significant.
Right. So you definitely have to be careful about that situation too.
So we think, oh, large data sets, fantastic, right? But you want to be cautious either way, right? Small data and large data.
Definitely. It’s good to keep in mind. Now, we’re going to put you on the spot Reagen. We don’t make fun of students on campus, but this seems to be the hardest idea for students to get how to correctly interpret a confidence interval. And people want to put a probability statement in. They want to say they know for a fact where the true population value is.
Can you put this to rest? How do we interpret a confidence interval?
Right. So the common interpretation among students– the thing that people want to say about confidence intervals is that the probability that the true population parameter is between the lower bound and the upper bound of the confidence interval is equal to 0.95 or 0.99, depending on the level of the confidence interval. That statement is not correct, because these are just numbers we’re talking about, right? So a confidence interval is two values. And the true population parameter, in truth, it’s just a number.
So the probability that it lies between these bounds that are constructed by your confidence interval, that probability is 0 or 1.
Right. So as an example, it’s like you’re saying the probability is 95% that the number 5 is between 7 and 12. And that makes no sense.
Right. Exactly.
It either is or it isn’t. It is or it isn’t.
Right.
Right. So that’s why we can’t say that probability statement. Instead, what a confidence interval is really saying is that using this procedure, using this rule for constructing a confidence interval for this particular sample of data that we have, we’re 95% confident. We’re confident that if we use that procedure over and over and over repeated samples, 95% of the time it will give us a correct confidence interval. 95% of the time, the confidence interval that we construct will contain the true population parameter.
We believe in the process. Actually, we will never know for any one interval we produce. It’s actually a very weird idea.
Right.
Because we never know what the true population value is. But we believe the process we’re using is a very good process. Yeah, confusing idea. Thank you very much. That clears it. Now Louise, Reagan alluded to this when we have large data sets. But this idea of p-hacking, what can go wrong? What is this idea of p-hacking? And what does it mean?
So p-hacking, it’s kind of a negative view on this hypothesis testing. It’s the idea that if you have enough hypotheses, eventually you can just kind of test whether this parameter is 0, or this parameter is 50, or whatever. And eventually, just due to noise, you’ll end up with a significant result. And so p-hacking is the process of looking for those significant results.
And if you have enough hypotheses, it will eventually pick up this noise and say that it’s significant. And so in general, we have to be really careful about how we’re doing hypothesis testing, because it’s really about the process of doing hypothesis testing. As well as similar to building a confidence interval, we’re confident in that process of building hypothesis test. And we just have to be really careful to not– Over-fit, to over-test, to just have a plan in place for the analysis.
Yeah, exactly.
Don’t just say, let’s just work at this until we find something.
Exactly.
We want to be intelligible about what we’re doing always. Now on the same vein, this idea of practical versus statistical significance. So what’s that mean?
I think this is another idea again, Reagan, of large sample issues. But sometimes, do you find differences that statistically, oh, they’re there, but don’t mean anything at all.
Yeah, definitely. Sometimes if you have a large data set and some pre-specified hypothesis that you’re going to test, you may find a significant result. For example, you might find that the difference between two means for two different populations is significant. But because you test it over a sample of 100,000 people, a difference in means of 0.001 may have been flagged as significant just due to sample size. So in that case, you have a significant result, but that difference of 0.001–
What does it mean in reality.
Yeah, it has no use really. I remember a study, it was for nasal spray for allergies. And they were looking at side effects. And 26% of people taking the nasal spray got headaches. And 24% of people not on the nasal spray reported headaches.
So for a 2% difference in headaches, are you going to go and not use the nasal spray? And this is an idea that the larger the data set, you can find differences even though they’re not practically significant exactly. Now, I will put Louise on the spot here.
This is a question actually asked in an exam earlier this week. And this idea that you can come to a wrong conclusion when you do hypothesis testing. So even if I reject the null hypothesis, I might still be wrong. Is that true? This is the problem with dealing with population quantities, right?
Yeah. And again, generally the process is good. And that process is mostly good, but sometimes mistakes– not mistakes, I guess things can happen.
You can control your rate, but there’s always, always– Once you look at the entire population, as long as you’re dealing with sample data, you could always make a mistake, right?
Yeah, definitely. And I think these are things really leaning on the process and saying, you know, I trust the process, but the process says I can still make mistakes. And so it’s really important to keep that in mind. And when you’re interpreting your conclusions, or when you’re making decisions based on the hypothesis test, it’s really important to sort of have that in mind always. You can’t be 100% certain unless you go out and test the entire population that you’re interested in.
Always the issue of sample data.
Yeah.
You are never 100% certain.
Yeah, there’s definitely always a trade off there.
And then finally just to wrap this up, you’ve got to be aware of the data you have. We have learned how to work with one sample means, one sample proportions, two sample means, two sample proportions. You don’t want to use the wrong routine on the wrong data, correct?
Yeah, and the thing that you’ll discover about R is that you can give it binary data where you would normally want to do a proportions test. It’s not going to yell at you.
No, it will give you an answer. It doesn’t mean it’s the right answer.
I think you can give it binary data when you want to do a means test, right?
Exactly.
And it’s not going to yell.
Yeah. And it’ll give you an answer. And so it’s really important to know what kind of test you’re doing, to know your data, to know the procedure.
Now, how do you know if it’s matched versus unmatched? I think students often have a question when we know we have two different sets of data, we want to do a comparison of means. But how do I know if it’s two independent sets of data or if we’re dealing with match data? I think some people understand, OK, I can recognize it’s quantitative data versus binary data. I can recognize I have two groups of data. But are there any hints you can tell people for recognizing match data versus independent data?
Yeah, you definitely have to look at the units of the two groups. So in general, if you’re comparing two groups where the individual units within those groups are different, that should suggest to you that you’re probably working with independent samples. On the other hand, if you’re looking at the same group of units in two different situations, for example, you give a group of people a pretest, and then later you give them that same test again, a post-test, where the individual units didn’t change, that would be match data.
So just thinking about who are the individuals within each group that you’re trying to compare I think can give you a lot of intuition about what the correct test is.
Excellent. It is always very useful tips. Thank you.
6.9 Causal inference
https://www.tandfonline.com/doi/full/10.1080/10691898.2020.1752859