Chapter 7 Linear Models I

Material: http://skranz.github.io//r/2020/10/08/RTutor-CompetitionPolicy.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+skranz_R+%28Economics+and+R+%28R+Posts%29%29

7.1 Introduction

Heather Whiteman Protagonist Video

So with our people analytics team, we stick to some very basic statistics, and even descriptive analytics, when we’re trying to analyze business results. That’s because half the time, just knowing the information about the organization is the most important part. So distributions are far more powerful in explaining a data story than people give them credit for. You can use basic measures of central tendency to explain to an organization what’s going on with their people. You can get a bit further and spend time on diagnostic techniques to really understand why things happened. These can be very simple methods– correlation, averages, rates. And the trends tell you more.

Too often we hear people trying to jump to predictive modeling, doing really advanced statistics just for the sake of doing them. And they are amazing. But at times, people forget how important the basis around understanding your data and telling a story with the data is. So I always recommend starting with some pretty pure descriptive statistics, correlations, metrics, and understanding it.

Once you’ve been able to advance, and you really feel that you know your population and what data you’re looking at, then it’s nice to step into some areas around getting a little bit more predictive and maybe even prescriptive with your data. We’ve been able to accomplish so much off of really basic things like statistical significance of mean differences or regression analysis, very simple linear regression, multiple regression, logistic regression. These are incredibly powerful tools. And when you’re working with a population that has a lot of unexplained variance, they can help you get a feeling for strength. But they can also help you understand how much about people data is left to variance and chance. And so I do recommend starting with more basic statistics and working your way up.

7.2 Introduction to Regression Modeling

We are now going to start a unit, which is the first of two units on linear models. The first set of slides I’m going to go through is going to give you a basic sense of what’s called regression modeling. I’ll be going through this unit– that’ll cover this particular week of material. And then I think Professor Rader is going to be covering the second set of material for linear models.

So I want you to consider the following data scenarios, which is going to really establish where we’re heading. So first of all, does increased advertising budget correspond to increasing sales? You might be interested in whether or not sales is going to change as a result of increasing the advertising budget.

The second question you might have in a given situation is, what demographic characteristics of potential clients predict the greatest repeat business? A third question one might ask in a particular business situation is, which is more important in measuring the success of a firm– the number of years of education or the number of businesses previously that employee has worked? So all of these situations have common features.

In each of these cases, we have a variable of primary interest, which is the response, and we have one or more variables that are being used to explain or predict the response. So there’s this asymmetry that we’re introducing into the problem, which we want to be able to take advantage of. So we would like to develop a statistical procedure to model the relationship between a response and a set of predictors.

One thing that’s going to be nice about this entire setup is we’re going to make very heavy use of the material that we’ve been developing all along, in particular the last unit where we ended up discussing statistical inference. We’re going to be applying all these principles to this new setting. Just to give a very concrete example, let’s talk about one of our favorite topics– ice cream. We’re going to focus specifically on ice cream consumption. And this is from a data set that was gathered many years ago between 1951 in 1953.

And the data that were collected in this study are the average number of pints of ice cream, per person, essentially, measured in pints, and the average daily temperature. So the situation here is that there were a total of 30 four-weekly observations. So these were measurements that were taken once every four weeks on individual people over– I’m sorry– over individual regions of people and counting the average number of pints of ice cream per person over those regions measured every four weeks.

So the question of interest here is, is ice cream consumption predictable from the average temperature on the day in which the data were collected? To give you a feel for this, let’s look at a scatter plot to see what the relationship looks like. So on the horizontal axis, we have the temperature in degrees Fahrenheit. And on the vertical axis, we have the consumption of ice cream per head in pints.

And you could see that, just as a descriptive analysis, that the larger the temperature gets, the higher the temperature gets on a given day, the greater the ice cream consumption. And that seems to make some sense because the hotter is outside, the more likely it is that you’re probably going to consume ice cream. What we’re really interested in is being able to describe the relationship between these two to have temperature predict the consumption of ice cream.

And just to spoil the fun, here’s going to be the best prediction. It’s this line that cuts through this set of data. We’re going to come back to this line as a way of summarizing the relationship, but also using some of the ideas that we learned in the last unit on statistical inference to be making some very concrete statistical statements about this line.

So if we wanted to specifically predict consumption from the specific temperature on a given day, that’s a question that we can answer or we’d like to be able to answer using the set up that we’re introducing today. So for a given temperature, what is the best prediction for the average ice cream consumption per head? So the idea here is to use the value on the line.

And to be very concrete, let’s try to come up with a specific example. So based on that graph and the line that I showed you, how can we estimate the average ice cream consumption per head when the temperature, say, is 50 degrees Fahrenheit. So here’s how you might do it. So here is what I claim to be the best describing line that goes through this data.

So if you’re interested in predicting the consumption of ice cream per head in pints when it’s 50 degrees out, the first thing you could do is go to the point in the line where the temperature is 50 degrees. Now, see where that point on the line is on the vertical axis. So go over in this direction to the vertical axis, and you could see that it actually hits the vertical axis at– if you were to do the calculation– 0.363. So what this says is that, according to this line, the estimated consumption based on the line is 0.362 per head. And in a subsequent segment, we’re going to learn how to obtain that value exactly. The kind of procedure we’re going to end up using to get an estimate of that line is called linear least-squares regression. So this is going to be the method that we’re going to learn to determine the linear relationship between a response variable and a predictor variable.

So we’ll refer to it as linear least-squares regression or, more compactly, as linear regression, which is actually how we’ll refer to it for the rest of this unit. So linear regression will provide the formulas for determining these prediction equations– the one that gave us the approximately 0.36 pints per head of ice cream consumed. And linear regression cannot only be used just for one predictor. We’ll be able to see how it’s going to be used for multiple predictors, and we’ll examine these in the upcoming segments.

file_download Downloads Introduction to Regression Modeling.pptx

7.3 Simple Linear Regression

All right, we are now going to start our formal introduction to linear models with something called simple linear regression. So let’s go. So we’re going to start out simple. And by simple, I don’t mean easy. What I really mean is using one predictor variable.

So what we’re going to have in our setup here is a sample of observations that are measured on two variables. We’re going to have our response variable, which is going to be the variable that we’re hoping to learn something about. And we’re going to have a predictor variable, which we’ll label x, which is going to be something that’s going to be used to make predictions about the response variable.

So we’re interested in modeling y, the variability in y and the mean of y, as a function of the predictor variable x. So we’re using simple in simple linear regression to refer to just one predictor variable. Ultimately, we’re going to be working with multiple predictors. And at that point, we’ll be analyzing multiple linear regression.

But first things first. Let’s get into linear regression modeling in the first place. So there is going to be an assumption about where the data come from. So we’re going to assume that the variable y, that’s the response variable, and the measurements that we’re measuring about the response variable are coming from a normal distribution with an unknown mean and an unknown variance sigma squared.

So we’ve been seeing this situation previously when we performed one-sample inference, when we wanted to make inferences about the unknown mean mu. And we did that through confidence intervals. And we did that through hypothesis tests. What we’re going to do– and this is the big change now– is we’re going to assume that the mean of that normal distribution isn’t just a single number, but can be written as a formula that involves the predictor variable x.

Specifically, we’re going to assume that the mean, the unknown mean mu, of the variable y, the response variable y, can be written as the sum of an unknown term beta 0 plus an unknown term beta 1 that’s multiplied by the predictor variable x. In other words, we’re going to assume that the mean of the normal distribution for y depends on knowing what the predictor variable is. And we’re going to be assuming for the rest of this discussion that you know what the x variable is. That’s handed to you. That’s something that’s known before typically you see the y variable.

So what is this linearity assumption again? We’ve been speaking about linearity assumptions in the context of measuring correlation coefficients between two variables. We’ve discussed linearity in just the relationships that we could assume between two variables. But let’s be a little more concrete here. Because now we actually have to look at the formula carefully.

What you may have seen in maybe your past lives is writing out the equations of lines. So you may remember that a common way to write the equation of a line is y equals mx plus c. And so the m here is going to be the slope of the line. It’s the change in the y value divided by the change in the x value. That’s the slope.

The c here is sometimes referred to as the y-intercept. That’s the place where the line crosses the place where x is equal to 0. So that is the y-intercept. So in this case, the y-intercept is equal to 2. And so that’s the value of c. So we usually see the equation as y equals mx plus c. The way that we’re going to be seeing it in the context of linear regression is mus is equal to beta 0 plus beta 1 times x.

And just to be really careful, let’s make sure that we can line up the correspondence between these two equations. We’re going to be thinking of the y, the vertical value, as being mus, the value that’s going to be the mean. And we’re going to assume that the intercept, c, is basically playing the role of beta 0. And the slope, m, is going to be playing the role of beta 1.

So really, what’s going to happen here is that whenever we see a beta 0, we should be thinking of the y-intercept. Whenever we see beta 1, we should be thinking about the slope of a line. And that’s how we’re going to be using these two parameters, beta 0 and beta 1, as we proceed through these slides.

So let me just remind you about the comparison to one-sample inference that we saw in the last unit. So before, when we were just given one sample of data, we assumed that the data came from a normal distribution with an unknown mean and some variance that we just don’t know. And we then estimated the mean by the sample mean y bar.

But for least squares regression, for linear regression, what we want to do is not just estimate mu. We want to estimate beta 0, the intercept, and beta 1, the slope. So we’re in a situation now where instead of just estimating one parameter that’s mainly of interest, mu, we want to estimate two parameters, beta 0 and beta 1.

So let’s talk a little bit about estimating the intercept and the slope. Well, what we’d like to do is be able to find a line that goes through the middle of the cloud of points, of the set of data points, in order to be able to get a good estimate of the intercept and the slope. So there is a particular criterion for deciding what is a good line to put through a set of data points that would be on a scatter plot.

What we’d like to be able to do is find the line that we could possibly put through the set of data points that minimizes the sum of vertical distance of the observed y values from the candidate line. And I’m going to be specific to show you how this works. But you may be wondering at this point, well, there’s a lot of different lines that one could try. And so is the process actually trial and error?

Well, let’s see. So here is that set of data points of ice cream consumption as a function of the temperature on the given day. So let’s try a candidate line. Let’s try this one. And we can ask ourselves, is that a good line to go through the set of data points?

I think intuitively we’re willing to agree that that’s probably not a good line. Because the direction of the points seems to be going in an upward direction. And this candidate line that I’m considering to be the line that I’ll be using for linear regression seems to go in a downward trend and has a negative slope.

And what I can do in order to figure out whether this makes any sense is compute all of these vertical distances from the line to each individual point. And really what I’m trying to do is I’m trying to see if the sum of these squared distances from each of the points to the candidate line, when I sum up the squares of those distances, whether that’s a large number or not. What I’m trying to do is I’m trying to figure out a line that cuts through these points so that the sum of those distances is as small as possible.

So in the answer to whether this is a good line, I would argue no. Because generally speaking, the points are, on average, far away from the line in a vertical distance, especially if you consider summing the squares of those distances between each point and the line. The ones out here should particularly capture your attention as being particularly far away from the line. Let’s try another one. Let’s put this line in. This is another guess.

And so again, it begs the question, is this a good line? Well, this one’s better because it goes in a positive and upward direction. But it looks like a lot of the points here are below the line. And so these distances, when you sum up the squares, it’s better than what it was in the last one. But it still seems there’s room for improvement.

So in this case, the points are still a little far from the line. Well, let me get let me get to the final, the best line, what I consider to be the best line. Which is this one, which really does cut through the points. And it turns out that this line turns out to be the one that minimizes the sum of these vertical distances squared. You can’t do any better. You cannot find a line that is going to have a smaller sum of squared vertical distances from each of these points to this candidate line.

So you might be thinking to yourself, well, does that mean I have to try every single possible line and then just figure out one that is the best one? Well, the good news is that they’re using differential calculus. We can actually derive formulas that tell us exactly what that line should be. So we don’t actually need trial and error. We can actually write down the formula. The formulas are kind of complicated. So it really doesn’t make a whole lot of sense to show you those formulas. You can look them up in any intro statistic textbook. But we’re just going to let R do the computing for us.

So I want to remind you of this discussion that we had in the statistical inference unit. There is the notion of having parameters which are quantities that are single values, but we don’t know what they are because they’re connected to the population, and then estimates that we end up deriving from a set of data. So we’re going to refer to the least squares estimates of these unknown quantities, beta 0 and beta 1, these values that are connected to the population. We’ll represent them by b0 and b1, which are the Roman letters that are the equivalent of the Greek version of beta 0 and beta 1.

So the estimated model is going to be represented as the observed value y that we would end up seeing as the point on the scatter plot. And that’s going to be equal to the estimated intercept b0 that we get from linear regression plus b1 times x– b1 being the estimated slope. And then this e here you can sort of think of as the error from having the line be the estimate of what the point value is in the vertical distance. e is going to be the vertical difference between the point and the line.

So that’s the difference between the y value and the estimated point in the line. So if we were to see a scatter plot that looked like this, and then here is the line that goes through them, this vertical distance here for that point is e. That’s the difference between the point that is on the scatter plot and then the least squares regression line.

Let’s see what this turns out to be for ice cream consumption. We’ll see once we do the R computing how to actually get this formula. But here’s the results when you run this in R. So you end up getting this output from running least squares regression. And these are the estimated intercept and slope of the linear regression.

And specifically, I can pull off each of these values and I can write out the equation. I’m going to represent y hat– y with a little hat on top of it– to represent the estimated y value for a given x value. In other words, you give me a temperature for the given day. And what I’ll do is I’ll replace that x value here. And then I’ll calculate this formula here, which is the estimated intercept plus the estimated slope times that temperature you gave me. And that’s going to be my estimated consumption of ice cream per head per pint. So that’s what this y hat is. So here is the estimated line. And here’s the estimated equation. So this is the equation of the linear regression. It’s has an intercept of 0.207. It has a slope of 0.00311.

And all of these guys, all of these vertical distances, are these error terms, these distances from the observed value from the candidate line. And those are all the ingredients. So in order to get the actual y value, I would end up adding the value that’s on the line plus the vertical differential to the actual point, which is that e. So let’s stop there and continue to the next segment.

file_download Downloads Simple Linear Regression.pptx

7.4 4 Interpretations and Predictions

All right, we’re on to the next segment, which is on interpretations and making predictions from linear models. So let’s go. So what we’d like to be able to do now, now that we have a way to decide what the line is that cuts through the set of data, is we’d like to be able to interpret our results to get some meaningful conclusions. So the goal is that we have the least squares regression estimates, and what we’d like to be able to do is interpret the intercept and the slope estimates into something that’s meaningful quantities rather than just view them as numbers. We’d also like to be able to make predictions basically using the regression formula.

And the main ingredient to be able to answer these questions is to use the estimated regression equation. There are two ways that I might be using the regression equation. The first way is to write it out as the observed y value is equal to the estimated regression equation, which is the estimated intercept plus the estimated slope times the predict predictor value plus that differential between the line and the observed value, the observed value. But another way I might say it is to say that the predicted y value, y with a hat on it, is just simply equal to the value that’s on the line, so one of those two ways.

Let’s start off with predicting the intercept. So the interpretation of the intercept, in some ways, is going to be a little disappointing, because, just to get to the punch line, the intercept is not going to be terribly important for the analyses that we’re going to be continuing through this unit. The specific interpretation of the intercept is it’s the estimated mean value of the response when the predictor variable is equal to 0, which is really the definition of an intercept in the first place.

So just as an example, for the ice cream consumption model, we ended up getting that the estimated response value is equal to 0.207 plus 0.00311 times the temperature on the given day. Now, what this means for the intercept, that 0.207, is that, when the temperature is 0 degrees Fahrenheit, the estimated ice cream consumption per head per day is 0.207 pints, the value of the intercept. But not terribly important, it’s really not that important to know specifically what the ice cream consumption is going to be on a day where the temperature is 0 degrees, because typically we’re not interested in the predictor variable of 0.

What really matters is the prediction for temperatures of ranges that we might be interested in. And we’d like to be able to choose the values that we’re interested in, not just simply assume that 0 is a value of interest. So for that reason, the intercept term is rarely of interest, except in unusual situations where we do want to find out what the predicted value is when the predictor variable, x, is equal to 0. So it’s rarely of importance, because the predictor variable of value of x equals 0 is not generally of special significance. So in this case, the ice cream consumption study– in the ice cream consumption study, a temperature of zero degrees Fahrenheit is not likely of interest.

So let’s really focus on the slope. And that’s where really all the action is. So here’s how you might interpret the slope in a regression, in a linear regression estimate. It’s the estimated change in the predicted value of the response corresponding to a 1 unit change in the predictor x. We’re going to get into that in a little more detail, what that means, because that’s a mouthful. But unlike intercepts, slopes are usually important to interpret in simple linear regression models.

So let’s actually interpret this in the context of the ice cream consumption model. We have that the estimated response is 0.207 plus 0.00311x. What this means is that, for each additional degree in temperature Fahrenheit, we can expect a 0.00311 increase in the estimated mean ice cream consumption per head per day. So as the temperature increases by 1 degree, the ice cream consumption per head per day increases by 0.00311. That’s a meaningful quantity.

So that’s how the slope is going to be interpreted. Let’s move over to predicted values. So given the value of a predictor, we can use the least squares estimated equation to form a predicted response value. So again, here’s the equation for the estimated response that we get once we estimate beta 0 with b0 and beta 1 with b1.

So as an example, suppose that we want to predict the estimated ice cream consumption per head per day for a temperature of 50 degrees. And we did this two segments ago. But now, we can actually see it in the formula. So we have the estimated regression equation. And what I’m going to do now is plug in x is equal to 50 into that equation. And so when I do that and I carry out the arithmetic, we end up getting that the value is 0.363. So that is the estimated ice cream consumption per head per day when the temperature is 50 degrees Fahrenheit.

I want to go through another example just to illustrate how this all works. And this is an example on college dropout rates. So this is back from a data set in 1995 when the US News & World Report was actually making some of their college data available, freely available. And so this is a data set that contains 776 colleges that are measured on many variables. And we’re only going to focus on two of them here. What we want to know is how dropout rates at various colleges tend to vary by the average room and board costs that are measured, in this case, in thousands of dollars. So it might be helpful, as usual, to visualize what the data look like before starting to think about linear regression models.

So here is a scatter plot of the data. And you could see just by eye that there’s a downward trend, that the higher the room and board cost, the lower the dropout rates. Now, we’re not making any causal conclusions here. We’re not saying that, if you end up increasing room and board, that’s going to cause a dropout rate for a college. We’re just measuring an association Here. We’re just seeing what the distribution of dropout rates is as a function of room and board costs and, again, in 1995 dollars.

So let’s actually see what the results of running linear regression turn out to be. So here is the output that we’re going to see once we’ve run the R commands. And we will see the R commands later in this unit. So what we are going to get out of this output is the estimate of the intercept and the estimate of the slope. So here is the estimates of the intercept in the slope. The intercept is estimated to be 63.5 basically. And the room and board– the slope to the predictor variable room and board is negative ## 63 approximately.

So we can write out the least squares regression equation as the estimated response, which is the dropout rate for a school, is equal to 63.5, which is 63.5%, minus ## 63 times the room and board cost in thousands of dollars. What I wanted to point out here is that this is a situation where the slope of the line is estimated to be a negative number. So it suggests that the larger the room and board cost, the lower the dropout rate, because every time you increase the room and board cost, that makes this entire contribution of this term to be more negative. Here’s what the least squares regression line looks like when you overlay it onto the scatter plot. And indeed, it is a negative slope line. So it does appear that we end up having a drop in the dropout rate as it were for a larger room and board costs for the college 776 colleges that are used in this data set.

So here are some interpretations. What it suggests is that, for each additional $1,000 spent on room and board, the dropout rate of a college is lower by ## 63%, because the ## 63% is the slope and the predictor variable is measured in thousands of dollars. So for each additional $1,000 that is applied for room and board, the dropout rate is lowered by ## 63%.

If we wanted to, say, come up with a prediction for the room and board costs– I’m sorry, if we wanted the dropout rate for a college for room and board cost of $6,000, the way we can get that is directly from the least squares regression equation. Here it is. It’s the estimated response is the 63.49 minus ## 63. And now, we’re going to put in here the numbers of thousands of dollars it costs for room and board, which in this case is 6 for $6,000. And when we do that calculation, we end up getting 23.71, which is the estimated dropout rate for a college that has a room and board at $6,000 in 1995 dollars. So we’ll be continuing on with this to get into statistical inference issues in the next segment.

file_download Downloads Interpretations and Predictions.pptx

7.5 Multiple Linear Regression

This segment is going to introduce to you the concept of multiple linear regression. So let’s go. All right. So we are investigating– we’re going to be investigating many predictor variables simultaneously. We’ve so far covered a bunch of segments that have focused on just using one predictor variable to predict a response variable. So what we’re going to do now is recognize that, in many data scenarios, a response variable is not accompanied by only one predictor but possibly by many predictors.

So one possibility that we could consider before we launch into this material in the segment is why not make our lives easy and just simply perform separate simple linear regression models for each individual predictor separately. So we have a response variable.

And we can regress it on the first predictor variable and summarize the results and do all of our analysis that way, then take the– then they take the response variable, perform regression on the second predictor, and then summarize our results that way. And just do it for each one one at a time. But maybe we can do a little bit better by including all of the predictor variables simultaneously so that we get an aggregation of strength of all of the predictor variables working together. So let’s see how this is going to work.

So here’s an example that we’ll be walking through. This is an example on home prices. So this is a data set collected on 894 homes that were sold recently in a single locale. And the interest here is going to focus on predicting the sale price of a home from five potential predictor variables. The predictor variables are going to be the living area of a home, the year that the home was built, the number of bedrooms, the number of bathrooms, and the number of garages of the home.

The reason you might want to do this kind of analysis is that maybe you’re interested in producing a website or an app that’s going to compete against a product like Zillow.com that ends up predicting home prices based on home characteristics as a way of being able to stay in front of the market.

So let’s perform a little bit of a summary of this data set just to see what the variables look like. So the variable of interest, the response variable, is the price. And so these are prices of the response variable, the five number summary plus the sample mean, which you get as a bonus. And you could see that the prices go between basically a $90,000 home up to a $1.5 million home in this single locale. Typical home prices are $400,000 or closer to $500,000 depending on whether you use the mean or median. Seems like this is right skewed data since the mean is higher.

And in addition to this response variable, we also have five different predictor variables. Here are the distributions summarized for each of these predictor variables. So the living area, which is measured in square feet, is anywhere between 572 square feet to 5,000 square feet; number of bedrooms, which is going to be a whole number is going to be anywhere between 1 and 8. Typical value of around 4. Number of bathrooms can be anywhere from one to families that need to go to the bathroom a lot I think because they have 10 bathrooms. The year of the home could be constructed in the 1700s all the way up through 2009, which tells you this is a recent data set. And then the number of garages could be anywhere between 0 and 6 in this particular data set.

There is a fair amount of variability in the response variable values and the predictor variable values. And it might be also helpful to look at the relationship of the price of a home with each of these individual predictor variables one at a time just to get a feel for what we can be expecting.

So here are five different scatterplots. On the vertical axis of each of these scatterplots is the price of the home. And the horizontal axis of each of these scatterplots is each of the different variables. So we have the living area over here, the number of bedrooms, number of bathrooms, number of garages, and then the year of construction.

And it’s not so easy to see what the relationships are entirely. It seems like there’s some positive relationship, which frankly you would expect between living area and the price of a home. The larger the living area, you might expect the price of the home to be a little larger. For the number of bedrooms, it’s not so clear. Maybe there’s a slight positive relationship. The number of bathrooms seems to be a little bit more clearly in a positive direction.

The year of the home not so evident whether newer homes correspond to higher prices. And then the number of garages, maybe a very ever so slight positive relationship. But it’s not clear. So really just simply by i, the price is weekly associated with each of these variables separately. But you know, it’s hard to really tell just from looking at it.

So what we’re going to do is develop a multiple linear regression method to be able to analyze the response variable as a function of the predictor variable simultaneously. So the idea here is that we’re going to model a response simultaneously from a set of predictors. And the idea is that we might be able to explain more of the variation in the response variable by including many predictor variables in the model, not just one at a time.

And so this whole approach, including many predictor variables, acknowledges that the relationships among the variables may be actually fairly complex and that simple linear regression isn’t really a powerful enough tool. So we might as well use a tool that is powerful enough to be able to take account of the simultaneity of the impact of a bunch of variables all at once.

So the best way to really learn multiple regression, I think, is to compare what’s going on between multiple regression and simple linear regression. So again, just to remind you, in simple linear regression, we ended up starting by saying that the response variable was normally distributed with some unknown mean. And we assumed that the mean was equal to some intercept plus an unknown slope times the predictor variable. In other words, we assumed that the mean of the response variable depends on the predictor variable x.

Now we’re going to be in a situation where we have many predictor variables. This situation actually just changes ever so slightly. Fortunately, it’s not going to be that complicated, because we’re going to start off with the same assumption that the response variable is still normally distributed with some unknown mean.

The way the situation changes with multiple linear regression is that now we’re going to say that the mean of the response variable depends on a whole bunch of different predictors with different coefficients that multiply each of those predictors. So what’s going on here is that, effectively, we still have an intercept term. This is beta 0.

But now each of those predictor variables ends up getting multiplied by a different coefficient. That’s what we’re going to call these guys. So these are slopes. But in the context of having multiple predictor variables, we’ll refer to them as unknown coefficients. So these coefficients are unknown. We’re hoping to estimate them through multiple least squares regression. And each one multiplies each of the separate predictor variables one at a time.

The way that we’re going to end up getting an estimated mean is by performing this arithmetic that adds the intercept plus the coefficient to the first variable times the first variable plus the coefficient times the second variable times the second variable and so on all the way up to the last one. And we’re just referring to the number of predictors as p. So there’s some similarities and differences to understand between simple and multiple regression.

So the good news is that the method that we ended up using to estimate coefficients, that is the betas, is going to be through minimizing the sum of squared residuals. In other words, we’re going to measure the distance between the individual y values and then what ends up getting estimated through least squares regression.

I’m actually, on purpose, avoiding the word line. Because in multiple regression, it’s no longer a line. It turns out it’s a flat surface. In multiple dimensions, it’s actually hard to even visualize. Think of it as if, instead of a line, imagine a flat plane, like a flat piece of paper in space, much like this little prop I have here. So it’s almost like the regression is a plane that cuts in an angle through space. And that’s going to be the regression equation. That’s what we’re hoping to estimate when we perform these multiple least squares regression.

The goals are going to be similar for multiple least squares regression as they were with simple least squares regression. We’re going to perform statistical inference for the betas, we want to know whether the betas– what the estimated values are and what interval estimates are. We also want to be able to perform hypothesis tests for the individual betas, which we’ll be discussing shortly.

We would also like to be able to make predictions of responses based on predictive variable values in order to make predictions for future values. Finally, we would just, like in simple linear regression, be able to assess the goodness of fit using this r squared statistic that we discussed in the last segment. The slight difference here is going to be interpreting the coefficients that come out of multiple linear regression. Let’s go back to that real estate example and see what kind of sense we can make out of it.

So we’re going to show you the R commands in the next segment. But let’s look at the output when we’ve run multiple lean linear regression on the real estate example. This is the output that we would end up getting. It looks a lot like the output that we got from linear regression, simple linear regression, but now we have more lines in the output. And the extra lines just correspond to the different predictor variables that we’ve included in multiple regression.

We still have a row for the percent, but now we have separate rows for each of the individual five-predictor variables that we’ve incorporated into the multiple linear regression. And in fact, the estimated values of the betas can just be picked off from that first column of estimates. So I can actually translate this column estimates.

And if I wanted to write out the multiple linear regression, I could write it out as the estimated response value, y hat, is equal to– and now this object here, this equation is the least squares regression equation with those betas replaced by their estimates, B0, B1, B2, up to be B5 in this case, because I have 5 predictor variables.

And I can plug in all of the different values from this column of numbers into the equation. So B0 here is this 320394.0## That’s the intercept, the estimated intercept. And then the coefficient, the estimated coefficient of living area is 93.12. That’s this guy up here minus 68,307 times the number of bedrooms and so on. And I could just write out the rest of the terms in this equation, and this is what it looks like.

So let’s try to go ahead and interpret one of these coefficients. The interpretation is actually quite similar to what we work with in simple linear regression, but there’s a slight twist. So the coefficient to one of the predictors has the following interpretation. It’s the estimated change in the response corresponding to a one-unit change in the predictor holding all of the other predictors constant, holding them fixed, In other words, keeping them at their current values and only changing the predictor variable of interest that’s multiplying the coefficient that we’re interested in interpreting.

To be really concrete, let’s actually look at an example. So if you go back to the output, you’ll see that the coefficient to bathrooms was 87,743.31. And that means that each additional bathroom that’s included in a home, the home price increases, on average, by $87,743.31.

Holding all the other predictors, fixed, in other words, thinking about a home where you’re only allowed to change the number of bathrooms, but keeping the living area constant, the number of garages constant, number of bedrooms constant, the age of the home constant, but only changing, only conceptually changing the number of bathrooms.

There is a connection here, though, I want to make back to study design. And it’s worth pointing out that we shouldn’t be interpreting these results as causal unless we’ve performed a good experiment, a randomized experiment. In other words, we’ve gathered this data in a way where we’ve just observed a bunch of homes, recorded the information, but didn’t perform a randomized study. So in other words, it’s not true that adding a new bathroom to an existing home can expect to add roughly $90,000, on average, to its value.

And again, that’s because the data were obtained as a survey, they were obtained as a survey and not as a randomized experiment. I’m partly mentioning that because they’re sort of the surprising result in this output of the multiple linear regression. The surprising result is that the coefficients to bedrooms and to garage are negative, which seems kind of counterintuitive. Because what it seems to be saying, if you have a negative coefficient to bedrooms, it seems to be saying that, if you were to conceptually add a new bedroom to a home, that’s going to decrease, on average, the price of the home. That seems very counterintuitive.

And in fact, you could even actually look at the correlation between the bedrooms and the price of a home, just those two variables, not accounting for the other variables, and you end up getting a positive correlation coefficient, a positive 0.11## Similarly, you can look at the correlation between the number of garages at homes and their prices.

And you also get, just looking at the correlation between those two variables, a positive correlation. So what’s interesting in this example is that the direction of the relationship reverses for these two particular variables when you actually account for the other variables in the model.

Maybe one way to think about it is, if you have a home of a fixed square area, and you start adding in bedrooms, then maybe the worth of the place is not high. In fact, maybe it decreases because you’re cutting into other area of the home that might be better used for other conditions where one might live.

So at least that’s some rationale why– at least there’s some intuition for making sense out of why the least squares coefficient is negative for the coefficient of the number of bedrooms. But just simply the binary correlation between the number of bedrooms and the price of a home is actually a positive correlation.

That said, we can still carry our confidence intervals and perform hypothesis tests for these coefficients. So here is a set of confidence intervals that we can generate through R about these coefficients. So this is the output that we get in R. And here’s an example of coefficients, like say for the coefficient for bathroom. We’re getting a 95% confidence interval that goes from positive 61,117.64 up to 114,368.79.

And so this means that we’re 95% confident that whatever the average price increase of a home is for an extra bathroom in the population, we can estimate that it’s somewhere between 61,000 and change up to 114,000 and change holding the other variables fixed at their current values. So this gives us an interpretation for what is the extra impact of an additional bathroom with all the other variables fixed and remembering that we’re not making a causal conclusion here. It’s just an association not a causal conclusion.

In addition it performing confidence intervals for the coefficients, what I can also do is perform hypothesis tests for each of these betas. So for example, if I wanted to perform a hypothesis test of the beta for a living area, which I’ll refer to as beta 1, whether that equals zero under the null hypothesis. And I want to write the alternative hypothesis as beta 1 is not equal to 0. I could perform that by just simply looking at the column in the p-value, the value in the p-value.

In this case, it’s going to be this value, which, again, is extremely low because it’s be multiplied 1.65 times 10 to the minus 11. So the interpretation is, like I can look at all the different p-values here. And I can actually pick off the ones that have p-values that are less than some conventional value, like 0.05, and say that those are the significant predictors of the price of a home. So living area is a statistically significant predictor at the 0.05 level of home price relative to the model with all other predictors included.

And I would end up drawing the same conclusions for the coefficient of bedroom and the coefficient of bathroom. And helpfully, R actually gives you a couple stars that are next to the different variables that are statistically significant. In this case, it’s giving you three stars when the p-value is between 0 and 0.001, which is very, very low.

So suppose I’m interested now in prediction. I want to have prediction interval for a home. So suppose I’m going to consider a home that has the following features. So living area is 3,000 square feet, has three bedrooms, two bathrooms, it was constructed in 1940, and it has one garage. I can use R commands to construct a prediction interval for that home, for the home price, and I get output that looks like this.

So this is going to be the best estimate of the prediction. In fact, that’s the value that you would get by writing out the equation of the multiple least squares prediction equation and plugging in all of the x values, plugging in all the predictor values. But R does the extra work of telling you what is the lower endpoint and upper endpoint of a 95% prediction interval, which is here. And what I’m going to get out of this analysis is that the point prediction is this $531,656 for this particular home with the characteristics on the top of the slide.

I’m also 95% confident that the price is going to be between somewhere between $17,65## 92 and $1,045,6565. So your reaction ought to be, at this point, that’s a pretty wide interval. I mean, what’s that really telling me? Well, in some ways, what it’s telling you is that it’s an indication that the model is not particularly predictive, that we’re not really learning a whole lot about the price of a home from these five variables, at least in this location where we ended up getting the home prices and using this kind of least squares regression approach. In fact, we can actually compute an R squared statistic that comes as a result of fitting this model, which we’ll see when we actually analyze and perform the R commands. It comes out to be 18.3%, which, the interpretation is the same as it was for simple linear regression, specifically, that 18.53% of the variation in home prices is explained by the inclusion of these five different predictors. So that’s not that much. That’s not really a lot of variation being explained. So the home prices are pretty weakly explained by the recorded home characteristics.

So this is– it’s a good chance to try out a multiple least squares regression, but we’re not really getting a whole lot of predictive power out of this particular model, because the variables, the predictor variables, are not terribly predictive. I want to giver some final thoughts, though, about multiple linear regression.

So multiple linear regression really can be understood as a basic but still powerful tool to predict a response from a set of predictors. And the power is that it simultaneously incorporates all predictors with their simultaneity in measuring their prediction on a response variable. And sometimes the results can be counterintuitive like we have seen. In that case, usually, further work is required to explore why the data are producing search results. And that’s why you have your data science team to start investigating these kinds of issues. But at least you’re now alert to why these kinds of issues might occur.

And so in the next segment, we’ll actually see some R code to actually run these analyses and interpret the output.

file_download Downloads Multiple Linear Regression.pptx

7.6 Fitting Least-Squares Regression in R, Example Case Study

We’re now going to look through some of the basic commands for running linear regression in R. So what I’ve done here is I’ve summarized all the commands that we’ve been using throughout the unit in this R script, which is in the upper left panel, here. And the commands are going to run once I click Run– up here– down below.

It’s probably worth we’re showing you since it’s going to be relevant, clicking on the Files tab here, to show you some of the files that are in the current folder. This unit– dash ## 8.R– is, in fact, this set of commands in the Script pane. And you’ll see that I have these two CSV files– icecream.CSV and realestatesample.CSV– which are going to be the datasets that I’ll be using. So in fact, if I wanted to actually view the file, I could click icecream.CSV and I would see the contents of this CSV file.

We’re not going to be using the file in its raw form. We’ll be reading it into R so we can just proceed by reading into R, using this command right here. So we’re going to set equal to ice cream– we’re going to create this data frame which is going to be read.CSV of icecream.CSV.

So we run that, and now, we can summarize this data frame. And you’ll see that this data frame contains four different variables, and we’re only going to be focusing on two of them– this first one, cons, which is consumption. And then, the last one, temp, which is the temperature.

So for starters, let’s plot consumption versus temperature. So using this command here, for plot of ice cream dollar sign temp– which is going to be the horizontal value– and ice cream dollar sign cons for consumption, will be the vertical value. And then, here is the X label. Here’s the Y label. I’ll run that.

And then, in the Plots tab down here, it shows me the scatterplot of ice cream consumption versus temperature.

So now, what I’m going to do is finally, run the least squares regression– the linear regression. The way to do that in R is on the right side of the Equal sign, I have the LM command, which stands for linear models.

The way that the linear models command works is the first argument of the linear models command has a formula. The formula for the linear models command is going to be the response variable is on the left side. Then, you use this twiddle– or tilde– on your keyboard. And then, on the right side is– in the case of simple linear regression– the predictor variable.

So this first argument for LM, this formula is essentially, that the response variable, cons, is modeled as a function of temperature. And then, the second argument is to give it the data frame, and I save the data frame in ice cream so I say data equals ice cream.

Finally, I’m going to assign the results of fitting this linear model to this new object, which I’m calling icecream.LM. So once I run this command– which I’m about to do in a moment– it’s going to assign all the information that comes from running a linear model of consumption as a function of temperature into this R object called icecream.LM.

So let me do that. Pretty anticlimactic because it just basically, runs that command down here, without any indication that it did anything. But what I can do now is now that I’ve created this linear model, I can summarize the results of the linear model by typing summary of that object that I created. So let me do that. Click Run.

And so now, what it’s giving me is a bunch of more stuff than we saw on the lecture slides. The part that we’re most interested in focusing on is this stuff– the coefficients, and their estimate standard errors, and particularly, their P-values.

We also have, as part of this output, the R-squared statistic, which we can also use, as well.

If I want to get the confidence intervals for the coefficients, I can do it by hand using the formula for a confidence interval, but there’s a much easier way, which is to use the confint command. So once again, I have this icecream.LM object, which is the result of fitting a linear model, and now, I type confint of icecream.LM, and I run this command. And that’s going to give me the 95% confidence intervals in the first case for the intercept, but usually, I don’t care about that.

But it’s also, giving me the 95% confidence interval for the slope of this regression line so this tells me that with 95% confidence, the slope is between 0.00213 and 0.00409, roughly. If I wanted to get the coefficients– if I wanted to extract the coefficients from the model, I could do that. Usually, I’m not terribly interested in performing that operation by itself, but if I wanted to, that’s one of the extractor functions I could use for a linear model. So I could type, coef of icecream.LM– that I have right here– and I click Run, and it’s just going to return the estimated intercept and the estimated slope.

Let me move down.

Now, suppose that I wanted to make predictions. So one way I can make a prediction on the consumption for, let’s say, I wanted to make a prediction for the consumption when the temperature is 50 degrees outside. So the first thing I’ll do is create a new data frame that has only the temperature being equal to 50, as the only information. So I have one observation and the one variable, which is temperature equals 50.

So I’m going to run this command, and that’s going to create this data frame which is called new obs. I’ll actually type that just so you can see what it looks like– new.obs. And I’m going to hit Enter. And so, that’s the content of new.obs.

Now, if I want to make a prediction of the consumption based on the temperature equals 50, here’s how I do it. I use the PREDICT command. The first argument to the PREDICT command is the fitted linear models object, which was that icecream.LM. The second argument is the new data that I just created. So the new data here, is a new.obs.

What it’s expecting is a data frame that contains at least one observation and contains the variables that are on this side of the formula– on the right side of the formula. So I have to have that temp in there in order for PREDICT to do its job.

So I type predict of icecream.LM and then, new data is equal to new obs. Hit Run. And what it gives me is that the consumption is expected to be 0.36223 pints per person, per unit time. Now, if I wanted to construct a prediction interval, it’s the exact same command, except this time, I give it a third argument, which is interval equals– and then, in quotes– predicts. And that’s going to give me the following output.

So I’m going to put the cursor at the beginning of the line and click Run. And at the bottom here, what it gives me is a vector of three values. The first value is the estimated prediction, and this is the same as what I had before when I didn’t ask for an interval.

But now, it’s also going to give me the 95% prediction interval for the probability– I’m sorry– it’s going to give me a 95% confidence interval for the average consumption for a temperature of 50. So the way to understand this is that we’re 95% sure that when the temperature is 50 degrees out, that the average consumption is somewhere between 0.274 and 0.45. Let’s move on to the next data set, where we end up working with a real estate example. So here’s the real estate example. I’m going to create this data frame, RE, which I’m going to get from reading in the data file, realestatesample.CSV, which is sitting in the folder. So I’ll click Run on that.

And what I can do is if I wanted to just summarize that data frame– which I usually do automatically after reading in a data frame– I’ll just type that in real quickly. And this is the summary of each individual variable in the data frame.

What I’m also going to do is show you something called a pairs plot which is going to show you all scatter plots of every variable against every other in pairs. So the way to run the PAIRS command is to type pairs, and then, the first argument is the data frame.

And now, I have this optional argument which is called PCH, which is the plotting character– P for plot. CH for character. And then, in quotes, that dot here means, rather than plotting what we’ve been seeing here as circles, just plot it as a period, like as small dot because this is going to be a very busy plot so I don’t want to clutter it up with having circles all over the place. I just want little dots.

So I click Run, and it plots all of the pairwise scatter plots and so we can see them plotted against each other. So this is a nice visual way to see what the relationship is among all the different variables against each other.

Well, let’s jump right to multiple linear regression. So the command for running multiple linear regression is almost identical to what it was for simple linear regression with just one small tweak. I’m going to run the command LM– that’s the linear model’s function, as it was before. And now, the formula that I’m going to use is similar to before, but the only difference is that since I have multiple predictor variables, I want to put all the predictor variables on the right side of the little tilde. And I want them separated by plus signs.

So the way that I’m going to run this is have the response variable be on the left side, which is price. And then, on the right side, I’m going to have living area, plus bedrooms, plus bathrooms, plus year of construction, plus number of garages. And then, as before, the second argument is going to be what the data frame name is, and in this case, I called it RE, for real estate.

So this is the command for running a linear model. And then, as before, I’m going to set that into this new object, which I’m going to call, RE.LM. So let me run this command so it’s assign to RE.LM, this multiple linear regression.

So once again, in an anti-climactic way, it just does the command and it returns the prompt, but now, let me summarize the fit of this linear regression model. And so I’m going to run summary of RE.LM and it gives me a whole bunch of output. It gives me a bunch of output that I’m not really going to pay all that much attention to.

The part that is most interesting for what we’ve been working on in the lecture notes is this set of output because this is telling me what the linear regression estimates are in this column. Gives me the standard errors in this column. And then, finally, in the final column, it gives me the P-value for the inclusion of each of these variables. In other words, whether each of the variables is statistically significant beyond the effect of the other variables that are already in the model.

I also get as a result of performing the linear regression model, this R-squared statistics so I’m learning that 18.53% of the variation in the prices is explained away by including these five different predictor variables in the model.

We ended up showing earlier on, how to compute the correlations between two variables so we’d seen these before in the earlier unit on descriptive statistics, but I’ll just run them anyway, just because I did them in the slides.

So I can get the correlation between the number of bedrooms and the price of a home. I can find the correlation between the number of garages and the price of a home, and these are both positive correlations, which seem to fly in the face of the negative coefficients that we get with bedrooms and with the number of garages in the multiple linear regression model.

Finally, the last thing I wanted to show you is performing predictions and prediction intervals from multiple linear regression. Fortunately, it’s exactly the same process as with simple linear regression, namely, that I start off with a new observation that I’m going to recreate as a data frame. And since, in this case, I have several different predictive variables, I need to create a data frame of one observation that contains each of those different predictor variables set to be a particular value.

So what I chose in this case, was to create a data frame where living area was 3,000 square feet. The number of bedrooms was three. The number of bathrooms was two. The year of construction was 1940. And there was one garage.

So let me run this command to assign the data frame to new.obs. So I make that assignment, and let’s see what that data frame looks like, just so we know what we’re working with. So there it is. Kind of boring. It’s just one row that contains these five different variable values.

And now, in order to come up with a prediction and prediction interval for 95% prediction, I’m going to run this following command, which is to use the PREDICT command again, the way I used it before. The first starting command is the results of running a linear regression, which is this RE.LM. The second argument is going to be the new data frame that I’m going to be making the prediction on. Again, that’s the same as what I did for simple linear regression.

And then, finally, the third argument is just that interval is going to be predict so that will give me the prediction interval. So the command is exactly the same, it’s just that this time, I’m applying it for multiple linear regression.

So let’s run that command, and as before, I end up getting this vector of three values. The first value is the best estimate of the mean price for a home that has these characteristics that I put in. So this would say that my best guess at the price of a home with living area 3,000, et cetera, was $531,65## 60– if you want to be really precise about it.

And then, these second two numbers are the lower and upper end points of the confidence interval for the mean price of a home with these characteristics. So I would say that I’m 95% certain that the average price of a home with these kinds of characteristics is going to be somewhere between $17,65## 92 up to $1,045,65## And as we discussed in the slides, this very wide interval is likely influenced by this model not being a particularly well fit linear regression model. So that’s pretty much all there is to it so you can go ahead and start performing your own linear regression models in R.

7.7 Discussion of Pros and Cons of Using Least-Squares Regression

All right, we’re back and we’re going to wrap up the unit on linear regression one, the first one. Because after this unit, you’re going to get another exposure to more linear regression, more fun topics. But before we do, let’s review what we learned in this unit.

So we first got some exposure about simple least-squares regression, simple linear regression, as well as multiple regression where we incorporate more than one predictor variable. We ended up exploring how to perform statistical inference for regression coefficients, meaning performing confidence intervals as well as hypothesis tests for the different coefficients.

We also ended up using linear regression to make predictions based on predictor variable values, as well as performing prediction intervals. We’re also able to assess the goodness of fit of a linear regression, a multiple or simple linear regression, through the R squared statistic. And then finally, we saw how to perform least-squares regression, linear regression, ourselves using R as the way to implement the procedure.

So we’re really at a point where we need to ask, this all seems great to be able to understand linear regression, but what are the some of the challenges? What are some of the walls that we need to climb over in order to get these regression tools implemented properly? So what do you think?