Chapter 8 Linear Models II

8.1 Material

https://fischeruu.shinyapps.io/RTutorNAFTAfreetrade/

https://mwentz93.shinyapps.io/RTutorWallStreet/

8.2 Introduction to Model Building

So in this new unit, we’re going to dive a little deeper into using linear regression as a source for modeling quantitative data. It’s titled “Introduction to Model Building.” And that’s essentially what we’re going to try to tackle in this unit.

We’re going to start off with a motivating example. So we have data from glassdoor.com about the pay of their employees. We were given a random sample of 1,000 employees that were made publicly available a couple of years ago. And the following variables were measured that we’re going to use. First is the response variable. We’re going to look at measuring income and bonus pay and the combination of the two in dollars. What else was measured? Whether somebody was male or female, each employee, what their age was in years, the highest level of education attained– high school, college, master’s, or doctorate– performance on a 1 to 5 scale, whole numbers that we’re going to have to be careful in treating, seniority on a scale of 1 to 5, again whole numbers, department, which had five categories two of which are operations and sales, and job title, which there were 10 categories– data scientist, IT as being some of them, two examples.

So let’s start off with any other data exploration by looking at some summary statistics. So we just start off by summarizing the entire data set, easy to do in R. And we quickly get some ideas of what the distributions of these variables look like. Job title, we see several of the different categories selected. We have more marketing and software engineers and we have fewer and fewer as you go down the list. This other represents the other five categories combined that were missing. Gender breakdown, a little more than half his male. Age breakdown, most individuals are centered around 40 years of age. Oldest individual is 65. Youngest is nice and young at 18.

Performance, like I said a factor between 1 and 5, looks fairly symmetric around the value of 3. Education, four different categories, fairly equally split across the four groups. Department, again five groups, again fairly evenly split. Seniority, pretty symmetric around a median of 3, mean of 3 as well. Income in this random sample goes from $34,000 up to $180,000. And the bonus associated with those individuals don’t match one to one, but the minimum bonus was a little less than $2,000 up to as much as about $11,000 with a mean around $6,400. So just to get us some intuition as to what the data look like.

Key in linear regression is to make sure you’re comfortable with the response variables. And in this case, we have multiple response variables we’re going to consider. All forms of pay here, so we have income and bonus. And then we’re going to add the two together and look at their sum. So let’s look at their distributions through their histograms. The histogram of income we said was centered just under $100,000. And that’s what we see in the histogram. How would you describe that distribution to me? Slightly right skewed ranging from $30,000 up to close to $200,000.

We can look at bonus alone. This distribution looks fairly symmetric around a value a little greater than $6,000. And what we see is a distribution that is essentially missing its tail. So it’s somewhat bell-shaped, but without the long tails you would expect if this was coming from a normal distribution. Add the two together and we see this distribution. Again, it still is a little right skewed There’s potentially some outliers here, large outliers. And it seems to be centered roughly around $100,000, maybe a little bit more.

What do we see? Well, this is just summarizing what we just said when we’re looking at the histograms. Income, right skewed, centered just under $100,000. Bonus is symmetric just under $7,000. It looks to be heavy shouldered, again is missing those tails that we would expect in comparison to if it were normally distributed. Total pay, which is the sum of the previous two, also a little right skewed and centered just above $100,000.

And income and total pay have a few high outliers that we might want to make sure we keep track of if we do data analysis further using this data set. But notice, none are very extreme. When we look at financial data like income, there is potential for one or two individuals to really be way out in the tail, $500,000 and $1 million, which we don’t see in our sample of data here.

Once you have a general rough idea of what your variables look like, what your response variables distributions are, then you can start to consider answering some questions with a data set like this. So we’re going to consider building several different linear regression models to answer some questions you might be interested in. First off, we could just build a general model to try to figure out which of our predictors, which of our factors, that are predictors, are important in associating and influencing pay, are related to pay.

Is there evidence potentially of gender gap or a possibly age-based pay gap at glassdoor.com? Is the effect of education different for various job titles? What do I mean? Well, maybe it’s really important to have a PhD in data science, but not so much in another job title like in IT or something like that, another department. And can a model be built to predict total pay from these predictors alone? There’s many, many more questions that can be considered. These are just a few that came to my mind.

Let’s start off by visually inspecting these relationships to see if these questions hold up in the data. So let’s perform a little bit more data exploration looking at some bivariate plots, meaning the response, in this case, the total pay across different factors as the x variable. Let’s start off by looking at gender, male versus female. And we see, generally speaking, the distribution of males is slightly higher than the distribution of females, but not a lot. And the question is, is this evidence of pay gap? And we’re going to look into that based on a linear regression model. We can look at the relationship between total pay and age. And what we see here is a general linear trend. As age increases, employees tend to get paid more.

We can do some more explorations. Let’s see exactly how PhD effect– if the subgroup of PhDs vary across different departments. Maybe a PhD gets paid really well in one department and less so in another department. Once we look at only the PhDs– so we’ve subset of our data to only look at the PhDs– we can just quickly visualize, since it’s categorical, the department here. It looks like across the five different departments– there are five different departments here– the distributions look fairly similar. However, it looks like management and sales tend to be slightly higher on average.

The two outliers show up in those two groups. And the general distribution is slightly higher than the others. Though, keep in mind, we noticed that the median for engineering does appear to be higher than the median for management. It just doesn’t have the upper tail distribution. And possibly, administration and operations are lower on average. And we want to put significance to the statistical significance to what we see in these visual inspections.

One more thing to look at is to see– all right, we saw that it was a slight difference in pay comparing men to women overall. But what if we dive a little bit deeper? What if we look at each of the 10 different departments, 10 different job titles to see how men and women relate? In this graph, what we’re looking at is red represents the women within a job title. Blue represents the men. And we can compare directly two box plots together.

So these two box plots represent, for the red one, women data scientists, the black one, men data– sorry, the blue one, men data scientists. And we can continue along the way. The next pair of box bots represent the drivers– red women drivers, blue male drivers– et cetera for the 10 different job titles that we see. Looking across this, what we want to compare is the paired up different distributions. Let’s compare the red distribution directly to the blue to see if there’s any evidence of a gap comparing men to women. And what we see across all of these pairs, the vast majority of them look fairly similar. There does appear to be possibly a difference in this category, which is for the sales associate potentially. And one other thing to notice, which doesn’t give us evidence of a pay gap, but the pay distributions for the managers, both men and women, is much higher than it is for all the other distributions that we’re considering. So this just gives us some idea– overall, we have effect of gender appears much smaller after we control for job title, though it didn’t look like it was all that great overall anyway.

So that was just a glimpse into a data set we’re going to be using for the rest of this unit. And we’re going to use that and other data to start to expand on different ideas based on expanding what we learned in the previous unit on linear regression. In this unit, we’re going to learn a few new types of predictor variables that are going to help us to model and predict the response variables from the predictors we have at hand. We’re going to consider categorical, quadratic, and interaction terms.

We’re going to be able to compare and choose between models. There are various approaches in order to do that. And we’re going to look at using linear regression to perform two major tasks within the context of model building. That is, one, to make sure we perform inference and come to sound conclusions. So making decisions is important. How do variables relate? Is there evidence of an association between a predictor and a response? And compare that and contrast that to a second task is making predictions as accurately as possible. They go hand-in-hand, but they are two very distinct approaches to using a regression model and interpreting it. We want to be able to use it to perform inferences, measure associations, and also be able to use it to make predictions on the response as close as possible.

file_download Downloads Introduction to Model Building.pptx

8.3 Binary and Categorical Predictors

So the first type of regression modeling considerations we’re going to look at is looking at binary and categorical predictors to incorporate those into a regression model. So far, to date, we’ve only looked at quantitative predictors, those x’s in our model. But linear regression is flexible enough to handle those categorical predictors as well.

The simplest categorical predictor we’re going to use is that binary predictor we’ve seen a little bit already, the binary variable we’ve seen already, where it really just has two options. This is often called a regression with a binary predictor. Binary predictor is sometimes called an indicator variable.

So a binary predictor is a X variable, a predictor variable in a regression model, that takes on the values– simply just 0 or 1. The variables can be informative. We can use the naming to be informative. So instead of naming a variable gender, to represent male versus female, call it female. And then have it take on values 0 for males, 1 for females.

We can then see if there is a difference in pay for men and women by fitting a linear regression model of the response variable being pay, where the predictor variable is this binary predictor, comparing men to women. So let’s set up the Model.

Y is going to be the total pay in dollars. And X is going to represent this binary predictor, where it takes on the value 1 for women and takes on the value 0 for a man. Let’s see what it looks like in R. Here’s what the scatterplot looks like.

To start off with, the response variable– plotted on the y-axis– is total pay, both the bonus and the income combined. And the x-axis represents the two possible outcomes for female. So over here, one female is 0. This is the men. And for one female is 1. This is the women.

And keep in mind, R is doing exactly what you’re asking it to do. It’s just fitting a best fit, least squares regression line to this scatter plot of points. In the end, what does it do? It fits the model such that it crosses the axis in the y direction for total pay at the mean for men.

So this is the mean for men in x. And same thing for the women, where it crosses the value for x equals 1 along the y-axis is the mean for women. So this is going to be the sample mean for women. And this is the sample mean for men.

So notice, if men and women have the exact same mean, same average, this line will be flat. We’ll have slope 0. Since men are a little bit higher than women, as that x variable increases, you go from a man to a woman when that x variable changes from a 0 to 1. So that slope is going to estimate exactly that difference in sample means, comparing women to men.

So let’s see what R spits out. Fit this regression model in R, it gives us a statement of coefficients. And what do we notice? Our estimated regression line is estimated here.

How do we interpret this? Well, the intercept here is still interpreted the same way as before. When x equals 0, that’s our estimate for the response.

Well, x being 0 is very important in this model. That is the representation of men. So it’s estimating the average total pay for men. This value for the slope is estimating the difference in average total pay, comparing women to men.

What do you notice? It’s negative. So on average, women are getting paid a little less– $8,500, in fact. In might be a little more extreme than a little bit. And what we notice– the p value associated with that coefficient is tiny, tiny, tiny.

So there is a lot of evidence to suggest that women are getting underpaid in comparison to their male counterparts. That doesn’t paint the whole picture. We’ll come back to investigating the effect of gender on pay with a multiple regression model, which will give us a more honest estimate.

For now, we’re going to look at expanding this idea of using a categorical predictor to be using a predictor that has more than one category. In our data set we had several. One of those variables is education.

Now, education, the way we set it up, or the way we’ve set it up in the data set is it had four categories– high school, college, masters, and PhD. And it’s increasing from smallest, high school, to largest, PhD, or higher achievement in education is PhD. And this just represents the highest degree attained for an individual.

How can we incorporate this into our linear regression model? Well, there’s two ways. We can define a variable where the variable takes on the value 1 for high school students, people with a high school degree. It takes on the value of 2 for individuals, employees who have a college degree and nothing higher, et cetera, 3 for master’s, 4 for PhD. And then we incorporate that quantitative predictor into a regression model.

But the problem with that is it incorrectly assumes that the step, the linear effect, the linear association going for any one unit increase in that x variable is the same, whether you’re going from 1 to 2– high school to college, 2 to 3– college to masters, or 3 to 4– masters to PhD. And that might not be an assumption we want. So instead, we have to take a different approach.

That different approach is using lots of binary predictors– in this case, three binary predictors– to represent the four different groups. This will allow us to estimate the mean salaries for each of these groups separately, the mean pay. The data look like this.

Essentially, our original x variable is education. From that education variable we can create three separate binary predictors– college, masters, and PhD. Somebody is missing. Who’s left out? High school does not have a binary predictor associated with it. But that’s OK. Our model still can take care of high school individuals, employees with just a high school degree.

And what we see is that high school individuals, much like males in the binary predictor variable, have value 0 for all three of these predictors. So its value 0 for college, value 0 for masters, value 0 for PhD. And that makes sense, because somebody whose highest degree is a high school diploma doesn’t have a master’s, doesn’t have a college degree, doesn’t have a PhD.

So what we’re going to do is build a regression model in R incorporating our 3 x’s. College, masters, and PhD are three binary predictors. So we’ll define a regression model where the mean of the response variable mu is going to be linearly associated, have an intercept and three different predictors in it– x1, x2, x3– where the x’s are the predictors we defined before. x1 is that for college. x2 is that for master’s. And x3 is that for PhD.

And high school’s left out. And that’s OK. That left-out group is what we call the reference group, because in the end our estimates for beta 1, beta 2, and beta 3 are going to be in reference to the high school group. And beta 0 is going to be estimating the mean of that reference group directly, the mean of the high school group in this model.

So let’s see how this works out in R. We fit the regression model in R. And we get the following output. Always, always first take a look at the estimates and their associated p values. What we notice here is we can write down the regression model, if we wanted to, to say that the intercept is about 95,000. And we see all the other effects, associations with the other predictor variables.

So let’s do some interpretations. The interpretation of the intercept– it’s the mean pay for those employees who only have a high school degree. And it’s the lowest one– 95,000– because all the other effects are positive. Every other group of those four are higher in correspondence to the high school group.

What’s the mean pay for an employee with a master’s degree? Well, the mean pay for an employee with a master’s degree is just going to be the sum of the intercept plus the effect of having a master’s, because a master’s individual is being estimated with the intercept plus the linear effect of having a master’s degree, which is a binary indicator. And someone with a master’s degree does not have a PhD. So that term drops out.

And somebody also doesn’t have a college degree– or has a college degree but that’s not their highest degree. That value drops out of the regression model as well. So in the end, we see an estimated mean salary of 104,000 for an individual with a master’s degree.

One thing to keep in mind is we always want to keep track of the significance. And the significance here is just telling us, in reference to the high school group, which of these groups are significantly different in pay. We see the college group is not quite significant, while the other two are. And that makes sense, because the college group has the closest estimate. Their estimate is closest to 0 in correspondence to the high school group. So I created those binary predictors by hand in R. But you don’t have to take so much effort in creating binary predictors. You can allow R to do that work for you instead.

Instead of creating manually those three predictors, we can just fit a model in R where the model statement says, let’s predict total pay based on education. R is going to create the binary predictors behind the scenes automatically for you. You just need to know how to interpret the output.

Keep in mind, when R is choosing the groups for you, it’s going to choose the reference group to be the first in alphabetical order. So in this case it’s going to be college, because college comes first alphabetically among our four different groups. You can override the default choice in R. And we’ll see how to do that when we get to the R live session.

So fitting the same general model in R for the same data set, we see that things change because we have a different reference group. The reference group now, the estimate is 98,673 because that represents the estimate for the group with a college degree. So here, R chooses college to be the reference group. And all of the beta estimates– beta 1 through beta 3– are in reference to that. So the effect of PhD, 7965, is the difference in mean pay comparing PhD employees to college employees.

But the group means here are all estimated the same as before. Let’s go back to estimating it for the master’s degree group. The master’s degree group now is estimated by taking the intercept plus the new estimate for the effect of master’s, but in the end getting the exact same value we had for the estimated pay with an individual with a master’s degree. And that is $104,000.

Binary predictors can be combined, along with quantitative variables, to build a larger regression model. And we’ll see that later on in the chapter, and use that later on in this unit. A multiple regression model, like I just said, can combine both quantitative and categorical variables.

So let’s fit that here, where we’re going to try to predict total pay, response of total pay, based on lots of variables– gender, education, and age all at once. Gender is categorical– binary, 0, 1. Education is categorical, with three groups. And age is quantitative.

We can fit that regression model in R and get our estimates for all those beta coefficients. What can we notice here? The estimates change. Why do those estimates change? The estimates change because now we’re always talking about, remember in a multiple regression model to interpret the effect– in this case, of PhD– as the effect of having a PhD, or the association in the change in pay of having a PhD in comparison to the high school group, but now controlling for the other variables in the model. It’s controlling for age as a quantitative predictor and also controlling for gender.

The fact that female’s in this model as well, we are controlling for its effect or its association with the response when we’re looking at the effect of PhD. So it’s changed slightly because of that difference in interpretation. We’re now adjusting for the other predictor variables in the model.

file_download Downloads Binary and Categorical Predictors.pptx

8.4 Quadratic Effects

So the second type of predictor variable we’ll introduce in this unit is looking at the quadratic effects of predictors on the response. The example we’re going to look at is predicting price of real estate based on the living area of that house. So this is the data we saw in the previous unit.

So let’s, again, first, as to reiterate, let’s look at this relationship. And we look at the scatterplot because price and living area are both quantitative. And we see that the association looks quite positive. And it looks fairly linear, though there is a lot of scatter around the predicted prices. If we were to fit a regression model, the line would probably look something like that. It looks fairly linear but definitely a positive association.

So if we were to actually fit this model and estimate the regression model in R, this is what we would get. We see that the coefficients are estimated here and the p-value’s estimated here. Of course, living area has a positive association with the response, selling price. And boy is that statistically significant.

So we interpret this as a one foot increase in the living area in the predictor is associated with a $114 increase in price. Of course, remember scientific notation. And that’s an effect, an association on average. If we were to suspect that this association is not linear in some fashion, we can try to incorporate that into our regression model by including extra predictor variables, by transforming the predictor variable in a very specific way.

So here, if we think the association accelerates, that is the effect on price is higher than $114 when living area is bigger, and less than $114 when the living area is smaller, we might be able to model this, or we will be able to incorporate this accelerating trend based on something called a quadratic model.

So a quadratic model in a linear regression model is essentially a linear regression model that allows for a quadratic effect for association of a predictor variable on a response. It allows for this nonlinearity to be incorporated into a model. A quadratic effects regression model is a regression model that has one of its predictors squared.

Thus the mean of y can be modeled with predictor x1, incorporating two versions, two forms of that predictor. We can predict the mean of y, model the mean of y based on that intercept, a linear association with x, and a quadratic association of x, or the way we actually mathematically do it as a linear association with the square of that predictor variable.

So here’s the model statement again. What are we measuring with these coefficients? Beta 2 then is going to measure the amount of non linearity in the relationship between the response and the predictor x1. If it’s zero, then this relationship simplifies. If beta 2 is zero, this term would drop out of the formula, leaving us with the standard linear model.

If beta 2 is non-zero, then there’s evidence that the relationship between y and x is not linear. We can perform a t-test to investigate, specifically that. We can look at a t-test to see if beta 2 is 0 or not, non-zero, to see if there is a nonlinear association between the predictor and the response.

Note, if we want to incorporate nonlinearities into a model, this idea of using a quadratic effect is a pretty basic approach. There are a lot of models that will incorporate nonlinearities much more carefully and rigorously.

So let’s look at, again, back to our example where we’re predicting selling price of homes based on the size of the home and the living area and the living area squared. We can incorporate those two predictors into a regression model in R. Here are our estimates associated with our three coefficients. So keep in mind the intercept, the linear trend are just the same as before. And to incorporate a quadratic effect, there’s two ways to do it in R. One is to just define a new variable, which is the old variable squared. And the other is to incorporate this notation where I use I to represent the indication that I want to use the variable living area squared. So by incorporating that as an extra predictor into the model, I get all three estimates. And I can write down the regression model as such.

Of course, we’re going to look at the significance eventually as well. So the estimated regression model says that we’re going to predict y based on an intercept, a slope, and a quadratic association with a predictor. How do we interpret this? The 471,000 is still the same thing as interpreting an intercept for a model that has just a linear effect, because if x1 is 0, both the second and third terms both drop out of this formula. So we would be left with the predicted selling price of a home is $471,000 if it had no living area, which, of course, doesn’t make any sense. But we can interpret it as such.

What we see is the sign of this linear trend is negative. But be careful in interpreting that, That affect, that association that is negative, doesn’t really apply here because what we see is there is a quadratic version of that predictor as well, which is 408.3. So the combined effect of x1 and x1 squared into measuring the association between x1 and the response will end up being positive for most values of x1 that we actually can observe.

OK. So let’s go back to that model and interpret the results. Since the quadratic effect is positive and the trend is positive for all reasonable values of living area– look at that scatterplot. There was no negative association in there whatsoever. And mathematically, for any reasonable value of x1 that we could put into this model that is actually in our data set, the combined effect of 110.7 and 408.3 on the variable squared would give us a positive trend.

We can test specifically is this model with a quadratic term better at predicting the response than just incorporating that positive linear effect. And in fact, it is. And the reason we know that is because for our estimate b2 of the quadratic term, we also get a p-value that is quite small. And that p-value is statistically significant. So what do we conclude? We conclude that this model predicts price, the response, better than just incorporating the linear term alone.

All right. So now we are able to model quadratic associations with predictor variables. So why do we name regression linear regression? Well, it actually appears to be misleading on the surface. It’s called linear regression because the coefficients really are linear terms. The betas are linear terms. But the predictors can be used and transformed into any fashion you want. We saw here we are dealing with a quadratic effect. So we used x squared into the model. If we want to look at x on a log scale, we could do that as well. Thus linear regression is actually quite a powerful and flexible model to actually incorporate associations that aren’t just linear, which is why it’s actually used quite often in practice and is quite flexible for a lot of situations.

file_download Downloads Quadratic Effects.pptx

8.5 Model Diagnostics

So we’ve seen a lot of different potential predictors we can bring into a model. And once we have a reasonable model with the right predictors or good predictors involved, then we want to start doing some diagnostics on whether that model is a reasonable one from a different perspective. So let’s do a quick review of the linear regression model. Remember, we define the linear regression model as trying to describe how y relates to x through its mean based on some linear association with predictors, which can be in quadratic form, things of that nature, interaction terms.

We can actually rewrite this formula or these sets of formula as such. We can say that the observations follow a linear relationship with the predictors. And around that association– those associations– there are errors, true errors. These are called the true residuals, where this term epsilon represents the residuals from the true population regression line, which is being defined by the betas.

And just like for y, we put a distribution of normality on it. For these residuals, we can put a distribution of normality, as well. They’re going to be centered at a different place, but still have the same variance that y is assumed to have. So let’s look a little bit further. We define those residuals as being normally distributed. This is the assumptions we make.

Writing this model statement down, saying that the residuals are distributed normally, leads to basically linear regression’s four assumptions. They are, number one, the relationships with the included predictors are linear. Two, the residuals are normally distributed. Three, the residuals have constant spread. That is, the variance in this expression, in this normal distribution, is the same no matter where we are along the fitted values, the x values, anything like that.

And lastly, we have to assume that the residuals are independent. So the question is how are we going to investigate these four assumptions on the true residuals around this theoretical population regression model? Well, you guessed it. We’re going to use the estimated residuals. Those estimated residuals are going to be data driven. We’re going to estimate them as the observed value for y minus what we predict it to be from the model. And that gives us our estimate of the error for every observation.

These hold the key to determine whether these four assumptions are reasonable for a specific linear regression model. So these four regression assumptions should be checked to make sure that the model we’re using is a valid one. We’re going to try to check these as much as we can, as often as we can, with actual plots.

So three of them– linearity, normally distributed residuals, and constant variance– can all be checked via plots that we’re going to definitely create and look at. Independence is a little trickier, because independence– there’s a lot of ways that independence could be violated. And there’s not one good plot, essentially, to investigate that. So we’ll look a little bit closer at that– or talk a little bit more about that in a few slides.

OK, so what graphics should we look at? For the simplest case, this is just some simulated data, what we’re looking at is here a response variable y and its association with x. And we fit a linear regression line to these data. Of course linear regression is not perfect here. You can see that linearity is most likely violated in this plot of y to x.

But remember, in the definition of the regression model we saw before, a lot of times it’s not just a one simple x variable. There’s x1, x2, x3. So you can’t get a sense of non-linearity just by looking at one variable alone. With that said, in this simple application, the two plots that we’re going to construct is from each of the residuals– so for example, these points have a vertical distance around the line. This is my residual for any particular point.

Estimate my residual, e. From that, what we’re going to do is create two different plots. One is the histogram of those estimated residuals. And that histogram we’re going to investigate for normality. And the other is this residual scatterplot. And what is this residual scatter plot? On the vertical axis are the residuals, the e’s, that are estimated from the data. And on the x-axis are the y hats, the fitted values that my model predicts for every observation.

And what we clearly notice here is that we have non-linearity. So this is potentially a good plot to consider if we’re looking to investigate whether that assumption is a reasonable one. So let’s start there. Assumption number 1– linearity. What does it state? It states that the association between a response and the included predictors are linear in nature.

We’re going to look at that residual scatter plot to investigate where the y-axis is the residuals and the x-axis are the fitted values, the predicted values y hats. If there’s curvature in the plot, there’s a violation of linearity. How do we fix it? If there is an issue, we can incorporate more predictors or transform our predictors, either x squared log x, or even consider a predictor that isn’t in the model yet at all.

So let’s look at a couple examples. Here is your classic linearity is satisfied, assumption is satisfied. What we see is the general direction of the residuals. No matter where we are along the x-axis, it seems to be centered at zero, centered at that horizontal line at zero. So we say no clear curvature or irregular pattern. Linearity looks like a pretty good assumption.

Well, we can clearly violate that in the second graph where, again, we’re looking at the residuals on the y-axis and the fitted values on the x-axis. And there’s a clear nonlinear association between the residuals and the fitted values. So that’s telling us our linearity assumption is violated. What can we do? We can potentially incorporate new x variables like x squared or something like that– log x or new predictors not even considered yet– into our model to try to improve things.

So here, classic u shape, either up or down is fine. And the residuals will be a clear violation of linearity. All right, let’s talk about the second assumption, normality. That just states that the residuals, the estimated residuals, should be normally distributed in a regression model. How are we going to look at this? Well, we’re going to look at two things. We’re going to look at the histogram and we’re going to look at the box plot of those residuals to investigate. This will show clear violations of normality.

If there’s skewedness, extreme outliers in the plots, then that’s a potential problem. How do we fix it? We consider transforming the y variable if this assumption fails. So let’s check our normality assumption. Here is a prototypical normality is fine. Notice it’s not perfectly normal. But it’s roughly symmetric. And that’s really the key.

As long as we have symmetry and a general bell shape, then it doesn’t have to be perfectly normal. Our assumptions are still well-satisfied. Histogram of residuals is reasonably bell shaped. Normality looks fine. We can violate that moderately in the second histogram we see here. This distribution is right skewed.

Again, this is a distribution of residuals. And that right skewedness will cause some issues in potentially interpreting the p values from a regression output. So classic right-skewed. That assumption of normality is clearly violated. We can go a little bit further. Box plots are useful beyond just a histogram, because they really are helpful in flagging potential outliers, remember, based on that 1.5 times IQR rule we saw before in unit 2.

Here is the same box plot from the previous slide. The histogram’s what we see here. We have a few very slight outliers in both the high and low direction, which is not all that surprising. If things were purely normally distributed, you would expect to see outliers if you had enough data points. So we say, OK, it shows a couple very mild outliers, but nothing to really cause concern.

In the second one, this is the box plot for the histogram, the second histogram we saw on the previous slide. And what we see now is a pretty moderate outlier. And that moderate outlier could cause problems, could cause issues depending on exactly where it lies in estimating our beta coefficient. So if you see a box plot like this for your residuals, you should look a little bit closer into your data to see what outlier– what data point that outlier represents.

Assumption number 3, constant variance. Constant variance is basically just saying that the vertical spread around the regression line has the same amount no matter where you are along the regression line in the x direction. We’re going to use that scatter plot of residuals versus predicted in order to investigate this. And what we’re going to look for, keying in, is for fanning out in the vertical direction. And I’ll tell you what we mean by that.

So how do we fix this? Again, we’re going to consider transforming the response variable. Or we might have to fit a completely different type of model. So let’s look at those residual scatter plots. Classic– everything’s working OK. We see here that what we’re looking for is that vertical spread of those residuals, hopefully on both sides of the x-axis and in the middle of the x-axis, that vertical spread is the same. And that’s exactly what we notice.

The general vertical spread is just about equal everywhere you go. Not far enough to really cause concern. Well, we can clearly violate that if we get the right plot. And that’s what we see here. Over here on the left-hand side of the x-axis, the vertical spread is really small, whereas over on the right hand side, the vertical spread is much larger. And that is a clear violation of the constant variance assumption.

That is our classic fanning out, where the vertical direction– the spread in the vertical direction is much smaller on the left than it is on the right. The last assumption that we want to discuss is this assumption of independence. What does that mean? It states that one observation is completely unrelated to another one in our data set. There’s not a single plot to investigate this, but instead, you really should think about the data generating process, where the data came from.

We’ll talk more– or we talked a lot about that in unit 3. If the data is completely random or from a random experiment, then our assumptions of independence should be pretty good. If two observations are more similar when they come closer in time or closer in location, time specifically if you have time series data like a stock price that is bouncing from day to day, then there’s a clear violation of independence.

Another situation where that’s violated is where you have clusters of data, where you have some of the same family members collected in your data set. They’re more likely to be similar than two observations that are from two different, distinct families. How do you fix this? Well, you’re going to have to use a different type of model that can incorporate those specific types of dependencies.

So for the time series data set, time series situation, there are a lot of models that can handle those time series data points, which we will not investigate in this class. All right, so let’s dig into an actual example data set to investigate these diagnostic plots. Let’s go back to the selling price of homes data set, and we’re trying to predict the selling price of these homes based on living area and it’s quadratic effect, the number of bedrooms, number of bathrooms, year built, and the garage space and the number of cars it can hold as predictors.

What do we see from the R output? Here’s our estimated regression model. We got an estimate associated with each and every predictor. We know how to interpret these estimates. And each estimate has its own p value. And in order to trust those p values as actually the probabilities that we want to compare to the 0.05 level, to come to the right conclusion, we’re relying on those assumptions underlying this regression.

So we should assess whether those assumptions are valid. So what do we start? We look at the residual histogram of box plot. Here we’re looking for normality. Our residuals clearly are a little bit right-skewed. The tail on the right is longer than the tail on the left. That shows up also in the box plot. And it also shows lots of high outliers, which are cause for concern. These have a lot of potential leverage in the estimates of our slope coefficients, our beta coefficients. So the histogram and box plot clearly illustrate that the residuals are right skewed with potential outliers, definitely not normally distributed. So what do you do? Well, you’re going to have to possibly incorporate more predictor variables into the model, or even consider log transforming the response variable with some other transformation to make this assumption more valid.

What else can we look at? The residual scatter plot, of course. Remember, y-axis is the residuals. X-axis are the predicted values. And what do we see? We see the classic fanning out, where a lot of the points are close over on the left hand side of the plot, and a lot more spread out on the right hand side of the plot. What does this indicate? It indicates that the constant variance assumption is violated.

Linearity seems OK, though. I don’t see any clear curvature in this plot. It seems to run pretty consistently through the middle of the plot no matter where you are along the x-axis. So it looks like the linearity assumption is the least of our worries.

file_download Downloads Model Diagnostics.pptx

8.6 Comparing Models

So we’ve looked at lots of different predictor variables that could be attempted to be included into a regression model. We’ve looked at diagnosing whether or not a regression model is worthwhile or not through its assumptions. But what if we have two reasonably good regression models that we want to compare? That’s what this segment investigates. Looking at the real estate data– remember we’re trying to predict selling price of homes based on lots of predictors– we could consider the following two models– oversimplified ones, but good for illustrative purposes.

Let’s call the first one model A. And model A is going to try to predict the mean of the response variable based on just the number of beds in the house alone. We can look at model B, which now incorporates three different predictors– not only the number of beds, but also the number of baths in the living area of the home.

What’s the main difference here between the interpretations of beta 1 in these two models? Well, remember, when it’s in a linear regression by itself, a simple linear aggression, the effect of beta 1 represents just for a 1 unit change in beds, the associated change in the response variable.

Under the model B where there’s two other predictors, the interpretation of beta 1 now is for a 1 unit change in the number of beds. It’s the predicted change in the response variable controlling for the number of baths and the living area of the space. So essentially what it’s saying is if you increase the number of beds but hold the number of baths and the floor space, the living area, in the house the same, constant, then you might see a different predicted change in the selling price of a home. It’s controlling for those other two factors in the model. What do these models result in when we fit them in R? Here are the estimated models from R’s output.

The simple regression model says that the association with bedrooms is positive. You get about an extra $40,000 increase in the selling price of a home is associated with increasing the number of bedrooms by 1 and that association is quite strong. We can look at the regression model with the three predictors and highlight, again, now, the association with bedrooms and, oh boy, what happened? We now have a negative association, which again is extremely significant.

It’s switched sides– uncontrolled for the other predictors, now controlled for the two predictors in the model. Why did that happen? Well, there’s a couple reasons. If I’m controlling for the number of bathrooms and living area in a house, we’re going to have a tough time changing the number of bedrooms.

Naturally, when we’re comparing two homes, if we increase the number of bedrooms, we’re probably along with it going to increase the number of bathrooms and the overall living area of a house. Big houses tend to have more bedrooms and tend to have more bathrooms. This interconnectivity between our predictors, this correlation between our predictors, is sometimes called multicollinearity and that is an issue that it makes the interpretation of these coefficients cumbersome in the context of a multiple regression model.

So we have two different models with different estimates, especially with the effect of beds on price. Which model is correct? Well, first off, they’re both kind of correct. There’s no such thing, first off, of a single correct regression model. We only have useful ones. And both of these models definitely are useful.

Model A suggests adding a bedroom to a house is generally associated with a higher selling price, while model B suggests that if you’re really not allowing the living area or the number of bathrooms to go up along with adding a bedroom to a house, you might actually end up having a lower selling price or it’s associated with a lower selling price. But both of these models provide some insight to how all of these predictor variables relate to the response.

So let’s compare two models more formally. We can talk about the measure of strength of models through the measurement R-squared. R-squared is great, but it has its limitations. Remember the definition– the proportion of variability in the response that’s accounted for by the predictors.

And its limitations, the biggest limitation, is that it’s a good measure to describe the sample at hand, but it has some issues in underestimating the error or overestimating the amount of predictive ability of a model if we’re actually looking at future observations that actually weren’t sampled. So if we’re going to try to predict selling prices of homes and I use the R-squared from the model for the data that I actually had, it will probably be lower for future homes than it is for the data at hand. This causes a problem.

Another issue is that when you bring in a variable that is useless into a regression model– R-squared cannot go down. R-squared can only either stay the same or increase. So you can artificially inflate your R-squared by just incorporating every single predictor model that you’ve measured– throw in the kitchen sink.

So there are other measures that actually are better comparison tools to actually compare two competing models to see which might be better at predicting observations in the future. One of those approaches is very formal called the F-test and the F-test allows us to compare what are called nested models.

It basically allows us to make a determination if added variables provide extra predictive power. So when variables are added into a model, it allows us to look at whether the predictive power of those extra variables are important.

But the key is that these two models must be nested. And two nested models basically mean that the model with fewer variables is completely encapsulated in the model with the more variables. So the larger model has all the predictor variables in the smaller models, plus a few extra.

The F-test is essentially testing if we’ve increased R-squared enough by incorporating these extra variables than just by chance alone, than if we were just to throw in completely unassociated variables with the response.

And the F-test technically is based on something called the analysis of variance– you may have heard of that from another class– that is sometimes abbreviated as ANOVA, and it’s basically a way to decompose the variability of the residuals into a regression into what can be explained based on R-squared and what cannot be explained based on 1 minus R-squared.

Formally, the hypothesis being tested in F-test are the null hypothesis is that for those extra variables we’re trying to incorporate into the model, into the larger model, that none of them are important. All of the betas are equal to 0. The alternative is that at least one of those extra betas provides predictive ability. One of those betas is not 0.

We calculate an F-statistic– R will do it for us– and large F-statistics are in favor of the alternative. And we’d expect to see F-statistics in year 1 if the null hypothesis is true. And every F-statistic, R will calculate a p-value for us and those are the measure that we will rely on to help us make our conclusions. Just like always before, a small p-value, typically less than 0.05, we treat as a significant result.

So let’s compare formally model A and model B in R using this F-test approach. We look at model 1, remember, is predicting– or model A is predicting price from bedrooms and model B is predicting price from three predictors– bedrooms, bathrooms, and the living area. And what we see– the F-statistic is calculated to be 93– very large– and the p-value is calculated to be something really, really small.

What can we conclude? The F-test, with a really small p-value, shows us that we can reject the null hypothesis and then conclude that at least one of those two added variables, bathroom and living area, or possibly both, do provide some sort of added predictive ability in that response variable of price.

The F-test works great when you have two models that you’re competing that are nested. What if you have two models that are not nested? We need a different approach. And that’s where this AIC– name for a statistician, Akaike– his information criterion is used often to determine which of two competing models is better at predicting the response variable in the future.

We don’t provide the mathematics here, but it’s basically based on a measure of the error in a model, which is described based on 1 minus R-squared, and it essentially inflates that value based on the number of predictors. The more predictors that are in the model, the more of an inflation factor you put onto this measure. So in conclusion, in result, what we want is we want a smaller AIC. The lower the AIC, the stronger that model is.

So let’s look at an example. We need a model that is not nested with the others. So we’re going to consider a third model to predict the mean selling price of homes. And that’s what we’re looking at here, predicting the mean selling price of homes now based on four predictors, really just two with their linear and quadratic terms, area and living area squared and bathrooms and bathrooms squared.

Notice, models A and B both had bedrooms as a predictor, not in this model, but neither of them had the quadratic terms. So we cannot actually have a nested structure here. Model C is not nested within model A and model B is not nested with it as well.

So let’s go and calculate AIC using R as our tool to do the calculation– really easy to do. And automatically, once you have the models defined in R, you can calculate the AIC with just one function call. And that AIC is what we see here. And notice, we can immediately throw out one of these models. Remember, we want the smallest AIC possible. Model A has a much higher AIC than the other two, about 150 points higher, and models B and C are very similar in their AIC measure.

Formally what we can say is that– let me write similar here– formally what we can say is that model B has the lowest AIC, model A has the highest AIC, and model C is somewhere in between, pretty close to what model B was. If we were to select one of these three models as the best prediction model, I certainly would choose model B. It has the lowest AIC and the other model that’s similar to it, model C, has more terms.

So typically, as a statistician, for prediction models, what I’m hoping to find is I’m trying to find the most parsimonious model possible, the model that has the fewest predictors that still does a really good job of prediction. AIC typically gets you there.

Note– DF in this output is just representing the number of predictors in our model plus 2, because the number of predictors plus 1 incorporates the intercept term, and that 2nd degree of freedom that we’re talking about is we have to estimate the variance around the line.

Final thoughts about AIC– keep in mind, AIC is not perfect. We use it because it’s a useful measure to depict how predictive a model will be for future observations, for observations out of sample, out of what we actually use to estimate the model. It’s not perfect and it’s an approximation of some really deep mathematical theory on the use of information.

There are a lot of other measures that you could consider. There’s something called BIC. There’s something called Mallow’s CP. This has been investigated a lot by statisticians. But AIC is the most used criterion for deciding between models and that’s what we suggest. The other ones have their own merit, but we’re going to use AIC throughout this course.

file_download Downloads Comparing Models.pptx

8.7 Automatic Model Selection

So in the last segment, we learned how to compare different models through the F-Test and AIC. And in this segment, we’re going to expand that to use AIC particularly as an automatic tool to choose between, potentially, lots of different models.

And that’s often what happens in practice. You want to build a best predictive model. And based on a lot of potential predictors, you want to choose which model potentially could be best at predicting future observations.

But keep in mind, this can be a very time consuming approach. You could have hundreds or even thousands of predictors to predict a variable that has many, many observations.

Well, that doesn’t even include the potential for quadratic effects, interaction terms, things of that nature. So this can quickly become a very burdensome task, if you’re trying to compare every model by hand in R.

Well, the way we can do that is there’s an automatic approach to doing that. But first, we need to define a few things. And that is, what does it mean to have a best predictive model? In linear regression that essentially reflects the idea of trying to predict the response variable for data that haven’t been collected yet. This is what we call out-of-sample prediction a lot of times.

Models that have too many predictors will tend to perform really well with the data at hand, but often don’t do quite nearly as well for future observations, not seen, not used, to estimate the model.

This idea of being fit towards the data at hand and not doing as well predicting for future observations is called overfitting. Fitting too many predictor variables to the data at hand and taking too much into leveraging, and saying that they are much more predictive than they actually are for future observations.

And we want to prevent that. So how are we going to go about doing it? So we’re going to want to choose predictor variables that are strong– have strong predictive power– but remove those predictors that are poor.

To do this, we can scan lots of potential models based upon various subset of the predictors and choose the best one among all those that we considered. We’re going to use AIC as our measure to make this determination. We said it’s not perfect, but it does a reasonably good job.

We’re going to call this process stepwise regression. This is the approach we suggest taking to build a best predictive model from lots of potential predictors that you want to consider.

One of these stepwise regression approaches is called backward stepwise regression. And this is where you start with the largest model possible, bringing in all the predictors that you are considering.

And from, there you slowly remove one variable at a time, one predictor at a time, until all the remaining predictors provide that explanatory power that you want. Let’s talk about the algorithm a little bit more precisely.

You’re going to start off with that full regression model that has all predictors within it. You’re going to consider all models with one fewer predictor. From x1 to xn, remove one at a time. And measure AIC for each.

For each model, calculate AIC. And then you’re going to step down to the model with one fewer predictor that has the lowest AIC of all those potential one-predictor fewer models. You’re only going to step down, of course, if that improves AIC. And then once you step down to a model with one fewer predictor, that’s your starting case. And you’re going to repeat steps two through four.

Consider all models now, remove, again, one fewer predictor. Calculate AIC and step down again if AIC can still be improved, one variable at a time.

The algorithm sounds a little complicated and hard to follow, but an example makes this pretty clear. So let’s use that real estate data we’ve been talking about to predict selling price from various different predictors.

The predictors we’re going to consider here, for simplification, we’re only going to consider living area, square feet, number of bedrooms, number of bathrooms, year built, and the size of the garage, in terms of number of cars it holds.

We can consider lots of other predictors, but just for simplification terms, we’re going to only consider these. This approach of stepwise model selection is really easy to implement in R. It’s a one-line command. And we’ll see how easy that is.

So this is how it works. You fit the full model and then you run the stepwise command on that full model and R will step one step at a time until it doesn’t need to remove any more variables, because the best model with the lowest AIC has already been measured.

Here’s what it looks like one step at a time. So when you perform the stepwise procedure, you start off with the full model. The full model here, remember, had five predictors in it. Those five predictors were living area, number of bedrooms, number of bathrooms, the year it was built, and the size of the garage.

The starting AIC for this model with all predictors is 2230## And what we can say is, all right if, we were to remove year from that full model, the resulting AIC would be 22306. If we were to remove garage, the resulting AIC would be 22307, et cetera. And what we see is this none model means that the model currently has an AIC of about 22308.

And then if we were to remove bedrooms, bathrooms, or living area, notice AIC would go up. So we could potentially remove either your garage and improve AIC, and year is the variable that improves the AIC the most.

So then we step down to a model that has four predictors. Has year removed, and has everything else still in it. That next step, then, has an AIC of 22305.5## That’s to predict price from four predictors.

And with these four predictors in the model, we can consider removing, again, one at a time. And garage, if we were to remove garage, we still see an improvement in AIC. So that’s exactly what we’re going to do.

We’re going to step once again, removing garage, refit the model. Now with the three remaining predictors. And we consider removing one at a time to see if AIC improves.

It looks like, though, that the model with the lowest AIC that we could potentially consider is the one we’re currently out. If we were to remove just bedrooms, remove just bathrooms, or remove just living area, AIC would worsen.

So the stepwise approach says, stop here. Our best predictive model, based on using AIC as a criterion, is this one with the three predictors in it.

So that is our result. Our best model, through this AIC stepwise procedure, says the best predictive model is the one with living area, bedrooms, and bathrooms in it as predictors.

This search is certainly not exhaustive. We didn’t even consider all possible combinations of predictors in our model, or consider predictors outside of those that we considered.

And there could be a model within our subset of predictors at hand that actually have a low lower AIC. But this model is certainly a good one. Note the model with AIC is not guaranteed to be the best predictive model, anyway. But it’s likely to be pretty close.

So we know how to perform stepwise regression, but the question is, when should we perform this stepwise approach to model selection in regression? And really, there are two situations in which you want to do it.

The one that we did just now is where you want to build a final best predictive model from a long list of potential predictors. Just like the one we performed here. Only five predictors isn’t that many, but for simplifying purposes, that’s what we started off with here.

The other a time is a little bit more subtle, in which you’ll use stepwise regression. And that is where there is a single variable of interest that you care about. You want to really measure the association.

Let’s go back to the pay data set, where we’re looking at, is there an association of gender with pay at Glassdoor.com? And what we find is that what we can do is build a regression model to potentially control for all of the other predictors first.

Do a model selection– stepwise model selection approach– considering all other predictors. Once we have a best prediction model from those other predictors, now take gender and bring it into the model. And see if its association still holds up.

It’s a good way to try to control for all potential confounders in a model, but still have a nice parsimonious model that’s not overfit to the other predictors. A couple words of caution when doing stepwise. First, by all means, go ahead, when you perform an automatic stepwise regression, you should report your model estimates, and you should report an R-squared.

You should not report, you should not report the p-values from the regression output table like you would otherwise. Why? Well, the reason is, it’s a pretty subtle issue. And that is we’ve considered lots of models. So this p-value idea loses their interpretation. If we look at enough models, we’re going to get something that shows up as statistical significance, just by chance alone.

So we should not incorporate the p-value or interpret the p-value as we typically do. This idea of artificially finding variables that are significant and in interpreting that p-value incorrectly is this idea of p-hacking.

Related to the idea of p-hacking, where we’re searching through potentially lots of predictors and only reporting those that are significant. It’s a good way to find associations that don’t hold up in future observations. It’s a form of overfitting.

file_download Downloads Automatic Model Selection.pptx

8.8 Model Building Example in R

So in this segment, we’re going to go look at some R code examples for implementing everything we’ve seen in this unit. And that’s starting with just exploring the data set a little bit.

So the two data sets we’re going to use throughout, the Glassdoor pay data set, and the real estate selling price data set. We’re going to start with the Glassdoor data set. We’re just going to read it in. We’re going to define, remember, we had a total pay, the sum of income and bonus within Glassdoor.

And then what we’re going to do is we’re just going to summarize the data set in total, which gives us the summary statistics categoricals, it gives us the breakdowns for each category, for quantitative variables. It gives us the five-number summary, along with the mean.

And then we can visualize. So there’s various histograms that we calculated, we created in the notes. We’ll just look at the histogram of total pay. There it is. It’s right-skewed, like we saw.

And then we started investigating several relationships with different variables. The one I want to draw your attention to is this one down here, where we’re going to actually look at just the PhDs.

So we create a new data set called phd dot data, which is a subset of the Glassdoor data set, where we’re only pulling off the rows in which the education is defined as PhD. So we create this subset of data, phd dot data, and then we use that when we create our box plot.

So we create the subset of phd dot data. We can look at its dimensions. 238 observations, 238 people have a PhD. And that agrees with our summary statistics here in our output, that there should be 238 people.

So that’s great, we probably did that correctly. And then we can look at the box plot of pay split across departments, only considering those PhD individuals. And that’s what we see here.

And the last one, we can actually split a box plot. In this case, again, total pay, splitting it across the categories for gender and the categories for job title. So we’ll get 20 separate box plots. And if we color them red and then blue, what it will do is every female will be colored in red for their box plot and every male group will be colored in blue. And if we run this, we’ll get the exact same plot we got when we did it in the slides.

Noticing that every pair is comparing women to men data scientists, women to men managers for the high one, just as a reminder.

From there, we went to categorical predictors. We first defined– looked at– whether or not pay was different for men and women overall. So what we created was– automatically created a variable for female or manually did it.

And then we looked at the plot for this binary predictor. And we put the regression line– fit the regression line first, and then fit the regression line to that plot. And this is what we.

We see that decreasing slope because women on average were paid less. And we can look at the summary statistics from that regression model. And we see that, yes, women on average get paid about $8,500 less. And that is an extremely significant result.

Then we moved on to a categorical predictor that had more than two categories. We looked at trying to predict, in this case, total pay from various different indicators from education. Because education had four categories, we automatically created, or manually created, three different predictors, called them college, master’s and PhD.

And then manually brought those binary predictors into a regression model. And we can look at the summary of that regression model all at once. And this agrees with the output that we saw in the slides.

We then ran that exact same model, except we parameterize it different, had a different reference group, used what R used as default. If I run the next model predicting total pay from education, R sets up the predictors– the binary predictors– slightly differently.

It chooses college as the reference group, because it’s the first one alphabetically. But you can force R to choose the reference level, the reference variable, very easily.

So the next line of code is going to create the variable education and redefine it, in this case, to be re-leveled so that its reference group is forced to be high school. So I can just run this line, it re-sets up– run this line, saves glassdoor education with a new reference level.

And then refit the model– I’m now calling it 2C– and look at the summary. And look, it spits out the fact that now it chooses the way we originally wanted it, to have high school as the baseline. So that is greater of the reference group.

And the last thing we did was we fit many predictors all at once in a regression model. And that’s what lm3 is doing. From there, we went to quadratic predictors.

And in order to fit a quadratic model, we’re going to go over to the real estate data set. So we’re going to incorporate that, run that data set, to load that data set in R. Then after we load that data set, we fit the linear model, summarize that linear model.

Then we fit the quadratic model. And this is what we’re going to look at a little bit, is we’re going to predict price from living area as a linear effect. And then also to incorporate the quadratic effect automatically.

We use this notation, where we’re trying to find the indication of a new defined data variable, where it’s based on the quadratic living area squared. And we can fit that model with the quadratic effect and look at the summary.

And just, again, recreating the calculations we did in the R output. And just as a reminder, that quadratic effect was quite significant and positive. And we saw, essentially, an accelerating effect of the change in price as living area increases.

To visualize this, we’re going to have to plot that quadratic term on that quadratic model. And the way we have to do that is a little bit of manual manipulation. So we’re going start by building the scatterplot.

We’re going to then define a dummy set of variable, x variable, which is going to take on 5 to 5,000, which represents essentially the whole range of values that x can take on, living area can take on.

So that’s just going to be sort of our new data set, where we’re doing our predictions. We’re going to fit the standard regression line as a dash line on that plot.

And then we’re going to fit the predicted values, which is going to be coming from the quadratic model. Remember, we called the model re2. And to do a prediction, you say predict the model I want.

And if you want to do it at new observations, you say new data gets a data frame where the variable that is used for prediction– in this case, just living area– both in linear and quadratic form, is taking on the values x, which we defined to be that dummy variable.

And we can plot that quadratic regression model as that curve plot. Exactly like we expected to see, the relationship accelerates as living area increases to the upper tail. And notice, it actually is estimated to dip down a little bit, at the very left-hand side of this graph.

And that’s probably just an artifact over– because of using a very simplistic way to show an acceleration model. And there’s very few data points over here, way on the left, below 1,000 square feet in living area.

Which I don’t really trust the fact that that is a negative slope, in the quadratic relationship at very low values of x. From there, we then went to interaction terms. And interaction terms in R, we were trying to predict the bonus pay in that Glassdoor data set based on age and female.

We fit a model with those as both just main effects, ignoring the interaction. And this should be the same results we saw before. And then what we did was we fit a model with the interaction term.

And it’s pretty easy to do in R. The way I do it– there’s a lot of ways to do it, but the way I do it is I run the model to predict the response. In this case, bonus, being predicted by age and female.

And in R, you can just instead of adding the variables in the model statement, in a multiple regression, if you want include their interaction effect, you can say age times female.

So I can run this model. And it’s going to include all three of those predictors, age, female, and the interaction between the two. And that’s exactly what we see.

Another way you could fit that model and get the exact same results is you could say, all right, for fit6, I want age and female. And I want all interactions between those two variables.

And the reason you might want to do it this way is you might want to include a third variable. So there was education in here. So now we’re going to get all of the interactions between age and female, age and education, and female and education.

Then if I ran this model– this is going to be model 6b, fit6b, we’ll see that now we have age, we have female, we have the main effects for the binaries for education.

And then we have all the two-way interactions. Age with female, age with education, and female with education. And, boy, is this model a bear to try to interpret. But we could go through that work if it was really important to us.

I’m just showing you how to fit these models in R in case you ever want to do this on your own. And then we plotted a few things here. This is just showing that interaction effect for men and women, plotting them as separate colors on a scatterplot.

And then fitting– pulling off the correct coefficients for men, pulling off the correct coefficients for women, define the two different regression lines, when we looked at that interaction effect.

The next set of units, we went into diagnostics. We’re going to go back to the real estate data set. And we’re going to fit a regression model that has predicting selling price from living area, the quadratic term for a living area, the number of bedrooms, the number of bathrooms, the year the house was built, and the garage space.

That’s the model we’re considering. Whoops– sorry about that– run the whole line, not just part of it.

And then we can do the summary of that regression model. This should hopefully match what we had in the notes. We can look at the histogram of the residuals. Just like we saw in the notes, that was right-skewed.

So the assumption of normality is probably not a valid one. We can look at the box plot of the residuals. So in order to pull off the residuals, you take your model variable, your model object which I called lm1, and I’m saying, from my lm1 object, pull off dollar sign, the residuals.

It’s one of the variables that is saved when you run a linear model in R. Sorry, I should answer that first. And the box plot here then illustrates, again, like what we saw in the notes. There are some high outliers, not surprising, matching up with that right-skewed distribution.

To fit the plot of residuals, the scatterplot residuals versus fitted, you just say my y variable are my residuals. I’m going to predict that from my x variable, which are the fitted values coming from that linear model.

And we’re just going to plot that based on a scatterplot. And then I always, just for visual purposes, add that horizontal line at 0, because that’s essentially what I’m in reference to, both for looking for nonlinearities, which seem to be OK here.

We seem to be fine with the linear assumption. But we do see that the spread is a little bit less to the left than it is to the right. A pretty clear moderate violation of constant variance. From diagnostics, we went into ways to compare models.

And in that subunit, what we did was we looked at predicting price for the real estate data, from just bedrooms. We looked at a model that included two more predictors, bathrooms and living area. And we wanted to compare them directly.

Since bedrooms is the smaller model, and it is incorporated completely in the larger model along with two extra new terms, we have nested models in this case. So it’s OK to run an F-test. And that F-test to determine if model B is doing a better job of predicting is just run through the anova command. Remember, the F-test is essentially analyzing R-squared, which is a measure of the variability. That’s where the term ANOVA comes from.

And it’s just a simple command, anova, smaller model comma larger model. And you get exactly what we were hoping for, the F statistic and its associated p-value.

And of course, like we saw already, we see a very significant result that bathrooms and living area as the two added terms, degrees of freedom 2, provide more predictive ability in this larger model.

We then went to a third model that was not nested with either one of those, model C, which had, as predictors, living area, living area squared, bathrooms, and bathrooms squared. We can run that model and look at the summary statistics.

And then consider all three models we looked at, calculate their AICs through the AIC function. And we see that the AICs match what we had in the notes. AIC for the model A was definitely the worst. Remember, smaller AIC, the better.

And it looks like model B is slightly better than model C, and model B has fewer predictors than model C anyway. So we would prefer that to prevent overfitting.

And the last thing we looked at was through model selection. Now I give some notation here that you may not have seen before. But to fit– remember, our backwards model selection, stepwise model selection approach, we start off with a largest model that includes all possible predictors that we want to consider.

So the way we can do that in R, to save us some effort in typing out all the variables, when we fit or than your model with the lm command, we can say, let’s predict my response valuable tilde dot. And what that notation refers to is the fact that I want to use all of the other variables in the data set, re, as predictors to predict price.

So if there is one variable that I wanted to leave out, or there was one variable that wasn’t actually used as a predictor, you should not use this approach. You could redefine a new data set that only had the response and the predictors, and then take this approach.

But if we run this command, and then we look at the results, the summary, of lm dot full, what we see is that output of all the other predictors in that data set, re.

So let’s just look at the header of re. I’m trying to predict price from living area bedrooms, bathrooms, year, and garage. And those were the predictors used in this full regression model. And then, like I said, when we saw this in the lecture, in order to perform a stepwise model selection procedure, you can just simply just say lm dot step. In this case, save the results of a step model where I’m starting off with the full model and I’m going backwards.

There is a forward approach, there is a way to go forwards and backwards at the same time. But we learned just the backwards approach in the step command. And that’s what we’re going to use here.

So run that. It automatically spits out the three steps, the starting initialization step, stepping once, which, of course, removed year, the worst variable in that largest model. Garage could be removed to improve AIC.

And then the last step is the model with bedrooms, bathrooms, and living areas as the predictors. And I saved it as a new object called lm.step, so then I can just run the summary of lm.step to get the output that I’m used to for a regression model in R.

And remember, don’t interpret these p-values like you typically do, because after performing model selection, these p-values lose their interpretability like we’re used to. But you can interpret the estimates, and you can interpret R-squared.

R-squared here, 0.1842, not great, but certainly has some predictive ability, especially with the about 900 observations that we have.

8.9 Inference vs. Prediction

So we’ve gone through the lectures, investigating a lot more deeply into linear regression. So we want to summarize that real quickly first. What we’ve learned is we’ve learned how to incorporate new types of variables into the model.

So we saw how to represent categorical predictors through binary predictors. We saw how to incorporate quadratic effects but through doing a quadratic predictor variable into the model. And we investigated interactive effects through incorporating interaction terms as well.

We talked about how to assess models for their assumptions by really investigating different residual plots. We looked at the histograms, the box plots of the residuals. And we looked at those residuals versus fitted plots.

From there, we started to compare models, formally and informally. We compared them if they were nested through F-tests and if they weren’t nested through the AIC. And we also decided how to select different models automatically through this wonderful approach called stepwise model selection, which used AIC, essentially, as the criterion to decide which model between lots of different models would be the best one to consider, mostly for a best predictive model, whatever that means. And we used R, then, to actually go about doing this data analysis– to estimate all of our different beta coefficients, to assess that model validity, to compare different models, and to select between lots of potential different linear regression models.

So I’d like to take a few minutes here. And I’d like you to, online, give us some thoughts on what could possibly be some places– some pitfalls or challenges you would have in trying to incorporate these ideas into a regression modeling data analysis example.

Great, so now that you’ve had time to give us your thoughts, we have Luis and Reagan here to help us think about what pitfalls we could potentially have and come against. And the first one is really an important idea when you have these complex multiple regression models. Reagan, can you help us sort of poke into the idea of how interpretations become difficult, depending on what variables you do have in your model?

Yeah, so if you have a multiple regression model where you’re including quadratic terms for a particular covariant or if you’re including interaction terms between the different predictive variables, it can become trickier to make sure that you’re getting the correct interpretation of all of those beta coefficients that you then have to look through, because you have to gather all of the different coefficients that are related to a particular predictor variable to determine its overall effect and its overall influence on the outcome.

And issues can come up, too, if you have problems with multicollinearity. If two variables are very highly correlated, and they’re both predictors in your model. We want to interpret each regression coefficient as, what’s the effect of a 1-unit increase in this variable on the outcome holding all other things constant.

But if there’s multicollinearity, it can be very difficult to actually hold all other variables constant.

The example we saw is we looked at the living area in a house, and the number of bedrooms, and the number of bathrooms, as all predictors in the same model to try to predict the selling price of a home. And that was, like, impossible to do.

Impossible to really interpret well. To hold the other variables constant, since they were so closely related themselves, those predictors. Cool.

Luis, what are your thoughts on the idea that sometimes when you fit a regression model, it looks great and the R-squared seems great, but then when you dive a little deeper into the checking whether those four are assumptions we talked about are valid, you start to see some issues going on.

So you want to talk a little bit about what issues that could lead to?

Yeah, first, I’d like to say that it’s really important that you check these assumptions.

Of course, yeah, yeah, yeah. Always, always do it.

Yeah. And I’ve seen a lot of cases where people don’t. And it’s really just one of those– the assumptions really are the backbone for the method. And if you sort of brush them aside, you really can get yourself in trouble.

And so some of the things that can go wrong are if you have nonlinear trends, for example, if you fit a linear regression with a linear predictor, you’ll be able to see this in the residuals. And you might still have a really high R-squared, but it’s not the whole story.

And so in order to get the whole story, looking at residual plots, looking at box plots, looking at all these things to make sure that the assumptions that– make sure that your assumptions hold is incredibly important.

Yeah, and it’s sort of related to– last, we talked about extrapolating. So if you think there’s a linear trend and you ignore that possible quadratic effect, or nonlinear trend. Then if you were to extrapolate, it would totally screw things up.

Yeah, definitely.

It’d definitely mix it up. Cool. And then from there, we talked a little bit about incorporating extra categorical predictors into a model. And in order to determine whether or not they were significant, we looked at the F-test.

Do you want to talk about that, Reagan, a little bit? About when is the F-test appropriate? When is it not appropriate?

Yeah, so the F-test could be really useful if you want to compare nested models. So for example, if I have a model that has three predictors, and I want to determine if a model that includes any additional predictor variable is better.

And maybe that additional predictor is categorical and it has several different levels. So it’ll show up in the model as multiple coefficients. The F-test is a really convenient tool that I can use to compare those models. And to test overall which model is better.

Overall, does the model that includes the additional predictor, does that fit the data better? Is that a better model than the simpler

One? Yeah, so F-test is a good way to compare those nested models formally. But one thing that naively we might do is just compare those R-squareds from one model to the next. And what are the possible pitfalls you might see in doing that, Luis?

Yeah, so I guess something that I learned early on when I was learning statistics is that if you add more coefficients, the R-squared will always go up. And it’s because you’re adding more information. Even if it’s not really that helpful, your R-squared will go up.

So if you want to inflate your R-squared, I mean, you could just–

You could add everything.

Just add everything, yeah. Measure something that’s completely not related. And it can only help, right?

Yeah. And so that’s where using selection criteria really are important. And so something like the AIC really is a systematic way of determining whether or not to include that extra variable. And so leaning on the methods that we’ve developed, or that have been developed for this kind of model selection is really important, not just R-squared.

So what does AIC actually help prevent? I mean, if I used R-squared, I would tend to overfit the model, right? And AIC helps prevent that.

So if we were going to, from first steps, use AIC, what would be a good approach of using AIC to compare models? So starting from the ground up, starting from top down.

If we just compared R-squared, it would lead us to larger models than we wanted. So how can we prevent that overfitting. What’s a good approach to do that? Either of you.

Well, stepwise regression goes in multiple directions. So if you’re really after a simple, a parsimonious model that has as few predictors as you possibly need, you might do a forward stepwise regression.

Where you basically start with a model that has nothing. And you sequentially add in predictors, only if they help. And you end up with this model that has the smallest number of predictors that will maximize the overall fit of the model.

On the other hand, if you’re OK with a larger model, or even just to evaluate and determine if your forward stepwise regression is the best overall fit, you can do backward stepwise regression, which starts by putting everything in and then filters out predictors.

So I think there’s different ways, depending on what your goals are to work with AIC and those type of criterion.

Yeah, and especially if you’re doing that backwards selection, which is what we demonstrated. You have to be careful early on with that full model, that you could have way too many predictors.

If you’re trying to fit the kitchen sink, there could be millions of predictors you’re considering, if you’re including interactions, and quadratic effects, and stuff like that. That, very quickly, you might have an overfit model to begin with.

And it’s just tough to step through those model selection approach. But once you do get to that final model, then what are the possible pitfalls you could come into there in interpreting that model?

Well, so one of the things that we talked about in a previous module was this idea of p-hacking, which is basically looking at many, many different hypothesis tests until you find one that’s significant. And then interpreting that as if it were sort of the original intent.

And so you can run into this type of thing in multiple regression, where you have– actually, you have many, many hypothesis tests in R. They’re marked as little stars when you have your output. And interpreting each one of those hypotheses as an independent hypothesis test, it’s actually very similar to p-hacking.

Yeah, you need to be honest with that. You know, your approach. With you data analysis approach, don’t just treat it as if, or report it as if, this was the one model we considered. And these are all the p-values associated with that one model.

That if you started off throwing away a lot of different variables, you considered a lot of models. And based on that, your p-values in the end just aren’t very interpretable. So definitely pitfalls to watch out for.

All right, so that basically concludes our thoughts on words of caution with linear regression. It’s a great tool to use to investigate how a quantitative response variable relates to possible predictors.

So make sure you keep track of all possible pitfalls you could do when you actually are fitting a model to a data that’s important to you. But it works beautifully. And I highly suggest that you at least start with that when you do a real data analysis at home.

8.10 New Section

8.10.1 Setting up Stata

We are going to allocate 10 megabites to the dataset. You do not want to allocate to much memory to the dataset because the more memory you allocate to the dataset, the less memory will be available to perform the commands. You could reduce the speed of Stata or even kill it.

[Code Stata]

set mem 10m

we can also decide to have the “more” separation line on the screen or not when the software displays results:

[Code Stata]

set more on
set more off

8.10.2 Setting up a panel

Now, we have to instruct Stata that we have a panel dataset. We do it with the command tsset, or iis and tis

[Code Stata]

iis idcode
tis year

or

tsset idcode year

In the previous command, idcode is the variable that identifies individuals in our dataset. Year is the variable that identifies time periods. This is always the rule. The commands refering to panel data in Stata almost always start with the prefix xt. You can check for these commands by calling the help file for xt.

[Code Stata]

help xt

You should describe and summarize the dataset as usually before you perform estimations. Stata has specific commands for describing and summarizing panel datasets.

[Code Stata]

xtdes
xtsum

xtdes permits you to observe the pattern of the data, like the number of individuals with different patterns of observations across time periods. In our case, we have an unbalanced panel because not all individuals have observations to all years.

The xtsum command gives you general descriptive statistics of the variables in the dataset, considering the overall, the between and the within variations. Overall refers to the whole dataset.

Between refers to the variation of the means to each individual (across time periods). Within refers to the variation of the deviation from the respective mean to each individual.

You may be interested in applying the panel data tabulate command to a variable. For instance, to the variable south, in order to obtain a one-way table.

In the previous command, idcode is the variable that identifies individuals in our dataset. Year is the variable that identifies time periods. This is always the rule.

The commands refering to panel data in Stata almost always start with the prefix xt. You can check for these commands by calling the help file for xt.

[Code Stata]

xttab south

As in the previous commands, Stata will report the tabulation for the overall variation, the within and the between variation.

8.10.3 How to generate variables

8.10.3.1 Generating variables

[Code Stata]

gen age2=age^2
gen ttl_exp2=ttl_exp^2
gen tenure2=tenure^2

Now, let’s compute the average wage for each individual (across time periods).

[Code Stata]

bysort idcode: egen meanw=mean(ln_wage)

In this case, we did not apply the sort command previously and then the by prefix command. We could have done it, but with this only command, you can always abreviate the implementation of the by prefix command.

The command egen is an extension of the gen command to generate new variables. The general rule to apply egen is when you want to generate a new variable that is created using a function inside Stata.

In our case, we used the function mean.

You can apply the command list to list the first 10 observations of the new variable mwage.

[Code Stata]

list meanw in 1/10

And then apply the xtsum command to summarize the new variable.

[Code Stata]

xtsum meanw

You may want to obtain the average of the logarithm of wages to each year in the panel.

[Code Stata]

bysort year: egen meanw1=mean(ln_wage)

And then you can apply the xttab command.

[Code Stata]

xttab meanw1

8.10.3.2 Generating dates

Let’s generate dates:

[Code Stata]

Format varname2 %d

And format:

[Code Stata]

Gen varname2 = date(varname1, “dmy”)

8.10.4 How to generate dummies

8.10.4.1 Generating general dummies

Let’s generate the dummy variable black, which is not in our dataset.

[Code Stata]

gen black=1 if race==2
replace black=0 if black==.

Suppose you want to generate a new variable called tenure1 that is equal to the variable tenure lagged one period. Than you would use a time series operator (l).

First, you would need to sort the dataset according to idcode and year, and then generate the new variable with the “by” prefix on the variable idcode.

[Code Stata]

sort idcode year
by idcode: gen tenure1=l.tenure

If you were interested in generating a new variable tenure3 equal to one difference of the variable tenure, you would use the time series d operator.

[Code Stata]

by idcode: gen tenure3=d.tenure

If you would like to generate a new variable tenure4 equal to two lags of the variable tenure, you would type:

[Code Stata]

by idcode: gen tenure4=l2.tenure

The same principle would apply to the operator d.

Let’s just save our data file with the changes that we made to it.

[Code Stata]

save, replace

Another way would be to use the xi command. It takes the items (string of letters, for instance) of a designated variable (category, for instance) and create a dummy variable for each item. You need to change the base anyway:

[Code Stata]

char _dta[omit] “prevalent”
xi: i.category
tabulate category

8.10.4.2 Generating time dummies

In order to do this, let’s first generate our time dummies. We use the “tabulate” command with the option “gen” in order to generate time dummies for each year of our dataset.

We will name the time dummies as “y”,

and we will get a first time dummy called “y1” which takes the value 1 if year=1980, 0 otherwise,
a second time dummy “y2” which assumes the value 1 if year=1982, 0 otherwise, and similarly for the remaining years. You could give any other name to your time dummies.

[Code Stata]

tab year, g(y)

8.10.5 Running OLS regressions

Let’s now turn to estimation commands for panel data.

The first type of regression that you may run is a pooled OLS regression, which is simply an OLS regression applied to the whole dataset. This regression is not considering that you have different individuals across time periods, and so, it is not considering for the panel nature of the dataset.

[Code Stata]

reg ln_wage grade age ttl_exp tenure black not_smsa south

In the previous command, you do not need to type age1 or age2. You just need to type age. When you do this, you are instructing Stata to include all the variables starting with the expression age to be included in the regression.

Suppose you want to observe the internal results saved in Stata associated with the last estimation. This is valid for any regression that you perform. In order to observe them, you would type:

[Code Stata]

ereturn list

If you want to control for some categories:

[Code Stata]

xi: reg dependent ind1 ind2 i.category1 i.category2 i.time

Let’s perform a regression where only the variation of the means across individuals is considered.

This is the between regression.

[Code Stata]

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be

8.10.6 Running panel regressions

In empirical work in panel data, you are always concerned in choosing between two alternative regressions. This choice is between fixed effects (or within, or least squares dummy variables - LSDV) estimation and random effects (or feasible generalized least squares - FGLS) estimation.

In panel data, in the two-way model, the error term can be the result of the sum of three components:

The two-way model assumes the error term as having a specific individual term effect,
a specific time effect
and an additional idiosyncratic term.

In the one-way model, the error term can be the result of the sum of one component:

assumes the error term as having a specific individual term effect

It is absolutely fundamental that the error term is not correlated with the independent variables

If you have no correlation, then the random effects model should be used because it is a weighted average of between and within estimations.
But, if there is correlation between the individual and/or time effects and the independent variables, then the individual and time effects (fixed effects model) must be estimated as dummy variables in order to solve for the endogeneity problem.

The fixed effects (or within regression) is an OLS regression of the form:
(yit - yi. - y.t + y..) = (xit - xi. - x.t + x..)B + (vit - vi. - v.t + v..)

where yi., xi. and vi. are the means of the respective variables (and the error) within the individual across time, y.t, x.t and v.t are the means of the respective variables (and the error) within each time period across individuals and y.., x.. and v.. is the overall mean of the respective variables (and the error).

8.10.6.1 Choosing between Fixed effects and Random effects? The Hausman test

The generally accepted way of choosing between fixed and random effects is running a Hausman test.

Statistically, fixed effects are always a reasonable thing to do with panel data (they always give consistent results) but they may not be the most efficient model to run. Random effects will give you better P-values as they are a more efficient estimator, so you should run random effects if it is statistically justifiable to do so.

The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent results.

To run a Hausman test comparing fixed with random effects in Stata, you need to first estimate the fixed effects model, save the coefficients so that you can compare them with theresults of the next model, estimate the random effects model, and then do the comparison.

[Code Stata]

xtreg dependentvar independentvar1 independentvar2… , fe
estimates store fixed
xtreg dependentvar independentvar1 independentvar2… , re
estimates store random
hausman fixed random

The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the consistent fixed effects estimator. If they are insignificant (P-value, Prob>chi2 larger than .05) then it is safe to use random effects. If you get a significant P-value, however, you should use fixed effects.

If you want a fixed effects model with robust standard errors, you can use the following command:

[Code Stata]

areg ln_wage grade age ttl_exp tenure black not_smsa south, absorb(idcode)
robust

You may be interested in running a maximum likelihood estimation in panel data. You would type:

[Code Stata]

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, mle

If you qualify for a fixed effects model, should you include time effects?

Other important question, when you are doing empirical work in panel data is to choose for the inclusion or not of time effects (time dummies) in your fixed effects model.

In order to perform the test for the inclusion of time dummies in our fixed effects regression,

first we run fixed effects including the time dummies. In the next fixed effects regression, the time dummies were abbreviated to “y” (see “Generating time dummies”, but you could type them all if you prefer.

[Code Stata]

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe

Second, we apply the “testparm” command. It is the test for time dummies, which assumes the null hypothesis that the time dummies are not jointly significant.

[Code Stata]

testparm y

We reject the null hypothesis that the time dummies are not jointly significant if p-value smaller than 10%, and as a consequence our fixed effects regression should include time effects.

Fixed effects or random effects when time dummies are involved: a test

What about if the inclusion of time dummies in our regression would permit us to use a random effects model in the individual effects?

(This question is not usually considered in typical empirical work- the purpose here is to show you an additional test for random effects in panel data.)

First, we will run a random effects regression including our time dummies,

[Code Stata]

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re

and then we will apply the “xttest0” command to test for random effects in this case, which assumes the null hypothesis of random effects.

[Code Stata]

xttest0

3.The null hypothesis of random effects is again rejected if p-value smaller than 10%, and thus we should use a fixed effects model with time effects.

8.10.7 GMM estimations

Two additional commands that are very usefull in empirical work are the Arellano and Bond estimator (GMM estimator) and the Arellano and Bover estimator (system GMM).

Both commands permit you do deal with dynamic panels (where you want to use as independent variable lags of the dependent variable) as well with problems of endogeneity.

You may want to have a look at them The commands are respectively “xtabond” and “xtabond2”. “xtabond” is a built in command in Stata, so in order to check how it works, just type:

[Code Stata]

help xtabond

“xtabond2” is not a built in command in Stata. If you want to look at it, previously, you must get it from the net (this is another feature of Stata- you can always get additional commands from the net). You type the following:

[Code Stata]

indit xtabond2

The next steps to install the command should be obvious.

How does it work?

The xtabond2 commands allows to estimate dynamic models either with the GMM estimator in difference or the GMM estimator in system.

[Code Stata]

xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1) iv(list2, options2)
two robust small

When noleveleq is specified, it is the GMM estimator in difference that’s used. Otherwise, if noleveleq is not specified, it is the GMM estimator in system that’s used.
gmm(list1, options):

list1 is the list of the non-exogenous independent variables
options1 may take the following values: lag(a,b), eq(diff), eq(level), eq(both) and collapse
- lag(a,b) means that for the equation in difference, the lagged variables (in level) of each variable from list1, dated from t-a to t-b, will be used as instruments; whereas for the equation in level, the first differences dated t-a+1 will be used as instruments. If b=●, it means b is infinite. By default, a=1, and b=●. Example: gmm(x y, lag(2 .))⇒all the lagged variables of x and y, lagged by at least two periods, will be used as instruments. Example 2: gmm(x, lag(1 2)) gmm (y, lag (2 3)) ⇒ for variable x, the lagged values of one period and two periods will be used as instruments, whereas for variable y, the lagged values of two and three periods will be used as instruments.
- Options eq(diff), eq(level) or eq(both) mean that the instruments must be used respectively for the equation in first difference, the equation in level, or for both. By default, the option is eq(both).
- Option collapse reduces the size of the instruments matrix and aloow to prevent the overestimation bias in small samples when the number of instruments is close to the number of observations. But it reduces the statistical efficiency of the estimator in large samples.

iv(list2, options2):

List2 is the list of variables that are strictly exogenous, and options2 may take the following values: eq(diff), eq(level), eq(both), pass and mz.
- Eq(diff), eq(level), and eq(both): see above.
- By default, the exogenous variables are differentiated to serve as instruments in the equations in first difference, and are used un-differentiated to serve as instruments in the equations in level. The pass option allows to prevent that exogenous variables are differentiated to serve as instruments in equations in first difference. Example: gmm(z, eq(level)) gmm(x, eq(diff) pass) allows to use variable x in level as an instrument in the equation in level as well as in the equation in difference.
- Option mz replaces the missing values of the exogenous variables by zero, allowing thus to include in the regression the observations whose data on exogenous variables are missing. This option impacts the coefficients only if the variables are exogenous.

Option two:

This option specifies the use of the GMM estimation in two steps. But although this two-step estimation is asymptotically more efficient, leads to biased results. To fix this issue, the xtabond2 command proceeds to a correction of the covariance matrix for finite samples. So far, there is no test to know whether the on-step GMM estimator or two-step GMM estimator should be used.

Option robust:

This option allows to correct the t-test for heteroscedasticity.

Option small:

This option replaces the z-statistics by the t-test results.