Chapter 9 Classification Modeling and Logistic Regression

We now enter a study of what’s called logistic regression. And I know you’re thinking, regression again? We just saw regression. Well, yes. Regression actually is a whole series of different methods. There’s linear regression and there’s nonlinear regression and there’s all sorts of other regressions. And this is logistic regression– a very useful type of modeling for problems you see often in business settings. Let’s first recall linear regression. As we saw in the two previous modules, linear regression allows us to model the relationship between a quantitative response Y and a set of predictors. In this module, we extend our modeling capabilities and learn how to model the relationship between a binary response Y and a set of predictors.

As a motivating example, we have data from LendingClub.com. LendingClub.com is a peer-to-peer network of people that want to lend money and people that need to borrow money. So if you need to borrow money, you put a query up on LendingClub.com. You have particulars about your case and who you are. And a group of individual investors can then decide who they want to fund and how much they want to fund. And it’s a very popular and successful program.

Of course, it’d be interesting to know which funds or which mortgages, or which loans actually– they’re not all mortgages of course– but which loans are successful and which loans are not successful. So we have a data set on around 9,500 3-year loans that were funded through LendingClub.com between May 2007 and February 2010. We have a variable called “default” that indicates that the loan was not paid back in full. The borrower either defaulted on the loan or the loan was charged off, meaning the borrower was deemed unlikely to ever pay it back. We also had the borrower’s FICO score– which is a credit score, the highest you can get is around 800– when they applied for the loan. So we have whether the loan was defaulted– yes or no– and someone’s FICO score.

Well, the response variable Y we want to model is “default,” which measures whether or not the loan defaulted. It’s called a binary response variable, since it only takes on the values 0 or 1– 0 if the loan was not defaulted on, 1 if the loan was defaulted on. So default=0 indicates the loan was paid back. Default=1 indicates the loan was not paid back. And of course we’re interested in modeling whether or not the loan was paid back as a function of someone’s FICO score.

A scatterplot is not that useful when you’re working with a binary response variable. That’s what it looks like. Why does it look so weird? Because Y only takes on two values, 0 or 1. And you can’t really see an obvious separation. By that we’re saying, oh, is it clear that for the low FICO scores you always go in default? And then for the high scores, you always don’t? No, there’s 0’s and 1’s all over the place. So it’s not really clear what’s going on. We need a way to model this.

Well, let’s run the linear regression. “If the only tool we have is a hammer, we are tempted to use it for every problem.” Now, I wrote this slide, and that’s probably not the quote I intended to write. The real quote is, “If the only tool you have is a hammer, every problem looks like a nail.” Yeah, that makes a bit more sense here. So if all you know is linear regression, you try to apply it in every situation. But it’s not applicable in every situation, as we’ll soon see.

So let’s see how linear regression would address this problem. Now, what you want to keep in mind is, what do we assume about the response variable in a linear regression model? That will kind of tell you why it’s not going to work in this case.

So here’s the linear regression output for our data. So here we’re modeling again whether or not you defaulted on the loan as a function of your FICO score. And as usual, we’re going to look at the p-value here. So some observation– because the p-value is less than 0.05, the FICO score is a significant predictor. We need it in the model.

Now, how do we interpret this? Again, that scientific notation that R loves to use. That’s minus 1.445 times 10 to the minus 3, otherwise known as this number. And we say, for a unit increase in x– so as FICO goes up by 1 point. So as FICO increases by 1 point, the average of the response variable decreases by 0.001445. Decreases of course, because it’s a negative coefficient there. How do we interpret this in the case that Y only takes on a 0 or 1 value? So as FICO increases by 1 point, the average response variable decreases by 0.001445. But wait a minute, Y only takes on 0 or 1. What exactly does that mean?

Here’s the estimated line plot. And OK, that’s weird too, right? Because you got your 1 and your 0’s, and what is that modeling? This is just weird. What are we doing in this case? Well, we ran a linear regression. But what exactly are we modeling?

Recall that in the linear regression model that we’ve studied for the two past modules, we’re saying Y is normally distributed around the mean mu, and that we model mu in a linear fashion– that mu is beta 0 plus beta 1 times x. And that is the average of Y is modeled as a line. So that’s what linear regression knows to do.

Now, there’s an interesting case when Y only takes on 0 or 1 values. In this case, the linear regression model has a very interesting interpretation. In the case that Y is binary– only takes on 0 or 1 values– the linear regression model is equivalent to assuming the probability Y=1 is beta 0 plus beta 1x. So it gets reduced to this very interesting form when Y is a binary response variable.

So if Y is a binary response variable, we’re modeling pi– which we’re calling the probability of Y=1– as a linear function of beta 0 plus beta 1x. Now that we know what’s being modeled when we run a linear regression with a binary response Y, we can interpret the regression output again. So here’s our regression output.

Now, how do we interpret this? It simply says, that for each 1-point increase in FICO score, the probability of not paying back the loan, the probability of defaulting, goes down by 0.001445. So of course we see, as FICO score goes up, the probability of defaulting on the loan goes down.

And you might say, oh, that’s a great interpretation. I can model the probability of default. There’s no issue whatsoever. Unfortunately, there is an issue. And the issue comes up, for once, in predicted values. Let’s examine some predicted values from our regression model.

Suppose you have a FICO score of 650. The default probability turns out to be 25% or 0.25. How did we get that? We take the estimated intercept and then we take their estimated slope times the 650– the x-value we want to do a prediction for– and we get a 0.25.

So nothing strange here. You have a FICO score of 650– a middling score. It’s above being a bad credit risk– it’s typical. Most people have around a 650 or 700. Default probability’s 0.25. So it doesn’t seem unusual.

Now, what happens if you have a FICO score of 825? It’s a very good FICO score. And it turns out your default probability is minus 0.005. You’re not going to default and then some. I’m not actually sure how you interpret a negative probability. Where do we get that from? Let’s make sure we understand how we get the predictions. We have our estimated intercept. We’re adding our estimated slope times the 825– that’s the x-value we want to do prediction at– and we get a minus 0.005.

Well, what’s the issue? Although a FICO score of 825 represents a great credit score, probabilities can’t be negative. And that’s one of the rubs with this model. So a main issue with modeling binary response variables with regular linear regression is, the predictions may be out of range. They can actually become bigger than 1 or they become negative.

If we model a binary response variable using linear regression, we are modeling probability as a linear relationship with a predictor. Probabilities have to always be between 0 and 1. But the linear regression model doesn’t make this assumption. There’s no constraints on what the predictions can be. So there’s no guarantee predictions will make sense whatsoever. You can get probabilities larger than 1. And as we just saw, you can even get negative probabilities.

Fortunately, there’s another type of regression model better suited to model probabilities, that we will discuss in the next segment.

file_download Downloads Introduction.pptx

9.1 The Basics of Logistic Regression

We just saw, in the previous unit, an issue with using linear regression to model a binary response variable. And the problem was predictions might be outside the range probabilities can take on. You might get negative probabilities predicted, or you might get probabilities larger than 1. There is a solution, though, and that solution is called logistic regression. Logistic regression is used with a binary response variable. Recall that a binary response variable can take on only one of two values– for example, y equals 0 or y equals 1.

Conventionally, 1 represents a success, and 0 a failure– depending on however you’re defining success and failure, but usually 1 is success, 0 is a failure, regardless of the outcome. Examples of this type of variable in practice are 1 is 1 if you defaulted on the loan, 0 otherwise. That’s the example we saw. 1 is 1 if made an online purchase, 0 otherwise.

You might be interested in what variables affect people making purchases. 1 is 1 if you renew the car lease, 0 otherwise. 1 is 1 if your employee is promoted, 0 otherwise. And you can see how you’d be interested in finding out which x variables are related to a success or failure.

So now, beyond linear regression. We showed, in the previous segment, that when a binary response variable is modeled using linear regression, the underlying model being estimated is simply probability of success is beta 0 plus beta 1 x. And the problem was, if you do this with regression– linear regression– you might get bad predictions. So issues can arise when making predictions and using this linear model. You might get predictions larger than 1 or less than 0.

Logistic regression is a way to model a binary response variable. The logistic regression model states that the probability of success– so the probability of y equals 1– equals 1 over 1 plus e to that stuff. Oh, my gosh, this is a big formula.

What is e? You may not remember this, but e is a constant. It’s 2.71. So there’s some mathematical constants. Pi, 3.14, is a constant. e is 2.71 is a constant. And so this is saying you want to take 2.71 and raise it to this power– minus beta 0 plus beta 1 x.

Yeah, that’s strange-looking. I remember the first time I learned this in graduate school. I was like, what? That’s a very strange-looking formula. But this is called the logistic function. This strange-looking function is called the logistic function, and its sole purpose– just to understand why it’s there– its sole purpose is to ensure that predictions will always be between 0 and 1.

And so if you actually graph the logistic function, it can look like this on the right. It can look like this. It’s called an S-shaped function, because it looks like the letter S. And all you’re doing is you’re making sure that no matter the value of x– and the only purpose of these two graphs are to show you that you’re in the interval 0 to 1 here, and no matter what x you put into the logistic function– could be as small as minus 10, as large as 10.

No matter what x you put into the logistic function, you’re always going to get an output between 0 and 1. And that’s what you want to do. If you’re modeling probabilities, you want to make sure that whenever you make predictions, you’ll get a prediction in the interval 0 to 1. And that’s what the logistic function is ensuring you get.

The logistic regression model can be estimated by what’s called the method of maximum likelihood estimation. This is a common approach for estimating parameters and statistical models. The estimated probability that y equals 1 is then given by p, instead of pi for the population quantity– this is our estimated probability, p– as 1 over 1 plus e to the minus b0 plus b1 x over here.

b0 is the estimated beta 0. b0 does not equal beta 0. It’s our estimate. b1 is the estimate of beta 1. So b0, b1 are the sample quantities. Beta 0 is the population intercept. Beta 1 is the population slope.

And our loan default– so the example we did before where we were looking at whether or not people default on their loans, the R output from estimating a logistic regression model is as follows. Now, the output looks surprisingly just like linear regression output, which is great, because it means we know how to interpret fee values. We may not know how to interpret the estimates yet, but it looks very similar to linear regression output.

Now, from the output, we can actually get the probability of a loan default. What we do is we plug in the estimated value. So we plug in b0 is over here, and b1 is over here. So with this equation, we can predict probability of loan default for any FICO score, x.

So how would you do predictions? It’s a little mathematical. You have to pull out the calculator for doing this, or use R itself. Using the estimated equation, if someone’s FICO score is 600, the probability of default is– well, we just plug in b0, and we have b1 x. And now we’re putting 600 in for x.

And the probability of a default for FICO score of 600 is 0.3## The probability of a default for someone who’s FICO score’s 850– well, we run the numbers in here. And again, it’s a little bit of work to calculate this, but easily done in R, as we’ll show you in an interactive session– is 0.03. So you’re not getting negative predictions anymore for high FICO scores. You’re actually getting predictions that make a bit more sense.

You can obtain what’s called a fitted probability plot. And here we see FICO score, and this is probability of default. Not surprisingly, you see as a FICO score increases– that means someone is more and more credit-worthy– the probability of a default gets lower and lower. But it stays above 0, does not get negative.

Now, the only tricky thing when fitting these models, though, is interpretation of b1. The logistic regression model is non-linear in the parameters. So there’s no easy way to interpret b1 as we did with linear regression. We will use a general guideline that says, if b1 is positive, a 1 unit increase in the predictor corresponds to an increase in the probability of success. And if b1 is negative, a 1 unit increase in the predictor corresponds to a decrease in the probability of success. For our loan default example, we see that as the FICO score increases, the probability of a loan default decreases, because the slope estimate was, indeed, negative. file_download Downloads The Basics of Logistic Regression.pptx

9.2 Inference and Goodness of Fit

So I’m going to be covering logistic regression where we’re going to be talking about inference and goodness of fit for the logistic regression. So let’s go. So let’s discuss a little bit more details about logistic regression. Just to remind you, in our last segment, we ended up learning how to estimate the coefficients from a simple logistic regression and obtain the predictions from logistic regression, recognizing that now the probability predictions will stay between 0 and 1.

What we’re going to focus on right now is hypothesis testing for a coefficient of a predictor in a logistic regression. We’re going to examine confidence intervals for probabilities. That’s our main focus for confidence intervals. And we’re going to have an analogous measure that we use for r squared in linear regression. We’ll have a version that applies to logistic regression. So this is the material that we’re going to be covering in this segment.

So we know how to get a point estimate, a single number estimate, for beta1, the coefficient that multiplies the predictor variable. And just to remind you, the model for simple logistic regression looks like this. It’s pi, the probability of y equals 1 is equal to this logistic function 1 over 1 plus e to the minus beta0 plus beta1 times x.

We usually are interested in confidence intervals and hypothesis tests for beta1. We are not only interested in estimating that. But we’d like to be able to say something inferential through a confidence interval and a hypothesis test. And we’ll get to that in a moment.

So part of the problem is that because the actual value of beta1 is difficult to interpret generally, we’re going to focus mainly on hypothesis testing for whether beta1 is equal to 0 or not. I’ll be showing you a confidence interval anyway. But the focus really is just to say whether beta1 is 0, because if it were 0, that means that the predictor variable basically disappears from this formula. If that beta1 is equal to 0, then that x variable just vanishes. And it says that the probability no longer depends on the x value. That’s why we’re mainly interested in knowing whether beta1 is equal to 0 or not. So let’s revisit the loan default example. So this was the output from running logistic regression. So this is our estimated model. You’ll notice that the coefficient for the FICO score B1 is equal to negative 0.0119. And that number we claim is difficult to interpret exactly. It can be, but it’s actually pretty difficult to come up with a very clean interpretation.

But what we can do instead is answer the question of whether the data provide evidence whether beta1 is not equal to 0. And if we can convincingly show that, then we’ve implied that FICO scores are indeed related to the probability of a loan default.

So let’s go back– before we answer the question about hypothesis testing, let’s actually remind ourselves how we can actually calculate a confidence interval for beta1. The formula I’m about to show you is going to look very familiar, because it’s a formula that applies in general across many settings.

So I could construct a confidence interval of the form B1 plus or minus 2 times the standard error of B1. So I would form the confidence interval by starting at B1 and subtracting 2 times the standard error of B1. And that would be the lower endpoint of the confidence interval. Then the upper end of the confidence interval is taking B1 and then adding 2 times the standard error of B1.

So usually it’s difficult to make sense out of the exact values. But we can assess whether beta1 is positive or negative by examining the confidence interval. So let’s actually try it just to see what happens. So here is the output again. We just saw this a couple slides ago, fitting the probability of a loan default from the FICO score. And this is the information on the estimate B1. And we also have the standard error based on this output, which is that second value to the right, which is that 0.0008231. So if I wanted to construct a 95% confidence interval for beta1, the way I would do it is by starting at– having the value be B1, which is this part, and then adding and subtracting 2 times the standard error of B1, which is that piece right there.

And so if I take B1 and I subtract twice the standard error of B1, that’s my lower endpoint. And the upper endpoint of the confidence interval is going to be B1 plus the twice standard error of B1. I’ll end up getting this interval as my final confidence interval, my 95% confidence interval for beta1, the unknown coefficient I’m trying to estimate.

So what’s interesting, and the reason I showed you this calculation, is that this is a confidence interval that only contains negative values, because it goes from negative 0.0135 up to negative 0.0103. So I can be 95% confident that whatever the value of beta1 happens to be, it’s somewhere in that interval. And if it’s in that interval, it must be a negative value.

For hypothesis testing, now we switch gears and we ask the question whether or not beta1 is equal to 0. That’s the null hypothesis. And the alternative hypothesis is that beta1 is not equal to zero. This should look very familiar, because we examined this hypothesis test in linear regression. The way that we answer this question is pretty much doing the same sort of calculation. Fortunately, we can just read off what the p-value is for this hypothesis test from the output of the r computation.

So we obtain a p-value and compare the p-value to the significance level. And we’ll let R do the work for us. So in this case, the hypothesis test for beta1 of whether it’s equal to zero basically involves performing this test. Beta1 is equal to 0 versus the alternative, beta1 is different from zero. And if you compute the p-value, it’s that circled value on the output. It’s for the column where the p-value is.

And if you expand out 2 times 10– 2 times 10 raised to the minus 16 power, they had to count out a lot of zeros to get this number right. But that is the right number. It’s a very, very, very small number. And because this number is less than the usual 0.05 significance level, then I can reject the null hypothesis. There really is very strong evidence that FICO score is a statistically significant predictor of the probability of a loan default.

Let’s move on to estimating probabilities for given predictor variable values. So we already know how to estimate the probability of y equals 1, the probability in our case of a loan default. But I need to– I’d like to be able to perform confidence intervals for the probability. So let’s say, for example, that I’m interested in the probability of a loan default for a FICO score of say 700, which is a pretty reasonable creditworthy FICO score.

So as we’ve seen in the last segment, all I need to do is plug in 700 into this formula for the estimated probability based on that logistic curve. So I’d plug in 700 into x. And I would pull off b0 and b1 one from the output of the R computation.

So that’s what I’m aiming to do. But I’d like to ask a question about confidence intervals. So the confidence interval is going to have the same flavor as the calculation that we did for the confidence interval for beta1. And again, that means that we’re going to start at p, the estimated probability. And then we’re going to add and subtract 2 times the standard error of p.

The actual calculations are pretty complex. It’s actually hard to just– there’s nowhere to pull off the standard error of p from any output. So we run some R commands to do that for us. And as you’ll see in a future segment in this unit, we will show you the R commands. So if you were to carry out the appropriate commands, you would get some output that looked like this text here. And what I circled are the lower and upper endpoints of the 95% confidence interval for the probability of a loan default with a FICO score of 700. And so what it seems to say in this example when you follow through the calculation is that we have 95% confidence that the true probability of a loan default with a FICO score of 700 is somewhere between 16.1% and 17.7%.

So basically, that’s what I just said. It’s a 95%– we’re 95% confident that the probability that a borrower with a FICO score of 700 would default is somewhere between 0.61 and 0.177. It’s actually kind of interesting to ask the question if I consider not just 700 as the FICO score but I considered a whole bunch of different values for the FICO score and then I were to plot what those confidence intervals look like, how would that appear on a plot? Well, here’s what it looks like.

So what I showed you before is– and what this is showing is on the horizontal axis, I have the FICO scores. On the vertical axis, I have the probability of a loan default. The calculation that I just showed you from the previous slides was saying if I knew that the FICO score was 700, what is the confidence interval for the probability of a loan default?

The probability of a loan default has a confidence interval that goes between these two dashed lines, so where I’m drawing these little dots here. And I were to go to the vertical axis, that marks off at 0.16 to 0.177 roughly. That’s what these values are here.

But the way I drew this curve by calculating repeatedly what the confidence intervals were, I can read off, for example, for a FICO score of 750, I have the appropriate values that are sandwiched between those dashed lines really for any of these FICO scores. It is worth noting, by the way, that the width of these intervals starts getting narrower the closer the probabilities get towards zero. And that’s something that commonly happens in logistic regression.

So the closer the estimated probability gets towards zero, like in this direction, you’ll notice that the width of these confidence intervals gets smaller. Turns out that the width of the confidence interval, which you can see gets wider and wider as we move further to the left, reaches its widest point when the probability is 0.5.

So if this were to extend a little further, so this is 0.4 on this figure. But if we were to extend this further so it goes up to 0.5, we would end up seeing the widest interval at that point. And then what’ll happen is then it will start to straighten out. And then those confidence intervals will get narrow again. So it’s just something to understand about how confidence intervals or probabilities work in the context of logistic regression.

I want to have a final– the final topic I want to cover in this segment is the question about goodness of fit for logistic regression. Unfortunately, there is no agreed upon method for summarizing goodness of fit in logistic regression much like there is in linear regression. The difficulty is that, in linear regression, we had this notion of the percent or proportionate variation explained by the predictor variables on the response variable. And that concept doesn’t translate quite as cleanly in logistic regression unfortunately. What we can do instead is use one of a various number of proposed measures to do a calculation that’s sort of like r squared. And the one that I’m going to show you here very quickly is something called the pseudo r squared measure. And this is due to the economist McFadden.

And so this calculation actually looks a little similar to r squared in linear regression. But it doesn’t quite have the interpretation of the percent variation explained. And in the R– in the segment where we show you the R commands, we’ll actually show you how to actually do the calculation.

But let me show you what it computes to in this example. So if you were to perform a goodness of fit analysis for the loan default logistic regression, the value ends up getting computed to be 0.0271. And much like r squared, the regular r squared that you learn for linear regression, 0.0271 is a pretty tiny number. So you would end up concluding based on this value that this is a very low pseudo r squared value. And roughly speaking, 2.7% of the variation in the loan defaults is explained by the FICO score. And so that’s a very small number. Generally, we don’t expect high pseudo r squared values. In linear regression, it’s not that uncommon to see r squared values of 0.8, 0.9, really high numbers that indicate a strong fit. It’s pretty rare to see that in pseudo r squared values. If you start seeing values as high as 0.4 or 0.5, that indicates a pretty good fit for the logistic regression. And so if you can achieve pseudo r square values that high, logistic regression is doing a pretty good job in explaining the variation in the 1’s and 0’s response variable based on your predictor variable.

file_download Downloads Inference and Goodness of Fit.pptx

9.3 Multiple Logistic Regression

In this segment, we’re going to discuss multiple logistic regression, essentially logistic regression where now we have several predictor variables. We’re going to see how everything that we’ve done up to this point extends to the situation where we have multiple predictor variables. So let’s go.

So the good news is that incorporating many predictor variables has a very strong analogy to what we did with linear regression, with multiple linear regression. So we’re going to be able to benefit in the same way using multiple predictors simultaneously, again much in the way that we did with linear regression. It’s worth mentioning upfront some of the pros and cons with multiple predictors in logistic regression. One, one of the positives is that there is greater accuracy in the predictions. And that’s because you’re using more information to inform the relationship between the predictors and the binary response. Another is that we’re making better use of the data. Rather than either ignoring the data or attempting to perform logistic regression separately one at a time with each individual predictor variable, instead we’re making more efficient and better use of the data by using all the data all at once.

One of the problems that we’re going to encounter– and we may not even try to encounter it, frankly– is that it’s much more difficult to visualize the results of multiple logistic regression, because we can’t just simply look at one plot like we have in the case of just simple logistic regression, where we saw the y variable on the vertical axis, and then a single predictor on the horizontal axis. Now, we have multiple predictors. So where do we put them all?

And then finally, sometimes with a large data set, it starts to get to be a little computationally intensive to fit. Usually, that’s not such a bad problem if we’re working with moderate sized data sets. But if we’re really working with– if we’re working with enormous data sets, logistic regression, as opposed to linear regression, uses an iterative algorithm to estimate the parameters of the model. And that iteration can really slow down when you’re working with very large data sets. So it’s worth being aware of some of the pros and cons of working with multiple logistic regression.

Well, we’re approaching the end of the course. So let’s use an example that we would probably want to kick back with, have a little bit of fun. And this is an example involving the quality of red wine. So the question might be whether we can use methods and analytics to predict the quality of red wine based on physio-chemical content of red wine. So this is a study were data were collected on 1,599 bottles of Portuguese red wines. And each individual wine was rated by an expert in the wines were dichotomized into their opinion being that it was a bad wine versus a good wine. So for each of the 1,599 bottles of wine, they were labeled either good or bad.

In addition to having these binary ratings of each of the bottles of wine, we also have a whole bunch of physio-chemical predictors. So for example, we have the fixed acidity of the wine. We have the volatile acidity of the wine, residual sugar, chlorides, total sulfur dioxide, the density of the wine, the sulfates in the wine, alcohol. And then finally, the response variable is this last one, whether or not the wine was good versus bad.

So in the example I’m going to be working through, we’re going to be considering the eight predictors listed above along with whether the wine was good– no versus yes. And I’ll let y equal 1 to represent yes and y equal 0 to represent no. As always, it’s useful to examine the data, particularly in a graph before starting to embark on any kind of exercise of modeling. And so here’s an example of what one might want to do.

This is a great example for the use of box plots, in particular taking box plots and putting them side by side to compare distributions. So for example, we can compare the fixed acidity, which is on the vertical axis here in this first figure, first pair of box plots; and then on the horizontal axis is whether or not the wine was labeled as good versus not. So what you can see based on the box plot is that the median fixed acidity for wines that were considered good seems to be a bit higher than the median wine– I’m sorry the median fixed acidity for wines that were considered to be not good. And you can go through all these different box plots.

You can see, in some cases, there doesn’t seem to be much difference when you look at the comparison of the quality of the wine and the variable. So for example, residual sugars, there doesn’t seem to be a huge difference in the distributions of the residual sugar between good quality wines and not good quality wines. But then you can go over, say to this last one, where you can really see there’s quite a difference in the distribution of the amount of alcohol in wines. And interestingly, wines that are considered to be good quality wines tend to have a higher amount, higher percentage of alcohol than the ones that don’t, which is evidenced by the entire box in the box plot being higher among the good quality wines in this alcohol measure compared to the not good quality wines. So this is examining each individual predictor variable by itself against whether the wine was of good quality or bad quality. But what we’re going to be doing for logistic regression is combining all the predictor information together into one model. So here’s the multiple logistic regression model. This is laying it out from basic specification. What we have, just like we had with linear regression, is we have p predictor variables, where p is going to be– in our case for this example, p would be ## So we have x1, x2, up to, in general, xp. And the model we’re going to assume is going to look a lot like the model we assume for simple logistic regression. It’s going to be that the probability of observing y equals 1, probability of success, is this logistic curve 1 over 1 plus e to the minus– and the quantity in the minus here is now not just a beta 0 plus beta 1 x1. It’s now equal to beta 0 plus beta 1 x1 plus beta 2 x2 plus beta 3 x3 all the way up to plus beta p xp. So we’re taking all the predictor variables and combining them together by this combining of beta 0, beta 1 and x1, beta 2 and x2, and then summing them up.

What we’d like to be able to do in multiple logistic regression is to make inferences about the unknown betas. These are the quantities that we’re going to be estimating in the process. So unlike in simple logistic regression, where we only ended up needing to estimate beta 0 and beta 1, now we have a whole bunch of other coefficients that we’re going to be needing to estimate in this whole process. So just a comparison of simple and multiple logistic regression– we’re going to use the same method to estimate the coefficients. Back with simple logistic regression, we used this method of maximum likelihood estimation. We’re going to do the same thing for multiple logistic regression. That doesn’t change. The goals are also going to be similar. What we’re interested in is performing statistical inference for the betas, which means typically performing hypothesis tests for the betas.

We also would like to be able to make probability predictions for the response based on a set of predictor variable values. So you give me a bottle of wine and its physio-chemical contents, I’ll take that information, plug that into my estimated multiple logistic regression. And what I’ll pop out is a probability prediction of whether or not that wine is good versus bad. There is going to be a slight difference in how I interpret the coefficients. And that difference is going to be very similar to the difference that we saw when we examined multiple linear regression. So let’s actually jump right into the example. So here is what happens when you run a multiple logistic regression. This is the output. It looks a lot like the multiple linear regression output. We end up getting a set of estimated coefficients, which is what is in the oval here. And if you extract out those estimates and I try to plug them back into the equation for multiple logistic regression, here’s what it looks like. The probability, the estimated probability, can be written as this logistic function evaluated at the estimated coefficients multiplied by whatever the predictor variable values happen to be. And I’ll just write this in word form, just so you can see what the x’s are here and what the predictor variable values are, and then plugging in the particular estimated coefficients that I’m reading off the r output.

So the final predicted model turns out to be 1 over 1 plus e to the minus 227 plus 0.281 times the fixed acidity value minus 2.91 times the volatile acidity value plus, and so on down to the last one, which is the alcohol content value, which is 0.72 times whatever the alcohol content is for this bottle of wine. So that is the formula for coming up with a predicted probability that a wine is going to be good based on this analysis of 1,599 bottles of wine.

All right, so let’s start with interpreting the coefficient. So a positive coefficient for any of the coefficients that we get have the– it’s going to be very similar to what we saw before in simple logistic regression. A one unit increase in the predictor variable, holding the other predictors fixed, in other words keeping all the other predictors at their current value, corresponds to an increase in the probability of success. In contrast, if we ended up having a negative value of a coefficient, that means that if you increase the predictor variable value by 1, then, holding the other predictors fixed, in other words not changing other predictors, the 1 unit increase in the predictor for a negative coefficient means that you’re going to have a decrease in the probability of success.

So let’s see that interpretation in the context of the wine example. So you can go through and see that fixed acidity, residual sugar, sulfates, and alcohol content all have positive coefficients to their estimates. What that means is that, in each of those cases, if you take, say, the alcohol content just to pick one of them– if I take the alcohol content of a bottle of wine and I were somehow to consider it being 1% higher, that’s going to correspond to a higher probability of the wine being of good quality.

Similar reasoning says that if I take all of the variables that have negative coefficients in this multiple logistic regression, which happened to be volatile acidity, chlorides, total sulfur dioxide, and the density of the wine– if I increase any of those variables, because the coefficient is negative, that’s going to correspond to a lower probability of a good one quality. So again, if I consider, say, the density of wine, if I increase the density of the wine, if I make the wine more dense by, say, 1%, then what’s going to happen is that’s going to lower the probability that the wine is of good quality. That’s what these results imply when you’re examining the coefficients.

Well, that’s interpreting the signs of the coefficients. What about the magnitude of the coefficients? Unfortunately, as with simple logistic regression, the actual values of the coefficients, they do have an interpretation, but they’re really not very straightforward. So here, we’re not going to make much of an interpretation. What’s more important is to interpret whether a coefficient is positive or negative, much like we interpreted it in the last slide.

But what we’d also like to be able to know is whether or not a coefficient is 0 versus not being 0. And of course, the reason we might want to know that is that, if the coefficient were 0, then that variable will no longer appear in the formula, in the logistic function formula. And that means that that variables no longer– is not an important variable. And as usual we can formally address this question using a hypothesis test.

So let’s revisit the output. And we can see that this set of values on the output are the p-values for the test of whether a coefficient is equal to 0 as the null hypothesis versus the alternative hypothesis that the coefficient is not 0, assuming the other variables are in the model. So it’s the effect of that variable being in the model beyond the effect of all the other variables already being in there.

So if you examine all of these p-values, you can see that every single one of them is less than 0.05. And so what that essentially means is that, for every single variable, each variable individually is a significant predictor of the quality of wine relative to models where all the other predictors are included. And another way to say that is that each individual variable has a significant effect beyond all the other variables that are already in the model.

We can also construct predictions and confidence intervals for the probability that a wine is good, in other words a confidence interval for the probability of success in the context of multiple logistic regression. So using as the starting point this estimated probability based on the resulting fitted multiple regression model, there is no simple formula that I can take using this as the starting point and constructing a confidence interval. But the good news is that this sort of calculation is very easy to do in R.

So let me just show you what the R output would look like. So I might, for example, consider a bottle of wine with the characteristics with a fixed acidity of 6.2, volatile acidity of 0.36, residual sugar of 2.2, and so on. This happens to correspond to one of the bottles of wine that I just pulled at random from the 1,599 bottles. What I can determine from the estimated logistic regression equation when you plug in those values is that the probability that this wine is of good quality is a measly 0.141. So there’s only a 14.1% chance that this wine with these characteristics would be labeled as good. Turns out, by the way, that this wine was rated as not good in the actual data set. That’s an aside. But at least, that is a very quick anecdotal evidence that this probability estimate of 0.141 seems to be in line with what the data are actually telling us, that this was not a good quality wine.

But what it can do is I can calculate a confidence interval for the probability using this set of predictor variable values. And if you run this through R, you’re going to get this interval. So rather than just simply saying that the estimate is 0.141, what I can say is that the 95% confidence interval for the probability is between 0.089 and 0.195, in that range of values. And again, that’s something that we will see how to get using R. So the interpretation here would be, for wines with this particular set of physio-chemical variables, we are 95% confident that the probability of a good quality evaluation is between ## 9% and 19.5%. We don’t know exactly what it is, but we’re 95% sure it’s in that interval.

The last thing I wanted to describe to you was whether or not pseudo-r squared applies to the setting of multiple logistic regression rather than just simple logistic regression with just one predictor. Of course, the good news is that pseudo-r squared indeed applies to multiple logistic regression in the same way as simple logistic regression. And again, we’ll show you how to do this in R. It’s exactly the same thing as how it was done for simple logistic regression. And again, you’ll be seeing that in the next segment, or I should say two segments from now.

So it turns out, if you do the calculation for this particular situation, the pseudo-r squared in this context is computed to be 0.313. And that’s actually a pretty reasonable magnitude. And this would suggest that there is a moderate amount of variability in the binary response of good versus bad quality wines that can be explained by these eight physio-chemical predictors. So again, this gives you a rough guide for how to say whether or not your multiple logistic regression is capturing an appreciable amount of the variation in the binary responses.

file_download Downloads Multiple Logistic Regression.pptx

9.4 Comparing Logistic Models

So when we have many competing logistic regression models, lots of models that we might want to consider, we can compare them– just like we did with linear regression– through various methods. Just as a quick review, what is the simple logistic regression model? Well, remember, we’re trying to predict the probability of a Bernoulli response variable taking on the value 1. And we’re doing it through this logistic function. We’re going to estimate that probability, call it p, based on the estimates from our regression model. So the betas in the true model represent the true associations of the predictors with the response. And the little b’s are our estimates from our data.

Well, we’re going to start off by doing some model diagnostics, or attempt to do model diagnostics, in the context of logistic regression. And we’ll find out that things can get quite difficult. Well, before we do that, we can define a residual, just like we did in linear regression. Remember, in linear regression, the residual was really the defining measure that we wanted to use to determine our model assumptions, whether they were reasonable.

Here we’re going to calculate them the same way. We’re going to use our observed outcome variable y, which takes on value 0 or 1, and subtract off what we predicted it to be, the value p, which is going to be a probability value. Well, there are some issues with that in calculating the residuals.

Keep in mind the result is what we call e, the estimated residuals. And the only problem with it is that it can really only take on two values for any predictive value. So if we predict a response variable of taking on the value 1 to be 0.7, then the only two possible residuals when we predict it to be 0.7 is either 1 minus 0.7 or 0 minus 0.7, positive 0.3 or negative 0.7. So if we were to look at the residual plots, they’re going to look a little funky. So let’s look at that example. I’ll use an example to highlight that.

So let’s look at trying to predict whether or not a loan defaults, just like we saw before, from the FICO score alone. We’re going to fit that model and then look at the residual plot, the residual scatter plot, trying to predict the residual from the fitted values. And this is the plot we get.

For any particular value of predicted probability goes on a scale of 0 to 1– here the actual observed predicted probabilities are small, between 0 and 0.4– we see that we only have two possible values. And those two possible values are one unit apart, one that is negative and one that is positive, one unit apart.

And notice the pattern we see here. Essentially, this is not a very useful plot for diagnostics checking, for checking whether or not this logistic assumption is reasonable. And that pattern emerges because of that discreteness, the binary nature of that response variable.

So because of that pattern that we saw in the residual plot, diagnostics for logistic regression really is not very easy at all. Performing diagnostics, that is. So if you’re using logistic regression and you want to perform inferences, like making sure that associations are significant or not– you really want to make sure your assumptions for the logistic regression are valid– then you probably will have to consult a statistician to get some more information on how to perform those model diagnostics or get the statistician to do that for you.

However, the good news is that using logistic regression as a predictive model is still perfectly valid whether or not those assumptions are reasonable. You could possibly improve them if you have valid assumptions or able to make those assumptions correct. But it doesn’t mean that the model that had invalid assumptions is a problem. It still is fine for prediction.

So now that we have diagnostics out of the way, we can also start talking about comparing different models. Let’s assume assumptions are reasonable for two different models. And let’s compare those models, just like we did in a linear regression.

So recall in linear regression, if we had two nested models and we wanted to perform a formal significance test, we could perform the linear regression F test. Here we’re going to do something very similar to it. But it’s going to have a different test statistic name. And that is the chi-square test statistic.

If two models are not nested, just like in linear regression, we’re going to use AIC to compare them. And if we’re going to want to compare or build a best predictive model, we can use AIC to do that for us. Because we don’t have to worry about nesting or anything like that.

All right, so as an illustrative example, we’re going to use the variable purpose for a loan as a predictor for whether or not a loan defaults. We’ve seen the FICO score already. It’s predictive of whether or not a borrower defaults on a loan. And the question is, can we improve that by using another variable, the purpose for the loan, as an additional predictor? But it has some issues. Well, not some issues. It has some intricacies that we have to worry about.

The variable purpose is a categorical. And it has 7 categories. So R will create 6 binary predictors to define it in a logistic regression model. To determine if it’s important then, you can’t perform one single test for a beta coefficient because there are 6 of them. So we’re going to need to take that different approach. And that’s where this chi-squared test comes in.

So formally speaking, the chi-squared test for determining whether added variables provide extra predictive power in logistic regression have two formal sets of hypotheses, the chi-squared test does. And that is, the null hypothesis is that all of the betas for the extra variables are equal to 0. The alternative is that at least one of those extra betas associated with those extra predictor variables are not 0. At least one is not 0.

We’re going to calculate something called a chi-squared test statistic. And if it’s large, the more in favor of the alternative it is. Really, we’re going to use it by incorporating the p value, looking at the p value. If the p value for this chi-square test is small, that means there is evidence that suggests at least one of those extra variables is important in predicting the binary outcome in the response.

And this chi-squared test and logistic regression is the analogue to the F test in linear regression. Has the exact same null and alternative hypotheses, but the mathematics are a little bit different. So that’s why we switch from an F test to a chi-squared test.

So let’s look at an example. In R, we’re going to look at comparing two different models, one model that has just FICO scores, the other model that has FICO scores and this categorical variable of the purpose for the loan, which had 6 categories. Therefore, 6 binary predictors to define it.

When we run the chi-squared test to compare two nested models in R, what we get is this output. It gives us a test statistic. And more importantly and easier to interpret is it gives us this p value associated with it. And of course, here this p value is extremely significant. This chi-squared test here, a very, very small p value. Certainly less than 0.05 and definitely less than even 0.0001. From that, this really small p value, we can conclude that the purpose for the loan is an important predictor to predict whether or not a borrower defaults on the loan, beyond just what FICO score is able to contribute.

Here is the key for both the chi-squared test in logistic regression and the F test in linear regression. These types of variables, categorical predictor variables, with many, many predictors can be tested all at once through either a chi-squared test here in logistic regression or an F test in linear regression. That’s where actually the F test in the chi-squared test come into play the most often.

With that said, if we want to perform model selection in logistic regression, we might want to consider non-nested models. Because of that fact, we’re going to take a different approach. And if we want to go automatically through selecting a model in logistic regression, we’re going to take the same approach we did in linear regression.

We’re going to take this backward stepwise model selection procedure and use AIC as the criterion to decide between models. The algorithm is the same as before for this backwards selection approach. What do we do? We start off with that full model with all potential predictors we’re considering.

Consider removing one variable at a time. And of all models with one less variable, choose the one that improves the AIC, makes it smaller by the most amount. And then you continue removing variables one at a time until AIC can no longer be improved, until you’ve reached the smallest AIC of all the models you’ve considered.

So we’re going to use this in an application to predict whether a borrower in our loan default data set is– if we can predict it from various different x variables, predictor variables, we’re going to start off with FICO score. We’ve already seen how strong it is. We’re going to consider debt-to-income ratio, which is denoted as DTI in the data set.

We’re going to look at annual income on the log scale. Because of the natural distribution that income takes on, which is often very right-skewed, the log scale often corrects for that and reduces the effect of outliers. The interest rate that the loan had, and then the purpose for the loan– these seven categories that we’ve been talking about. Credit card is one of them. Educational loan, real estate loan, things of that nature.

So this is what the full model output looks like. This is our full model with all the predictors included. We could write out the model statement for logistic regression based on that logistic expression where all of my b estimates associated with all the predictors are circled here. And what we see is that the p values for most of these predictors within the context of this model are extremely important.

Well, this model has an AIC. We can use R to calculate it. It automatically gets reported by the summary output for a logistic regression model. And here what we see is the AIC was calculated to be 8067.1.

And we see that within the context of all the p values there is one that really kind of sticks out. And that is DTI. DTI’s p value is definitely not significant. And the question is, if we were to take this backwards stepwise approach, we are going to want to see if DTI drops out of the model.

So let’s see what happens, results, after a stepwise. After all of our steps, we see the only variables that remain are FICO, log annual income, interest rate, and purpose. What dropped out? DTI dropped out. DTI, by dropping it out, improved the AIC ever so slightly from 8067.1 to 8067.0.

But possibly considering removing any of the other variables that are still remaining in the model only would make AIC worse. That would increase AIC. So we’re going to leave everything else in the model.

We stop after that first step, after dropping DTI. Because all the other models with fewer predictors have that higher AIC. And keep in mind the variable purpose, with its 7 categories for this categorical predictor, will all be included– the 6 binaries associated with that– will either all be included in a model or not.

And this has the general heuristic idea that purpose is a predictor. All of its categories are related and should all be incorporated in the model, provide predictive ability, or not. Just a side note, but keep in mind R treats categorical variables with many categories exactly the way we want.

file_download Downloads Comparing Logistic Models.pptx

9.5 R Code and Examples

Welcome, everyone, so far in this unit, we’ve learned the general concepts and how to interpret a logistic regression model, the concepts related to it. Now we’re going to get into some interactive R coding to actually run these analyses and fit these models within R for a specific data set.

And the data set we’re going to be working with here is the loans data set that we saw in the lecture slides. So I’m going to start by just reading that in with a read.csv command. I have the data set saved in my working directories. So under My Files, you see the loans.csv is sitting there. So if I read.csv of that file, it should just load in correctly. And I’m saving it as a data set called loans. We see it there– loans has over 9,000 observations and 14 variables.

And usually what we’ll do is just summarize the data set to get a general sense of what we’re dealing with with the variables. This is what we see here. We have lots of different variables here. The variable of interest is whether or not a loan was fully paid off, but I don’t like typing in “not fully paid” every single time, so I’m going to redefine that variable, create a new variable called “default” within the loans data set, which is just going to be the same things as “not fully paid,” it’s just going to be easier to handle. Of course, I had that highlighted, so all it did was run the highlighted lines. So let’s run that again.

So now I have the default variable. And we’re going to start by, like we did in the slides, looking at a regression model– a logistic regression model– NR, to predict whether or not a borrower defaults on their loan, based on a model that has the FICO score as the only predictor.

And the command we have to use to run this, to fit this, to estimate the beta coefficients for logistic regression, is now GLM instead of LM for linear regression. GLM stands for “generalized linear model.” And essentially, it’s just a fancy way of saying there’s a whole class of families that are extensions to the linear model that have that linear relationship to, not the predictors, but the coefficients are in the linear nature, the effects are combined in a linear nature.

And for this GLM command to run a logistic, it actually is more flexible than just logistic regressions. We’re going to tell it to run a family=binomial, because essentially that’s the form of the response variable. Remember, our response variable is binary, which is based on a bernoulli distribution, which is a special case for the binomial distribution.

So we’re telling our GLM command, here’s the model we want, predict default from FICO score. We want family equals binomial, because we want, specifically, a logistic regression model, and we want to look for these variables in the data set called loans.

Saving it as a object called, loans.glm1. Run that command. And it does nothing. Uneventful, so let’s look at the summary of that command. And our eye should first come here, to the coefficients table. Just like we did with linear regression, this gives us the most of the information that we care about or we’ve learned in this class. And these are the essential estimates for the beta coefficients.

So this first column is our R estimates, our intercept– beta zero estimate, b, is 6.72. And our estimate for the slope is negative. And what that’s telling us is that the association between the probability of defaulting and an individual’s FICO score is negative. Interpreting, actually, that value, the magnitude is not easily done in logistic regression, so we’re not going to get into that, but really what’s important is the sign. There’s a negative relationship.

And what it boils down to then, of course, we can always look at the p value related to that association, to that estimate. And we see that p value is extremely tiny. So there is a lot of evidence to suggest this relationship is an important one, that this negative relationship between probability of defaulting for a borrower and their FICO score, is a lot of evidence to support that that relationship is true.

After we do that, we can just do some calculations. Let’s do some predictions for a particular individual, for someone who has a FICO score of 700. Let’s look at what their predicted probability of defaulting is. So what we’re going to look at is using the predict command just like we did with linear regression.

For the predict command, we have to provide what model we’re using, here, this loan’s GLM. And since we’re doing a prediction at a new variable, add a new observation, we have to do this prediction for new data. And we have to define, actually, a new data frame. It’s a little clumsy, but define a new data frame that has a variable, FICO, the predictor that we want, that has the value 700.

And we’re going to be asking for some default figures here. We’re going to ask for the response, because we want the probability, so this is the argument we need to provide to get probabilities out. And sc.fit also just means that we’re going to not only get the predicted value, but we’re also going to get the standard error for that prediction.

So if we run this command– whoops, not just the sc.fit– if we run this entire line, we get now a prediction. What does that prediction look like? Our predicted probability of defaulting for a individual with a FICO score of 700 is about 16.9%. And we have a standard error for that fit of about 0.003.

What we can then do is turn that into, essentially, a confidence interval, by looking at creating a combination of a prediction, a lower confidence interval, subtracting off two standard errors, and an upper bound on the confidence interval, adding two standard errors to that prediction. So we’re defining something called “out,” we’re giving names to the values in that out variable. That out variable, then, it’s going to spit out or output the fitted predicted value and the lower and upper bound on that prediction for an observation for an individual with a FICO score of 700.

OK, so we know how to do predictions, we know how to interpret the model, or we’ve seen an example of that, at least. Let’s go a little bit further. Let’s scroll down here in the R script and visualize this relationship. It looks like it’s a strong relationship, potentially– there’s definitely a lot of evidence, at least. So let’s plot these observations. And of course, we’re going to plot on the x-axis FICO score and on the y-axis, whether or not somebody defaults, the binary for that.

So what is this going to look like? Of course, we’re going to have a lot of points piled up on the y, vertical axis at zero and a lot of points piled up on the vertical axis at 1. And it’s going to be across various different values of FICO. And that’s what we get. We get a scatterplot where– not very informative– we have a lot of points piled up at various different FICO scores for those individuals who did not default. And then we have lots of individuals piled up for those individuals that did default on their loan.

So in order to get a sense of actually what this model is doing, we’re going to do the prediction command now for not just one individual from this GLM model, but for lots of individuals– for everybody who has a FICO score going from 550 all the way up to 900, in increments of one. So this semi-colon, of course, as a reminder, is defining a new variable that goes 550, then 551, then 552, et cetera. And again, we want the responses. We don’t need the standard errors this time. And then we’re going to use that to plot out the line for providing the estimated logistic regression curve, fit to this scatterplot of predictive 0s and 1s.

So let’s do the predictions at various x’s, 550 to 900, and then fit the lines connecting the points. And there is a visualization of what that logistic regression is estimating for the probability of defaulting, for every single individual, based on their FICO score. And not surprisingly, since the relationship was negative in the regression output, we see that the logistic-estimated logistic curve decreases as FICO score increases, but of course, it’s not linear, because it has that general shape that it’s going to be bounded at the lower end by zero. So as FICO scores goes up, the probability of defaulting gets closer and closer to zero. And as FICO score goes down, the probability of defaulting increases and gets closer and closer to 1. So we quickly don’t get very high here, because we just don’t have that high proportion of individuals who default.

So the next thing we looked at then, in a logistic regression, just like with linear regression, we decided to also look at whether or not a model could be improved by incorporating new predictors. So what we’re going to do is fit a second logistic regression model, where not only is FICO score a predictor, but also purpose.

And now remember, “purpose” was the purpose for the loan, it was categorical. It had seven categories to it. So if I just look at table of the variable purpose, which is in the data set loans, it gives us a general sense of what that predictor looks like. There are seven categories– there are the breakdowns for the number of individuals within each category and we’re going to use that as a second predictor in a multiple logistic regression model. Fit that, call it loans.glm2.

We can summarize that. And we see that FICO score, in the context of this model, still has a negative relationship and is still quite significant. And that is controlling for the various purposes for getting a loan. And what we see are the six binary predictors to define these seven categories. And they have various effects. These various estimates for their betas are representative in correspondence to the reference group.

So of course, you’ve got to ask the question, what’s the reference group? And remember, the reference group is the category– first alphabetically. So if we scroll up, it looks like the reference group is all other purposes for a loan. So these are all in reference to those individuals that fall under the “other” category.

So credit card has a lower chance of defaulting for two individuals with the same FICO score in comparison, comparing somebody with a credit card loan versus the other type of loan. And those that appear to be highest are those with the highest positive estimate. And here, that are loans for small businesses, which probably is not all that surprising, if you know anything about loans.

And then if we go a little bit further, we can look at doing this test. Now, since we want to compare model two to model one, the extra term is this purpose variable which has six coefficients associated with it.

So we can’t just simply do a t-test to determine whether purpose is an important predictor, on top of what FICO was able to predict. So what we’re going to do is perform an f-test, comparing model one with just FICO to model two, which has FICO and purpose. And the test we’re going to perform is a chi-squared test. And the ANOVA command will allow us to do that.

We click Run we see, oh, great, the ANOVA between these two models gives us a chi-squared statistic that leads us to a p-value that’s tiny, tiny, tiny. And what that tells us is that, yes, purpose provides extra explanatory power to predict whether or not a borrower defaults on a loan beyond what FICO score does.

So great, we have a fairly strong model– a better model with these two predictors. And the question is, can we improve upon that? So what we’re going to do is go through our standard stepwise model selection, just like we did with linear model, with a GLM model, with a logistic regression model. And R knows how to do that with the GLM type of model, just like it did with the linear model.

So what we’re going to do is define our baseline model. I shouldn’t really call it a baseline model, but our starting model, of course, when we do backwards model selection, is the largest model that we’re interested in. And just for simplistic purposes here, we’re going to look at a model that has five predictors. That FICO score, which we know is important, we have DTI– debt-to-income ratio– the log annual income for the borrower, the interest rate that that loan was borrowed at, and the purpose for the loan– which, again, we saw should be a important predictor.

So we can fit that overall full model that has all five predictors we want to consider. We don’t want to consider the other variables in the data set. So we fit what I call GLM full. And then we can look at that summary table, and boy, is there a lot of information there. There are a whole lot of coefficients. It’s pretty big, a lot of predictors, but of course, we could have incorporated more, but we decided not to. Just from first principles, this was busy enough for illustrative purposes.

And then to run a model selection to determine to build a best predictive logistic regression model, just like in R– we can use that step command. Sorry, just like for the linear model, the LM-type model, for the GLM, it’s the same thing. We can just step, starting with the full model, and R will know to start with that full model and step down one step at a time.

So as a reminder, how do we interpret this outcome? Basically, it starts off with a model with five predictors. It considers dropping one variable at a time, and it looks at, does AIC change? The AIC for the full model is 8067.13 and if I were to remove DTI, that AIC improves to 8067.0. In comparison to the model with nothing else removed, my AIC is 8067.1, and if I were to remove any of the other predictors, my AIC would actually increase, which is a bad thing.

So since one of the variables does improve things, let’s remove it, refit the model with all the variables remaining– which is DTI has been removed, the other four remain. And we see, after that, are now comparison models. It looks like the model that doesn’t remove any of the other predictors is the best one to consider. So that’s the model we’re going to end up on. And that is the model that has four predictors and had DTI removed.

I made a little bit of a mistake here. I ran the step command, but I didn’t save it as a new object. So let’s call this loans.glm.step and run that step command again. So now I have an object called loans.glm.step. And like we often do, let’s look at the summary of glm.step.

And what we see are the coefficients associated with all of our different variables. We could calculate pseudo R-squared and what we do not want to do, as a reminder, is that these p values cannot be interpreted like they typically do. They could be measured comparatively, relatively to each other, to see which predictor potentially is most important with the context of this model, but it’s not really a good measure to use to determine which of these variables are significant for inferential purposes, because this wasn’t done based on just one model. Those p values are interpretable, if this was the only model we considered, but we didn’t. We considered a few different models through this stepwise command.

So hopefully by stepping through these R commands, you’ll feel a little bit more comfortable fitting logistic regression models in R. And hopefully you get a chance to do that with some practice problems in the live sessions.

9.6 Critical Assessment

Let’s recap what we’ve seen about logistic regression. We started out by saying how logistic regression differs from linear regression. We then covered simple and multiple logistic regression models. We then covered prediction and statistical inference about the coefficients– how to do hypothesis testing and confidence intervals.

We looked at goodness of fit through what’s called pseudo R-squared, a slightly different R-squared measure– actually, a very different R-squared measure– than what we saw when we looked at linear regression models.

Model comparisons and model building for logistic regression was discussed. And then, of course, using R, as always, for running logistic regression analysis– how to do predictions, make inference, [INAUDIBLE], and everything that’s needed to do when you’re running a logistic regression.

Now that you’ve seen logistic regression and you’ve fit some models, what are some challenges you think that can occur when you build logistic regression models? And what do you want to think about when you’re analyzing data using logistic regression?

Let’s discuss what are important issues when using logistic regression models. And back with us again are my colleagues, Kevin Rater and Mark Wickman. So nice to have you guys back.

Oh, it’s nice to be back.

It’s been awhile.

It’s good to see you.

Yeah, thanks.

So Mark, what do you think about logistic regression, some issues to worry about?

There are always issues to worry about in any model that you’re fitting.

Sure.

One of the issues that I think is a big one is that unlike linear regression, logistic regression can be pretty sensitive to idiosyncrasies in the data. For example, I find that when I work with, say, binary predictors or just really any kind of categorical predictors, you’re sometimes at risk of having the estimates of your coefficients sometimes even go to infinity. And that’s particularly problematic when you have a small sample of data. So there are techniques to handle those sorts of things. But if you just simply run logistic regression in its plain form, you’re at risk for these kinds of funny things happening that you wouldn’t expect.

Is that related to overfitting in any way?

It can be. Some of it actually has to do with this topic, this idea, called separation where you can take the binary outcome, separate them into the 1’s separate them to the 0’s. And then you might have predictor variables that also separate, for binary predictors, into all 1’s and all 0’s, and then there’s this kind of crisscrossing that you can’t really exactly match up. Gotcha.

Now Kevin, what’s interesting, I find, is that they sound very different, linear regression, logistic regression. But there is something they share in terms of this linear effect. Can you talk about that a little bit?

Sure, yeah. In the linear model, we explicitly say that the mean of y is being modeled linearly in correspondence to the predictors. In logistic regression, that’s not quite the case, because we have that strange logistic function where you see e to the beta plus beta 1, x1, stuff like that, and the relational between y and x, between the response and the predictors. But really, when you combine the effects of all the predictors, it is sort of in a linear fashion. Because when you look at how the predictors combine in that function, it is a linear function. It’s just you have to convert that before you actually apply it to that binary.

And it’s funny. It does not look linear, right? At the end of the day–

No, by no means. By no means.

The formula is completely non-linear. But if you actually stare at how the predictor variables are combining together–

There is that linear component.

It’s just linear.

There’s a linear component.

Yeah, it’s just then you’re right there. But the rest of it isn’t.

Now, talking about that weird function, I know, Mark, you and I discussed this when we were writing the slides. In terms of interpreting these coefficients, you can go haywire. You can get some real weird math involved. And we try to keep it a little bit simplistic to make it easy to interpret the coefficients.

Yeah, it’s such a headache. You know I teach a couple of classes, a couple of different classes, where I cover logistic regression. And I’m always deciding every single time I teach the class– How deep to get into it?

Yeah, how deep to get into interpreting the coefficients, because there are some ways to do it. There are some rules of thumb. Sometimes those rules of thumb don’t work. Sometimes they do. But the main thing is to make sure you understand whether the coefficient is positive or negative. That gets you pretty far along the way.

I like what we did. We did that. We just said the positive negative effect on probabilities, which I thought was a great interpretation.

Yeah, I think that helps a lot. And that’s really kind of at a rudimentary level, what you really need to know. But part of the message here is that as somebody who’s taking the class and learning logistic regression for the first time, don’t feel like you need to know everything. You have at your disposal statisticians that you work with or people that you can consult, and they can help make those interpretations– just to be aware that it’s hard to make those interpretations in a very clean way just from the results of running a logistic regression.

Absolutely. And finally– and we’ll turn to Kevin for this– and we also had very nice discussions when we were in the course discussing about this. There’s some historical interesting papers involving diagnostics for logistic regression, but it’s a lot easier to do diagnostics for linear regression, wouldn’t you say?

Yeah, and we saw that. Linear regression, there’s some pretty clear graphics that you can look at, that residual versus fitted scatterplot and stuff like that.

They don’t work, right?

They don’t work in logistic regression.

Well, those parallel curves that you showed were very revealing.

Yeah, it just shows that binary nature, and it’s really hard to get around like, does this model fit well to the data as far as assumptions go? And there are, like you said, some fancy techniques to look at that, but they are pretty fancy. And I wouldn’t just run them blindly. That’s for sure. Sure.

That’s for sure. And again, you might have to talk to a statistician, somebody who’s done this many, many times, to really get a sense of, is this the appropriate model to use to predict this binary outcome? It will always work fairly well, but–

And that doesn’t even get into– like when you were showing in linear regression, when you were showing these curved residual plots, there’s really nothing very clearly analogous for what to do in logistic regression.

Right.

So even detecting the curvature or outliers or anything like that, it’s a very, very hard problem.

Yeah, if I want to incorporate a quadratic term for a predictor in logistic regression–

Yeah, what do you look for?

I’m never going to be able to decide that just by looking at a graph. That’s for sure.

Let’s not turn them off. Let’s say logistic regression, a very powerful technique, very useful, absolutely you want to use it if you have a binary response.

And we use it all the time.

We use it all the time.

Not to pooh-pooh the use of it, it’s something that we use all the time in our–

And trust.

–work. And trust, yeah, we know how it works.

Again, always consult a statistician if you get stuck or if you have questions about how to interpret it optimally.

Absolutely.

Great.