Panel data and other advanced techniques

Regression analysis

Quadratic effects

A linear regression model was fit to predict how much money a customer would spend at an online retailer (in CAN$) based on the amount of time they were browsing the website (ranging between 1 and 100 minutes) along with the quadratic term for time. The coefficients table from the R output is:

	Estimate	SE	t-value	Pr(>\|t\|)
Intercept	6.902146	0.900384	7.666	9.41e-14	***
time	1.598175	0.122716	13.023	<2e-16	***
I(time^2)	-0.008558	0.002852	-3.001	0.00282	**

Binary & categorical independent variables

The main selling price of a sample of condos in Montreal was calculated to be $740,000 while the mean selling price of single family homes was calculated to the $975,000. If a regression mode was fit to predict selling price of a home based on a binary predictor for whether it was a condo ( x = 1 represents the condominium group).

Interactions

A regression model was fit to predict selling price of condos and single family homes in Montreal from $x_1 = house~size$, $x_2 =$ a binary independent variable for whether a home is a single family home ($x_2 = 1$ for single family homes) and the interaction between the two. the estimated regression model is given below:

\[\hat y = 428 + 0.286 x_1 + 104 x_2 - 0.140 (x_1 \times x_2)\]

The regression model from the previous part is repeated here: to predict selling price of condominiums and single-family homes in Cambridge from x1 = house size, x2 = a binary predictor for whether a home is a single family home (x2 = 1 for single family homes), and the interaction between the two. The estimated regression model is given below:

\[\hat y = 428 + 0.286 x_1 + 104 x_2 - 0.140(x_1 \times x_2)\]

Comparing models

Two regression models are to be considered:

Model 1: $y = \beta_0 + \beta_1 x_1$

Model 2: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$

Two regression models are to be considered:

Model 1: $y = \beta_0 + \beta_1 x_1$

Model 2: $y = \beta_0 + \beta_1 x_2 + \beta_2 x_3$

Automatic Model Selection

Model diagnostics

You’d like to determine whether the normal distribution assumption is reasonable for a simple linear regression model.

Let us fit a linear regression:

re = read.csv("Real_Estate_Sample.csv")
lm1 = lm(Price ~ year, data=re)

The following histogram was produced after fitting the previous simple linear regression:

re = read.csv("Real_Estate_Sample.csv")
lm1 = lm(Price ~ year, data=re)

hist(lm1$residuals,col="gray")

The following boxplot was produced after fitting a simple linear regression of Y on X:

boxplot(lm1$residuals,col="gray")

The following residual-versus-predicted scatterplot was produced after fitting a simple linear regression of Y on X:

plot(lm1$residuals~lm1$fitted,cex=0.7)
abline(h=0,lwd=2)