## The Stock Market Data

We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR package. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005.

For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).

library(ISLR)
names(Smarket)
dim(Smarket)
summary(Smarket)
pairs(Smarket)

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative.

cor(Smarket)
cor(Smarket[,-9])

As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.

attach(Smarket)
plot(Volume)

## Logistic Regression

Next, we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax generalized of the glm() function is similar to that of lm(), except that we must pass in linear model the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial)
summary(glm.fits)

The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

We use the coef() function in order to access just the coefficients for this fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the p-values for the coefficients.

coef(glm.fits)
summary(glm.fits)$coef summary(glm.fits)$coef[,4]

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type="response" option tells R to output probabilities of the form $$P(Y = 1|X)$$, as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.

Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

glm.probs = predict(glm.fits, type = "response")
glm.probs[1:10]

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.

glm.pred = rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"

The first command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5. Given these predictions, the table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

table(glm.pred ,Direction )
(507 + 145) /1250
mean(glm.pred==Direction )

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of $$507 + 145 = 652$$ correct predictions. The mean() function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 52.2% of the time.

At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of 1,250 observations. In other words, $$100 − 52.2 = 47.8%$$ is the training error rate. As we have seen previously, the training error rate is often overly optimistic—it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

train = (Year < 2005)
Smarket.2005 = Smarket [!train,]
dim(Smarket.2005)
Direction.2005 = Direction [!train]

The object train is a vector of 1,250 elements, corresponding to the observations in our data set. The elements of the vector that correspond to observations that occurred before 2005 are set to TRUE, whereas those that correspond to observations in 2005 are set to FALSE. The object train is a Boolean vector, since its elements are TRUE and FALSE. Boolean vectors can be used to obtain a subset of the rows or columns of a matrix. For instance, the command Smarket[train,] would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of train are TRUE. The ! symbol can be used to reverse all of the elements of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train,] yields a submatrix of the stock market data containing only the observations for which train is FALSE-that is, the observations with dates in 2005. The output above indicates that there are 252 such observations.

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set—that is, for the days in 2005.

glm.fits = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial, subset=train)
glm.probs = predict(glm.fits, Smarket.2005, type = "response")

Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.

glm.pred=rep("Down",252)
glm.pred[glm.probs >.5]= "Up"
table(glm.pred, Direction.2005)
mean(glm.pred == Direction.2005)
mean(glm.pred != Direction.2005)

The != notation means not equal to, and so the last command computes the test set error rate. The results are rather disappointing: the test error rate is 52%, which is worse than random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days’ returns to predict future market performance. (After all, if it were possible to do so, then the authors of this book would be out striking it rich rather than writing a statistics textbook.)

We recall that the logistic regression model had very underwhelming pvalues associated with all of the predictors, and that the smallest p-value, though not very small, corresponded to Lag1. Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model. After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement. Below we have refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.

glm.fits = glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial, subset = train)
glm.probs = predict (glm.fits, Smarket.2005, type = "response")
glm.pred = rep("Down",252)
glm.pred[glm.probs >.5] = "Up"
table(glm.pred, Direction.2005)
mean(glm.pred == Direction.2005)
106/(106+76)

Now the results appear to be a little better: 56% of the daily movements have been correctly predicted. It is worth noting that in this case, a much simpler strategy of predicting that the market will increase every day will also be correct 56% of the time! Hence, in terms of overall error rate, the logistic regression method is no better than the naïve approach. However, the confusion matrix shows that on days when logistic regression predicts an increase in the market, it has a 58% accuracy rate. This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.

Suppose that we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we want to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and −0.8. We do this using the predict() function.

predict (glm.fits,newdata = data.frame(Lag1=c(1.2 ,1.5), Lag2=c(1.1,-0.8) ), type = "response")

## K-Nearest Neighbors

We will now perform KNN. We need:

1. A matrix containing the predictors associated with the training data, labeled train below.

2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test below.

3. A vector containing the class labels for the training observations, labeled Direction below.

4. A value for K, the number of nearest neighbors to be used by the classifier (will be determined automatically).

Step 1. Let us pre-process our data for the KNN classification

We have a labeled dataset here about the stock market. A data frame with 1250 observations on the following 9 variables.

Variable Description
$$Year$$ The year that the observation was recorded
$$Lag_1$$ Percentage return for previous day
$$Lag_2$$ Percentage return for 2 days previous
$$Lag_n$$ Percentage return for $$n$$ days previous
$$Direction$$ A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day

Pre-processing means that we will create the train and the test datasets (they will be called train and test). We also need to consider a variable with the labeled value, here "Direction" (it is a factor with the "up" and "down" values).

We will use the caret package on top of the usual suspects.

library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)
head(train)
head(test)

Step 2. Let us proceed to the classification for the train dataset

The model we want to estimate is the importance of the Lag1 and Lag2 variables as features to be able to classify with good enough accuracy the Direction label.

$Direction ~ Lag1 + Lag2$

Of course, we could also have a more comprehensive model with other variables (e.g., all the variables in the dataset: $$Direction ~ .$$).

library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)
library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))
modelknn

Let us plot our model:

library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))
library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))

plot(modelknn)

What is the best $$K$$? Well, you can find it by looking into modelknn$bestTune. library(tidyverse) library(ISLR) library(caret) # Creating a train and a test dataset train <- Smarket %>% filter(Year < 2005) test <- Smarket %>% filter(Year == 2005) modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train, method = "knn", tuneGrid = expand.grid(k = seq(1, 101, by = 2)), preProcess = c("center", "scale")) library(tidyverse) library(ISLR) library(caret) # Creating a train and a test dataset train <- Smarket %>% filter(Year < 2005) test <- Smarket %>% filter(Year == 2005) modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train, method = "knn", tuneGrid = expand.grid(k = seq(1, 101, by = 2)), preProcess = c("center", "scale")) modelknn$bestTune

Let us see the importance of you selected variables with the varImp() function:

library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))
library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))

varImp(modelknn)

Step 3. Model performance evaluation with the test dataset

Now that we have the best selection for our parameters ($$k$$) in terms of the "accuracy" performance measure of our model, let us predict on our test dataset. So, what it means is that the values of lag1 and lag2 through time in the test dataset will help determine whether the market is up or down.

Let us predict the Direction, and see what it does predict indeed.

library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))
library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))

knnPredict <- predict(modelknn, newdata = test)

df_knnPredict <- data.frame(knnPredict)
df_knnPredict <- bind_cols(test, df_knnPredict)
head(df_knnPredict)

Since we have the actual labels with in the test$Direction variable for the test dataset, we will be able to check the performance of the prediction on the test dataset. Let's produce the confusion matrix based on the optimal parameters from the knnfit with the new dataset called test_final: library(tidyverse) library(ISLR) library(caret) # Creating a train and a test dataset train <- Smarket %>% filter(Year < 2005) test <- Smarket %>% filter(Year == 2005) modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train, method = "knn", tuneGrid = expand.grid(k = seq(1, 101, by = 2)), preProcess = c("center", "scale")) knnPredict <- predict(modelknn, newdata = test) #Get the confusion matrix to see accuracy value and other parameter values library(tidyverse) library(ISLR) library(caret) # Creating a train and a test dataset train <- Smarket %>% filter(Year < 2005) test <- Smarket %>% filter(Year == 2005) modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train, method = "knn", tuneGrid = expand.grid(k = seq(1, 101, by = 2)), preProcess = c("center", "scale")) knnPredict <- predict(modelknn, newdata = test) #Get the confusion matrix to see accuracy value and other parameter values confusionMatrix(knnPredict, test$Direction)

Step 4. Point Estimate Prediction

Now, you can use your own features... Just to be sure, let us use the first observation of the data frame.

library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))

knnPredict <- predict(modelknn, newdata = test)

my_own_set <- test %>% slice(1)
library(tidyverse)
library(ISLR)
library(caret)
# Creating a train and a test dataset
train <- Smarket %>% filter(Year < 2005)
test <- Smarket %>% filter(Year == 2005)

modelknn <- train(form = Direction ~ Lag1 + Lag2, data = train,
method = "knn",
tuneGrid = expand.grid(k = seq(1, 101, by = 2)),
preProcess = c("center", "scale"))

knnPredict <- predict(modelknn, newdata = test)

my_own_set <- test %>% slice(1)

knnPredict2 <- predict(modelknn, newdata = my_own_set)

df_knnPredict2 <- data.frame(knnPredict2)
head(df_knnPredict2)

TL;DR

Try now with other independent variables... See how the results will change.

## Principal Component Analysis

You are asked to find what are the actual "dimensions" in your sample (your dataset). It can help assess the validity of your sample if you see that the way you collected your data is suddenly more towards some information. It can help also do some dimension reduction and optimize your model. All in all, you see the references to inferential statistics, the Central Limit Theorem, the potential issues with sampling, etc.

To be able to easily do a PCA (yes, I promise this is the easiest way for what we aim at), we will need these packages on top of the usual suspects (tidiverse, etc.):

# Installing and loading the packages

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")

if(!require(FactoMineR)) install.packages("FactoMineR")

Step 1. Data collection

We need to get the data from our datalake here: https://www.warin.ca/datalake/courses_data/qmibr/session9/movies_metadata.csv

library(tidyverse)
movies <- readr::read_csv("https://www.warin.ca/datalake/courses_data/qmibr/session9/movies_metadata.csv")
library(tidyverse)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
head(df)

Step 2. Finding the dimensions

Let us compute the dimensions and do our screeplot to decide which dimensions are the most relevant ones to our research question.

Computing can take a while, be patient please ;-).

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)

Remember these dimensions are the results of the so-called eigenvalues. Let us compute and represent these eigenvalues/variances. By the way, now you should know how to debug the code if something does not work as intended.

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
# Extract eigenvalues/variances
get_eig(res.pca)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
# Visualize eigenvalues/variances
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))

Step 3. What contributes to these dimensions?

Here we want to determine which variables you have in your dataset that contribute the most to the dimensions.

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
# Extract the results for variables
var <- get_pca_var(res.pca)
var

# Coordinates of variables
head(var$coord) # Contribution of variables head(var$contrib)

Let us now graph these results.

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
# Control variable colors using their contributions
fviz_pca_var(res.pca, col.var="contrib",
repel = TRUE # Avoid text overlapping
)

Let us graph the contributions of variables to PC1:

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

Let us graph the contributions of variables to PC2:

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)
library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

We can also be curious and extract the results for the individual observations.

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)

# Extract the results for individuals
ind <- get_pca_ind(res.pca)
ind
## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description
## 1 "$coord" "Coordinates for the individuals" ## 2 "$cos2"    "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals" # Coordinates of individuals head(ind$coord)
##        Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
## 1 10.7468471 -1.3765037 -1.49270196 -0.1115660 -6.2270208
## 2  7.3279480 -0.7093316 -0.33807210 -0.5969450 -0.7600249
## 3  0.4544490  0.5957983 -0.35789264  1.3256214  0.1630035
## 4  1.0356193  0.5232758  0.52046056 -0.4126039  0.3728671
## 5  0.9074839  0.1562641  0.08868175  0.6146029 -0.2909026
## 6  6.3403964  1.0646766  0.62588178 -0.0882128 -0.1072013

Let us know graph the individuals on the two principal components. As it is computing intensive, do this on your own machine.

library(tidyverse)
library(FactoMineR)
library(factoextra)
df <- movies %>%
dplyr::select(budget, popularity, revenue, runtime, vote_average, vote_count)
res.pca <- PCA(df,  graph = FALSE)
fviz_pca_ind(res.pca, col.ind = "cos2",
repel = TRUE # Avoid text overlapping (slow if many points)
)

Step 4. Final output

Let us graph the principal components with the variables and the individuals contributing to these dimensions.

library(tidyverse)
library(FactoMineR)
library(factoextra)
fviz_pca_biplot(res.pca, repel = TRUE)