Skip to Tutorial Content

Facebook World

The goal of this session is to predict success of a branding campaign on social media using techniques of logistic regression. We will explore the use of logistic regression to predict Facebook comment activity information about posts from previous campaigns to help inform ad placement decisions.

An analysis of this data set was first presented in Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9),3341-3351.

A pre-print of the article is available [here].

Part 1: Exploring the data

For this exercise, you'll work with the dataset facebook.csv stored here: https://www.warin.ca/datalake/courses_data/qmibr/session10/facebook.csv.

The dataset facebook.csv contains one row for each unique ad post. The following variables were recorded:

Variables Description
Page.total.likes Number of people who indicated that they liked the company’s page
Category Category
Type of ad Action (special offers and contests); Product (direct advertisement, explicit brand content); Inspiration (non-explicit brand related content)
Type Type of content (Link, Photo, Status, Video)
Post.Month Month the post was published (1=January, 2=February, . . . , 12=December)
Post.Hour Hour the post was published (0, 1, 2, 3, 4, . . . , 23)
Post.Weekday Weekday the post was published (1=Sunday, 2=Monday, . . . , 7=Saturday)
Paid If the company paid Facebook for this advertisement (0=no, 1=yes)
Comment 0 if there were no comments on the post and 1 if there was at least one comment

Read the file into a data frame called facebook and show the first 6 rows. The main goal of this exercise is to learn which predictor variables are important for modeling the probability of at least one comment made on a post with an ad.

facebook = readr::read_csv("https://www.warin.ca/datalake/courses_data/qmibr/session10/facebook.csv")
head(facebook)

Question Set 1

  1. How many total posts are in the data set?
nrow(facebook)
  1. How many ad posts of each Category (Action, Product, Inspiration) are in the data set?
table(facebook$Category)
  1. Examine side-by-side boxplots of the number of people who liked the company’s page by whether a comment was posted. Describe the difference in the distribution of the number of likes.
boxplot(Page.total.likes ~ comment, data = facebook, col="red", xlab="Comment", ylab="Tal Likes")
  1. Run the command prop.table(table(facebook$comment, facebook$Category), margin=2) to see the proportion of comments given by ad type. What can you say about the proportion of comments by ad type?
prop.table(table(facebook$comment, facebook$Category), margin=2)

Part 2: Paying for attention

Does paying for the ad correspond to an increase in the probability of receiving at least one comment? We explore this question by examining the relationship between Paid and comment.

Question Set 2

  1. What is the distribution of the binary comment variable? What fraction of posts received at least one comment? Use the prop.table() function to answer this latter question.
table(facebook$comment)
prop.table(table(facebook$comment))
  1. What fraction of posts were paid? Again, use the table command.
table(facebook$Paid)
prop.table(table(facebook$Paid))
  1. As in Part 1, run the command prop.table(table(facebook$comment,facebook$Paid),margin=2) to see the fraction of comments posted to ads according to whether they were paid. What fraction of paid ads received comments? What fraction of unpaid ads received comments?
table(facebook$comment,facebook$Paid)
prop.table(table(facebook$comment,facebook$Paid),margin=2)
prop.table(table(facebook$comment, facebook$Paid), margin = 2)
  1. How might you perform a hypothesis test to check whether the fractions are significantly different?
prop.test(table(facebook$Paid, facebook$comment), correct = F)
prop.test(c(274,120),c(360,140), correct=F)

Part 3: Simple logistic regression

The analysis in Part 2 can be cast as a logistic regression. We explore how logistic regression can be used to model the probability of a post receiving a comment based on whether the ad was paid versus unpaid.

Question Set 3

  1. Run a simple logistic regression predicting comment from the Paid predictor. Interpret the sign of the coefficient. Is whether the ad was paid a significant predictor? How does this compare with the last question in the previous question set?
facebook1.glm <- 
facebook1.glm <- glm(comment ~ Paid , family = binomial, data = facebook)
summary(facebook1.glm)
  1. Using the predict() function, predict the probability of a post receiving a comment for a paid ad. Do the same for an unpaid ad. How do these probability predictions compare to the fractions you computed inthe previous question set? What do you make of these comparisons?
p = 
out = 
  
  
p2 = 
out2 =
p = predict(facebook1.glm, newdata = data.frame(Paid=1), type = "response", se.fit=T)
out = c(p$fit, p$fit-2*p$se.fit, p$fit+2*p$se.fit)
names(out)=c("Fit", "Lower", "Upper")
out

p2 = predict(facebook1.glm, newdata = data.frame(Paid=0), type = "response", se.fit=T)
out2=c(p2$fit, p2$fit-2*p$se.fit, p2$fit+2*p$se.fit)
names(out2)=c("Fit", "Lower", "Upper")
out2
  1. Run a simple logistic regression ofcommentonPost.Hour. Using the confint() function, obtain a 95% confidence interval for the coefficient of Post.Hour. Interpret this interval.
facebook2.glm <- 
facebook2.glm <- glm(comment ~ Post.Hour, family = binomial, data = facebook)
summary(facebook2.glm)
confint(facebook2.glm, "Post.Hour")
c.ph[1]
summary(facebook$Post.Hour)
plot(facebook$Post.Hour, facebook$comment)
  1. Why might the logistic regression analysis including Post.Hour as a quantitative predictor variable be problematic?

Part 4: Multiple logistic regression

We can expect to obtain better and more precise predictions for whether a post received a comment by including more predictor variables.

Question Set 4

  1. Convert the variables Post.Month, Post.Weekday and Post.Hour into categorical variables (factors) in the data frame. Call these new variables Post.Month.factor, Post.Weekday.factor, and Post.Hour.factor. You can use the factor() function to do so.

For example, if facebook was the name of the data frame, you can run the following code to create Post.Month.factor as the factor version of Post.Month:

facebook$Post.Month.factor = factor(facebook$Post.Month,levels = 1:12, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
table(fb$Post.Hour)
facebook$Post.Month.factor = factor(facebook$Post.Month, levels= 1:12, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
facebook$Post.Weekday.factor = factor(facebook$Post.Weekday, levels= 1:7, labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
facebook$Post.Hour.factor = factor(facebook$Post.Hour, levels= 1:24)
head(facebook)

Given that these variables will be used as predictors in a logistic regression, why does it make sense to include them as categorical variables?

  1. Run a multiple logistic regression with the predictors Page.total.likes, Type, Category, Post.Month.factor, Post.Weekday.factor, Post.Hour.factor, and Paid. How does the coefficient for Paid correspond to the value in the simple logistic regression? What about the p-value for its significance?
facebook3.glm <- 
facebook3.glm <- glm(comment ~ Page.total.likes + Type + Category + Post.Month.factor + Post.Weekday.factor + Post.Hour.factor, family = binomial, data = facebook)
summary(facebook3.glm)
  1. Perform a chi-squared test for assessing whether Post.Weekday.factor, Post.Month.factor and Post.Hour.factor are simultaneously significant predictors of the probability of a post receiving a comment.
facebook4.glm <-

facebook5.glm <- 
facebook4.glm <- glm(comment~Paid+Page.total.likes+Type+Category, data=facebook, family="binomial")
anova(facebook4.glm, facebook3.glm, test = "Chisq")

facebook5.glm <- glm(comment ~ Post.Month.factor + Post.Weekday.factor + Post.Hour.factor + Paid, family =binomial, data=facebook)

summary(facebook4.glm)

anova(facebook1.glm, facebook4.glm, test="Chisq")

Hint: The model in question 2 is the alternative hypothesis, but what is the model in the null hypothesis? Based on your analysis, should we include these three variables?

References

Kleinbaum, D. G., & Klein, M. (2010). Logistic regression: a self-learning text. Springer Science & Business Media. Can be downloaded here.

Faraway, J. J. (2016). Extending the linear model with R: generalized linear, mixed effects and non parametric regression models (Vol. 124). CRC press

Part 5: Model selection and prediction

The model fit in Part 4 may contain variables that are not predictive of the probability of leaving a comment.We examine here whether we can reduce the number of predictors prior to making predictions.

Question Set 5

  1. Perform backward stepwise model selection on the full model you fit in the previous question. Which predictor variables remain in the final model?
facebook.glm.reduced <-
facebook.glm.reduced <- step(facebook3.glm, direction="backward")
summary(facebook.glm.reduced)
  1. How does the significance of Paid in the reduced model compare to that in the model with all of the predictors? What do you make of the difference?

  2. It appears that none of the individual Post.Month.factor month variables are statistically significant, yet Post.Month.factor was retained as a predictive variable. Why might this be?

  3. Construct a 95% confidence interval for the probability of a comment on a post where the ad was a photograph, the category was a product, the month was May, and the ad was unpaid. How do you interpret this interval? Does the width of the interval seem large? Further exposure to logistic regression.

newdata <- 
p <- 
out <- 
newdata = data.frame(Type="Photo", Category="Product", Post.Month.factor="May", Paid=0)
p = predict(facebook.glm.reduced, newdata, type="response", se.fit=T)
out=c(p$fit, p$fit-2*p$se.fit, p$fit+2*p$se.fit)
names(out)=c("Fit", "Lower", "Upper")
out

LinkedIn World

Follow this link: https://www.linkedin.com/psettings/member-data

There you will find a section labeled Get a copy of your data. Select the second option and within it the connections option. After you click request archive and waiting some minutes, you should receive an email that you may download the data.

After the download, rename the file to: Connections.csv.

When you open the Connections.csv you will see that LinkedIn allowed for the following data fields:

  • First Name,
  • Last Name,
  • Email Address,
  • Company,
  • Position,
  • Connected On

Part 1: Data Cleansing

The goal of data cleansing is the following:

  1. Read the file Connections.csv into a data frame called li_cons. Use the read_csv() function from the readr package.
li_cons <- readr::read_csv("./data/Connections.csv", skip = 2) # skip = 2 since there is some text above the data
  1. Get rid of the data fields that we will not need, i.e. name and email information. Use the select() function from the dplyr package.
library(dplyr)
li_cons <- 
library(dplyr)
li_cons <- li_cons %>% 
  select("Company", "Position", "Connected On")
  1. Expand the date information. First, transform the Connected On variable into a date format in a new variable called connectedOn, using the lubridate package. Then, by using this new variable connectedOn, create 4 new variables:
    1. Year: connectedOnYear
    2. Year Quarter: connectedOnYearQuarter
    3. Year Month: connectedOnYearMonth
    4. Day of week (DOW): connectedOnDOW
# Fix the date (change type from <chr> to <date>)
li_cons$connectedOn  <-
  
# Generate columns for year, quarter, month, and day of the week
li_cons <- 
# Fix the date (change type from <chr> to <date>)
li_cons$connectedOn <- lubridate::dmy(li_cons$`Connected On`)

# Generate columns for year, quarter, month, and day of the week
li_cons <- li_cons %>% 
  mutate(
    connectedOnYear = lubridate::year(connectedOn),
    connectedOnYearQuarter = lubridate::quarter(connectedOn, with_year = T),
    connectedOnYearMonth = lubridate::format_ISO8601(connectedOn, precision = "ym"),
    connectedOnDOW = lubridate::wday(connectedOn, label = TRUE, abbr = TRUE)
  )
  1. Write the cleaned data into a new csv file called cleansed_Connections.csv. Use the write_csv() function from the dplyr package.
readr::write_csv(li_cons, "./cleansed_Connections.csv")

Part 2: Data Wrangling

Company information is provided by LinkedIn users themselves. Hence, it should not be a surprise that this information is not necessary consistent and ready-to-use for at-once analysis. One might write Google, another Google Inc. or Google LLC.

Let's count the number of occurrences of each company by using the dplyr package.

library(dplyr)
li_cons %>% 
  count(Company) %>% 
  arrange(desc(n))

The above code groups the company field and tells you the number of connections working for that company. Now you could make a decision to clean some of the data. You can have a look at your data and see possible different spellings.

# Introduce a new variable for the cleaned company names
li_cons$comp_clean <- li_cons$Company

library(stringr)

# Your magic here: 
li_cons <- li_cons %>% 
  mutate(
    comp_clean = if_else(str_detect(Company, "Deloitte"), "Deloitte", comp_clean), # Whenever a company contain "Deloitte", write it to just "Deloitte"
    comp_clean = if_else(str_detect(Company, "Université"), "University", comp_clean), # Whenever a company info contains the word "Université" change it to "University"
    comp_clean = if_else(str_detect(Company, "University"), "University", comp_clean) # Whenever a company info contains the word "University" change it to "University"
  )

And so on. Make as many or few consolidations as you see fit.

The same holds true for the Position data field. Let's count the number of occurrences of each positions by using the dplyr package.

library(dplyr)
li_cons %>% 
  count(Company) %>% 
  arrange(desc(n))

You can have a look at your data and see possible different spellings.

# Introduce a new variable
li_cons$pos_clean <- li_cons$Position

library(stringr)

# Your magic here: 
li_cons <- li_cons %>% 
  mutate(
    pos_clean = if_else(str_detect(Position, "Director"), "Director", pos_clean), # Look for positions that contain the word "Director" and change it to just "Director"
    pos_clean = if_else(str_detect(Position, "Analyst"), "Analyst", pos_clean), # Look for positions that contain the word "Analyst" and change it to "Analyst"
    pos_clean = if_else(str_detect(Position, "Consultant"), "Consultant", pos_clean) # Look for positions that contain the word "Consultant" and change it to "Consultant"
  )

And so on.

Part 3: Playing with the data

Let's create a color palette.

library(ggplot2)

# My color palette
c_col <- c("#264653", "#2a9d8f", "#e9c46a", "#f4a261", "#e76f51")

# Theme information for ggplot2 graphs
theme_set(
  theme(
    text = element_text(size = 10,  family = "Optima"),
    plot.title = element_text(
      size = 20,
      hjust = 0,
      margin = margin(t = 10, r = 0, b = 10, l = 0, unit = "pt"),
      color = c_col[1],
      face = "bold"
    ),
    axis.title = element_text(
      size = 12,
      margin = margin(t = 10, r = 10, b = 10, l = 10, unit = "pt"),
      color = c_col[1],
      face = "bold"
    )
  )
)

How many new connections have I made every year?

library(ggplot2)

# Show a bar chart with frequencies by year
li_cons %>% 
  ggplot() +
  geom_bar(aes(as.factor(connectedOnYear)), fill = c_col[1], color = "Black") +
  labs(title = "Number of connections per year",
       subtitle = paste("I startet using LinkedIn in", min(li_cons$connectedOnYear), "and it was the most active year concerning new connections"),
       x = "Years active on LinkedIn",
       y = "Number of new connections"
  ) 

How many new connections have I made every month?

# Show a bar chart with frequencies by year and quarter
li_cons %>% 
  ggplot() +
  geom_bar(
    aes(connectedOnYearQuarter),
    fill = c_col[1], color = "Black"
  ) +
  labs(
    title = "Number of connections per year and quarter",
    subtitle = paste("In the first month I connected to", count(li_cons, connectedOnYearQuarter)[1,2], "contacts"),
    x = "Years and quarters active on LinkedIn",
    y = "Number of new connections"
  ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 12))

How many new connections have I made every month disregarding the first month after joining?

peak <- count(li_cons, connectedOnYearQuarter)%>%top_n(1)

# Show a bar chart with frequencies by year and quarter
li_cons %>% 
  filter(connectedOnYearMonth != min(li_cons$connectedOnYearMonth)) %>% 
  ggplot() +
  geom_bar(
    aes(connectedOnYearQuarter),
    fill = c_col[1], color = "Black"
  ) +
  labs(
    title = "Number of connections per year and quarter \n(without March 2012)",
    subtitle = paste("Disregarding the initial month, there is a connection peak in", zoo::as.yearqtr(peak$connectedOnYearQuarter)),
    x = "Years and quarters active on LinkedIn",
    y = "Number of new connections"
  ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 12))

For which companies are my connections working for?

library(tidyr)
# Show frequencies of top-most companies
li_cons %>% 
  group_by(comp_clean) %>% 
  summarise(n = n()) %>% 
  drop_na() %>% 
  filter(n > 3) %>% 
  arrange(-n)  %>% 
  ggplot() +
  geom_col(
    aes(n, comp_clean),
    fill = c_col[1], color = "Black"
  ) +
  labs(
    title = "Number of connections per Company \n(with at least 4 connections)",
    subtitle = "Naturally, I have the most connections working for my employers, customers, and business partners",
    y = "Company name (with at least 4 connections)",
    x = "Number of connections"
  ) 

Are there any peak moments with top-companies?

# list of top companies, with at least 4 connections
li_top_companies <- li_cons %>% 
  group_by(comp_clean) %>% 
  summarise(n = n()) %>% 
  arrange(n) %>% 
  drop_na() %>% 
  filter(n > 5) %>% 
  pull(comp_clean)

li_top_companies
## [1] "University"
li_cons <- li_cons %>% 
  mutate(
    is_top_company = ifelse(comp_clean %in% li_top_companies, TRUE, FALSE),
    top_company = ifelse(comp_clean %in% li_top_companies, comp_clean, "Other")
  )

li_cons %>% 
  #filter(is_top_company) %>% 
  ggplot() + 
  geom_jitter(
    aes(connectedOnYear, top_company, color = top_company),
    size = 3,
    height = 0.2
  ) +
  labs(
    title = "Number of connections per company \n(with at least 6 connections per company)",
    subtitle = "Company information reflect my connections' actual employer and not the employer at the time \nwhen we connected",
    y = "Company name (with at least 6 connections)",
    x = "Years active on LinkedIn"
  ) +
  theme(legend.position = "none") + 
  scale_color_manual(values=c(c_col[1], c_col[2]))

What are my connections' positions?

# Show frequencies of top-most positions

# some reording based on frequency
levels4sorting <- li_cons %>% 
  group_by(pos_clean) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) %>% 
  drop_na() %>% 
  top_n(2) %>% 
  pull(pos_clean)

li_cons_topPositions <- li_cons %>% 
  filter(pos_clean %in% levels4sorting) 

li_cons_topPositions$pos_clean <- factor(li_cons_topPositions$pos_clean, levels = levels4sorting)  

li_cons_topPositions %>% 
  group_by(pos_clean) %>% 
  summarise(
    n = n()
  ) %>% 
  ggplot() +
  geom_col(
    aes(n, pos_clean),
    fill = c_col[1], color = "Black"
  ) +
  labs(
    title = "Number of connections per position \n(with at least 4 positions)",
    subtitle = paste("I just realized I am the most connected with", levels4sorting,"!"),
    y = "Position name (with at least 4 connections)",
    x = "Number of connections"
  ) 

Are there any peak moments with top-positions?

# list of top positions, with at least 4 connections
li_top_pos <- li_cons %>% 
  group_by(pos_clean) %>% 
  summarise(n = n()) %>% 
  arrange(n) %>% 
  drop_na() %>% 
  filter(n >= 10) %>% 
  pull(pos_clean)

li_top_pos
## [1] "Consultant" "Analyst"
li_cons <- li_cons %>% 
  mutate(
    is_top_pos = ifelse(pos_clean %in% li_top_pos, TRUE, FALSE),
    top_pos = ifelse(pos_clean %in% li_top_pos, pos_clean, "Other")
  )

li_cons %>% 
  ggplot() + 
  geom_jitter(
    aes(connectedOnYear, top_pos, color = top_pos),
    height = 0.2
  ) +
  labs(
    title = "Number of connections per position \n(with at least 10 connections per position)",
    subtitle = "Position information reflect my connections' actual positions and not the positions at the time \nwhen we connected",
    y = "Position (with at least 10 connections)",
    x = "Years active on LinkedIn"
  ) +
  theme(legend.position = "none") + 
  scale_color_manual(values=c(c_col[1], c_col[2], c_col[3]))

Where are the top-positions working at the moment?

# Show relationship of top positions and top companies

li_cons %>% 
  ggplot() +
  geom_jitter(
    aes(top_company, top_pos, color = connectedOnYear),
    height = 0.2, width = 0.2
  ) +
  labs(
    title = "Relationship between top positions and \ntop companies",
    subtitle = paste("Many of my connections made it to", levels4sorting,". Congrats!"),
    y = "Position (with at least 10 connections)",
    x = "Company name (limited to companies with at least 6 connections)"
  ) +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.6, size = 12))

Session 8: Facebook