As part of the Hacking Health Covid-19, the SKEMA Global Lab in AI provided to SKEMA’ students a fully developped data science environment to realize their project. See [here].
For this specific module, this team used these following courses:
In order to realize our project, we used the John Hopkins data to create a predictive model. Based on the ARIMA model, we selected the United States to do a time series analysis on the evolution of Covid-19 cases and deaths.
# Load packages
library(tidyverse)
library(lubridate)
library(rvest)
library(stringdist)
library(forecast)
We need to build a function to read .csv files and convert them to long format. Next, we pull official country level indicators from the UN Statistics Division to get country level identifiers. Merging by country name is messy. I start with a fuzzy matching approach by using the stringdist package.
We took the data cleaning and engneering [here]
# Function to read the raw CSV files
# Files are aggregated to the country level and then converted to long format.
clean_jhd_to_long <- function(df) {
df_str <- deparse(substitute(df))
var_str <- substr(df_str, 1, str_length(df_str) - 4)
df %>% group_by(`Country/Region`) %>%
filter(`Country/Region` != "Cruise Ship") %>%
select(-`Province/State`, -Lat, -Long) %>%
mutate_at(vars(-group_cols()), sum) %>%
distinct() %>%
ungroup() %>%
rename(country = `Country/Region`) %>%
pivot_longer(
-country,
names_to = "date_str",
values_to = var_str
) %>%
mutate(date = mdy(date_str)) %>%
select(country, date, !! sym(var_str))
}
confirmed_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
jh_covid19_data <- clean_jhd_to_long(confirmed_raw) %>%
full_join(clean_jhd_to_long(deaths_raw))
In order to take a country’s data from the DF for analysis, we need to first define a country name variable, and then use this variable, called an indicator, to create a subset and “sample” the country.
We selected the US country to do the time series analysis with ARIMA model.
#select a country to do time series analysis with ARIMA model
slctcountry = "US" #This varible defines the county
countrydata <- subset(jh_covid19_data, country == slctcountry)
We converts(ts founction) the data frame into time series by using frequency 365 (in days), and it will start with 22nd Jan. 2020. Then we plot it:
#time series analysis for confirmed cases
tscfm <- ts(countrydata$confirmed, frequency = 365.25, start=c(2020,1,22))
plot.ts(tscfm)
As we can see from the plot, take the death case of US for instance, the number of deaths began to increase significantly around the 25th day.
Then we can feed them into ARIMA (Autoregressive Moving Average Model) models:
#TS analysis with auto.arima founction
arc <- auto.arima(tscfm)
summary(arc)
Series: tscfm
ARIMA(2,2,2)
Coefficients:
ar1 ar2 ma1 ma2
1.2494 -0.9336 -1.4171 0.9170
s.e. 0.0319 0.0362 0.0740 0.0434
sigma^2 estimated as 3582037: log likelihood=-1093.01
AIC=2196.02 AICc=2196.54 BIC=2210.04
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 242.3179 1846.27 1087.874 1.831374 6.710824 NaN
ACF1
Training set -0.08393917
By summarizing the model, we learn that the error rate (MAPE: Mean absolute percentage error) is 9.489553, which is not too bad. And the ARIMA model here is ARIMA(1, 2, 1).
ARIMA (p, d, q):
So here the ARIMA (1, 2, 1) model means: The level of the process, measured as the most resent value, plus a trend measured as the most recent change in the process.
More specifically: d = 2, adaptive trend in addition on the level.
Use the current model to do prediction for the next week (7 days).
fc1 <- forecast(arc, h=7) #predict for next 7 days
summary(fc1)
Forecast method: ARIMA(2,2,2)
Model Information:
Series: tscfm
ARIMA(2,2,2)
Coefficients:
ar1 ar2 ma1 ma2
1.2494 -0.9336 -1.4171 0.9170
s.e. 0.0319 0.0362 0.0740 0.0434
sigma^2 estimated as 3582037: log likelihood=-1093.01
AIC=2196.02 AICc=2196.54 BIC=2210.04
Error measures:
ME RMSE MAE MPE MAPE MASE
Training set 242.3179 1846.27 1087.874 1.831374 6.710824 NaN
ACF1
Training set -0.08393917
Forecasts:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2020.3395 1663851 1661426 1666277 1660142 1667561
2020.3422 1685352 1680289 1690415 1677609 1693095
2020.3450 1707999 1700213 1715784 1696092 1719905
2020.3477 1731241 1720719 1741763 1715148 1747334
2020.3504 1754158 1740710 1767605 1733591 1774724
2020.3532 1776111 1759283 1792940 1750375 1801848
2020.3559 1797166 1776323 1818009 1765290 1829043
plot(fc1)
As we can see from the forecast matrix, on day one, there is a 95% probability that the predicted value will be between 761934.4 and 768048.2, and an 80% probability that the predicted value will be between 762992.5 and 766990.1…
#Residual analysis
acf(fc1$residuals)
pacf(fc1$residuals)
plot(fc1$residuals)
qqnorm(fc1$residuals)
And we repeat the exact same process but for the confirmed deaths this time.
##time series analysis for confirmed deaths
tsdeath <- ts(countrydata$deaths, frequency = 365, start=c(2020,22), end=c(2020,92))
plot(tsdeath)
ard <- auto.arima(tsdeath)
summary(ard)
Series: tsdeath
ARIMA(0,2,0)
sigma^2 estimated as 3735: log likelihood=-381.69
AIC=765.38 AICc=765.44 BIC=767.61
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 16.11268 60.25065 19.32394 4.638444 17.12991 NaN
ACF1
Training set 0.1439591
fc2 <- forecast(ard, h=7) #predict for next 7 days
summary(fc1)
Forecast method: ARIMA(2,2,2)
Model Information:
Series: tscfm
ARIMA(2,2,2)
Coefficients:
ar1 ar2 ma1 ma2
1.2494 -0.9336 -1.4171 0.9170
s.e. 0.0319 0.0362 0.0740 0.0434
sigma^2 estimated as 3582037: log likelihood=-1093.01
AIC=2196.02 AICc=2196.54 BIC=2210.04
Error measures:
ME RMSE MAE MPE MAPE MASE
Training set 242.3179 1846.27 1087.874 1.831374 6.710824 NaN
ACF1
Training set -0.08393917
Forecasts:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2020.3395 1663851 1661426 1666277 1660142 1667561
2020.3422 1685352 1680289 1690415 1677609 1693095
2020.3450 1707999 1700213 1715784 1696092 1719905
2020.3477 1731241 1720719 1741763 1715148 1747334
2020.3504 1754158 1740710 1767605 1733591 1774724
2020.3532 1776111 1759283 1792940 1750375 1801848
2020.3559 1797166 1776323 1818009 1765290 1829043
plot(fc1)
#Residual analysis
acf(fc2$residuals)
pacf(fc2$residuals)
plot(fc2$residuals)
qqnorm(fc2$residuals)
# Load packages
library(tidyverse)
library(lubridate)
library(rvest)
library(stringdist)
library(forecast)
# Function to read the raw CSV files
# Files are aggregated to the country level and then converted to long format.
clean_jhd_to_long <- function(df) {
df_str <- deparse(substitute(df))
var_str <- substr(df_str, 1, str_length(df_str) - 4)
df %>% group_by(`Country/Region`) %>%
filter(`Country/Region` != "Cruise Ship") %>%
select(-`Province/State`, -Lat, -Long) %>%
mutate_at(vars(-group_cols()), sum) %>%
distinct() %>%
ungroup() %>%
rename(country = `Country/Region`) %>%
pivot_longer(
-country,
names_to = "date_str",
values_to = var_str
) %>%
mutate(date = mdy(date_str)) %>%
select(country, date, !! sym(var_str))
}
confirmed_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
jh_covid19_data <- clean_jhd_to_long(confirmed_raw) %>%
full_join(clean_jhd_to_long(deaths_raw))
#select a country to do time series analysis with ARIMA model
slctcountry = "US" #This varible defines the county
countrydata <- subset(jh_covid19_data, country == slctcountry)
#time series analysis for confirmed cases
tscfm <- ts(countrydata$confirmed, frequency = 365.25, start=c(2020,1,22))
plot.ts(tscfm)
#TS analysis with auto.arima founction
arc <- auto.arima(tscfm)
summary(arc)
fc1 <- forecast(arc, h=7) #predict for next 7 days
summary(fc1)
plot(fc1)
#Residual analysis
acf(fc1$residuals)
pacf(fc1$residuals)
plot(fc1$residuals)
qqnorm(fc1$residuals)
##time series analysis for confirmed deaths
tsdeath <- ts(countrydata$deaths, frequency = 365, start=c(2020,22), end=c(2020,92))
plot(tsdeath)
ard <- auto.arima(tsdeath)
summary(ard)
fc2 <- forecast(ard, h=7) #predict for next 7 days
summary(fc1)
plot(fc1)
#Residual analysis
acf(fc2$residuals)
pacf(fc2$residuals)
plot(fc2$residuals)
qqnorm(fc2$residuals)
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".