COVID-19¶
This example uses the 'covid19' dataset to predict the number of deaths and cases one day in advance. You can download the Jupyter Notebook of the study here.
- date: Date of the record
- cases: Number of people infected
- deaths: Number of deaths
- state: State
- fips: The Federal Information Processing Standards (FIPS) code for the county.
- county: County
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset. The dataset is available here.
covid19 = vp.read_csv("data/covid19_deaths.csv")
display(covid19)
Data Exploration and Preparation¶
Let's explore the data by displaying descriptive statistics of all the columns.
covid19.describe(method = "categorical", unique = True)
We have data from January 2020 to the beginning of May.
covid19["date"].describe()
We'll try to predict the number of future deaths by using the statistics from previous days. We can drop the columns 'county' and 'fips,' since the scope of our analysis is focused on the United States and the FIPS code isn't relevant to our predictions.
covid19.drop(["fips", "county"])
Let's sum the number of deaths and cases by state and date.
import verticapy.stats as st
covid19 = covid19.groupby(["state",
"date"],
[st.sum(covid19["deaths"])._as("deaths"),
st.sum(covid19["cases"])._as("cases")])
display(covid19)
Let's look at the autocorrelation graphic of the number of deaths.
%matplotlib inline
covid19.acf(column = "deaths",
ts = "date",
by = ["state"],
p = 48)
The process doesn't seem to be stationary. Let's use a Dickey-Fuller test to confirm our hypothesis.
from verticapy.stats import adfuller
adfuller(covid19,
ts = "date",
column = "deaths",
by = ["state"],
p = 12)
We can look at the cumulative number of deaths and its exponentiality.
covid19["deaths"].plot(ts = "date",
by = "state")