COVID-19

This example uses the 'covid19' dataset to predict the number of deaths and cases one day in advance. You can download the Jupyter Notebook of the study here.

  • date: Date of the record
  • cases: Number of people infected
  • deaths: Number of deaths
  • state: State
  • fips: The Federal Information Processing Standards (FIPS) code for the county.
  • county: County

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset. The dataset is available here.

In [3]:
covid19 = vp.read_csv("data/covid19_deaths.csv")
display(covid19)
📅
date
Date
Abc
county
Varchar(66)
Abc
state
Varchar(48)
123
fips
Int
123
cases
Int
123
deaths
Int
12020-01-21SnohomishWashington5306110
22020-01-22SnohomishWashington5306110
32020-01-23SnohomishWashington5306110
42020-01-24CookIllinois1703110
52020-01-24SnohomishWashington5306110
62020-01-25CookIllinois1703110
72020-01-25OrangeCalifornia605910
82020-01-25SnohomishWashington5306110
92020-01-26CookIllinois1703110
102020-01-26Los AngelesCalifornia603710
112020-01-26MaricopaArizona401310
122020-01-26OrangeCalifornia605910
132020-01-26SnohomishWashington5306110
142020-01-27CookIllinois1703110
152020-01-27Los AngelesCalifornia603710
162020-01-27MaricopaArizona401310
172020-01-27OrangeCalifornia605910
182020-01-27SnohomishWashington5306110
192020-01-28CookIllinois1703110
202020-01-28Los AngelesCalifornia603710
212020-01-28MaricopaArizona401310
222020-01-28OrangeCalifornia605910
232020-01-28SnohomishWashington5306110
242020-01-29CookIllinois1703110
252020-01-29Los AngelesCalifornia603710
262020-01-29MaricopaArizona401310
272020-01-29OrangeCalifornia605910
282020-01-29SnohomishWashington5306110
292020-01-30CookIllinois1703120
302020-01-30Los AngelesCalifornia603710
312020-01-30MaricopaArizona401310
322020-01-30OrangeCalifornia605910
332020-01-30SnohomishWashington5306110
342020-01-31CookIllinois1703120
352020-01-31Los AngelesCalifornia603710
362020-01-31MaricopaArizona401310
372020-01-31OrangeCalifornia605910
382020-01-31Santa ClaraCalifornia608510
392020-01-31SnohomishWashington5306110
402020-02-01CookIllinois1703120
412020-02-01Los AngelesCalifornia603710
422020-02-01MaricopaArizona401310
432020-02-01OrangeCalifornia605910
442020-02-01Santa ClaraCalifornia608510
452020-02-01SnohomishWashington5306110
462020-02-01SuffolkMassachusetts2502510
472020-02-02CookIllinois1703120
482020-02-02Los AngelesCalifornia603710
492020-02-02MaricopaArizona401310
502020-02-02OrangeCalifornia605910
512020-02-02San FranciscoCalifornia607520
522020-02-02Santa ClaraCalifornia608520
532020-02-02SnohomishWashington5306110
542020-02-02SuffolkMassachusetts2502510
552020-02-03CookIllinois1703120
562020-02-03Los AngelesCalifornia603710
572020-02-03MaricopaArizona401310
582020-02-03OrangeCalifornia605910
592020-02-03San FranciscoCalifornia607520
602020-02-03Santa ClaraCalifornia608520
612020-02-03SnohomishWashington5306110
622020-02-03SuffolkMassachusetts2502510
632020-02-04CookIllinois1703120
642020-02-04Los AngelesCalifornia603710
652020-02-04MaricopaArizona401310
662020-02-04OrangeCalifornia605910
672020-02-04San FranciscoCalifornia607520
682020-02-04Santa ClaraCalifornia608520
692020-02-04SnohomishWashington5306110
702020-02-04SuffolkMassachusetts2502510
712020-02-05CookIllinois1703120
722020-02-05DaneWisconsin5502510
732020-02-05Los AngelesCalifornia603710
742020-02-05MaricopaArizona401310
752020-02-05OrangeCalifornia605910
762020-02-05San FranciscoCalifornia607520
772020-02-05Santa ClaraCalifornia608520
782020-02-05SnohomishWashington5306110
792020-02-05SuffolkMassachusetts2502510
802020-02-06CookIllinois1703120
812020-02-06DaneWisconsin5502510
822020-02-06Los AngelesCalifornia603710
832020-02-06MaricopaArizona401310
842020-02-06OrangeCalifornia605910
852020-02-06San FranciscoCalifornia607520
862020-02-06Santa ClaraCalifornia608520
872020-02-06SnohomishWashington5306110
882020-02-06SuffolkMassachusetts2502510
892020-02-07CookIllinois1703120
902020-02-07DaneWisconsin5502510
912020-02-07Los AngelesCalifornia603710
922020-02-07MaricopaArizona401310
932020-02-07OrangeCalifornia605910
942020-02-07San FranciscoCalifornia607520
952020-02-07Santa ClaraCalifornia608520
962020-02-07SnohomishWashington5306110
972020-02-07SuffolkMassachusetts2502510
982020-02-08CookIllinois1703120
992020-02-08DaneWisconsin5502510
1002020-02-08Los AngelesCalifornia603710
Rows: 1-100 | Columns: 6

Data Exploration and Preparation

Let's explore the data by displaying descriptive statistics of all the columns.

In [5]:
covid19.describe(method = "categorical", unique = True)
Out[5]:
dtype
count
top
top_percent
unique
"date"date1282562020-05-092.247110.0
"county"varchar(66)128256Washington1.1631710.0
"state"varchar(48)128256Texas6.7151.0
"fips"int128256530610.0862882.0
"cases"int128256113.3763830.0
"deaths"int128256061.125806.0
Rows: 1-6 | Columns: 6

We have data from January 2020 to the beginning of May.

In [6]:
covid19["date"].describe()
Out[6]:
value
name"date"
dtypedate
count128256
min2020-01-21
max2020-05-09
Rows: 1-5 | Columns: 2

We'll try to predict the number of future deaths by using the statistics from previous days. We can drop the columns 'county' and 'fips,' since the scope of our analysis is focused on the United States and the FIPS code isn't relevant to our predictions.

In [7]:
covid19.drop(["fips", "county"])
Out[7]:
📅
date
Date
Abc
state
Varchar(48)
123
cases
Int
123
deaths
Int
12020-01-21Washington10
22020-01-22Washington10
32020-01-23Washington10
42020-01-24Illinois10
52020-01-24Washington10
62020-01-25Illinois10
72020-01-25California10
82020-01-25Washington10
92020-01-26Illinois10
102020-01-26California10
112020-01-26Arizona10
122020-01-26California10
132020-01-26Washington10
142020-01-27Illinois10
152020-01-27California10
162020-01-27Arizona10
172020-01-27California10
182020-01-27Washington10
192020-01-28Illinois10
202020-01-28California10
212020-01-28Arizona10
222020-01-28California10
232020-01-28Washington10
242020-01-29Illinois10
252020-01-29California10
262020-01-29Arizona10
272020-01-29California10
282020-01-29Washington10
292020-01-30Illinois20
302020-01-30California10
312020-01-30Arizona10
322020-01-30California10
332020-01-30Washington10
342020-01-31Illinois20
352020-01-31California10
362020-01-31Arizona10
372020-01-31California10
382020-01-31California10
392020-01-31Washington10
402020-02-01Illinois20
412020-02-01California10
422020-02-01Arizona10
432020-02-01California10
442020-02-01California10
452020-02-01Washington10
462020-02-01Massachusetts10
472020-02-02Illinois20
482020-02-02California10
492020-02-02Arizona10
502020-02-02California10
512020-02-02California20
522020-02-02California20
532020-02-02Washington10
542020-02-02Massachusetts10
552020-02-03Illinois20
562020-02-03California10
572020-02-03Arizona10
582020-02-03California10
592020-02-03California20
602020-02-03California20
612020-02-03Washington10
622020-02-03Massachusetts10
632020-02-04Illinois20
642020-02-04California10
652020-02-04Arizona10
662020-02-04California10
672020-02-04California20
682020-02-04California20
692020-02-04Washington10
702020-02-04Massachusetts10
712020-02-05Illinois20
722020-02-05Wisconsin10
732020-02-05California10
742020-02-05Arizona10
752020-02-05California10
762020-02-05California20
772020-02-05California20
782020-02-05Washington10
792020-02-05Massachusetts10
802020-02-06Illinois20
812020-02-06Wisconsin10
822020-02-06California10
832020-02-06Arizona10
842020-02-06California10
852020-02-06California20
862020-02-06California20
872020-02-06Washington10
882020-02-06Massachusetts10
892020-02-07Illinois20
902020-02-07Wisconsin10
912020-02-07California10
922020-02-07Arizona10
932020-02-07California10
942020-02-07California20
952020-02-07California20
962020-02-07Washington10
972020-02-07Massachusetts10
982020-02-08Illinois20
992020-02-08Wisconsin10
1002020-02-08California10
Rows: 1-100 of 128256 | Columns: 4

Let's sum the number of deaths and cases by state and date.

In [8]:
import verticapy.stats as st
covid19 = covid19.groupby(["state",
                           "date"],
                          [st.sum(covid19["deaths"])._as("deaths"),
                           st.sum(covid19["cases"])._as("cases")])
display(covid19)
Abc
state
Varchar(48)
📅
date
Date
123
deaths
Integer
123
cases
Integer
1South Carolina2020-03-203126
2South Carolina2020-04-04401917
3South Carolina2020-04-03341700
4South Carolina2020-04-02311554
5South Carolina2020-04-08632552
6South Carolina2020-04-201244439
7South Carolina2020-04-191204377
8South Carolina2020-04-181194246
9South Carolina2020-04-171164086
10South Carolina2020-04-161093931
11South Carolina2020-04-151073656
12South Carolina2020-04-14973553
13South Carolina2020-04-13873439
14District of Columbia2020-03-0701
15District of Columbia2020-03-0801
16District of Columbia2020-03-0904
17District of Columbia2020-03-1004
18District of Columbia2020-03-11010
19District of Columbia2020-03-12010
20District of Columbia2020-03-13010
21District of Columbia2020-03-14016
22District of Columbia2020-03-15017
23District of Columbia2020-03-16022
24District of Columbia2020-03-17031
25District of Columbia2020-03-18036
26District of Columbia2020-03-19039
27District of Columbia2020-03-20071
28District of Columbia2020-03-21177
29District of Columbia2020-03-22198
30Nevada2020-05-082985972
31Nevada2020-05-093016076
32North Dakota2020-04-2214679
33North Dakota2020-04-2315709
34North Dakota2020-04-2013627
35North Dakota2020-04-2113644
36North Dakota2020-04-2617867
37North Dakota2020-04-2719942
38North Dakota2020-04-2415748
39North Dakota2020-04-2516803
40North Dakota2020-04-149341
41North Dakota2020-04-159365
42North Dakota2020-04-128308
43North Dakota2020-04-139331
44North Dakota2020-04-189528
45North Dakota2020-04-1910585
46North Dakota2020-04-169393
47North Dakota2020-04-179439
48North Dakota2020-04-074237
49North Dakota2020-04-106278
50South Carolina2020-04-281925735
51South Carolina2020-04-271775613
52South Carolina2020-04-261745490
53South Carolina2020-04-251665253
54South Carolina2020-04-241575070
55South Carolina2020-04-231504917
56South Carolina2020-04-221404761
57South Carolina2020-04-211354608
58South Carolina2020-05-063056936
59South Carolina2020-05-052966841
60South Carolina2020-05-042836757
61South Carolina2020-05-032756626
62South Carolina2020-05-022676489
63South Carolina2020-05-012566258
64South Carolina2020-04-302446095
65South Carolina2020-04-292325881
66South Carolina2020-05-093307531
67South Carolina2020-05-083207367
68South Carolina2020-05-073167142
69Ohio2020-05-04105620474
70Ohio2020-05-07127122131
71Hawaii2020-05-0917620
72Hawaii2020-05-0817619
73Hawaii2020-05-0517615
74Hawaii2020-05-0417612
75Hawaii2020-05-0717619
76Hawaii2020-05-0617616
77Hawaii2020-03-22056
78Hawaii2020-03-21048
79Hawaii2020-03-24077
80Hawaii2020-03-23072
81Hawaii2020-03-18016
82Hawaii2020-03-300191
83Hawaii2020-03-290163
84Hawaii2020-04-011238
85Hawaii2020-03-311209
86Hawaii2020-03-280145
87Hawaii2020-03-270115
88Hawaii2020-03-0601
89Hawaii2020-03-0802
90Hawaii2020-03-0701
91Hawaii2020-03-1302
92Hawaii2020-04-2312590
93Hawaii2020-04-2212586
94Hawaii2020-04-2514597
95Hawaii2020-04-2413595
96Hawaii2020-04-1910574
97Hawaii2020-04-189568
98Hawaii2020-04-2112580
99Hawaii2020-04-2010578
100Hawaii2020-05-0116610
Rows: 1-100 | Columns: 4

Let's look at the autocorrelation graphic of the number of deaths.

In [9]:
%matplotlib inline
covid19.acf(column = "deaths", 
            ts = "date",
            by = ["state"],
            p = 48)
Out[9]:
value
confidence
01.00.03306808277915286
10.9990.05724557037333314
20.9980.07387480270409284
30.9970.08737730395335205
40.9950.09901827142968875
50.9920.10937378440149341
60.9890.11878064860034902
70.9850.12743226263766533
80.9810.13547360751434442
90.9750.14297793284551047
100.9690.15002630207245246
110.9630.15668069918376346
120.9550.16296406141763914
130.9470.1689185480557603
140.9380.1745667355850512
150.9290.17993837287877298
160.9180.18503660039549066
170.9070.18988474395122357
180.8950.19449265660474047
190.8830.1988785290267625
200.870.20304878272513283
210.8560.20700903997259487
220.840.21075551505216278
230.8240.21430193241170437
240.8090.21766872646402852
250.7930.22085833870057284
260.7760.22387311276697835
270.7580.2267153493691516
280.740.22939446302883856
290.7210.2319122183474391
300.7030.23428371983050017
310.6830.2365035253502923
320.6630.23857946037199607
330.6450.2405306388244739
340.6240.24234583571399215
350.5990.24401021225441466
360.5830.24557919332126724
370.5550.2469965798526741
380.5270.24827182839332118
390.4970.24940512413516747
400.4650.2503979952407609
410.4330.2512610083135219
420.40.25200086557622814
430.3620.2526121165114924
440.3230.2531053885928537
450.280.25348470463884176
460.2350.2537625031036636
470.1990.25397198430011114
480.1670.25413032503412625
Rows: 1-49 | Columns: 3

The process doesn't seem to be stationary. Let's use a Dickey-Fuller test to confirm our hypothesis.

In [10]:
from verticapy.stats import adfuller
adfuller(covid19,
         ts = "date", 
         column = "deaths", 
         by = ["state"], 
         p = 12)
Out[10]:
value
ADF Test Statistic-0.9916599727457468
p_value0.321448022269228
# Lags used12
# Observations Used3513
Critical Value (1%)-3.43
Critical Value (2.5%)-3.12
Critical Value (5%)-2.86
Critical Value (10%)-2.57
Stationarity (alpha = 1%)
Rows: 1-9 | Columns: 2

We can look at the cumulative number of deaths and its exponentiality.

In [11]:
covid19["deaths"].plot(ts = "date", 
                       by = "state")
Out[11]:
<AxesSubplot:xlabel='"date"', ylabel='"deaths"'>