Amazon

This example uses the 'Amazon' dataset to predict the number of forest fires in Brazil. You can download a copy of the Jupyter Notebook of the study here.

  • date: Date of the record
  • number: Number of forest fires
  • state: State in Brazil

We'll follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem, and we'll do it without ever loading our data into memory.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset.

In [3]:
from verticapy.datasets import load_amazon
amazon = load_amazon()
amazon.head(5)
Out[3]:
📅
date
Date
Abc
state
Varchar(32)
123
number
Int
11998-01-01ACRE0
21998-01-01ALAGOAS0
31998-01-01AMAPÁ0
41998-01-01AMAZONAS0
51998-01-01BAHIA0
Rows: 1-5 | Columns: 3

Data Exploration and Preparation

We can explore our data by displaying descriptive statistics of all the columns.

In [4]:
amazon.describe(method = "categorical", unique = True)
Out[4]:
dtype
count
top
top_percent
unique
"date"date64542017-01-010.434239.0
"state"varchar(32)6454ALAGOAS3.71927.0
"number"int645409.561480.0
Rows: 1-3 | Columns: 6

Using the describe() method, we can see that our data ranges from the beginning of 1998 to the end of 2017.

In [5]:
amazon["date"].describe()
Out[5]:
value
name"date"
dtypedate
count6454
min1998-01-01
max2017-11-01
Rows: 1-5 | Columns: 2

Brazil has dry and rainy seasons. Knowing this, we would expect that the frequency of forest fires vary between seasons. Let's confirm our hypothesis using an autocorrelation plot with 48 lags (4 years).

In [6]:
%matplotlib inline
amazon.acf(column = "number", 
           ts = "date",
           by = ["state"],
           p = 48)
Out[6]:
value
confidence
01.00.024396841824873755
10.6730.036275983687006595
20.3490.045111637590579914
30.1650.052306855578449836
40.030.057263854711292146
5-0.0420.06180006129512837
6-0.060.06516081663993491
7-0.0450.06829524194541631
80.0260.07124360828375358
90.1480.07407557653279681
100.3160.07669571841412227
110.5630.07912720149833499
120.7670.08102489031074944
130.5920.08192275688166938
140.3230.08272773065888367
150.1440.08348307665310473
160.0230.0842182150459408
17-0.0460.08492934300188434
18-0.0610.08560860089613088
19-0.0430.08624466587899496
200.0260.08672523909375807
210.1520.08691922124869496
220.3320.08709516883010104
230.5870.0872602406741454
240.7780.08741685133673856
250.5870.08757323051611797
260.310.08772142268465483
270.1480.0878541726492689
280.0160.08794631519454699
29-0.0510.08798551761309605
30-0.0650.08802107344145058
31-0.050.08805320275459808
320.0160.08808450865585683
330.1360.08811343002364964
340.3010.08813936389787239
350.5750.08816388633991983
360.7770.08818772410073934
370.5980.08820895683521275
380.320.08822957354705878
390.1570.0882489976063002
400.0180.08826784670309079
41-0.0530.08828083535495473
42-0.0690.0882923062148065
43-0.0570.08830377916236357
440.0010.08831425667709758
450.1120.08832334504593417
460.2620.08833197382451202
470.5040.08834060445341575
480.6750.08834750611745151
Rows: 1-49 | Columns: 3

The process is not stationary. Let's use a Dickey-Fuller test to confirm our hypothesis.

In [7]:
from verticapy.stats import adfuller
adfuller(amazon,
         ts = "date", 
         column = "number", 
         by = ["state"], 
         p = 48)
Out[7]:
value
ADF Test Statistic-1.4627032775079862
p_value0.143610439349909
# Lags used48
# Observations Used6454
Critical Value (1%)-3.43
Critical Value (2.5%)-3.12
Critical Value (5%)-2.86
Critical Value (10%)-2.57
Stationarity (alpha = 1%)
Rows: 1-9 | Columns: 2

The effects of each season seem pretty clear. We can see this graphically using the cumulative sum of the number of forest fires partitioned by states. If our hypothesis is correct, we should see staircase functions.

In [8]:
amazon.cumsum("number", 
              by = ["state"], 
              order_by = ["date"], 
              name = "cum_sum")
amazon["cum_sum"].plot(ts = "date", 
                       by = "state")
Out[8]:
<AxesSubplot:xlabel='"date"', ylabel='"cum_sum"'>

We can clearly see the seasonality per state which contributes to a global seasonality. Let's draw the cumulative sum to see this more clearly.

In [9]:
import verticapy.stats as st
amazon = amazon.groupby(["date"], 
                        [st.sum(amazon["number"])._as("number")])
amazon.cumsum("number", 
              order_by = ["date"], 
              name = "cum_sum")
amazon["cum_sum"].plot(ts = "date")
Out[9]:
<AxesSubplot:xlabel='"date"', ylabel='"cum_sum"'>

Machine Learning

Let's create an spatial autoregressive (SAR) model to predict the number of forest fires in Brazil. We know that this seasonality happens each year (s=12) and let's consider 4 lags (P=4).

In [10]:
from verticapy.learn.tsa import SARIMAX

model = SARIMAX("amazon_ar", 
                s = 12, 
                P = 4)
model.fit(amazon,
          y = "number",
          ts = "date")
model.regression_report()
Out[10]:
value
explained_variance0.735475202180538
max_error53472.4790951474829625
median_absolute_error2345.29902142932
mean_absolute_error6098.51854215664
mean_squared_error118668651.839548
root_mean_squared_error10893.514209819896
r20.735475202180538
r2_adj0.7297864968510872
aic3561.3668585620294
bic3577.3039013779385
Rows: 1-10 | Columns: 2

Our model is quite good. Let's look at our predictions.

In [11]:
x = model.plot(amazon, 
               nlead=100, 
               dynamic=True)

The plot shows that our model has successfully captured the seasonality implied by our data. Let's add the prediction in the vDataFrame.

In [12]:
amazon = model.predict(amazon, name = "prediction")
display(amazon)
📅
date
Date
123
cum_sum
Integer
123
prediction
Numeric(34,16)
123
number
Integer
11998-01-010[null]0
21998-02-010[null]0
31998-03-010[null]0
41998-04-010[null]0
51998-05-010[null]0
61998-06-013551[null]3551
71998-07-0111617[null]8066
81998-08-0147166[null]35549
91998-09-0189134[null]41968
101998-10-01112629[null]23495
111998-11-01119433[null]6804
121998-12-01123881[null]4448
131999-01-01124962[null]1081
141999-02-01126246[null]1284
151999-03-01126913[null]667
161999-04-01127630[null]717
171999-05-01129442[null]1812
181999-06-01133074[null]3632
191999-07-01141830[null]8756
201999-08-01181316[null]39486
211999-09-01218229[null]36913
221999-10-01245241[null]27012
231999-11-01254101[null]8860
241999-12-01258477[null]4376
252000-01-01259255[null]778
262000-02-01259816[null]561
272000-03-01260664[null]848
282000-04-01261201[null]537
292000-05-01263298[null]2097
302000-06-01269573[null]6275
312000-07-01274312[null]4739
322000-08-01296514[null]22202
332000-09-01319805[null]23291
342000-10-01347141[null]27336
352000-11-01355540[null]8399
362000-12-01360005[null]4465
372001-01-01360552[null]547
382001-02-01361611[null]1059
392001-03-01362879[null]1268
402001-04-01363960[null]1081
412001-05-01366050[null]2090
422001-06-01374483[null]8433
432001-07-01380973[null]6490
442001-08-01412860[null]31887
452001-09-01452694[null]39834
462001-10-01483732[null]31038
472001-11-01499371[null]15639
482001-12-01505572[null]6201
492002-01-015072262316.4414033856071654
502002-02-015087962432.25220861947261570
512002-03-015104752372.98919204466941679
522002-04-015121572210.19845304673851682
532002-05-015159753560.9238158724533818
542002-06-015268147369.91033400987610839
552002-07-015405657771.16077484753813751
562002-08-0159771631142.4547077059457151
572002-09-0165351932092.9229997187155803
582002-10-0170124129428.7398625661647722
592002-11-0172942011848.87112410100828179
602002-12-017413646222.51406343865511944
612003-01-017464552288.59076565603165091
622003-02-017488532386.24974670776872398
632003-03-017516022669.3042919592082749
642003-04-017542792465.84886082105282677
652003-05-017560263949.2057153732591747
662003-06-017625329923.078935343526506
672003-07-017743368788.01399746554111804
682003-08-0181807234822.9615071465943736
692003-09-0189439738546.62317740048576325
702003-10-0193769235117.1891515254643295
712003-11-0196126417826.57770413384423572
722003-12-019766068485.21269976646415342
732004-01-019793113586.26187384190332705
742004-02-019805663074.3154307912951255
752004-03-019826063265.90319500056972040
762004-04-019839413204.7590174344661335
772004-05-019874764133.6475845184733535
782004-06-01100173810438.46040264954214262
792004-07-01102554712444.99807167016823809
802004-08-01107487247514.4074840430849325
812004-09-01115837258250.1679749816983500
822004-10-01119870342556.7339660122940331
832004-11-01122946624861.42076511130730763
842004-12-01124699012488.71308151275117524
852005-01-0112519804898.9842361993364990
862005-02-0112541333288.96408914641642153
872005-03-0112558393664.82763757241261706
882005-04-0112568503470.6433675071491011
892005-05-0112600604363.7860569258523210
902005-06-01126587111293.4039303822465811
912005-07-01128153417315.9987824632415663
922005-08-01133351551992.88991842011551981
932005-09-01140977273577.5918387367676257
942005-10-01145964845954.8458705544649876
952005-11-01148140028885.0135877550721752
962005-12-01148775416559.2531343567536354
972006-01-0114910095676.8589237469633255
982006-02-0114926753299.0718100751031666
992006-03-0114944493647.24789984271961774
1002006-04-0114952413137.2990214293236792
Rows: 1-100 | Columns: 4

From here, we can use a time series plot to compare our prediction with the real values.

In [13]:
amazon.plot(ts = "date", 
            columns = ["number", "prediction"])
Out[13]:
<AxesSubplot:xlabel='"date"'>