Amazon¶
This example uses the 'Amazon' dataset to predict the number of forest fires in Brazil. You can download a copy of the Jupyter Notebook of the study here.
- date: Date of the record
- number: Number of forest fires
- state: State in Brazil
We'll follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem, and we'll do it without ever loading our data into memory.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset.
from verticapy.datasets import load_amazon
amazon = load_amazon()
amazon.head(5)
Data Exploration and Preparation¶
We can explore our data by displaying descriptive statistics of all the columns.
amazon.describe(method = "categorical", unique = True)
Using the describe() method, we can see that our data ranges from the beginning of 1998 to the end of 2017.
amazon["date"].describe()
Brazil has dry and rainy seasons. Knowing this, we would expect that the frequency of forest fires vary between seasons. Let's confirm our hypothesis using an autocorrelation plot with 48 lags (4 years).
%matplotlib inline
amazon.acf(column = "number",
ts = "date",
by = ["state"],
p = 48)
The process is not stationary. Let's use a Dickey-Fuller test to confirm our hypothesis.
from verticapy.stats import adfuller
adfuller(amazon,
ts = "date",
column = "number",
by = ["state"],
p = 48)
The effects of each season seem pretty clear. We can see this graphically using the cumulative sum of the number of forest fires partitioned by states. If our hypothesis is correct, we should see staircase functions.
amazon.cumsum("number",
by = ["state"],
order_by = ["date"],
name = "cum_sum")
amazon["cum_sum"].plot(ts = "date",
by = "state")
We can clearly see the seasonality per state which contributes to a global seasonality. Let's draw the cumulative sum to see this more clearly.
import verticapy.stats as st
amazon = amazon.groupby(["date"],
[st.sum(amazon["number"])._as("number")])
amazon.cumsum("number",
order_by = ["date"],
name = "cum_sum")
amazon["cum_sum"].plot(ts = "date")
Machine Learning¶
Let's create an spatial autoregressive (SAR) model to predict the number of forest fires in Brazil. We know that this seasonality happens each year (s=12) and let's consider 4 lags (P=4).
from verticapy.learn.tsa import SARIMAX
model = SARIMAX("amazon_ar",
s = 12,
P = 4)
model.fit(amazon,
y = "number",
ts = "date")
model.regression_report()
Our model is quite good. Let's look at our predictions.
x = model.plot(amazon,
nlead=100,
dynamic=True)
The plot shows that our model has successfully captured the seasonality implied by our data. Let's add the prediction in the vDataFrame.
amazon = model.predict(amazon, name = "prediction")
display(amazon)
From here, we can use a time series plot to compare our prediction with the real values.
amazon.plot(ts = "date",
columns = ["number", "prediction"])