Commodities¶
This example uses the 'Commodities' dataset to predict the price of different commodities. You can download the Jupyter Notebook of the study here.
- date: Date of the record
- Gold: Price per ounce of Gold
- Oil: Price per Barrel - West Texas Intermediate (WTI)
- Spread: Interest Rate Spreads.
- Vix: The CBOE Volatility Index (VIX) is a measure of expected price fluctuations in the SP500 Index options over the next 30 days.
- Dol_Eur: How much $1 US is in euros.
- SP500: The S&P 500, or simply the S&P, is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States.
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset.
from verticapy.datasets import load_commodities
commodities = load_commodities()
display(commodities)
Data Exploration and Preparation¶
Let's explore the data by displaying descriptive statistics of all the columns.
commodities.describe(method = "all", unique = True)
We have data from January 1986 to the beginning of August 2020. We don't have any missing values, so our data is already clean.
Let's draw the different variables.
%matplotlib inline
commodities.plot(ts = "date")
Some of the commodities have an upward monotonic trend and some others might be stationary. Let's use Augmented Dickey-Fuller tests to check our hypotheses.
from verticapy.stats import adfuller
from verticapy import *
fuller = {}
for commodity in ["Gold", "Oil", "Spread", "Vix", "Dol_Eur", "SP500"]:
result = adfuller(commodities,
column = commodity,
ts = "date",
p = 3,
with_trend = True)
fuller["index"] = result["index"]
fuller[commodity] = result["value"]
fuller = tablesample(fuller)
display(fuller)
As expected: The price of gold and the S&P 500 index are not stationary. Let's use the Mann-Kendall test to confirm the trends.
from verticapy.stats import mkt
kendall = {}
for commodity in ["Gold", "SP500"]:
result = mkt(commodities,
column = commodity,
ts = "date")
kendall["index"] = result["index"]
kendall[commodity] = result["value"]
kendall = tablesample(kendall)
display(kendall)
Our hypothesis is correct. We can also look at the correlation between the elapsed time and our variables to see the different trends.
import verticapy.stats as st
commodities["elapsed_days"] = commodities["date"] - st.min(commodities["date"])._over()
commodities.corr(focus = "elapsed_days")
In the last plot, it's a bit hard to tell if 'Spread' is stationary. Let's draw it alone.
commodities["Spread"].plot(ts = "date")
We can see some sudden changes, so let's smooth the curve.
commodities.rolling(func = "avg",
window = (-20, 0),
columns = "Spread",
order_by = ["date"],
name = "Spread_smooth",)
commodities["Spread_smooth"].plot(ts = "date")
After each local minimum, there is a local maximum. Let's look at the number of lags needed to keep most of the information. To visualize this, we can draw the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots.
from verticapy.learn.model_selection import plot_acf_pacf
plot_acf_pacf(commodities,
column = "Spread",
ts = "date",
p = 60)
We can clearly see the influence of the last two values on 'Spread,' which makes sense. When the curve slightly changes its direction, it will increase/decrease until reaching a new local maximum/minimum. Only the recent values can help the prediction in case of autoregressive periodical model. The local minimums of interest rate spreads are indicators of an economic crisis.
We saw the correlation between the price-per-barrel of Oil and the time. Let's look at the time series plot of this variable.
commodities["Oil"].plot(ts = "date")
Our chart shows effects of the 2010s oil glut and its lowest price in two decades with COVID-19.
plot_acf_pacf(commodities,
column = "Oil",
ts = "date",
p = 60)