verticapy.machine_learning.vertica.tsa.MA#
- class verticapy.machine_learning.vertica.tsa.MA(name: str = None, overwrite_model: bool = False, q: int = 1, penalty: Literal[None, 'none', 'l2'] = 'none', C: int | float | Decimal = 1.0, missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation')#
Creates a inDB Moving Average model.
New in version 11.0.0.
Note
The MA model may be faster and more accurate than ARIMA(0, 0, q) or ARMA(0, q) because the underlying algorithm of MA is quite different.
Parameters#
- name: str, optional
Name of the model. The model is stored in the database.
- overwrite_model: bool, optional
If set to
True
, training a model with the same name as an existing model overwrites the existing model.- q: int, optional
Integer in the range [1, 67), the number of lags to consider in the computation.
- penalty: str, optional
Method of regularization.
- none:
No regularization.
- l2:
L2 regularization.
- C: PythonNumber, optional
The regularization parameter value. The value must be zero or non-negative.
- missing: str, optional
Method for handling missing values, one of the following strings:
- ‘drop’:
Missing values are ignored.
- ‘raise’:
Missing values raise an error.
- ‘zero’:
Missing values are set to zero.
- ‘linear_interpolation’:
Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.
Attributes#
Many attributes are created during the fitting phase.
- theta_: numpy.array
The theta coefficient of the Moving Average process. It signifies the impact and contribution of the lagged error terms in determining the current value within the time series model.
- mu_: float
Represents the mean or average of the series. It is a constant term that reflects the expected value of the time series in the absence of any temporal dependencies or influences from past error terms.
- mean_: float
The mean of the time series values.
- mse_: float
The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.
- n_: int
The number of rows used to fit the model.
Note
All attributes can be accessed using the
get_attributes()
method.Note
Several other attributes can be accessed by using the
get_vertica_attributes()
method.Examples#
The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.
Initialization#
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.For this example, we will generate a dummy time-series dataset that has some noise variation centered around a mean value.
# Initialization N = 30 # Number of rows temp = [23] * N noisy_temp = [x + random.uniform(-5, 5) for x in temp] # Building the vDataFrame data = vp.vDataFrame( { "day": [i for i in range(1, N + 1)], "temp": noisy_temp, } )
123dayInteger123tempNumeric(19)1 1 21.662884519150932 2 2 22.026766951993785 3 3 23.732593461073865 4 4 22.406725325955627 5 5 21.049988417476907 6 6 24.143901247703276 7 7 24.401658125360335 8 8 22.75079510740254 9 9 20.27199280977796 10 10 22.131577781675222 11 11 24.777913120303204 12 12 22.776659858718958 13 13 27.79318880685984 14 14 23.33301269614893 15 15 26.558679903927896 16 16 27.73744159306446 17 17 27.01019690491039 18 18 18.692709815830582 19 19 21.35450284681953 20 20 25.159758639588077 21 21 20.556031647991936 22 22 26.612993545644272 23 23 24.931764912759245 24 24 25.69685361560668 25 25 18.805605751532845 26 26 20.3730966019823 27 27 24.984742340037144 28 28 23.207967672811186 29 29 19.121727179549712 30 30 20.994745650008404 Rows: 1-30 | Columns: 2Note
VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.
We can plot the data to visually inspect it for the presence of any trends:
data["temp"].plot(ts = "day")
It is obvious there is no trend in our example, but we can confirm it by the
mkt()
(Mann Kendall test) test:from verticapy.machine_learning.model_selection.statistical_tests import mkt mkt(data, column = "temp", ts = "day")
value Mann Kendall Test Statistic 0.0 S 1.0 STDS 56.0505724026675 p_value 1.0 Monotonic Trend ❌ Trend no trend Rows: 1-6 | Columns: 2The above report confirms that there is no trend in our data and hence it is stationary. Note the high p-value which is also indicative of the absemce of trend. Once we have established that the data is statioanry, we can then apply MovingAverage model on it.
Model Initialization#
First we import the
MA
model:from verticapy.machine_learning.vertica.tsa import MA
Then we can create the model:
model = MA(q = 2)
Hint
In
verticapy
1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.Important
The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.
Model Fitting#
We can now fit the model:
model.fit(data, "day", "temp")
Important
To train a model, you can directly use the
vDataFrame
or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. Inverticapy
, we don’t work usingX
matrices andy
vectors. Instead, we work directly with lists of predictors and the response name.Model Register#
In order to register the model for tracking and versioning:
model.register("model_v1")
Please refer to Model Tracking and Versioning for more details on model tracking and versioning.
One important thing in time-series forecasting is that it has two types of forecasting:
One-step ahead forecasting
Full forecasting
One-step ahead#
In this type of forecasting, the algorithm utilizes the true value of the previous timestamp (t-1) to predict the immediate next timestamp (t). Subsequently, to forecast additional steps into the future (t+1), it relies on the actual value of the immediately preceding timestamp (t).
A notable drawback of this forecasting method is its tendency to exhibit exaggerated accuracy, particularly when predicting more than one step into the future.
Metrics#
We can get the entire report using:
model.report(start = 3)
value explained_variance -1.05378265098599 max_error 10.1498911305363 median_absolute_error 2.06019249966271 mean_absolute_error 3.04025284309777 mean_squared_error 14.87762744397 root_mean_squared_error 3.85715276388815 r2 -1.05960973472998 r2_adj -1.14199412411918 aic 77.7295147428362 bic 79.4878551415115 Rows: 1-10 | Columns: 2Important
The value for
start
has to be greater than theq
value selected for the MA model.You can also choose the number of predictions and where to start the forecast. For example, the following code will allow you to generate a report with 10 predictions, starting the forecasting process at index 25.
model.report(start = 25, npredictions = 10)
value explained_variance -1.99397622692757 max_error 6.37627336159975 median_absolute_error 1.90169166063587 mean_absolute_error 2.87136053629103 mean_squared_error 13.6194408715668 root_mean_squared_error 3.69045266485925 r2 -2.10062663255054 r2_adj -3.13416884340071 aic 27.0574912393226 bic 16.2763670641908 Rows: 1-10 | Columns: 2Important
Most metrics are computed using a single SQL query, but some of them might require multiple SQL queries. Selecting only the necessary metrics in the report can help optimize performance. E.g.
model.report(metrics = ["mse", "r2"])
.You can utilize the
score()
function to calculate various regression metrics, with the explained variance being the default.model.score(start = 25, npredictions = 10) Out[5]: -2.10062663255054
Important
If you do not specify a starting point and the number of predictions, the forecast will begin at one-fourth of the dataset, which can result in an inaccurate score, especially for large datasets. It’s important to choose these parameters carefully.
Prediction#
Prediction is straight-forward:
model.predict()
123predictionFloat(22)1 19.3605875902352 2 18.4559994369852 3 16.8958342018877 4 14.9641150617587 5 12.3977726232792 6 9.04424416712433 7 4.6453743913806 8 -1.11961553560087 9 -8.67652277981185 10 -18.5818606246452 Rows: 1-10 | Column: prediction | Type: Float(22)Hint
You can control the number of prediction steps by changing the
npredictions
parameter:model.predict(npredictions = 30)
.Note
Predictions can be made automatically by using the training set, in which case you don’t need to specify the predictors. Alternatively, you can pass only the
vDataFrame
to thepredict()
function, but in this case, it’s essential that the column names of thevDataFrame
match the predictors and response name in the model.If you would like to have the ‘time-stamps’ (ts) in the output then you can switch the
output_estimated_ts
the parameter.model.predict(output_estimated_ts = True)
123dayFloat(22)123predictionFloat(22)1 31.0 19.3605875902352 2 32.0 18.4559994369852 3 33.0 16.8958342018877 4 34.0 14.9641150617587 5 35.0 12.3977726232792 6 36.0 9.04424416712433 7 37.0 4.6453743913806 8 38.0 -1.11961553560087 9 39.0 -8.67652277981185 10 40.0 -18.5818606246452 Rows: 1-10 | Columns: 2Important
The
output_estimated_ts
parameter provides an estimation of ‘ts’ assuming that ‘ts’ is regularly spaced.If you don’t provide any input, the function will begin forecasting after the last known value. If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.
model.predict( data, "day", "temp", start = 25, npredictions = 10, output_estimated_ts = True, )
123dayFloat(22)123predictionFloat(22)1 26.0 19.7615572264859 2 27.0 18.6084689784374 3 28.0 23.8795480070827 4 29.0 23.9174451290013 5 30.0 19.0930539893725 6 31.0 19.3605875902352 7 32.0 18.4559994369852 8 33.0 16.8958342018877 9 34.0 14.9641150617587 10 35.0 12.3977726232792 Rows: 1-10 | Columns: 2Plots#
We can conveniently plot the predictions on a line plot to observe the efficacy of our model:
model.plot(data, "day", "temp", npredictions = 15, start=25)
Note
You can control the number of prediction steps by changing the
npredictions
parameter:model.plot(npredictions = 30)
.Please refer to Machine Learning - Time Series Plots for more examples.
Model Register#
In order to register the model for tracking and versioning:
model.register("model_v1")
Please refer to Model Tracking and Versioning for more details on model tracking and versioning.
Full forecasting#
In this forecasting approach, the algorithm relies solely on a chosen true value for initiation. Subsequently, all predictions are established based on a series of previously predicted values.
This methodology aligns the accuracy of predictions more closely with reality. In practical forecasting scenarios, the goal is to predict all future steps, and this technique ensures a progressive sequence of predictions.
Metrics#
We can get the report using:
model.report(start = 25, method = "forecast")
By selecting
start = 25
, we will measure the accuracy from 40th time-stamp and continue the assessment until the last available time-stamp.value explained_variance -0.905916751996866 max_error 9.21892717443571 median_absolute_error 6.62095146016343 mean_absolute_error 5.61945090406766 mean_squared_error 39.9499299242951 root_mean_squared_error 6.32059569378513 r2 -8.09507356872479 r2_adj -11.1267647582997 aic 32.4381345906276 bic 21.6570104154958 Rows: 1-10 | Columns: 2Notice that the accuracy using
method = forecast
is poorer than the one-step ahead forecasting.You can utilize the
score()
function to calculate various regression metrics, with the explained variance being the default.model.score(start = 25, npredictions = 30, method = "forecast") Out[6]: -8.09507356872479
Prediction#
Prediction is straight-forward:
model.predict(start = 25, npredictions = 15, method = "forecast")
123predictionFloat(22)1 19.7615572264859 2 17.9919312016457 3 16.5870162126478 4 14.4687018076985 5 11.7758184755727 6 8.22070244505243 7 3.56840539218381 8 -2.53204224580659 9 -10.5276701572297 10 -21.008372746107 11 -34.7461787681022 12 -52.7534068235429 13 -76.3568723556198 14 -107.295768363823 15 -147.849780106188 Rows: 1-15 | Column: prediction | Type: Float(22)If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.
model.predict( data, "day", "temp", start = 25, npredictions = 20, output_estimated_ts = True, output_standard_errors = True, method = "forecast" )
123dayFloat(22)123predictionFloat(22)1 26.0 19.7615572264859 2 27.0 17.9919312016457 3 28.0 16.5870162126478 4 29.0 14.4687018076985 5 30.0 11.7758184755727 6 31.0 8.22070244505243 7 32.0 3.56840539218381 8 33.0 -2.53204224580659 9 34.0 -10.5276701572297 10 35.0 -21.008372746107 11 36.0 -34.7461787681022 12 37.0 -52.7534068235429 13 38.0 -76.3568723556198 14 39.0 -107.295768363823 15 40.0 -147.849780106188 16 41.0 -201.007071008859 17 42.0 -270.684457882112 18 43.0 -362.016016794863 19 44.0 -481.73137964174 20 45.0 -638.651597284325 Rows: 1-20 | Columns: 2Plots#
We can conveniently plot the predictions on a line plot to observe the efficacy of our model:
model.plot(data, "day", "temp", npredictions = 15, start = 25, method = "forecast")
- __init__(name: str = None, overwrite_model: bool = False, q: int = 1, penalty: Literal[None, 'none', 'l2'] = 'none', C: int | float | Decimal = 1.0, missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation') None #
Must be overridden in the child class
Methods
__init__
([name, overwrite_model, q, ...])Must be overridden in the child class
contour
([nbins, chart])Draws the model's contour plot.
deploySQL
([ts, y, start, npredictions, ...])Returns the SQL code needed to deploy the model.
does_model_exists
(name[, raise_error, ...])Checks whether the model is stored in the Vertica database.
drop
()Drops the model from the Vertica database.
export_models
(name, path[, kind])Exports machine learning models.
features_importance
([show, chart])Computes the model's features importance.
fit
(input_relation, ts, y[, test_relation, ...])Trains the model.
get_attributes
([attr_name])Returns the model attributes.
get_match_index
(x, col_list[, str_check])Returns the matching index.
Returns the parameters of the model.
get_plotting_lib
([class_name, chart, ...])Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.
get_vertica_attributes
([attr_name])Returns the model Vertica attributes.
import_models
(path[, schema, kind])Imports machine learning models.
plot
([vdf, ts, y, start, npredictions, ...])Draws the model.
predict
([vdf, ts, y, start, npredictions, ...])Predicts using the input relation.
register
(registered_name[, raise_error])Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.
regression_report
([metrics, start, ...])Computes a regression report using multiple metrics to evaluate the model (
r2
,mse
,max error
...).report
([metrics, start, npredictions, method])Computes a regression report using multiple metrics to evaluate the model (
r2
,mse
,max error
...).score
([metric, start, npredictions, method])Computes the model score.
set_params
([parameters])Sets the parameters of the model.
Summarizes the model.
to_binary
(path)Exports the model to the Vertica Binary format.
to_pmml
(path)Exports the model to PMML.
to_python
([return_proba, ...])Returns the Python function needed for in-memory scoring without using built-in Vertica functions.
to_sql
([X, return_proba, ...])Returns the SQL code needed to deploy the model without using built-in Vertica functions.
to_tf
(path)Exports the model to the Frozen Graph format (TensorFlow).
Attributes