Loading...

verticapy.machine_learning.vertica.tsa.AR#

class verticapy.machine_learning.vertica.tsa.AR(name: str = None, overwrite_model: bool = False, p: int = 3, method: Literal['ols', 'yule-walker'] = 'ols', penalty: Literal[None, 'none', 'l2'] = 'none', C: int | float | Decimal = 1.0, missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation')#

Creates a inDB Autoregressor model.

New in version 11.0.0.

Note

The AR model is much faster than ARIMA(p, 0, 0) or ARMA(p, 0) because the underlying algorithm of AR is quite different.

Parameters#

name: str, optional

Name of the model. The model is stored in the database.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

p: int, optional

Integer in the range [1, 1999], the number of lags to consider in the computation. Larger values for p weaken the correlation.

method: str, optional

One of the following algorithms for training the model:

  • ols:

    Ordinary Least Squares

  • yule-walker:

    Yule-Walker

penalty: str, optional

Method of regularization.

  • none:

    No regularization.

  • l2:

    L2 regularization.

C: PythonNumber, optional

The regularization parameter value. The value must be zero or non-negative.

missing: str, optional

Method for handling missing values, one of the following strings:

  • ‘drop’:

    Missing values are ignored.

  • ‘raise’:

    Missing values raise an error.

  • ‘zero’:

    Missing values are set to zero.

  • ‘linear_interpolation’:

    Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Attributes#

Many attributes are created during the fitting phase.

phi_: numpy.array

The coefficient of the AutoRegressive process. It represents the strength and direction of the relationship between a variable and its past values.

intercept_: float

Represents the expected value of the time series when the lagged values are zero. It signifies the baseline or constant term in the model, capturing the average level of the series in the absence of any historical influence.

features_importance_: numpy.array

The importance of features is computed through the AutoRegressive part coefficients, which are normalized based on their range. Subsequently, an activation function calculates the final score. It is necessary to use the features_importance() method to compute it initially, and the computed values will be subsequently utilized for subsequent calls.

mse_: float

The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.

n_: int

The number of rows used to fit the model.

Note

All attributes can be accessed using the get_attributes() method.

Note

Several other attributes can be accessed by using the get_vertica_attributes() method.

Examples#

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.

Initialization#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will generate a dummy time-series dataset.

data = vp.vDataFrame(
    {
        "month": [i for i in range(1, 11)],
        "GB": [5, 10, 20, 35, 55, 80, 110, 145, 185, 230],
    }
)

123
month
Integer
123
GB
Integer
115
2210
3320
4435
5555
6680
77110
88145
99185
1010230
Rows: 1-10 | Columns: 2

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can plot the data to visually inspect it for the presence of any trends:

data["GB"].plot(ts = "month")

Though the increasing trend is obvious in our example, we can confirm it by the mkt() (Mann Kendall test) test:

from verticapy.machine_learning.model_selection.statistical_tests import mkt

mkt(data, column = "GB", ts = "month")
value
Mann Kendall Test Statistic3.935479640399647
S45.0
STDS11.1803398874989
p_value8.303070332644367e-05
Monotonic Trend
Trendincreasing
Rows: 1-6 | Columns: 2

The above tests gives us some more insights into the data such as that the data is monotonic, and is increasing. Furthermore, the low p-value confirms the presence of a trend with respect to time. Now we are sure of the trend so we can apply the appropriate time-series model to fit it.

Model Initialization#

First we import the AR model:

from verticapy.machine_learning.vertica.tsa import AR

Then we can create the model:

model = AR(p = 2)

Hint

In verticapy 1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.

Important

The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.

Model Fitting#

We can now fit the model:

model.fit(data, "month", "GB")

Important

To train a model, you can directly use the vDataFrame or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. In verticapy, we don’t work using X matrices and y vectors. Instead, we work directly with lists of predictors and the response name.

Features Importance#

We can conveniently get the features importance:

model.features_importance()
Out[6]: 

Model Register#

In order to register the model for tracking and versioning:

model.register("model_v1")

Please refer to Model Tracking and Versioning for more details on model tracking and versioning.


One important thing in time-series forecasting is that it has two types of forecasting:

  • One-step ahead forecasting

  • Full forecasting

Important

The default method is one-step ahead forecasting. To use full forecasting, use ``method = “forecast” ``.

One-step ahead#

In this type of forecasting, the algorithm utilizes the true value of the previous timestamp (t-1) to predict the immediate next timestamp (t). Subsequently, to forecast additional steps into the future (t+1), it relies on the actual value of the immediately preceding timestamp (t).

A notable drawback of this forecasting method is its tendency to exhibit exaggerated accuracy, particularly when predicting more than one step into the future.

Metrics#

We can get the entire report using:

model.report(start = 4)
value
explained_variance1.0
max_error3.66071617463604e-11
median_absolute_error1.89857018995099e-11
mean_absolute_error2.05015264024648e-11
mean_squared_error5.16565676074652e-22
root_mean_squared_error2.27280812228981e-11
r21.0
r2_adj1.0
aic-283.422372105612
bic-290.505519833823
Rows: 1-10 | Columns: 2

Important

The value for start cannot be less than the p value selected for the AR model.

You can also choose the number of predictions and where to start the forecast. For example, the following code will allow you to generate a report with 30 predictions, starting the forecasting process at index 40.

model.report(start = 4, npredictions = 10)
value
explained_variance1.0
max_error3.66071617463604e-11
median_absolute_error1.89857018995099e-11
mean_absolute_error2.05015264024648e-11
mean_squared_error5.16565676074652e-22
root_mean_squared_error2.27280812228981e-11
r21.0
r2_adj1.0
aic-283.422372105612
bic-290.505519833823
Rows: 1-10 | Columns: 2

Important

Most metrics are computed using a single SQL query, but some of them might require multiple SQL queries. Selecting only the necessary metrics in the report can help optimize performance. E.g. model.report(metrics = ["mse", "r2"]).

You can utilize the score() function to calculate various regression metrics, with the explained variance being the default.

model.score(start = 3, npredictions = 30)
Out[7]: 1.0

Important

If you do not specify a starting point and the number of predictions, the forecast will begin at one-fourth of the dataset, which can result in an inaccurate score, especially for large datasets. It’s important to choose these parameters carefully.

Prediction#

Prediction is straight-forward:

model.predict()
123
prediction
Float(22)
1279.999999999954
2334.999999999853
3394.999999999684
4459.999999999435
5529.999999999093
6604.999999998642
7684.999999998068
8769.999999997352
9859.999999996477
10954.999999995423
Rows: 1-10 | Column: prediction | Type: Float(22)

Hint

You can control the number of prediction steps by changing the npredictions parameter: model.predict(npredictions = 30).

Note

Predictions can be made automatically by using the training set, in which case you don’t need to specify the predictors. Alternatively, you can pass only the vDataFrame to the predict() function, but in this case, it’s essential that the column names of the vDataFrame match the predictors and response name in the model.

If you would like to have the ‘time-stamps’ (ts) in the output then you can switch the output_estimated_ts the parameter.

model.predict(output_estimated_ts = True)
123
month
Float(22)
123
prediction
Float(22)
111.0279.999999999954
212.0334.999999999853
313.0394.999999999684
414.0459.999999999435
515.0529.999999999093
616.0604.999999998642
717.0684.999999998068
818.0769.999999997352
919.0859.999999996477
1020.0954.999999995423
Rows: 1-10 | Columns: 2

Important

The output_estimated_ts parameter provides an estimation of ‘ts’ assuming that ‘ts’ is regularly spaced.

If you don’t provide any input, the function will begin forecasting after the last known value. If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.

model.predict(
    data,
    "month",
    "GB",
    start = 7,
    npredictions = 10,
    output_estimated_ts = True,
)
123
month
Float(22)
123
prediction
Float(22)
18.0144.999999999978
29.0184.999999999971
310.0229.999999999963
411.0279.999999999954
512.0334.999999999853
613.0394.999999999684
714.0459.999999999435
815.0529.999999999093
916.0604.999999998642
1017.0684.999999998068
Rows: 1-10 | Columns: 2

Plots#

We can conveniently plot the predictions on a line plot to observe the efficacy of our model:

model.plot(data, "month", "GB", npredictions = 10, start=7)

Note

You can control the number of prediction steps by changing the npredictions parameter: model.plot(npredictions = 30).

Please refer to Machine Learning - Time Series Plots for more examples.

Full forecasting#

In this forecasting approach, the algorithm relies solely on a chosen true value for initiation. Subsequently, all predictions are established based on a series of previously predicted values.

This methodology aligns the accuracy of predictions more closely with reality. In practical forecasting scenarios, the goal is to predict all future steps, and this technique ensures a progressive sequence of predictions.

Metrics#

We can get the report using:

model.report(start = 4, method = "forecast")

By selecting start = 4, we will measure the accuracy from 40th time-stamp and continue the assessment until the last available time-stamp.

value
explained_variance1.0
max_error3.31056071445346e-10
median_absolute_error9.27684595808387e-11
mean_absolute_error1.26798719672176e-10
mean_squared_error2.87684111264052e-20
root_mean_squared_error1.69612532338873e-10
r21.0
r2_adj1.0
aic-259.303387354852
bic-266.386535083063
Rows: 1-10 | Columns: 2

Notice that the accuracy using method = forecast is poorer than the one-step ahead forecasting.

You can utilize the score() function to calculate various regression metrics, with the explained variance being the default.

model.score(start = 4, npredictions = 6, method = "forecast")
Out[8]: 1.0

Prediction#

Prediction is straight-forward:

model.predict(start = 100, npredictions = 10, method = "forecast")
123
prediction
Float(22)
1279.999999999954
2334.999999999853
3394.999999999684
4459.999999999435
5529.999999999093
6604.999999998642
7684.999999998068
8769.999999997352
9859.999999996477
10954.999999995423
111054.99999999417
121159.9999999927
131269.99999999098
141384.99999998899
151504.99999998672
161629.99999998412
171759.99999998118
181894.99999997786
192034.99999997414
202179.99999996999
212329.99999996537
222484.99999996025
232644.9999999546
242809.99999994838
252979.99999994155
263154.99999993408
273334.99999992593
283519.99999991706
293709.99999990743
303904.99999989699
314104.99999988571
324309.99999987353
334519.99999986042
344734.99999984632
354954.9999998312
365179.99999981499
375409.99999979765
385644.99999977913
395884.99999975938
406129.99999973834
Rows: 1-40 | Column: prediction | Type: Float(22)

If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.

model.predict(
    data,
    "date",
    "passengers",
    start = 4,
    npredictions = 20,
    output_estimated_ts = True,
    output_standard_errors = True,
    method = "forecast"
)
123
month
Float(22)
123
prediction
Float(22)
15.054.9999999999918
26.079.999999999972
37.0109.999999999936
48.0144.999999999878
59.0184.999999999792
610.0229.999999999669
711.0279.9999999995
812.0334.999999999276
913.0394.999999998984
1014.0459.999999998612
1115.0529.999999998147
1216.0604.999999997573
1317.0684.999999996876
1418.0769.999999996037
1519.0859.999999995039
1620.0954.999999993862
1721.01054.99999999249
1822.01159.99999999089
1923.01269.99999998905
2024.01384.99999998694
Rows: 1-20 | Columns: 2

Plots#

We can conveniently plot the predictions on a line plot to observe the efficacy of our model:

model.plot(data, "month", "GB", npredictions = 10, start = 5, method = "forecast")
__init__(name: str = None, overwrite_model: bool = False, p: int = 3, method: Literal['ols', 'yule-walker'] = 'ols', penalty: Literal[None, 'none', 'l2'] = 'none', C: int | float | Decimal = 1.0, missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation') None#

Must be overridden in the child class

Methods

__init__([name, overwrite_model, p, method, ...])

Must be overridden in the child class

contour([nbins, chart])

Draws the model's contour plot.

deploySQL([ts, y, start, npredictions, ...])

Returns the SQL code needed to deploy the model.

does_model_exists(name[, raise_error, ...])

Checks whether the model is stored in the Vertica database.

drop()

Drops the model from the Vertica database.

export_models(name, path[, kind])

Exports machine learning models.

features_importance([show, chart])

Computes the model's features importance.

fit(input_relation, ts, y[, test_relation, ...])

Trains the model.

get_attributes([attr_name])

Returns the model attributes.

get_match_index(x, col_list[, str_check])

Returns the matching index.

get_params()

Returns the parameters of the model.

get_plotting_lib([class_name, chart, ...])

Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.

get_vertica_attributes([attr_name])

Returns the model Vertica attributes.

import_models(path[, schema, kind])

Imports machine learning models.

plot([vdf, ts, y, start, npredictions, ...])

Draws the model.

predict([vdf, ts, y, start, npredictions, ...])

Predicts using the input relation.

register(registered_name[, raise_error])

Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.

regression_report([metrics, start, ...])

Computes a regression report using multiple metrics to evaluate the model (r2, mse, max error...).

report([metrics, start, npredictions, method])

Computes a regression report using multiple metrics to evaluate the model (r2, mse, max error...).

score([metric, start, npredictions, method])

Computes the model score.

set_params([parameters])

Sets the parameters of the model.

summarize()

Summarizes the model.

to_binary(path)

Exports the model to the Vertica Binary format.

to_pmml(path)

Exports the model to PMML.

to_python([return_proba, ...])

Returns the Python function needed for in-memory scoring without using built-in Vertica functions.

to_sql([X, return_proba, ...])

Returns the SQL code needed to deploy the model without using built-in Vertica functions.

to_tf(path)

Exports the model to the Frozen Graph format (TensorFlow).

Attributes