Loading...

verticapy.machine_learning.vertica.tsa.MA#

class verticapy.machine_learning.vertica.tsa.MA(name: str = None, overwrite_model: bool = False, q: int = 1, penalty: Literal[None, 'none', 'l2'] = 'none', C: int | float | Decimal = 1.0, missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation')#

Creates a inDB Moving Average model.

New in version 11.0.0.

Note

The MA model may be faster and more accurate than ARIMA(0, 0, q) or ARMA(0, q) because the underlying algorithm of MA is quite different.

Parameters#

name: str, optional

Name of the model. The model is stored in the database.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

q: int, optional

Integer in the range [1, 67), the number of lags to consider in the computation.

penalty: str, optional

Method of regularization.

  • none:

    No regularization.

  • l2:

    L2 regularization.

C: PythonNumber, optional

The regularization parameter value. The value must be zero or non-negative.

missing: str, optional

Method for handling missing values, one of the following strings:

  • ‘drop’:

    Missing values are ignored.

  • ‘raise’:

    Missing values raise an error.

  • ‘zero’:

    Missing values are set to zero.

  • ‘linear_interpolation’:

    Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Attributes#

Many attributes are created during the fitting phase.

theta_: numpy.array

The theta coefficient of the Moving Average process. It signifies the impact and contribution of the lagged error terms in determining the current value within the time series model.

mu_: float

Represents the mean or average of the series. It is a constant term that reflects the expected value of the time series in the absence of any temporal dependencies or influences from past error terms.

mean_: float

The mean of the time series values.

mse_: float

The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.

n_: int

The number of rows used to fit the model.

Note

All attributes can be accessed using the get_attributes() method.

Note

Several other attributes can be accessed by using the get_vertica_attributes() method.

Examples#

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.

Initialization#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will generate a dummy time-series dataset that has some noise variation centered around a mean value.

# Initialization
N = 30 # Number of rows
temp = [23] * N
noisy_temp = [x + random.uniform(-5, 5) for x in temp]

# Building the vDataFrame
data = vp.vDataFrame(
    {
        "day": [i for i in range(1, N + 1)],
        "temp": noisy_temp,
    }
)
123
day
Integer
123
temp
Numeric(19)
1121.662884519150932
2222.026766951993785
3323.732593461073865
4422.406725325955627
5521.049988417476907
6624.143901247703276
7724.401658125360335
8822.75079510740254
9920.27199280977796
101022.131577781675222
111124.777913120303204
121222.776659858718958
131327.79318880685984
141423.33301269614893
151526.558679903927896
161627.73744159306446
171727.01019690491039
181818.692709815830582
191921.35450284681953
202025.159758639588077
212120.556031647991936
222226.612993545644272
232324.931764912759245
242425.69685361560668
252518.805605751532845
262620.3730966019823
272724.984742340037144
282823.207967672811186
292919.121727179549712
303020.994745650008404
Rows: 1-30 | Columns: 2

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can plot the data to visually inspect it for the presence of any trends:

data["temp"].plot(ts = "day")

It is obvious there is no trend in our example, but we can confirm it by the mkt() (Mann Kendall test) test:

from verticapy.machine_learning.model_selection.statistical_tests import mkt

mkt(data, column = "temp", ts = "day")
value
Mann Kendall Test Statistic0.0
S1.0
STDS56.0505724026675
p_value1.0
Monotonic Trend
Trendno trend
Rows: 1-6 | Columns: 2

The above report confirms that there is no trend in our data and hence it is stationary. Note the high p-value which is also indicative of the absemce of trend. Once we have established that the data is statioanry, we can then apply MovingAverage model on it.

Model Initialization#

First we import the MA model:

from verticapy.machine_learning.vertica.tsa import MA

Then we can create the model:

model = MA(q = 2)

Hint

In verticapy 1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.

Important

The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.

Model Fitting#

We can now fit the model:

model.fit(data, "day", "temp")

Important

To train a model, you can directly use the vDataFrame or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. In verticapy, we don’t work using X matrices and y vectors. Instead, we work directly with lists of predictors and the response name.

Model Register#

In order to register the model for tracking and versioning:

model.register("model_v1")

Please refer to Model Tracking and Versioning for more details on model tracking and versioning.


One important thing in time-series forecasting is that it has two types of forecasting:

  • One-step ahead forecasting

  • Full forecasting

Important

The default method is one-step ahead forecasting. To use full forecasting, use ``method = “forecast” ``.

One-step ahead#

In this type of forecasting, the algorithm utilizes the true value of the previous timestamp (t-1) to predict the immediate next timestamp (t). Subsequently, to forecast additional steps into the future (t+1), it relies on the actual value of the immediately preceding timestamp (t).

A notable drawback of this forecasting method is its tendency to exhibit exaggerated accuracy, particularly when predicting more than one step into the future.

Metrics#

We can get the entire report using:

model.report(start = 3)
value
explained_variance-1.05378265098599
max_error10.1498911305363
median_absolute_error2.06019249966271
mean_absolute_error3.04025284309777
mean_squared_error14.87762744397
root_mean_squared_error3.85715276388815
r2-1.05960973472998
r2_adj-1.14199412411918
aic77.7295147428362
bic79.4878551415115
Rows: 1-10 | Columns: 2

Important

The value for start has to be greater than the q value selected for the MA model.

You can also choose the number of predictions and where to start the forecast. For example, the following code will allow you to generate a report with 10 predictions, starting the forecasting process at index 25.

model.report(start = 25, npredictions = 10)
value
explained_variance-1.99397622692757
max_error6.37627336159975
median_absolute_error1.90169166063587
mean_absolute_error2.87136053629103
mean_squared_error13.6194408715668
root_mean_squared_error3.69045266485925
r2-2.10062663255054
r2_adj-3.13416884340071
aic27.0574912393226
bic16.2763670641908
Rows: 1-10 | Columns: 2

Important

Most metrics are computed using a single SQL query, but some of them might require multiple SQL queries. Selecting only the necessary metrics in the report can help optimize performance. E.g. model.report(metrics = ["mse", "r2"]).

You can utilize the score() function to calculate various regression metrics, with the explained variance being the default.

model.score(start = 25, npredictions = 10)
Out[5]: -2.10062663255054

Important

If you do not specify a starting point and the number of predictions, the forecast will begin at one-fourth of the dataset, which can result in an inaccurate score, especially for large datasets. It’s important to choose these parameters carefully.

Prediction#

Prediction is straight-forward:

model.predict()
123
prediction
Float(22)
119.3605875902352
218.4559994369852
316.8958342018877
414.9641150617587
512.3977726232792
69.04424416712433
74.6453743913806
8-1.11961553560087
9-8.67652277981185
10-18.5818606246452
Rows: 1-10 | Column: prediction | Type: Float(22)

Hint

You can control the number of prediction steps by changing the npredictions parameter: model.predict(npredictions = 30).

Note

Predictions can be made automatically by using the training set, in which case you don’t need to specify the predictors. Alternatively, you can pass only the vDataFrame to the predict() function, but in this case, it’s essential that the column names of the vDataFrame match the predictors and response name in the model.

If you would like to have the ‘time-stamps’ (ts) in the output then you can switch the output_estimated_ts the parameter.

model.predict(output_estimated_ts = True)
123
day
Float(22)
123
prediction
Float(22)
131.019.3605875902352
232.018.4559994369852
333.016.8958342018877
434.014.9641150617587
535.012.3977726232792
636.09.04424416712433
737.04.6453743913806
838.0-1.11961553560087
939.0-8.67652277981185
1040.0-18.5818606246452
Rows: 1-10 | Columns: 2

Important

The output_estimated_ts parameter provides an estimation of ‘ts’ assuming that ‘ts’ is regularly spaced.

If you don’t provide any input, the function will begin forecasting after the last known value. If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.

model.predict(
    data,
    "day",
    "temp",
    start = 25,
    npredictions = 10,
    output_estimated_ts = True,
)
123
day
Float(22)
123
prediction
Float(22)
126.019.7615572264859
227.018.6084689784374
328.023.8795480070827
429.023.9174451290013
530.019.0930539893725
631.019.3605875902352
732.018.4559994369852
833.016.8958342018877
934.014.9641150617587
1035.012.3977726232792
Rows: 1-10 | Columns: 2

Plots#

We can conveniently plot the predictions on a line plot to observe the efficacy of our model:

model.plot(data, "day", "temp", npredictions = 15, start=25)

Note

You can control the number of prediction steps by changing the npredictions parameter: model.plot(npredictions = 30).

Please refer to Machine Learning - Time Series Plots for more examples.

Model Register#

In order to register the model for tracking and versioning:

model.register("model_v1")

Please refer to Model Tracking and Versioning for more details on model tracking and versioning.

Full forecasting#

In this forecasting approach, the algorithm relies solely on a chosen true value for initiation. Subsequently, all predictions are established based on a series of previously predicted values.

This methodology aligns the accuracy of predictions more closely with reality. In practical forecasting scenarios, the goal is to predict all future steps, and this technique ensures a progressive sequence of predictions.

Metrics#

We can get the report using:

model.report(start = 25, method = "forecast")

By selecting start = 25, we will measure the accuracy from 40th time-stamp and continue the assessment until the last available time-stamp.

value
explained_variance-0.905916751996866
max_error9.21892717443571
median_absolute_error6.62095146016343
mean_absolute_error5.61945090406766
mean_squared_error39.9499299242951
root_mean_squared_error6.32059569378513
r2-8.09507356872479
r2_adj-11.1267647582997
aic32.4381345906276
bic21.6570104154958
Rows: 1-10 | Columns: 2

Notice that the accuracy using method = forecast is poorer than the one-step ahead forecasting.

You can utilize the score() function to calculate various regression metrics, with the explained variance being the default.

model.score(start = 25, npredictions = 30, method = "forecast")
Out[6]: -8.09507356872479

Prediction#

Prediction is straight-forward:

model.predict(start = 25, npredictions = 15, method = "forecast")
123
prediction
Float(22)
119.7615572264859
217.9919312016457
316.5870162126478
414.4687018076985
511.7758184755727
68.22070244505243
73.56840539218381
8-2.53204224580659
9-10.5276701572297
10-21.008372746107
11-34.7461787681022
12-52.7534068235429
13-76.3568723556198
14-107.295768363823
15-147.849780106188
Rows: 1-15 | Column: prediction | Type: Float(22)

If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.

model.predict(
    data,
    "day",
    "temp",
    start = 25,
    npredictions = 20,
    output_estimated_ts = True,
    output_standard_errors = True,
    method = "forecast"
)
123
day
Float(22)
123
prediction
Float(22)
126.019.7615572264859
227.017.9919312016457
328.016.5870162126478
429.014.4687018076985
530.011.7758184755727
631.08.22070244505243
732.03.56840539218381
833.0-2.53204224580659
934.0-10.5276701572297
1035.0-21.008372746107
1136.0-34.7461787681022
1237.0-52.7534068235429
1338.0-76.3568723556198
1439.0-107.295768363823
1540.0-147.849780106188
1641.0-201.007071008859
1742.0-270.684457882112
1843.0-362.016016794863
1944.0-481.73137964174
2045.0-638.651597284325
Rows: 1-20 | Columns: 2

Plots#

We can conveniently plot the predictions on a line plot to observe the efficacy of our model:

model.plot(data, "day", "temp", npredictions = 15, start = 25, method = "forecast")
__init__(name: str = None, overwrite_model: bool = False, q: int = 1, penalty: Literal[None, 'none', 'l2'] = 'none', C: int | float | Decimal = 1.0, missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation') None#

Must be overridden in the child class

Methods

__init__([name, overwrite_model, q, ...])

Must be overridden in the child class

contour([nbins, chart])

Draws the model's contour plot.

deploySQL([ts, y, start, npredictions, ...])

Returns the SQL code needed to deploy the model.

does_model_exists(name[, raise_error, ...])

Checks whether the model is stored in the Vertica database.

drop()

Drops the model from the Vertica database.

export_models(name, path[, kind])

Exports machine learning models.

features_importance([show, chart])

Computes the model's features importance.

fit(input_relation, ts, y[, test_relation, ...])

Trains the model.

get_attributes([attr_name])

Returns the model attributes.

get_match_index(x, col_list[, str_check])

Returns the matching index.

get_params()

Returns the parameters of the model.

get_plotting_lib([class_name, chart, ...])

Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.

get_vertica_attributes([attr_name])

Returns the model Vertica attributes.

import_models(path[, schema, kind])

Imports machine learning models.

plot([vdf, ts, y, start, npredictions, ...])

Draws the model.

predict([vdf, ts, y, start, npredictions, ...])

Predicts using the input relation.

register(registered_name[, raise_error])

Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.

regression_report([metrics, start, ...])

Computes a regression report using multiple metrics to evaluate the model (r2, mse, max error...).

report([metrics, start, npredictions, method])

Computes a regression report using multiple metrics to evaluate the model (r2, mse, max error...).

score([metric, start, npredictions, method])

Computes the model score.

set_params([parameters])

Sets the parameters of the model.

summarize()

Summarizes the model.

to_binary(path)

Exports the model to the Vertica Binary format.

to_pmml(path)

Exports the model to PMML.

to_python([return_proba, ...])

Returns the Python function needed for in-memory scoring without using built-in Vertica functions.

to_sql([X, return_proba, ...])

Returns the SQL code needed to deploy the model without using built-in Vertica functions.

to_tf(path)

Exports the model to the Frozen Graph format (TensorFlow).

Attributes