verticapy.machine_learning.vertica.tsa.ARIMA#

class verticapy.machine_learning.vertica.tsa.ARIMA(name: str = None, overwrite_model: bool = False, order: tuple[int] | list[int] = (0, 0, 0), tol: float = 1e-06, max_iter: int = 100, init: Literal['zero', 'hr'] = 'zero', missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation')#

Creates a inDB ARIMA model.

New in version 23.4.0.

Note

The AR model is much faster than ARIMA(p, 0, 0) or ARMA(p, 0) because the underlying algorithm of AR is quite different.

Note

The MA model may be faster and more accurate than ARIMA(0, 0, q) or ARMA(0, q) because the underlying algorithm of MA is quite different.

Parameters#

name: str, optional

Name of the model. The model is stored in the database.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

order: tuple, optional

The (p,d,q) order of the model for the autoregressive, differences, and moving average components.

tol: float, optional

Determines whether the algorithm has reached the specified accuracy result.

max_iter: int, optional

Determines the maximum number of iterations the algorithm performs before achieving the specified accuracy result.

init: str, optional

Initialization method, one of the following:

‘zero’:
Coefficients are initialized to zero.
‘hr’:
Coefficients are initialized using the Hannan-Rissanen algorithm.

missing: str, optional

Method for handling missing values, one of the following strings:

‘drop’:
Missing values are ignored.
‘raise’:
Missing values raise an error.
‘zero’:
Missing values are set to zero.
‘linear_interpolation’:
Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Attributes#

Many attributes are created during the fitting phase.

phi_: numpy.array: The coefficient of the AutoRegressive process. It represents the strength and direction of the relationship between a variable and its past values.
theta_: numpy.array: The theta coefficient of the Moving Average process. It signifies the impact and contribution of the lagged error terms in determining the current value within the time series model.
mean_: float: The mean of the time series values.
features_importance_: numpy.array: The importance of features is computed through the AutoRegressive part coefficients, which are normalized based on their range. Subsequently, an activation function calculates the final score. It is necessary to use the features_importance() method to compute it initially, and the computed values will be subsequently utilized for subsequent calls.
mse_: float: The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.
n_: int: The number of rows used to fit the model.

Note

All attributes can be accessed using the get_attributes() method.

Note

Several other attributes can be accessed by using the get_vertica_attributes() method.

Examples#

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.

Initialization#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the airline passengers dataset.

import verticapy.datasets as vpd

data = vpd.load_airline_passengers()

	📅 date Date	123 passengers Integer
1	1949-01-01	112
2	1949-02-01	118
3	1949-03-01	132
4	1949-04-01	129
5	1949-05-01	121
6	1949-06-01	135
7	1949-07-01	148
8	1949-08-01	148
9	1949-09-01	136
10	1949-10-01	119
11	1949-11-01	104
12	1949-12-01	118
13	1950-01-01	115
14	1950-02-01	126
15	1950-03-01	141
16	1950-04-01	135
17	1950-05-01	125
18	1950-06-01	149
19	1950-07-01	170
20	1950-08-01	170
21	1950-09-01	158
22	1950-10-01	133
23	1950-11-01	114
24	1950-12-01	140
25	1951-01-01	145
26	1951-02-01	150
27	1951-03-01	178
28	1951-04-01	163
29	1951-05-01	172
30	1951-06-01	178
31	1951-07-01	199
32	1951-08-01	199
33	1951-09-01	184
34	1951-10-01	162
35	1951-11-01	146
36	1951-12-01	166
37	1952-01-01	171
38	1952-02-01	180
39	1952-03-01	193
40	1952-04-01	181
41	1952-05-01	183
42	1952-06-01	218
43	1952-07-01	230
44	1952-08-01	242
45	1952-09-01	209
46	1952-10-01	191
47	1952-11-01	172
48	1952-12-01	194
49	1953-01-01	196
50	1953-02-01	196
51	1953-03-01	236
52	1953-04-01	235
53	1953-05-01	229
54	1953-06-01	243
55	1953-07-01	264
56	1953-08-01	272
57	1953-09-01	237
58	1953-10-01	211
59	1953-11-01	180
60	1953-12-01	201
61	1954-01-01	204
62	1954-02-01	188
63	1954-03-01	235
64	1954-04-01	227
65	1954-05-01	234
66	1954-06-01	264
67	1954-07-01	302
68	1954-08-01	293
69	1954-09-01	259
70	1954-10-01	229
71	1954-11-01	203
72	1954-12-01	229
73	1955-01-01	242
74	1955-02-01	233
75	1955-03-01	267
76	1955-04-01	269
77	1955-05-01	270
78	1955-06-01	315
79	1955-07-01	364
80	1955-08-01	347
81	1955-09-01	312
82	1955-10-01	274
83	1955-11-01	237
84	1955-12-01	278
85	1956-01-01	284
86	1956-02-01	277
87	1956-03-01	317
88	1956-04-01	313
89	1956-05-01	318
90	1956-06-01	374
91	1956-07-01	413
92	1956-08-01	405
93	1956-09-01	355
94	1956-10-01	306
95	1956-11-01	271
96	1956-12-01	306
97	1957-01-01	315
98	1957-02-01	301
99	1957-03-01	356
100	1957-04-01	348

Rows: 1-100 | Columns: 2

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can plot the data to visually inspect it for the presence of any trends:

data["passengers"].plot(ts = "date")

Though the increasing trend is obvious in our example, we can confirm it by the mkt() (Mann Kendall test) test:

from verticapy.machine_learning.model_selection.statistical_tests import mkt

mkt(data, column = "passengers", ts = "date")

	value
Mann Kendall Test Statistic	14.381116595942574
S	8327.0
STDS	578.953653873376
p_value	6.798871501067664e-47
Monotonic Trend	✅
Trend	increasing

Rows: 1-6 | Columns: 2

The above tests gives us some more insights into the data such as that the data is monotonic, and is increasing. Furthermore, the low p-value confirms the presence of a trend with respect to time. Now we are sure of the trend so we can apply the appropriate time-series model to fit it.

Model Initialization#

First we import the ARIMA model:

from verticapy.machine_learning.vertica.tsa import ARIMA

Then we can create the model:

model = ARIMA(order = (12, 1, 2))

Hint

In verticapy 1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.

Important

The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.

Model Fitting#

We can now fit the model:

model.fit(data, "date", "passengers")

Important

To train a model, you can directly use the vDataFrame or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. In verticapy, we don’t work using X matrices and y vectors. Instead, we work directly with lists of predictors and the response name.

Features Importance#

We can conveniently get the features importance:

model.features_importance()
Out[5]: 

Important

Feature importance is determined by using the coefficients of the auto-regressive (AR) process and normalizing them. This method tends to be precise when your time series primarily consists of an auto-regressive component. However, its accuracy may be a topic of discussion if the time series contains other components as well.

Model Register#

In order to register the model for tracking and versioning:

model.register("model_v1")

Please refer to Model Tracking and Versioning for more details on model tracking and versioning.

One important thing in time-series forecasting is that it has two types of forecasting:

One-step ahead forecasting
Full forecasting

Important

The default method is one-step ahead forecasting. To use full forecasting, use ``method = “forecast” ``.

One-step ahead#

In this type of forecasting, the algorithm utilizes the true value of the previous timestamp (t-1) to predict the immediate next timestamp (t). Subsequently, to forecast additional steps into the future (t+1), it relies on the actual value of the immediately preceding timestamp (t).

A notable drawback of this forecasting method is its tendency to exhibit exaggerated accuracy, particularly when predicting more than one step into the future.

Metrics#

We can get the entire report using:

model.report()

	value
explained_variance	0.843011800385913
max_error	108.703124575763
median_absolute_error	23.5457433749146
mean_absolute_error	31.195252646127
mean_squared_error	1692.48056292341
root_mean_squared_error	41.1397686299207
r2	0.842975867228999
r2_adj	0.841494507485876
aic	807.057132402344
bic	812.230918666116

Rows: 1-10 | Columns: 2

You can also choose the number of predictions and where to start the forecast. For example, the following code will allow you to generate a report with 30 predictions, starting the forecasting process at index 40.

model.report(start = 40, npredictions = 30)

	value
explained_variance	0.421076426699653
max_error	52.5240603081696
median_absolute_error	13.4454792561496
mean_absolute_error	19.8817394978026
mean_squared_error	607.492387897198
root_mean_squared_error	24.6473606679741
r2	0.420420176407039
r2_adj	0.399720896993005
aic	197.020930088516
bic	199.082584111099

Rows: 1-10 | Columns: 2

Note

No matter what value you give for npredictons, in the report, the comparison will only be until the extent of the availability of true value. For exaxmple, even if we give n_predictions = 300, the report result will be the same as ``n_predictions = 104 `` starting from 40. This is because there are only 104 values beyond 40 in the dataset.

Important

Most metrics are computed using a single SQL query, but some of them might require multiple SQL queries. Selecting only the necessary metrics in the report can help optimize performance. E.g. model.report(metrics = ["mse", "r2"]).

You can utilize the score() function to calculate various regression metrics, with the explained variance being the default.

model.score()
Out[6]: 0.842975867228999

The same applies to the score. You can choose where to start and the number of predictions to use.

model.score(start = 40, npredictions = 30)
Out[7]: 0.420420176407039

Important

If you do not specify a starting point and the number of predictions, the forecast will begin at one-fourth of the dataset, which can result in an inaccurate score, especially for large datasets. It’s important to choose these parameters carefully.

Prediction#

Prediction is straight-forward:

model.predict()

	123 prediction Float(22)
1	436.808245506626
2	411.303769750774
3	456.591517112856
4	497.165582992911
5	523.414142302269
6	579.634194756896
7	670.753858449996
8	648.086244158784
9	558.685139438718
10	498.606577143251

Rows: 1-10 | Column: prediction | Type: Float(22)

Hint

You can control the number of prediction steps by changing the npredictions parameter: model.predict(npredictions = 30).

Note

Predictions can be made automatically by using the training set, in which case you don’t need to specify the predictors. Alternatively, you can pass only the vDataFrame to the predict() function, but in this case, it’s essential that the column names of the vDataFrame match the predictors and response name in the model.

If you would like to have the ‘time-stamps’ (ts) in the output then you can switch the output_estimated_ts the parameter. And if you also would like to see the standard error then you can switch the output_standard_errors parameter:

model.predict(output_estimated_ts = True, output_standard_errors = True)

	📅 date Date	123 prediction Float(22)	123 std_err Float(22)
1	1961-01-01	436.808245506626	1.0
2	1961-02-01	411.303769750774	1.00174003420373
3	1961-03-01	456.591517112856	1.01233294298172
4	1961-04-01	497.165582992911	1.01300655943781
5	1961-05-01	523.414142302269	1.027160664119
6	1961-06-01	579.634194756896	1.02738774823065
7	1961-07-01	670.753858449996	1.06683194222182
8	1961-08-01	648.086244158784	1.06683209333843
9	1961-09-01	558.685139438718	1.07835995410046
10	1961-10-01	498.606577143251	1.08127865934304

Rows: 1-10 | Columns: 3

Important

The output_estimated_ts parameter provides an estimation of ‘ts’ assuming that ‘ts’ is regularly spaced.

If you don’t provide any input, the function will begin forecasting after the last known value. If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.

model.predict(
    data,
    "date",
    "passengers",
    start = 40,
    npredictions = 20,
    output_estimated_ts = True,
    output_standard_errors = True,
)

	📅 date Date	123 prediction Float(22)	123 std_err Float(22)
1	1952-05-01	171.548232235818	1.0
2	1952-06-01	194.512342704946	1.0
3	1952-07-01	222.671940336835	1.0
4	1952-08-01	252.982229727132	1.0
5	1952-09-01	238.061912589394	1.0
6	1952-10-01	196.980851755186	1.0
7	1952-11-01	164.648691506461	1.0
8	1952-12-01	159.245534010566	1.0
9	1953-01-01	205.912103999788	1.0
10	1953-02-01	202.322730035995	1.0
11	1953-03-01	201.679566319763	1.0
12	1953-04-01	256.80201279766	1.0
13	1953-05-01	221.86132754643	1.0
14	1953-06-01	239.245960979411	1.0
15	1953-07-01	267.327156632142	1.0
16	1953-08-01	279.032360375191	1.0
17	1953-09-01	281.340236784706	1.0
18	1953-10-01	203.760442118814	1.0
19	1953-11-01	191.453482378921	1.0
20	1953-12-01	159.770607459934	1.0

Rows: 1-20 | Columns: 3

Plots#

We can conveniently plot the predictions on a line plot to observe the efficacy of our model:

model.plot(data, "date", "passengers", npredictions = 20, start = 140)

Note

You can control the number of prediction steps by changing the npredictions parameter: model.plot(npredictions = 30).

Please refer to Machine Learning - Time Series Plots for more examples.

Full forecasting#

In this forecasting approach, the algorithm relies solely on a chosen true value for initiation. Subsequently, all predictions are established based on a series of previously predicted values.

This methodology aligns the accuracy of predictions more closely with reality. In practical forecasting scenarios, the goal is to predict all future steps, and this technique ensures a progressive sequence of predictions.

Metrics#

We can get the report using:

model.report(start = 40, method = "forecast")

By selecting start = 40, we will measure the accuracy from 40th time-stamp and continue the assessment until the last available time-stamp.

	value
explained_variance	0.856355581856155
max_error	171.905938422592
median_absolute_error	39.8392278219606
mean_absolute_error	46.4958633427347
mean_squared_error	3472.99371220737
root_mean_squared_error	58.932111044891
r2	0.664855563897496
r2_adj	0.661569834131785
aic	852.086332997406
bic	857.177094993708

Rows: 1-10 | Columns: 2

Notice that the accuracy using method = forecast is poorer than the one-step ahead forecasting.

You can utilize the score() function to calculate various regression metrics, with the explained variance being the default.

model.score(start = 40, npredictions = 30, method = "forecast")
Out[8]: 0.285565495885585

Prediction#

Prediction is straight-forward:

model.predict(start = 100, npredictions = 40, method = "forecast")

	123 prediction Float(22)
1	1011.09669062909
2	1148.32678897059
3	1090.51794877614
4	1009.87358230212
5	747.754338701048
6	587.098901215861
7	444.495918307164
8	317.747911374787
9	374.322358690382
10	473.284471095515
11	729.643678064786
12	887.429050463949
13	1092.29300784565
14	1234.03689920842
15	1181.1887847047
16	1088.14776653289
17	783.979083705412
18	605.804632247447
19	412.222285053335
20	277.702017385629
21	330.11196614039
22	444.619365164586
23	745.827031912083
24	927.420632459289
25	1182.77854631629
26	1328.35058838066
27	1285.85843434179
28	1173.59198840309
29	824.83628957681
30	623.054243934679
31	370.093556127764
32	229.024666603705
33	270.850967558915
34	408.032452903617
35	756.504475660798
36	968.822129499929
37	1283.40758688914
38	1432.86496297949
39	1407.00508221321
40	1267.20307757442

Rows: 1-40 | Column: prediction | Type: Float(22)

If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.

model.predict(
    data,
    "date",
    "passengers",
    start = 40,
    npredictions = 20,
    output_estimated_ts = True,
    output_standard_errors = True,
    method = "forecast"
)

	📅 date Date	123 prediction Float(22)	123 std_err Float(22)
1	1952-05-01	171.548232235818	1.0
2	1952-06-01	183.166871179642	1.00174003420373
3	1952-07-01	187.973337894722	1.01233294298172
4	1952-08-01	208.645306405147	1.01300655943781
5	1952-09-01	207.736698554311	1.027160664119
6	1952-10-01	193.539446859085	1.02738774823065
7	1952-11-01	172.377393747302	1.06683194222182
8	1952-12-01	154.789094931389	1.06683209333843
9	1953-01-01	174.753419065415	1.07835995410046
10	1953-02-01	174.920362312853	1.08127865934304
11	1953-03-01	189.0281201891	1.08992715239033
12	1953-04-01	201.515010730262	1.09183098148237
13	1953-05-01	197.603571967111	1.99988128107435
14	1953-06-01	211.197497700065	1.99993755844603
15	1953-07-01	214.26856666644	2.00161582411564
16	1953-08-01	235.612077615575	2.00694633503065
17	1953-09-01	231.999171473765	2.02903658560943
18	1953-10-01	219.359269338858	2.03175369433929
19	1953-11-01	196.904604419615	2.14039154204599
20	1953-12-01	178.575590207497	2.14043910030784

Rows: 1-20 | Columns: 3

Plots#

We can conveniently plot the predictions on a line plot to observe the efficacy of our model:

model.plot(data, "date", "passengers", npredictions = 40, start = 120, method = "forecast")

__init__(name: str = None, overwrite_model: bool = False, order: tuple[int] | list[int] = (0, 0, 0), tol: float = 1e-06, max_iter: int = 100, init: Literal['zero', 'hr'] = 'zero', missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation') → None#: Must be overridden in the child class

Methods

`__init__`([name, overwrite_model, order, ...])	Must be overridden in the child class
`contour`([nbins, chart])	Draws the model's contour plot.
`deploySQL`([ts, y, start, npredictions, ...])	Returns the SQL code needed to deploy the model.
`does_model_exists`(name[, raise_error, ...])	Checks whether the model is stored in the Vertica database.
`drop`()	Drops the model from the Vertica database.
`export_models`(name, path[, kind])	Exports machine learning models.
`features_importance`([show, chart])	Computes the model's features importance.
`fit`(input_relation, ts, y[, test_relation, ...])	Trains the model.
`get_attributes`([attr_name])	Returns the model attributes.
`get_match_index`(x, col_list[, str_check])	Returns the matching index.
`get_params`()	Returns the parameters of the model.
`get_plotting_lib`([class_name, chart, ...])	Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.
`get_vertica_attributes`([attr_name])	Returns the model Vertica attributes.
`import_models`(path[, schema, kind])	Imports machine learning models.
`plot`([vdf, ts, y, start, npredictions, ...])	Draws the model.
`predict`([vdf, ts, y, start, npredictions, ...])	Predicts using the input relation.
`register`(registered_name[, raise_error])	Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.
`regression_report`([metrics, start, ...])	Computes a regression report using multiple metrics to evaluate the model (`r2`, `mse`, `max error`...).
`report`([metrics, start, npredictions, method])	Computes a regression report using multiple metrics to evaluate the model (`r2`, `mse`, `max error`...).
`score`([metric, start, npredictions, method])	Computes the model score.
`set_params`([parameters])	Sets the parameters of the model.
`summarize`()	Summarizes the model.
`to_binary`(path)	Exports the model to the Vertica Binary format.
`to_pmml`(path)	Exports the model to PMML.
`to_python`([return_proba, ...])	Returns the Python function needed for in-memory scoring without using built-in Vertica functions.
`to_sql`([X, return_proba, ...])	Returns the SQL code needed to deploy the model without using built-in Vertica functions.
`to_tf`(path)	Exports the model to the Frozen Graph format (TensorFlow).

Attributes

object_type