Loading...

verticapy.machine_learning.vertica.tsa.ARIMA

class verticapy.machine_learning.vertica.tsa.ARIMA(name: str = None, overwrite_model: bool = False, order: tuple[int] | list[int] = (0, 0, 0), tol: float = 1e-06, max_iter: int = 100, init: Literal['zero', 'hr'] = 'zero', missing: Literal['drop', 'error', 'zero', 'linear_interpolation'] = 'linear_interpolation')

Creates a inDB ARIMA model.

Added in version 23.4.0.

Note

The AR model is much faster than ARIMA(p, 0, 0) or ARMA(p, 0) because the underlying algorithm of AR is quite different.

Note

The MA model may be faster and more accurate than ARIMA(0, 0, q) or ARMA(0, q) because the underlying algorithm of MA is quite different.

Parameters

name: str, optional

Name of the model. The model is stored in the database.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

order: tuple, optional

The (p,d,q) order of the model for the autoregressive, differences, and moving average components.

tol: float, optional

Determines whether the algorithm has reached the specified accuracy result.

max_iter: int, optional

Determines the maximum number of iterations the algorithm performs before achieving the specified accuracy result.

init: str, optional

Initialization method, one of the following:

  • ‘zero’:

    Coefficients are initialized to zero.

  • ‘hr’:

    Coefficients are initialized using the Hannan-Rissanen algorithm.

missing: str, optional

Method for handling missing values, one of the following strings:

  • ‘drop’:

    Missing values are ignored.

  • ‘error’:

    Missing values raise an error.

  • ‘zero’:

    Missing values are set to zero.

  • ‘linear_interpolation’:

    Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Attributes

Many attributes are created during the fitting phase.

phi_: numpy.array

The coefficient of the AutoRegressive process. It represents the strength and direction of the relationship between a variable and its past values.

theta_: numpy.array

The theta coefficient of the Moving Average process. It signifies the impact and contribution of the lagged error terms in determining the current value within the time series model.

mean_: float

The mean of the time series values.

features_importance_: numpy.array

The importance of features is computed through the AutoRegressive part coefficients, which are normalized based on their range. Subsequently, an activation function calculates the final score. It is necessary to use the features_importance() method to compute it initially, and the computed values will be subsequently utilized for subsequent calls.

mse_: float

The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.

n_: int

The number of rows used to fit the model.

Note

All attributes can be accessed using the get_attributes() method.

Note

Several other attributes can be accessed by using the get_vertica_attributes() method.

Examples

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.

Initialization

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the airline passengers dataset.

import verticapy.datasets as vpd

data = vpd.load_airline_passengers()
📅
date
Date
123
passengers
Integer
11949-01-01112
21949-02-01118
31949-03-01132
41949-04-01129
51949-05-01121
61949-06-01135
71949-07-01148
81949-08-01148
91949-09-01136
101949-10-01119
111949-11-01104
121949-12-01118
131950-01-01115
141950-02-01126
151950-03-01141
161950-04-01135
171950-05-01125
181950-06-01149
191950-07-01170
201950-08-01170
211950-09-01158
221950-10-01133
231950-11-01114
241950-12-01140
251951-01-01145
261951-02-01150
271951-03-01178
281951-04-01163
291951-05-01172
301951-06-01178
311951-07-01199
321951-08-01199
331951-09-01184
341951-10-01162
351951-11-01146
361951-12-01166
371952-01-01171
381952-02-01180
391952-03-01193
401952-04-01181
411952-05-01183
421952-06-01218
431952-07-01230
441952-08-01242
451952-09-01209
461952-10-01191
471952-11-01172
481952-12-01194
491953-01-01196
501953-02-01196
511953-03-01236
521953-04-01235
531953-05-01229
541953-06-01243
551953-07-01264
561953-08-01272
571953-09-01237
581953-10-01211
591953-11-01180
601953-12-01201
611954-01-01204
621954-02-01188
631954-03-01235
641954-04-01227
651954-05-01234
661954-06-01264
671954-07-01302
681954-08-01293
691954-09-01259
701954-10-01229
711954-11-01203
721954-12-01229
731955-01-01242
741955-02-01233
751955-03-01267
761955-04-01269
771955-05-01270
781955-06-01315
791955-07-01364
801955-08-01347
811955-09-01312
821955-10-01274
831955-11-01237
841955-12-01278
851956-01-01284
861956-02-01277
871956-03-01317
881956-04-01313
891956-05-01318
901956-06-01374
911956-07-01413
921956-08-01405
931956-09-01355
941956-10-01306
951956-11-01271
961956-12-01306
971957-01-01315
981957-02-01301
991957-03-01356
1001957-04-01348
Rows: 1-100 | Columns: 2

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can plot the data to visually inspect it for the presence of any trends:

data["passengers"].plot(ts = "date")

Though the increasing trend is obvious in our example, we can confirm it by the mkt() (Mann Kendall test) test:

from verticapy.machine_learning.model_selection.statistical_tests import mkt

mkt(data, column = "passengers", ts = "date")
value
Mann Kendall Test Statistic14.381116595942574
S8327.0
STDS578.953653873376
p_value6.798871501067664e-47
Monotonic Trend
Trendincreasing
Rows: 1-6 | Columns: 2

The above tests gives us some more insights into the data such as that the data is monotonic, and is increasing. Furthermore, the low p-value confirms the presence of a trend with respect to time. Now we are sure of the trend so we can apply the appropriate time-series model to fit it.

Model Initialization

First we import the ARIMA model:

from verticapy.machine_learning.vertica.tsa import ARIMA

Then we can create the model:

model = ARIMA(order = (12, 1, 2))

Hint

In verticapy 1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.

Important

The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.

Model Fitting

We can now fit the model:

model.fit(data, "date", "passengers")


============
coefficients
============
parameter| value  
---------+--------
  phi_1  |-0.02408
  phi_2  |-0.03398
  phi_3  |-0.02702
  phi_4  |-0.12197
  phi_5  |-0.01651
  phi_6  |-0.21558
  phi_7  |-0.00477
  phi_8  |-0.15146
  phi_9  | 0.04249
 phi_10  |-0.16296
 phi_11  | 0.04043
 phi_12  | 0.86090
 theta_1 | 0.06580
 theta_2 |-0.06794


==============
regularization
==============
none

===============
timeseries_name
===============
passengers

==============
timestamp_name
==============
date

==============
missing_method
==============
linear_interpolation

===========
call_string
===========
ARIMA('"public"."_verticapy_tmp_arima_v_demo_58a95b8455a511ef880f0242ac120002_"', '"public"."_verticapy_tmp_view_v_demo_58b4228a55a511ef880f0242ac120002_"', 'passengers', 'date' USING PARAMETERS p=12, d=1, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);

===============
Additional Info
===============
       Name       |  Value  
------------------+---------
        p         |   12    
        d         |    1    
        q         |    2    
       mean       | 2.23776 
      lambda      | 1.00000 
mean_squared_error|178.86952
rejected_row_count|    0    
accepted_row_count|   144   

Important

To train a model, you can directly use the vDataFrame or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. In verticapy, we don’t work using X matrices and y vectors. Instead, we work directly with lists of predictors and the response name.

Features Importance

We can conveniently get the features importance:

model.features_importance()
Out[5]: