Loading...

verticapy.machine_learning.vertica.tsa.VAR

class verticapy.machine_learning.vertica.tsa.VAR(name: str = None, overwrite_model: bool = False, p: int = 3, method: Literal['yule-walker'] = 'yule-walker', penalty: Literal[None, 'none', 'l2'] = 'none', C: Annotated[int | float | Decimal, 'Python Numbers'] = 1.0, missing: Literal['drop', 'error', 'zero'] = 'error', subtract_mean: bool = False)

Creates a inDB VectorAutoregressor model.

Added in version 24.2.0.

Parameters

name: str, optional

Name of the model. The model is stored in the database.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

p: int, optional

Integer in the range [1, 1999], the number of lags to consider in the computation. Larger values for p weaken the correlation.

method: str, optional

One of the following algorithms for training the model:

  • ols:

    Ordinary Least Squares

  • yule-walker:

    Yule-Walker

penalty: str, optional

Method of regularization.

  • none:

    No regularization.

  • l2:

    L2 regularization.

C: PythonNumber, optional

The regularization parameter value. The value must be zero or non-negative.

missing: str, optional

Method for handling missing values, one of the following strings:

  • ‘drop’:

    Missing values are ignored.

  • ‘error’:

    Missing values raise an error.

  • ‘zero’:

    Missing values are set to zero.

  • ‘linear_interpolation’:

    Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

subtract_mean: bool, optional

For Yule Walker, if subtract_mean is True, then the mean of the column(s) will be subtracted before calculating the coefficients. If False (default), then the calculations will be performed directly on the data, this often gives a more accurate model. Note that the means saved in the model will be saved as all 0s if this parameter is set to False. This parameter has no effect for OLS.

Attributes

Many attributes are created during the fitting phase.

phi_: numpy.array

The coefficient of the AutoRegressive process. It represents the strength and direction of the relationship between a variable and its past values.

Note

In the case of multivariate analysis, each coefficient is represented by a matrix of numbers.

intercept_: float

Represents the expected value of the time series when the lagged values are zero. It signifies the baseline or constant term in the model, capturing the average level of the series in the absence of any historical influence.

Note

In the case of multivariate analysis, the intercept is represented by a vector of numbers.

features_importance_: numpy.array

The importance of features is computed through the AutoRegressive part coefficients, which are normalized based on their range. Subsequently, an activation function calculates the final score. It is necessary to use the features_importance() method to compute it initially, and the computed values will be subsequently utilized for subsequent calls.

mse_: float

The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.

n_: int

The number of rows used to fit the model.

Note

All attributes can be accessed using the get_attributes() method.

Note

Several other attributes can be accessed by using the get_vertica_attributes() method.

Examples

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.

Initialization

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will generate a dummy time-series dataset.

data = vp.vDataFrame(
    {
        "month": [i for i in range(1, 11)],
        "GB1": [5, 10, 20, 35, 55, 80, 110, 145, 185, 230],
        "GB2": [3, 7, 12, 18, 22, 30, 37, 39, 51, 80],
    }
)

123
month
Integer
123
GB1
Integer
123
GB2
Integer
1153
22107
332012
443518
555522
668030
7711037
8814539
9918551
101023080
Rows: 1-10 | Columns: 3

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can plot the data to visually inspect it for the presence of any trends:

data.plot(ts = "month", columns = ["GB1", "GB2"])

Though the increasing trend is obvious in our example, we can confirm it by the mkt() (Mann Kendall test) test:

from verticapy.machine_learning.model_selection.statistical_tests import mkt

mkt(data, column = "GB1", ts = "month")
value
Mann Kendall Test Statistic3.935479640399647
S45.0
STDS11.1803398874989
p_value8.303070332644367e-05
Monotonic Trend
Trendincreasing
Rows: 1-6 | Columns: 2

The above tests gives us some more insights into the data such as that the data is monotonic, and is increasing. Furthermore, the low p-value confirms the presence of a trend with respect to time. Now we are sure of the trend so we can apply the appropriate time-series model to fit it.

Model Initialization

First we import the VAR model:

from verticapy.machine_learning.vertica.tsa import VAR

Then we can create the model:

model = VAR(p = 2)

Hint

In verticapy 1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.

Important

The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.

Model Fitting

We can now fit the model:

model.fit(data, "month", ["GB1", "GB2"])


=========
phi_(t-1)
=========
predictor|  gb1   |  gb2   
---------+--------+--------
   gb1   | 2.15113|-3.77023
   gb2   | 0.18040| 0.12940


=========
phi_(t-2)
=========
predictor|  gb1   |  gb2   
---------+--------+--------
   gb1   |-2.13660| 6.37890
   gb2   |-0.27343| 1.00159


====
mean
====
predictor| value  
---------+--------
   gb1   | 0.00000
   gb2   | 0.00000


==================
mean_squared_error
==================
predictor|  value   
---------+----------
   gb1   |1234.12309
   gb2   |280.45886 


=================
predictor_columns
=================
"gb1", "gb2"

================
timestamp_column
================
month

==============
missing_method
==============
error

===========
call_string
===========
autoregressor('"public"."_verticapy_tmp_ar_v_mldb_8d772ada55a511ef880f0242ac120002_"', '"public"."_verticapy_tmp_view_v_mldb_8d8f018c55a511ef880f0242ac120002_"', '"gb1", "gb2"', 'month'
USING PARAMETERS p=2, method=yule-walker, missing=error, regularization='none', lambda=1, compute_mse=true, subtract_mean=false);

===============
Additional Info
===============
       Name       | Value  
------------------+--------
    lag_order     |   2    
  num_predictors  |   2    
      lambda      | 1.00000
rejected_row_count|   0    
accepted_row_count|   10   

Important

To train a model, you can directly use the vDataFrame or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. In verticapy, we don’t work using X matrices and y vectors. Instead, we work directly with lists of predictors and the response name.

Features Importance

We can conveniently get the features importance of the first predictor:

model.features_importance(idx=0)
Out[6]: