verticapy.machine_learning.vertica.tsa.ARMA#
- class verticapy.machine_learning.vertica.tsa.ARMA(name: str = None, overwrite_model: bool = False, order: tuple[int] | list[int] = (0, 0), tol: float = 1e-06, max_iter: int = 100, init: Literal['zero', 'hr'] = 'zero', missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation')#
Creates a inDB ARMA model.
New in version 12.0.3.
Note
The AR model is much faster than ARIMA(p, 0, 0) or ARMA(p, 0) because the underlying algorithm of AR is quite different.
Note
The MA model may be faster and more accurate than ARIMA(0, 0, q) or ARMA(0, q) because the underlying algorithm of MA is quite different.
Parameters#
- name: str, optional
Name of the model. The model is stored in the database.
- overwrite_model: bool, optional
If set to
True
, training a model with the same name as an existing model overwrites the existing model.- order: tuple, optional
The (p,q) order of the model for the autoregressive, and moving average components.
- tol: float, optional
Determines whether the algorithm has reached the specified accuracy result.
- max_iter: int, optional
Determines the maximum number of iterations the algorithm performs before achieving the specified accuracy result.
- init: str, optional
Initialization method, one of the following:
- ‘zero’:
Coefficients are initialized to zero.
- ‘hr’:
Coefficients are initialized using the Hannan-Rissanen algorithm.
- missing: str, optional
Method for handling missing values, one of the following strings:
- ‘drop’:
Missing values are ignored.
- ‘raise’:
Missing values raise an error.
- ‘zero’:
Missing values are set to zero.
- ‘linear_interpolation’:
Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.
Attributes#
Many attributes are created during the fitting phase.
- phi_: numpy.array
The coefficient of the AutoRegressive process. It represents the strength and direction of the relationship between a variable and its past values.
- theta_: numpy.array
The theta coefficient of the Moving Average process. It signifies the impact and contribution of the lagged error terms in determining the current value within the time series model.
- mean_: float
The mean of the time series values.
- features_importance_: numpy.array
The importance of features is computed through the AutoRegressive part coefficients, which are normalized based on their range. Subsequently, an activation function calculates the final score. It is necessary to use the
features_importance()
method to compute it initially, and the computed values will be subsequently utilized for subsequent calls.- mse_: float
The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.
- n_: int
The number of rows used to fit the model.
Note
All attributes can be accessed using the
get_attributes()
method.Note
Several other attributes can be accessed by using the
get_vertica_attributes()
method.Examples#
The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.
Initialization#
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.For this example, we will use the airline passengers dataset.
import verticapy.datasets as vpd data = vpd.load_airline_passengers()
📅dateDate123passengersInteger1 1949-01-01 112 2 1949-02-01 118 3 1949-03-01 132 4 1949-04-01 129 5 1949-05-01 121 6 1949-06-01 135 7 1949-07-01 148 8 1949-08-01 148 9 1949-09-01 136 10 1949-10-01 119 11 1949-11-01 104 12 1949-12-01 118 13 1950-01-01 115 14 1950-02-01 126 15 1950-03-01 141 16 1950-04-01 135 17 1950-05-01 125 18 1950-06-01 149 19 1950-07-01 170 20 1950-08-01 170 21 1950-09-01 158 22 1950-10-01 133 23 1950-11-01 114 24 1950-12-01 140 25 1951-01-01 145 26 1951-02-01 150 27 1951-03-01 178 28 1951-04-01 163 29 1951-05-01 172 30 1951-06-01 178 31 1951-07-01 199 32 1951-08-01 199 33 1951-09-01 184 34 1951-10-01 162 35 1951-11-01 146 36 1951-12-01 166 37 1952-01-01 171 38 1952-02-01 180 39 1952-03-01 193 40 1952-04-01 181 41 1952-05-01 183 42 1952-06-01 218 43 1952-07-01 230 44 1952-08-01 242 45 1952-09-01 209 46 1952-10-01 191 47 1952-11-01 172 48 1952-12-01 194 49 1953-01-01 196 50 1953-02-01 196 51 1953-03-01 236 52 1953-04-01 235 53 1953-05-01 229 54 1953-06-01 243 55 1953-07-01 264 56 1953-08-01 272 57 1953-09-01 237 58 1953-10-01 211 59 1953-11-01 180 60 1953-12-01 201 61 1954-01-01 204 62 1954-02-01 188 63 1954-03-01 235 64 1954-04-01 227 65 1954-05-01 234 66 1954-06-01 264 67 1954-07-01 302 68 1954-08-01 293 69 1954-09-01 259 70 1954-10-01 229 71 1954-11-01 203 72 1954-12-01 229 73 1955-01-01 242 74 1955-02-01 233 75 1955-03-01 267 76 1955-04-01 269 77 1955-05-01 270 78 1955-06-01 315 79 1955-07-01 364 80 1955-08-01 347 81 1955-09-01 312 82 1955-10-01 274 83 1955-11-01 237 84 1955-12-01 278 85 1956-01-01 284 86 1956-02-01 277 87 1956-03-01 317 88 1956-04-01 313 89 1956-05-01 318 90 1956-06-01 374 91 1956-07-01 413 92 1956-08-01 405 93 1956-09-01 355 94 1956-10-01 306 95 1956-11-01 271 96 1956-12-01 306 97 1957-01-01 315 98 1957-02-01 301 99 1957-03-01 356 100 1957-04-01 348 Rows: 1-100 | Columns: 2Note
VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.
We can plot the data to visually inspect it for the presence of any trends:
data["passengers"].plot(ts = "date")
Though the increasing trend is obvious in our example, we can confirm it by the
mkt()
(Mann Kendall test) test:from verticapy.machine_learning.model_selection.statistical_tests import mkt mkt(data, column = "passengers", ts = "date")
value Mann Kendall Test Statistic 14.381116595942574 S 8327.0 STDS 578.953653873376 p_value 6.798871501067664e-47 Monotonic Trend ✅ Trend increasing Rows: 1-6 | Columns: 2The above tests gives us some more insights into the data such as that the data is monotonic, and is increasing. Furthermore, the low p-value confirms the presence of a trend with respect to time. Now we are sure of the trend so we can apply the appropriate time-series model to fit it.
Model Initialization#
First we import the
ARMA
model:from verticapy.machine_learning.vertica.tsa import ARMA
Then we can create the model:
model = ARMA(order = (12, 1, 2))
Hint
In
verticapy
1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.Important
The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.
Model Fitting#
We can now fit the model:
model.fit(data, "date", "passengers")
Important
To train a model, you can directly use the
vDataFrame
or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. Inverticapy
, we don’t work usingX
matrices andy
vectors. Instead, we work directly with lists of predictors and the response name.Features Importance#
We can conveniently get the features importance:
model.features_importance() Out[5]:
Important
Feature importance is determined by using the coefficients of the auto-regressive (AR) process and normalizing them. This method tends to be precise when your time series primarily consists of an auto-regressive component. However, its accuracy may be a topic of discussion if the time series contains other components as well.
Model Register#
In order to register the model for tracking and versioning:
model.register("model_v1")
Please refer to Model Tracking and Versioning for more details on model tracking and versioning.
One important thing in time-series forecasting is that it has two types of forecasting:
One-step ahead forecasting
Full forecasting
One-step ahead#
In this type of forecasting, the algorithm utilizes the true value of the previous timestamp (t-1) to predict the immediate next timestamp (t). Subsequently, to forecast additional steps into the future (t+1), it relies on the actual value of the immediately preceding timestamp (t).
A notable drawback of this forecasting method is its tendency to exhibit exaggerated accuracy, particularly when predicting more than one step into the future.
Metrics#
We can get the entire report using:
model.report()
value explained_variance 0.843011800385913 max_error 108.703124575763 median_absolute_error 23.5457433749146 mean_absolute_error 31.195252646127 mean_squared_error 1692.48056292341 root_mean_squared_error 41.1397686299207 r2 0.842975867228999 r2_adj 0.841494507485876 aic 807.057132402344 bic 812.230918666116 Rows: 1-10 | Columns: 2You can also choose the number of predictions and where to start the forecast. For example, the following code will allow you to generate a report with 30 predictions, starting the forecasting process at index 40.
model.report(start = 40, npredictions = 30)
value explained_variance 0.421076426699653 max_error 52.5240603081696 median_absolute_error 13.4454792561496 mean_absolute_error 19.8817394978026 mean_squared_error 607.492387897198 root_mean_squared_error 24.6473606679741 r2 0.420420176407039 r2_adj 0.399720896993005 aic 197.020930088516 bic 199.082584111099 Rows: 1-10 | Columns: 2Important
Most metrics are computed using a single SQL query, but some of them might require multiple SQL queries. Selecting only the necessary metrics in the report can help optimize performance. E.g.
model.report(metrics = ["mse", "r2"])
.You can utilize the
score()
function to calculate various regression metrics, with the explained variance being the default.model.score() Out[6]: 0.842975867228999
The same applies to the score. You can choose where to start and the number of predictions to use.
model.score(start = 40, npredictions = 30) Out[7]: 0.420420176407039
Important
If you do not specify a starting point and the number of predictions, the forecast will begin at one-fourth of the dataset, which can result in an inaccurate score, especially for large datasets. It’s important to choose these parameters carefully.
Prediction#
Prediction is straight-forward:
model.predict()
123predictionFloat(22)1 436.808245506626 2 411.303769750774 3 456.591517112856 4 497.165582992911 5 523.414142302269 6 579.634194756896 7 670.753858449996 8 648.086244158784 9 558.685139438718 10 498.606577143251 Rows: 1-10 | Column: prediction | Type: Float(22)Hint
You can control the number of prediction steps by changing the
npredictions
parameter:model.predict(npredictions = 30)
.Note
Predictions can be made automatically by using the training set, in which case you don’t need to specify the predictors. Alternatively, you can pass only the
vDataFrame
to thepredict()
function, but in this case, it’s essential that the column names of thevDataFrame
match the predictors and response name in the model.If you would like to have the ‘time-stamps’ (ts) in the output then you can switch the
output_estimated_ts
the parameter. And if you also would like to see the standard error then you can switch the ``output_standard_errors``parameter:model.predict(output_estimated_ts = True, output_standard_errors = True)
📅dateDate123predictionFloat(22)123std_errFloat(22)1 1961-01-01 436.808245506626 1.0 2 1961-02-01 411.303769750774 1.00174003420373 3 1961-03-01 456.591517112856 1.01233294298172 4 1961-04-01 497.165582992911 1.01300655943781 5 1961-05-01 523.414142302269 1.027160664119 6 1961-06-01 579.634194756896 1.02738774823065 7 1961-07-01 670.753858449996 1.06683194222182 8 1961-08-01 648.086244158784 1.06683209333843 9 1961-09-01 558.685139438718 1.07835995410046 10 1961-10-01 498.606577143251 1.08127865934304 Rows: 1-10 | Columns: 3Important
The
output_estimated_ts
parameter provides an estimation of ‘ts’ assuming that ‘ts’ is regularly spaced.If you don’t provide any input, the function will begin forecasting after the last known value. If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.
model.predict( data, "date", "passengers", start = 40, npredictions = 20, output_estimated_ts = True, output_standard_errors = True, )
📅dateDate123predictionFloat(22)123std_errFloat(22)1 1952-05-01 171.548232235818 1.0 2 1952-06-01 194.512342704946 1.0 3 1952-07-01 222.671940336835 1.0 4 1952-08-01 252.982229727132 1.0 5 1952-09-01 238.061912589394 1.0 6 1952-10-01 196.980851755186 1.0 7 1952-11-01 164.648691506461 1.0 8 1952-12-01 159.245534010566 1.0 9 1953-01-01 205.912103999788 1.0 10 1953-02-01 202.322730035995 1.0 11 1953-03-01 201.679566319763 1.0 12 1953-04-01 256.80201279766 1.0 13 1953-05-01 221.86132754643 1.0 14 1953-06-01 239.245960979411 1.0 15 1953-07-01 267.327156632142 1.0 16 1953-08-01 279.032360375191 1.0 17 1953-09-01 281.340236784706 1.0 18 1953-10-01 203.760442118814 1.0 19 1953-11-01 191.453482378921 1.0 20 1953-12-01 159.770607459934 1.0 Rows: 1-20 | Columns: 3Plots#
We can conveniently plot the predictions on a line plot to observe the efficacy of our model:
model.plot(data, "date", "passengers", npredictions = 20, start=135)
Note
You can control the number of prediction steps by changing the
npredictions
parameter:model.plot(npredictions = 30)
.Please refer to Machine Learning - Time Series Plots for more examples.
Full forecasting#
In this forecasting approach, the algorithm relies solely on a chosen true value for initiation. Subsequently, all predictions are established based on a series of previously predicted values.
This methodology aligns the accuracy of predictions more closely with reality. In practical forecasting scenarios, the goal is to predict all future steps, and this technique ensures a progressive sequence of predictions.
Metrics#
We can get the report using:
model.report(start = 40, method = "forecast")
By selecting
start = 40
, we will measure the accuracy from 40th time-stamp and continue the assessment until the last available time-stamp.value explained_variance 0.856355581856155 max_error 171.905938422592 median_absolute_error 39.8392278219606 mean_absolute_error 46.4958633427347 mean_squared_error 3472.99371220737 root_mean_squared_error 58.932111044891 r2 0.664855563897496 r2_adj 0.661569834131785 aic 852.086332997406 bic 857.177094993708 Rows: 1-10 | Columns: 2Notice that the accuracy using
method = forecast
is poorer than the one-step ahead forecasting.You can utilize the
score()
function to calculate various regression metrics, with the explained variance being the default.model.score(start = 40, npredictions = 30, method = "forecast") Out[8]: 0.285565495885585
Prediction#
Prediction is straight-forward:
model.predict(start = 100, npredictions = 40, method = "forecast")
123predictionFloat(22)1 1011.09669062909 2 1148.32678897059 3 1090.51794877614 4 1009.87358230212 5 747.754338701048 6 587.098901215861 7 444.495918307164 8 317.747911374787 9 374.322358690382 10 473.284471095515 11 729.643678064786 12 887.429050463949 13 1092.29300784565 14 1234.03689920842 15 1181.1887847047 16 1088.14776653289 17 783.979083705412 18 605.804632247447 19 412.222285053335 20 277.702017385629 21 330.11196614039 22 444.619365164586 23 745.827031912083 24 927.420632459289 25 1182.77854631629 26 1328.35058838066 27 1285.85843434179 28 1173.59198840309 29 824.83628957681 30 623.054243934679 31 370.093556127764 32 229.024666603705 33 270.850967558915 34 408.032452903617 35 756.504475660798 36 968.822129499929 37 1283.40758688914 38 1432.86496297949 39 1407.00508221321 40 1267.20307757442 Rows: 1-40 | Column: prediction | Type: Float(22)If you want to forecast starting from a specific value within the input dataset or another dataset, you can use the following syntax.
model.predict( data, "date", "passengers", start = 40, npredictions = 20, output_estimated_ts = True, output_standard_errors = True, method = "forecast" )
📅dateDate123predictionFloat(22)123std_errFloat(22)1 1952-05-01 171.548232235818 1.0 2 1952-06-01 183.166871179642 1.00174003420373 3 1952-07-01 187.973337894722 1.01233294298172 4 1952-08-01 208.645306405147 1.01300655943781 5 1952-09-01 207.736698554311 1.027160664119 6 1952-10-01 193.539446859085 1.02738774823065 7 1952-11-01 172.377393747302 1.06683194222182 8 1952-12-01 154.789094931389 1.06683209333843 9 1953-01-01 174.753419065415 1.07835995410046 10 1953-02-01 174.920362312853 1.08127865934304 11 1953-03-01 189.0281201891 1.08992715239033 12 1953-04-01 201.515010730262 1.09183098148237 13 1953-05-01 197.603571967111 1.99988128107435 14 1953-06-01 211.197497700065 1.99993755844603 15 1953-07-01 214.26856666644 2.00161582411564 16 1953-08-01 235.612077615575 2.00694633503065 17 1953-09-01 231.999171473765 2.02903658560943 18 1953-10-01 219.359269338858 2.03175369433929 19 1953-11-01 196.904604419615 2.14039154204599 20 1953-12-01 178.575590207497 2.14043910030784 Rows: 1-20 | Columns: 3Plots#
We can conveniently plot the predictions on a line plot to observe the efficacy of our model:
model.plot(data, "date", "passengers", npredictions = 40, start = 120, method = "forecast")
- __init__(name: str = None, overwrite_model: bool = False, order: tuple[int] | list[int] = (0, 0), tol: float = 1e-06, max_iter: int = 100, init: Literal['zero', 'hr'] = 'zero', missing: Literal['drop', 'raise', 'zero', 'linear_interpolation'] = 'linear_interpolation') None #
Must be overridden in the child class
Methods
__init__
([name, overwrite_model, order, ...])Must be overridden in the child class
contour
([nbins, chart])Draws the model's contour plot.
deploySQL
([ts, y, start, npredictions, ...])Returns the SQL code needed to deploy the model.
does_model_exists
(name[, raise_error, ...])Checks whether the model is stored in the Vertica database.
drop
()Drops the model from the Vertica database.
export_models
(name, path[, kind])Exports machine learning models.
features_importance
([show, chart])Computes the model's features importance.
fit
(input_relation, ts, y[, test_relation, ...])Trains the model.
get_attributes
([attr_name])Returns the model attributes.
get_match_index
(x, col_list[, str_check])Returns the matching index.
Returns the parameters of the model.
get_plotting_lib
([class_name, chart, ...])Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.
get_vertica_attributes
([attr_name])Returns the model Vertica attributes.
import_models
(path[, schema, kind])Imports machine learning models.
plot
([vdf, ts, y, start, npredictions, ...])Draws the model.
predict
([vdf, ts, y, start, npredictions, ...])Predicts using the input relation.
register
(registered_name[, raise_error])Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.
regression_report
([metrics, start, ...])Computes a regression report using multiple metrics to evaluate the model (
r2
,mse
,max error
...).report
([metrics, start, npredictions, method])Computes a regression report using multiple metrics to evaluate the model (
r2
,mse
,max error
...).score
([metric, start, npredictions, method])Computes the model score.
set_params
([parameters])Sets the parameters of the model.
Summarizes the model.
to_binary
(path)Exports the model to the Vertica Binary format.
to_pmml
(path)Exports the model to PMML.
to_python
([return_proba, ...])Returns the Python function needed for in-memory scoring without using built-in Vertica functions.
to_sql
([X, return_proba, ...])Returns the SQL code needed to deploy the model without using built-in Vertica functions.
to_tf
(path)Exports the model to the Frozen Graph format (TensorFlow).
Attributes