
verticapy.machine_learning.vertica.tsa.ARIMA¶
- class verticapy.machine_learning.vertica.tsa.ARIMA(name: str = None, overwrite_model: bool = False, order: tuple[int] | list[int] = (0, 0, 0), tol: float = 1e-06, max_iter: int = 100, init: Literal['zero', 'hr'] = 'zero', missing: Literal['drop', 'error', 'zero', 'linear_interpolation'] = 'linear_interpolation')¶
Creates a inDB ARIMA model.
Added in version 23.4.0.
Note
The AR model is much faster than ARIMA(p, 0, 0) or ARMA(p, 0) because the underlying algorithm of AR is quite different.
Note
The MA model may be faster and more accurate than ARIMA(0, 0, q) or ARMA(0, q) because the underlying algorithm of MA is quite different.
Parameters¶
- name: str, optional
Name of the model. The model is stored in the database.
- overwrite_model: bool, optional
If set to
True
, training a model with the same name as an existing model overwrites the existing model.- order: tuple, optional
The (p,d,q) order of the model for the autoregressive, differences, and moving average components.
- tol: float, optional
Determines whether the algorithm has reached the specified accuracy result.
- max_iter: int, optional
Determines the maximum number of iterations the algorithm performs before achieving the specified accuracy result.
- init: str, optional
Initialization method, one of the following:
- ‘zero’:
Coefficients are initialized to zero.
- ‘hr’:
Coefficients are initialized using the Hannan-Rissanen algorithm.
- missing: str, optional
Method for handling missing values, one of the following strings:
- ‘drop’:
Missing values are ignored.
- ‘error’:
Missing values raise an error.
- ‘zero’:
Missing values are set to zero.
- ‘linear_interpolation’:
Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.
Attributes¶
Many attributes are created during the fitting phase.
- phi_: numpy.array
The coefficient of the AutoRegressive process. It represents the strength and direction of the relationship between a variable and its past values.
- theta_: numpy.array
The theta coefficient of the Moving Average process. It signifies the impact and contribution of the lagged error terms in determining the current value within the time series model.
- mean_: float
The mean of the time series values.
- features_importance_: numpy.array
The importance of features is computed through the AutoRegressive part coefficients, which are normalized based on their range. Subsequently, an activation function calculates the final score. It is necessary to use the
features_importance()
method to compute it initially, and the computed values will be subsequently utilized for subsequent calls.- mse_: float
The mean squared error (MSE) of the model, based on one-step forward forecasting, may not always be relevant. Utilizing a full forecasting approach is recommended to compute a more meaningful and comprehensive metric.
- n_: int
The number of rows used to fit the model.
Note
All attributes can be accessed using the
get_attributes()
method.Note
Several other attributes can be accessed by using the
get_vertica_attributes()
method.Examples¶
The following examples provide a basic understanding of usage. For more detailed examples, please refer to the Machine Learning or the Examples section on the website.
Initialization¶
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.For this example, we will use the airline passengers dataset.
import verticapy.datasets as vpd data = vpd.load_airline_passengers()
📅date123passengers1 1949-01-01 112 2 1949-02-01 118 3 1949-03-01 132 4 1949-04-01 129 5 1949-05-01 121 6 1949-06-01 135 7 1949-07-01 148 8 1949-08-01 148 9 1949-09-01 136 10 1949-10-01 119 11 1949-11-01 104 12 1949-12-01 118 13 1950-01-01 115 14 1950-02-01 126 15 1950-03-01 141 16 1950-04-01 135 17 1950-05-01 125 18 1950-06-01 149 19 1950-07-01 170 20 1950-08-01 170 21 1950-09-01 158 22 1950-10-01 133 23 1950-11-01 114 24 1950-12-01 140 25 1951-01-01 145 26 1951-02-01 150 27 1951-03-01 178 28 1951-04-01 163 29 1951-05-01 172 30 1951-06-01 178 31 1951-07-01 199 32 1951-08-01 199 33 1951-09-01 184 34 1951-10-01 162 35 1951-11-01 146 36 1951-12-01 166 37 1952-01-01 171 38 1952-02-01 180 39 1952-03-01 193 40 1952-04-01 181 41 1952-05-01 183 42 1952-06-01 218 43 1952-07-01 230 44 1952-08-01 242 45 1952-09-01 209 46 1952-10-01 191 47 1952-11-01 172 48 1952-12-01 194 49 1953-01-01 196 50 1953-02-01 196 51 1953-03-01 236 52 1953-04-01 235 53 1953-05-01 229 54 1953-06-01 243 55 1953-07-01 264 56 1953-08-01 272 57 1953-09-01 237 58 1953-10-01 211 59 1953-11-01 180 60 1953-12-01 201 61 1954-01-01 204 62 1954-02-01 188 63 1954-03-01 235 64 1954-04-01 227 65 1954-05-01 234 66 1954-06-01 264 67 1954-07-01 302 68 1954-08-01 293 69 1954-09-01 259 70 1954-10-01 229 71 1954-11-01 203 72 1954-12-01 229 73 1955-01-01 242 74 1955-02-01 233 75 1955-03-01 267 76 1955-04-01 269 77 1955-05-01 270 78 1955-06-01 315 79 1955-07-01 364 80 1955-08-01 347 81 1955-09-01 312 82 1955-10-01 274 83 1955-11-01 237 84 1955-12-01 278 85 1956-01-01 284 86 1956-02-01 277 87 1956-03-01 317 88 1956-04-01 313 89 1956-05-01 318 90 1956-06-01 374 91 1956-07-01 413 92 1956-08-01 405 93 1956-09-01 355 94 1956-10-01 306 95 1956-11-01 271 96 1956-12-01 306 97 1957-01-01 315 98 1957-02-01 301 99 1957-03-01 356 100 1957-04-01 348 Rows: 1-100 | Columns: 2Note
VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.
We can plot the data to visually inspect it for the presence of any trends:
data["passengers"].plot(ts = "date")
Though the increasing trend is obvious in our example, we can confirm it by the
mkt()
(Mann Kendall test) test:from verticapy.machine_learning.model_selection.statistical_tests import mkt mkt(data, column = "passengers", ts = "date")
value Mann Kendall Test Statistic 14.381116595942574 S 8327.0 STDS 578.953653873376 p_value 6.798871501067664e-47 Monotonic Trend ✅ Trend increasing Rows: 1-6 | Columns: 2The above tests gives us some more insights into the data such as that the data is monotonic, and is increasing. Furthermore, the low p-value confirms the presence of a trend with respect to time. Now we are sure of the trend so we can apply the appropriate time-series model to fit it.
Model Initialization¶
First we import the
ARIMA
model:from verticapy.machine_learning.vertica.tsa import ARIMA
Then we can create the model:
model = ARIMA(order = (12, 1, 2))
Hint
In
verticapy
1.0.x and higher, you do not need to specify the model name, as the name is automatically assigned. If you need to re-use the model, you can fetch the model name from the model’s attributes.Important
The model name is crucial for the model management system and versioning. It’s highly recommended to provide a name if you plan to reuse the model later.
Model Fitting¶
We can now fit the model:
model.fit(data, "date", "passengers") ============ coefficients ============ parameter| value ---------+-------- phi_1 |-0.02408 phi_2 |-0.03398 phi_3 |-0.02702 phi_4 |-0.12197 phi_5 |-0.01651 phi_6 |-0.21558 phi_7 |-0.00477 phi_8 |-0.15146 phi_9 | 0.04249 phi_10 |-0.16296 phi_11 | 0.04043 phi_12 | 0.86090 theta_1 | 0.06580 theta_2 |-0.06794 ============== regularization ============== none =============== timeseries_name =============== passengers ============== timestamp_name ============== date ============== missing_method ============== linear_interpolation =========== call_string =========== ARIMA('"public"."_verticapy_tmp_arima_v_demo_58a95b8455a511ef880f0242ac120002_"', '"public"."_verticapy_tmp_view_v_demo_58b4228a55a511ef880f0242ac120002_"', 'passengers', 'date' USING PARAMETERS p=12, d=1, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100); =============== Additional Info =============== Name | Value ------------------+--------- p | 12 d | 1 q | 2 mean | 2.23776 lambda | 1.00000 mean_squared_error|178.86952 rejected_row_count| 0 accepted_row_count| 144
Important
To train a model, you can directly use the
vDataFrame
or the name of the relation stored in the database. The test set is optional and is only used to compute the test metrics. Inverticapy
, we don’t work usingX
matrices andy
vectors. Instead, we work directly with lists of predictors and the response name.Features Importance¶
We can conveniently get the features importance:
model.features_importance() Out[5]: