verticapy.machine_learning.model_selection.statistical_tests.tsa.het_arch#

verticapy.machine_learning.model_selection.statistical_tests.tsa.het_arch(input_relation: str | vDataFrame, eps: str, ts: str, by: str | list[str] | None = None, p: int = 1) → tuple[float, float, float, float]#

Engle’s Test for Autoregressive Conditional Heteroscedasticity (ARCH).

Parameters#

input_relation: SQLRelation: Input relation.
eps: str: Input residual vDataColumn.
ts: str: vDataColumn used as timeline to to order the data. It can be a numerical or date-like type (date, datetime, timestamp…) vDataColumn.
by: SQLColumns, optional: vDataColumns used in the partition.
p: int, optional: Number of lags to consider in the test.

Returns#

tuple: Lagrange Multiplier statistic, LM pvalue, F statistic, F pvalue

Examples#

Initialization#

Let’s try this test on a dummy dataset that has the following elements:

A value of interest that has noise
Time-stamp data

Before we begin we can import the necessary libraries:

import verticapy as vp

import numpy as np

Example 1: Random#

Now we can create the dummy dataset:

# Initialization
N = 50 # Number of Rows.

days = list(range(N))

vals = [np.random.normal(5) for x in days]

# vDataFrame
vdf = vp.vDataFrame(
    {
        "day": days,
        "eps": vals,
    }
)

Let us plot the distribution of noise with respect to time:

vdf.scatter(["day", "eps"])

Test#

Now we can apply the Durbin Watson Test:

from verticapy.machine_learning.model_selection.statistical_tests import het_arch

het_arch(input_relation = vdf, ts = "day", eps = "eps", p = 5)
Out[8]: (5.6097146881908, 0.34606518349573423, 1.110826545721167, 0.3704722727198604)

We can see that there is no relationship with any lag except that which is by chance.

Now let us contrast it with another example where the lags are related:

Example 1: Correlated#

We can create an alternate dataset that exhibits some correlation with a specific lag. Below, we intertwine two separate values, one after the other, thereby creating a new value. This new value has the characteristic that every other value is related to the one that is two steps before it, but not to the one immediately before it

# Initialization
N = 50 # Number of Rows

days = list(range(N))

x1 = [2 * -x for x in list(range(40, 40 + 5 * N, 5))]

x2 = [-2 * -x * x * x / 2 for x in list(range(4, 4 + 2 * N, 2))]

vals = []

for elem_1, elem_2 in zip(x1, x2):
    vals.extend([elem_1, elem_2])


# vDataFrame
vdf = vp.vDataFrame(
    {
        "day": days,
        "eps": vals,
    }
)

Let us plot the distribution of noise with respect to time to observe the trend:

vdf.scatter(["day", "eps"])

Notice that it is a bit hard to see the relationship of certain lags. That is why we need the Engle’s Test for Autoregressive Conditional Heteroscedasticity.

Test#

Now we can apply the Durbin Watson Test:

from verticapy.machine_learning.model_selection.statistical_tests import het_arch

het_arch(input_relation = vdf, ts = "day", eps = "eps", p = 5)
Out[17]: 
(44.999964568754415,
 1.4509012338760658e-08,
 9906502.518229792,
 6.717425199609984e-118)

We can see that the lags of multiple of 2 have a very low value of p. This confirms the presence of correaltion with certain lags.