Loading...

verticapy.machine_learning.model_selection.statistical_tests.tsa.durbin_watson

verticapy.machine_learning.model_selection.statistical_tests.tsa.durbin_watson(input_relation: Annotated[str | vDataFrame, ''], eps: str, ts: str, by: Annotated[str | list[str], 'STRING representing one column or a list of columns'] | None = None) float

Durbin Watson test (residuals autocorrelation).

Parameters

input_relation: SQLRelation

Input relation.

eps: str

Input residual vDataColumn.

ts: str

vDataColumn used as timeline to order the data. It can be a numerical or date-like type (date, datetime, timestamp…) vDataColumn.

by: SQLColumns, optional

vDataColumns used in the partition.

Returns

float

Durbin Watson statistic.

Examples

Initialization

Let’s try this test on a dummy dataset that has the following elements:

  • A value of interest that has noise related to time

  • Time-stamp data

Before we begin we can import the necessary libraries:

import verticapy as vp

import numpy as np

Data

Now we can create the dummy dataset:

# Initialization
N = 50 # Number of Rows

days = list(range(N))

y_val = [2 * x + np.random.normal(scale = 4 * x * x) for x in days]

# vDataFrame
vdf = vp.vDataFrame(
    {
        "day": days,
        "y1": y_val,
    }
)

Model Fitting

Next, we can fit a Linear Model. To do that we need to first import the model and intialize:

from verticapy.machine_learning.vertica.linear_model import LinearRegression

model = LinearRegression()

Next we can fit the model:

model.fit(vdf, X = "day", y = "y1")


=======
details
=======
predictor|coefficient| std_err |t_value |p_value 
---------+-----------+---------+--------+--------
Intercept|-1059.96780|849.05804|-1.24840| 0.21794
   day   | 107.60489 |29.86052 | 3.60358| 0.00074


==============
regularization
==============
type| lambda 
----+--------
none| 1.00000


===========
call_string
===========
linear_reg('"public"."_verticapy_tmp_linearregression_v_demo_21766c8a55a311ef880f0242ac120002_"', '"public"."_verticapy_tmp_view_v_demo_218560fa55a311ef880f0242ac120002_"', '"y1"', '"day"'
USING PARAMETERS optimizer='newton', epsilon=1e-06, max_iterations=100, regularization='none', lambda=1, alpha=0.5, fit_intercept=true)

===============
Additional Info
===============
       Name       |Value
------------------+-----
 iteration_count  |  1  
rejected_row_count|  0  
accepted_row_count| 50  

We can create a column in the vDataFrame that has the predictions:

model.predict(vdf, X = "day", name = "y_pred")
123
day
Integer
123
y1
Numeric(23)
123
y_pred
Float(22)
100.0-1059.96779943289
213.0323024307400877-952.36290617722
322.6262768227060183-844.758012921546
4338.70522660695966-737.153119665873
5495.15498880502142-629.5482264102
65-3.4002681648928874-521.943333154526
7692.84930903581254-414.338439898853
8787.59586637070083-306.73354664318
98278.04629117969984-199.128653387507
109-339.2998480761343-91.5237601318332
1110-154.6187385695865316.0811331238401
1211-544.6677594658147123.686026379513
1312-419.8403811208115231.290919635187
1413-138.45993379271926338.89581289086
1514-147.16892856243376446.500706146533
1615-905.7992030534032554.105599402207
1716-1367.8142783247429661.71049265788
1817-6.780636118270287769.315385913553
19181188.4731411340285876.920279169227
20192847.1974067405863984.5251724249
212077.489574793538451092.13006568057
2221-480.35347464297391199.73495893625
23221668.50948204265251307.33985219192
242349.9172517914986361414.94474544759
2524-521.4281385560281522.54963870327
26253809.4980088655411630.15453195894
2726224.34135464088961737.75942521461
28273015.89227226459661845.36431847029
2928-632.32705887535751952.96921172596
30295378.8778497664282060.57410498163
3130-1935.35807525451742168.17899823731
32313089.25251120069832275.78389149298
3332-77.570830876821022383.38878474865
343311431.113615418082490.99367800433
3534-1023.13902458119452598.59857126
3635-1694.60569441837012706.20346451567
3736-319.94065826806842813.80835777135
38377379.5776475360772921.41325102702
39388311.6424962433743029.01814428269
40398395.6074391340753136.62303753837
41403350.56229309023463244.22793079404
42413232.63355689427273351.83282404971
4342-725.39141228017243459.43771730539
44434832.3028218755973567.04261056106
4544-5444.2456433412123674.64750381673
46452112.92528164104173782.25239707241
474611392.9132378460423889.85729032808
48471774.2377667274453997.46218358375
49485847.2655586085074105.06707683943
50495691.57342339180654212.6719700951
Rows: 1-50 | Columns: 3

Then we can calculate the residuals i.e. eps:

vdf["eps"] = vdf["y1"] - vdf["y_pred"]

We can plot the residuals to see the trend:

vdf.scatter(["day", "eps"])

Test

Now we can apply the Durbin Watson Test:

from verticapy.machine_learning.model_selection.statistical_tests import durbin_watson

durbin_watson(input_relation = vdf, ts = "day", eps = "eps")
Out[12]: 2.19927329804118

We can see that the Durbin-Watson statistic is not equal to 2. This shows the presence of autocorrelation.

Note

The Durbin-Watson statistic values can be interpretted as such:

Approximately 2: No significant autocorrelation.

Less than 2: Positive autocorrelation (residuals are correlated positively with their lagged values).

Greater than 2: Negative autocorrelation (residuals are correlated negatively with their lagged values).