Loading...

verticapy.machine_learning.model_selection.statistical_tests.ols.het_white#

verticapy.machine_learning.model_selection.statistical_tests.ols.het_white(input_relation: str | vDataFrame, eps: str, X: str | list[str]) tuple[float, float, float, float]#

White’s Lagrange Multiplier Test for Heteroscedasticity.

Parameters#

input_relation: SQLRelation

Input relation.

eps: str

Input residual vDataColumn.

X: str

Exogenous Variables to test the Heteroscedasticity on.

Returns#

tuple

Lagrange Multiplier statistic, LM pvalue, F statistic, F pvalue

Examples#

Initialization#

Let’s try this test on a dummy dataset that has the following elements:

  • x (a predictor)

  • y (the response)

  • Random noise

Note

This metric requires eps, which represents the difference between the predicted value and the true value. If you already have eps available, you can directly use it instead of recomputing it, as demonstrated in the example below.

Before we begin we can import the necessary libraries:

import verticapy as vp

import numpy as np

from verticapy.machine_learning.vertica.linear_model import LinearRegression

Example 1: Homoscedasticity#

Next, we can create some values with random noise:

y_vals = [0, 2, 4, 6, 8, 10] + np.random.normal(0, 0.4, 6)

We can use those values to create the vDataFrame:

vdf = vp.vDataFrame(
    {
        "x": [0, 1, 2, 3, 4, 5],
        "y": y_vals,
    }
)

We can initialize a regression model:

model = LinearRegression()

Fit that model on the dataset:

model.fit(input_relation = vdf, X = "x", y = "y")

We can create a column in the vDataFrame that has the predictions:

model.predict(vdf, X = "x", name = "y_pred")
Out[8]: 
None  x                       y                y_pred  
1    0    -0.11159529612786812    -0.233172689639161  
2    1       2.246904311020397       1.8116606069923  
3    2       3.489126653145639      3.85649390362376  
4    3       5.489775535029665      5.90132720025522  
5    4       7.522503795103315      7.94616049688668  
6    5       10.63674831346581      9.99099379351814  
Rows: 1-6 | Columns: 3

Then we can calculate the residuals i.e. eps:

vdf["eps"] = vdf["y"] - vdf["y_pred"]

We can plot the residuals to see the trend:

vdf.scatter(["x", "eps"])

Notice the randomness of the residuals with respect to x. This shows that the noise is homoscedestic.

To test its score, we can import the test function:

from verticapy.machine_learning.model_selection.statistical_tests import het_white

And simply apply it on the vDataFrame:

lm_statistic, lm_pvalue, f_statistic, f_pvalue = het_white(vdf, eps = "eps", X = "x")
print(lm_statistic, lm_pvalue, f_statistic, f_pvalue)
4.3422907319481485 0.03717687222806552 10.477810109733486 0.0317657498402267

As the noise was not heteroscedestic, we got higher p_value scores and lower statistics score.

Note

A p_value in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test data does not have a heteroscedestic noise in the current case.

However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.

Generally a p-value less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read more

Note

F-statistics tests the overall significance of a model, while LM statistics tests the validity of linear restrictions on model parameters. High values indicate heterescedestic noise in this case.

Example 2: Heteroscedasticity#

We can contrast the above result with a dataset that has heteroscedestic noise below:

# y values
y_vals = np.array([0, 2, 4, 6, 8, 10])

# Adding some heteroscedestic noise
y_vals = y_vals + [0.5, 0.3, 0.2, 0.1, 0.05, 0]
vdf = vp.vDataFrame(
    {
        "x": [0, 1, 2, 3, 4, 5],
        "y": y_vals,
    }
)

We can intialize a regression model:

model = LinearRegression()

Fit that model on the dataset:

model.fit(input_relation = vdf, X = "x", y = "y")

We can create a column in the vDataFrame that has the predictions:

model.predict(vdf, X = "x", name = "y_pred")
Out[18]: 
None  x       y              y_pred  
1    0     0.5    0.43095238095238  
2    1     2.3     2.3352380952381  
3    2     4.2    4.23952380952381  
4    3     6.1    6.14380952380952  
5    4    8.05    8.04809523809524  
6    5    10.0    9.95238095238095  
Rows: 1-6 | Columns: 3

Then we can calculate the residual i.e. eps:

vdf["eps"] = vdf["y"] - vdf["y_pred"]

We can plot the residuals to see the trend:

vdf.scatter(["x", "eps"])

Notice the relationship of the residuals with respect to x. This shows that the noise is heteroscedestic.

Now we can perform the test on this dataset:

lm_statistic, lm_pvalue, f_statistic, f_pvalue = het_white(vdf, eps = "eps", X = "x")
print(lm_statistic, lm_pvalue, f_statistic, f_pvalue)
4.0173425105493 0.04503462183546162 8.104965243719054 0.04654043801519015

Note

Notice the contrast of the two test results. In this dataset, the noise was heteroscedestic so we got very low p_value scores and higher statistics score. Thus confirming that the noise was in fact heteroscedestic.