verticapy.machine_learning.model_selection.statistical_tests.ols.het_breuschpagan#
- verticapy.machine_learning.model_selection.statistical_tests.ols.het_breuschpagan(input_relation: str | vDataFrame, eps: str, X: str | list[str]) tuple[float, float, float, float] #
Uses the Breusch-Pagan to test a model for Heteroscedasticity.
Parameters#
- input_relation: SQLRelation
Input relation.
- eps: str
Input residual vDataColumn.
- X: list
The exogenous variables to test.
Returns#
- tuple
Lagrange Multiplier statistic, LM pvalue, F statistic, F pvalue
Examples#
Initialization#
Let’s try this test on a dummy dataset that has the following elements:
x (a predictor)
y (the response)
Random noise
Note
This metric requires
eps
, which represents the difference between the predicted value and the true value. If you already haveeps
available, you can directly use it instead of recomputing it, as demonstrated in the example below.Before we begin we can import the necessary libraries:
import verticapy as vp import numpy as np from verticapy.machine_learning.vertica.linear_model import LinearRegression
Example 1: Homoscedasticity#
Next, we can create some values with random noise:
y_vals = [0, 2, 4, 6, 8, 10] + np.random.normal(0, 0.4, 6)
We can use those values to create the
vDataFrame
:vdf = vp.vDataFrame( { "x": [0, 1, 2, 3, 4, 5], "y": y_vals, } )
We can initialize a regression model:
model = LinearRegression()
Fit that model on the dataset:
model.fit(input_relation = vdf, X = "x", y = "y")
We can create a column in the
vDataFrame
that has the predictions:model.predict(vdf, X = "x", name = "y_pred") Out[8]: None x y y_pred 1 0 0.4410034003495173 0.51111304610273 2 1 2.5316250528125286 2.38509059539907 3 2 4.316927255256366 4.25906814469541 4 3 6.312677815827419 6.13304569399174 5 4 7.238592067046039 8.00702324328808 6 5 10.335515924769565 9.88100079258442 Rows: 1-6 | Columns: 3
Then we can calculate the residuals i.e.
eps
:vdf["eps"] = vdf["y"] - vdf["y_pred"]
We can plot the residuals to see the trend:
vdf.scatter(["x", "eps"])
Notice the randomness of the residuals with respect to x. This shows that the noise is homoscedestic.
To test its score, we can import the test function:
from verticapy.machine_learning.model_selection.statistical_tests import het_breuschpagan
And simply apply it on the
vDataFrame
:lm_statistic, lm_pvalue, f_statistic, f_pvalue = het_breuschpagan(vdf, eps = "eps", X = "x")
print(lm_statistic, lm_pvalue, f_statistic, f_pvalue) 2.3918568365040778 0.12196867909592389 2.6516207679371715 0.17877538266140303
As the noise was not heteroscedestic, we got higher p_value scores and lower statistics score.
Note
A
p_value
in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test data does not have a heteroscedestic noise in the current case.However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.
Generally a
p-value
less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read moreNote
F-statistics tests the overall significance of a model, while LM statistics tests the validity of linear restrictions on model parameters. High values indicate heterescedestic noise in this case.
Example 2: Heteroscedasticity#
We can contrast the above result with a dataset that has heteroscedestic noise below:
# y values y_vals = np.array([0, 2, 4, 6, 8, 10]) # Adding some heteroscedestic noise y_vals = y_vals + [0.5, 0.3, 0.2, 0.1, 0.05, 0]
vdf = vp.vDataFrame( { "x": [0, 1, 2, 3, 4, 5], "y": y_vals, } )
We can intialize a regression model:
model = LinearRegression()
Fit that model on the dataset:
model.fit(input_relation = vdf, X = "x", y = "y")
We can create a column in the
vDataFrame
that has the predictions:model.predict(vdf, X = "x", name = "y_pred") Out[18]: None x y y_pred 1 0 0.5 0.43095238095238 2 1 2.3 2.3352380952381 3 2 4.2 4.23952380952381 4 3 6.1 6.14380952380952 5 4 8.05 8.04809523809524 6 5 10.0 9.95238095238095 Rows: 1-6 | Columns: 3
Then we can calculate the residual i.e.
eps
:vdf["eps"] = vdf["y"] - vdf["y_pred"]
We can plot the residuals to see the trend:
vdf.scatter(["x", "eps"])
Notice the relationship of the residuals with respect to x. This shows that the noise is heteroscedestic.
Now we can perform the test on this dataset:
lm_statistic, lm_pvalue, f_statistic, f_pvalue = het_breuschpagan(vdf, eps = "eps", X = "x")
print(lm_statistic, lm_pvalue, f_statistic, f_pvalue) 1.726937533721334 0.18880247576982115 1.616580658344827 0.2724702876334907
Note
Notice the contrast of the two test results. In this dataset, the noise was heteroscedestic so we got very low p_value scores and higher statistics score. Thus confirming that the noise was in fact heteroscedestic.
For more information check out this link.