verticapy.machine_learning.model_selection.statistical_tests.ols.het_goldfeldquandt#
- verticapy.machine_learning.model_selection.statistical_tests.ols.het_goldfeldquandt(input_relation: str | vDataFrame, y: str, X: str | list[str], idx: int = 0, split: float = 0.5, alternative: Literal['increasing', 'decreasing', 'two-sided'] = 'increasing') tuple[float, float] #
Goldfeld-Quandt Homoscedasticity test.
Parameters#
- input_relation: SQLRelation
Input relation.
- y: str
Response Column.
- X: SQLColumns
Exogenous Variables.
- idx: int, optional
Column index of variable according to which observations are sorted for the split.
- split: float, optional
Float to indicate where to split (Example: 0.5 to split on the median).
- alternative: str, optional
Specifies the alternative hypothesis for the p-value calculation, one of the following variances: “increasing”, “decreasing”, “two-sided”.
Returns#
- tuple
statistic, p_value
Examples#
Initialization#
Let’s try this test on a dummy dataset that has the following elements:
x (a predictor)
y (the response)
Random noise
Before we begin we can import the necessary libraries:
import verticapy as vp import numpy as np
Example 1: Homoscedasticity#
Next, we can create some values with random noise:
N = 50 # Number of rows x_val = list(range(N)) y_val = [x * 2 for x in x_val] + np.random.normal(0, 0.4, N)
We can use those values to create the
vDataFrame
:vdf = vp.vDataFrame( { "x": x_val, "y": y_val, } )
We can plot the values to see the trend:
vdf.scatter(["x", "y"])
Notice the randomness of the data with respect to x. This shows that the noise is homoscedestic.
To test its score, we can import the test function:
from verticapy.machine_learning.model_selection.statistical_tests import het_goldfeldquandt
And simply apply it on the
vDataFrame
:statistic, pvalue = het_goldfeldquandt(vdf, y = "y", X = "x")
print(statistic, pvalue) 1.5172856550914458 0.15694752301480847
As the noise was not heteroscedestic, we got higher p_value scores and lower statistics score.
Note
A
p_value
in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test data does not have a heteroscedestic noise in the current case.However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.
Generally a
p-value
less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read moreNote
F-statistics tests the overall significance of a model, while LM statistics tests the validity of linear restrictions on model parameters. High values indicate heterescedestic noise in this case.
Example 2: Heteroscedasticity#
We can contrast the above result with a dataset that has heteroscedestic noise below:
# y values x_val = list(range(N)) y_val = [x * 2 for x in x_val] # Adding some heteroscedestic noise y_val = [x + np.random.normal() for x in y_val]
vdf = vp.vDataFrame( { "x": x_val, "y": y_val, } )
We can plot the data to see the trend:
vdf.scatter(["x", "y"])
Notice the relationship of the residuals with respect to x. This shows that the noise is heteroscedestic.
Now we can perform the test on this dataset:
statistic, pvalue = het_goldfeldquandt(vdf, y = "y", X = "x")
print(statistic, pvalue) 0.9008900931714832 0.5998476852966561
Note
Notice the contrast of the two test results. In this dataset, the noise was heteroscedestic so we got very low p_value scores and higher statistics score. Thus confirming that the noise was in fact heteroscedestic.