Loading...

verticapy.machine_learning.model_selection.statistical_tests.ols.het_goldfeldquandt#

verticapy.machine_learning.model_selection.statistical_tests.ols.het_goldfeldquandt(input_relation: str | vDataFrame, y: str, X: str | list[str], idx: int = 0, split: float = 0.5, alternative: Literal['increasing', 'decreasing', 'two-sided'] = 'increasing') tuple[float, float]#

Goldfeld-Quandt Homoscedasticity test.

Parameters#

input_relation: SQLRelation

Input relation.

y: str

Response Column.

X: SQLColumns

Exogenous Variables.

idx: int, optional

Column index of variable according to which observations are sorted for the split.

split: float, optional

Float to indicate where to split (Example: 0.5 to split on the median).

alternative: str, optional

Specifies the alternative hypothesis for the p-value calculation, one of the following variances: “increasing”, “decreasing”, “two-sided”.

Returns#

tuple

statistic, p_value

Examples#

Initialization#

Let’s try this test on a dummy dataset that has the following elements:

  • x (a predictor)

  • y (the response)

  • Random noise

Before we begin we can import the necessary libraries:

import verticapy as vp

import numpy as np

Example 1: Homoscedasticity#

Next, we can create some values with random noise:

N = 50 # Number of rows

x_val = list(range(N))

y_val = [x * 2 for x in x_val] + np.random.normal(0, 0.4, N)

We can use those values to create the vDataFrame:

vdf = vp.vDataFrame(
    {
        "x": x_val,
        "y": y_val,
    }
)

We can plot the values to see the trend:

vdf.scatter(["x", "y"])

Notice the randomness of the data with respect to x. This shows that the noise is homoscedestic.

To test its score, we can import the test function:

from verticapy.machine_learning.model_selection.statistical_tests import het_goldfeldquandt

And simply apply it on the vDataFrame:

statistic, pvalue = het_goldfeldquandt(vdf, y = "y", X = "x")
print(statistic, pvalue)
1.5172856550914458 0.15694752301480847

As the noise was not heteroscedestic, we got higher p_value scores and lower statistics score.

Note

A p_value in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test data does not have a heteroscedestic noise in the current case.

However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.

Generally a p-value less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read more

Note

F-statistics tests the overall significance of a model, while LM statistics tests the validity of linear restrictions on model parameters. High values indicate heterescedestic noise in this case.

Example 2: Heteroscedasticity#

We can contrast the above result with a dataset that has heteroscedestic noise below:

# y values
x_val = list(range(N))

y_val = [x * 2 for x in x_val]

# Adding some heteroscedestic noise
y_val = [x + np.random.normal() for x in y_val]
vdf = vp.vDataFrame(
    {
        "x": x_val,
        "y": y_val,
    }
)

We can plot the data to see the trend:

vdf.scatter(["x", "y"])

Notice the relationship of the residuals with respect to x. This shows that the noise is heteroscedestic.

Now we can perform the test on this dataset:

statistic, pvalue = het_goldfeldquandt(vdf, y = "y", X = "x")
print(statistic, pvalue)
0.9008900931714832 0.5998476852966561

Note

Notice the contrast of the two test results. In this dataset, the noise was heteroscedestic so we got very low p_value scores and higher statistics score. Thus confirming that the noise was in fact heteroscedestic.