Loading...

verticapy.machine_learning.model_selection.statistical_tests.norm.jarque_bera#

verticapy.machine_learning.model_selection.statistical_tests.norm.jarque_bera(input_relation: str | vDataFrame, column: str) tuple[float, float]#

Jarque-Bera test (Distribution Normality).

Parameters#

input_relation: SQLRelation

Input relation.

column: str

Input vDataColumn to test.

Returns#

tuple

statistic, p_value

Examples#

Let’s try this test on two set of distribution to obverse the contrast in test results:

  • normally distributed dataset

  • uniformly distributed dataset

Normally Distributed#

Import the necessary libraries:

import verticapy as vp
import numpy as np
import random

Then we can define the basic parameters for the normal distribution:

# Distribution parameters
N = 100 # Number of rows
mean = 0
std_dev = 1

# Dataset
data = np.random.normal(mean, std_dev, N)

Now we can create the vDataFrame:

vdf = vp.vDataFrame({"col": data})

We can visualize the distribution:

vdf["col"].hist()

To find the test p-value, we can import the test function:

from verticapy.machine_learning.model_selection.statistical_tests import jarque_bera

And simply apply it on the vDataFrame:

jarque_bera(vdf, column = "col")
Out[3]: (2.21680972244939, 0.33008507285184063)

We can see that the p-value is high meaning that we cannot reject the null hypothesis. This is supported by the low Jarque-Bera Test Statistic value, providing further evidence that the distribution is normal.

Note

A p_value in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test distribution does not belong to a normal distribution.

However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.

Generally a p-value less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read more

Uniform Distribution#

We can define the basic parameters for the uniform distribution:

# Distribution parameters
low = 0
high = 1

# Dataset
data = np.random.uniform(low, high, N)

# vDataFrame
vdf = vp.vDataFrame({"col": data})

We can visualize the distribution:

vdf["col"].hist()

And simply apply it on the vDataFrame:

jarque_bera(vdf, column = "col")
Out[4]: (8879.70070672812, 0.0)

Note

In this case, the p-value is quite low meaning that it is highly probable that the data is not normally distributed. This is supported by the elevated Jarque-Bera Test Statistic value, providing further evidence that the distribution deviates from normality.