verticapy.machine_learning.model_selection.statistical_tests.norm.jarque_bera#
- verticapy.machine_learning.model_selection.statistical_tests.norm.jarque_bera(input_relation: str | vDataFrame, column: str) tuple[float, float] #
Jarque-Bera test (Distribution Normality).
Parameters#
- input_relation: SQLRelation
Input relation.
- column: str
Input vDataColumn to test.
Returns#
- tuple
statistic, p_value
Examples#
Let’s try this test on two set of distribution to obverse the contrast in test results:
normally distributed dataset
uniformly distributed dataset
Normally Distributed#
Import the necessary libraries:
import verticapy as vp import numpy as np import random
Then we can define the basic parameters for the normal distribution:
# Distribution parameters N = 100 # Number of rows mean = 0 std_dev = 1 # Dataset data = np.random.normal(mean, std_dev, N)
Now we can create the
vDataFrame
:vdf = vp.vDataFrame({"col": data})
We can visualize the distribution:
vdf["col"].hist()
To find the test p-value, we can import the test function:
from verticapy.machine_learning.model_selection.statistical_tests import jarque_bera
And simply apply it on the
vDataFrame
:jarque_bera(vdf, column = "col") Out[3]: (2.21680972244939, 0.33008507285184063)
We can see that the p-value is high meaning that we cannot reject the null hypothesis. This is supported by the low Jarque-Bera Test Statistic value, providing further evidence that the distribution is normal.
Note
A
p_value
in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test distribution does not belong to a normal distribution.However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.
Generally a
p-value
less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read moreUniform Distribution#
We can define the basic parameters for the uniform distribution:
# Distribution parameters low = 0 high = 1 # Dataset data = np.random.uniform(low, high, N) # vDataFrame vdf = vp.vDataFrame({"col": data})
We can visualize the distribution:
vdf["col"].hist()
And simply apply it on the
vDataFrame
:jarque_bera(vdf, column = "col") Out[4]: (8879.70070672812, 0.0)
Note
In this case, the p-value is quite low meaning that it is highly probable that the data is not normally distributed. This is supported by the elevated Jarque-Bera Test Statistic value, providing further evidence that the distribution deviates from normality.