verticapy.machine_learning.model_selection.statistical_tests.norm.normaltest#
- verticapy.machine_learning.model_selection.statistical_tests.norm.normaltest(input_relation: str | vDataFrame, column: str) tuple[float, float] #
This function tests the null hypothesis that a sample comes from a normal distribution.
Parameters#
- input_relation: SQLRelation
Input relation.
- column: str
Input vDataColumn to test.
Returns#
- tuple
statistic, p_value
Examples#
Let’s try this test on two set of distribution to obverse the contrast in test results:
normally distributed dataset
uniformly distributed dataset
Normally Distributed#
Import the necessary libraries:
import verticapy as vp import numpy as np import random
Then we can define the basic parameters for the normal distribution:
# Distribution parameters N = 100 # Number of rows mean = 0 std_dev = 1 # Dataset data = np.random.normal(mean, std_dev, N)
Now we can create the
vDataFrame
:vdf = vp.vDataFrame({"col": data})
We can visualize the distribution:
vdf["col"].hist()
To find the test p-value, we can import the test function:
from verticapy.machine_learning.model_selection.statistical_tests import normaltest
And simply apply it on the
vDataFrame
:normaltest(vdf, column = "col") Out[3]: (0.9891922257619671, 0.6098171548586554)
We can see that the p-value is high meaning that we cannot reject the null hypothesis. The low normal test statistic value further supports the conclusion that the distribution is normal.
Note
A
p_value
in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test distribution does not belong to a normal distribution.However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.
Generally a
p-value
less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read moreUniform Distribution#
We can define the basic parameters for the uniform distribution:
# Distribution parameters low = 0 high = 1 # Dataset data = np.random.uniform(low, high, N) # vDataFrame vdf = vp.vDataFrame({"col": data})
We can visualize the distribution:
vdf["col"].hist()
And simply apply it on the
vDataFrame
:normaltest(vdf, column = "col") Out[4]: (237.99651061203633, 2.0879224523930834e-52)
Note
In this case, the p-value is quite low meaning that it is highly probable that the data is not normally distributed. The high normal test statistic value further supports the conclusion that the distribution is not normal.