Loading...

verticapy.machine_learning.model_selection.statistical_tests.norm.normaltest#

verticapy.machine_learning.model_selection.statistical_tests.norm.normaltest(input_relation: str | vDataFrame, column: str) tuple[float, float]#

This function tests the null hypothesis that a sample comes from a normal distribution.

Parameters#

input_relation: SQLRelation

Input relation.

column: str

Input vDataColumn to test.

Returns#

tuple

statistic, p_value

Examples#

Let’s try this test on two set of distribution to obverse the contrast in test results:

  • normally distributed dataset

  • uniformly distributed dataset

Normally Distributed#

Import the necessary libraries:

import verticapy as vp
import numpy as np
import random

Then we can define the basic parameters for the normal distribution:

# Distribution parameters
N = 100 # Number of rows
mean = 0
std_dev = 1

# Dataset
data = np.random.normal(mean, std_dev, N)

Now we can create the vDataFrame:

vdf = vp.vDataFrame({"col": data})

We can visualize the distribution:

vdf["col"].hist()

To find the test p-value, we can import the test function:

from verticapy.machine_learning.model_selection.statistical_tests import normaltest

And simply apply it on the vDataFrame:

normaltest(vdf, column = "col")
Out[3]: (0.9891922257619671, 0.6098171548586554)

We can see that the p-value is high meaning that we cannot reject the null hypothesis. The low normal test statistic value further supports the conclusion that the distribution is normal.

Note

A p_value in statistics represents the probability of obtaining results as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true. A smaller p-value typically suggests stronger evidence against the null hypothesis i.e. the test distribution does not belong to a normal distribution.

However, small is a relative term. And the choice for the threshold value which determines a “small” should be made before analyzing the data.

Generally a p-value less than 0.05 is considered the threshold to reject the null hypothesis. But it is not always the case - read more

Uniform Distribution#

We can define the basic parameters for the uniform distribution:

# Distribution parameters
low = 0
high = 1

# Dataset
data = np.random.uniform(low, high, N)

# vDataFrame
vdf = vp.vDataFrame({"col": data})

We can visualize the distribution:

vdf["col"].hist()

And simply apply it on the vDataFrame:

normaltest(vdf, column = "col")
Out[4]: (237.99651061203633, 2.0879224523930834e-52)

Note

In this case, the p-value is quite low meaning that it is highly probable that the data is not normally distributed. The high normal test statistic value further supports the conclusion that the distribution is not normal.