
VerticaPy
Normalization¶
Normalizing data is crucial when using machine learning algorithms because of how sensitive most of them are to un-normalized data. For example, the neighbors-based and k-means algorithms use the p-distance in their learning phase. Normalization is the first step before using a linear regression due to Gauss-Markov assumptions.
Unnormalized data can also create complications for the convergence of some ML algorithms. Normalization is also a way to encode the data and to retain the global distribution. When we know the estimators to use to normalize the data, we can easily un-normalize the data and come back to the original distribution.
There are three main normalization techniques:
- Z-Score : We reduce and center the feature values using the average and standard deviation. This normalization is sensitive to outliers.
- Robust Z-Score : We reduce and center the feature values using the median and the median absolute deviation. This normalization is robust to outliers.
- Min-Max : We reduce the feature values by using a bijection to [0,1]. The max will reach 1 and the min will reach 0. This normalization is robust to outliers.
To demonstrate data normalization in VerticaPy, we will use the well-known 'Titanic' dataset.
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
Let's look at the 'fare' and 'age' of the passengers.
vdf.select(["age", "fare"])
These lie in different numerical intervals so it's probably a good idea to normalize them. To normalize data in VerticaPy, we can use the 'normalize' method.
help(vdf["age"].normalize)
The three main normalization techniques are available. Let's normalize the 'fare' and the 'age' using the 'MinMax' method.
vdf["age"].normalize(method = "minmax")
vdf["fare"].normalize(method = "minmax")
vdf.select(["age", "fare"])
Both of the features now scale in [0,1]. It is also possible to normalize by a specific partition with the 'by' parameter.