Normalization#

Normalizing data is crucial when using machine learning algorithms because of how sensitive most of them are to un-normalized data. For example, the neighbors-based and k-means algorithms use the p-distance in their learning phase. Normalization is the first step before using a linear regression due to Gauss-Markov assumptions.

Unnormalized data can also create complications for the convergence of some ML algorithms. Normalization is also a way to encode the data and to retain the global distribution. When we know the estimators to use to normalize the data, we can easily un-normalize the data and come back to the original distribution.

There are three main normalization techniques:

  • Z-Score : We reduce and center the feature values using the average and standard deviation. This normalization is sensitive to outliers.

  • Robust Z-Score : We reduce and center the feature values using the median and the median absolute deviation. This normalization is robust to outliers.

  • Min-Max : We reduce the feature values by using a bijection to [0,1]. The max will reach 1 and the min will reach 0. This normalization is robust to outliers.

To demonstrate data normalization in VerticaPy, we will use the well-known ‘Titanic’ dataset.

[1]:
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

Let’s look at the ‘fare’ and ‘age’ of the passengers.

[2]:
vdf.select(["age", "fare"])
[2]:
123
age
Numeric(6,3)
123
fare
Numeric(10,5)
12.0151.55
230.0151.55
325.0151.55
439.00.0
571.049.5042
647.0227.525
7[null]25.925
824.0247.5208
936.075.2417
1025.026.0
1145.035.5
1242.026.55
1341.030.5
1448.050.4958
15[null]39.6
1645.026.55
17[null]31.0
1833.05.0
1928.047.1
2017.047.1
2149.026.0
2236.078.85
2346.061.175
24[null]0.0
2527.0136.7792
26[null]52.0
2747.025.5875
2837.083.1583
29[null]26.55
3070.071.0
3139.071.2833
3231.052.0
3350.0106.425
3439.029.7
3536.031.6792
36[null]221.7792
3730.027.75
3819.0263.0
3964.0263.0
40[null]26.55
41[null]0.0
4237.053.1
4347.038.5
4424.079.2
4571.034.6542
4638.0153.4625
4746.079.2
48[null]42.4
4945.083.475
5040.00.0
5155.093.5
5242.042.5
53[null]51.8625
5455.050.0
5542.052.0
56[null]30.6958
5750.028.7125
5846.026.0
5950.026.0
6032.5211.5
6158.029.7
6241.051.8625
63[null]26.55
64[null]27.7208
6529.030.0
6630.045.5
6730.026.0
6819.053.1
6946.075.2417
7054.051.8625
7128.082.1708
7265.026.55
7344.090.0
7455.030.5
7547.042.4
7637.029.7
7758.0113.275
7864.026.0
7965.061.9792
8028.527.7208
81[null]0.0
8245.528.5
8323.093.5
8429.066.6
8518.0108.9
8647.052.0
8738.00.0
8822.0135.6333
89[null]227.525
9031.050.4958
91[null]50.0
9236.040.125
9355.059.4
9433.026.55
9561.0262.375
9650.055.9
9756.026.55
9856.030.6958
9924.060.0
100[null]26.0
Rows: 1-100 | Columns: 2

These lie in different numerical intervals so it’s probably a good idea to normalize them. To normalize data in VerticaPy, we can use the ‘normalize’ method.

[3]:
help(vdf["age"].normalize)
Help on method normalize in module verticapy.vcolumn:

normalize(method:str='zscore', by:list=[], return_trans:bool=False) method of verticapy.vcolumn.vColumn instance
    ---------------------------------------------------------------------------
    Normalizes the input vColumns using the input method.

    Parameters
    ----------
    method: str, optional
            Method to use to normalize.
                    zscore        : Normalization using the Z-Score (avg and std).
                            (x - avg) / std
                    robust_zscore : Normalization using the Robust Z-Score (median and mad).
                            (x - median) / (1.4826 * mad)
                    minmax        : Normalization using the MinMax (min and max).
                            (x - min) / (max - min)
    by: list, optional
            vColumns used in the partition.
    return_trans: bool, optimal
            If set to True, the method will return the transformation used instead of
            the parent vDataFrame. This parameter is used for testing purpose.

    Returns
    -------
    vDataFrame
            self.parent

    See Also
    --------
    vDataFrame.outliers : Computes the vDataFrame Global Outliers.

The three main normalization techniques are available. Let’s normalize the ‘fare’ and the ‘age’ using the ‘MinMax’ method.

[4]:
vdf["age"].normalize(method = "minmax")
vdf["fare"].normalize(method = "minmax")
vdf.select(["age", "fare"])
[4]:
123
age
Numeric(20,15)
123
fare
Numeric(24,15)
10.0209614660474460.295805899800363
20.372411196184260.295805899800363
30.3096523158026860.295805899800363
40.4853771808710930.0
50.8870340153131670.096625763278767
60.5857913894816120.44409922370226
7[null]0.050602229972447
80.2971005397263710.483128426019833
90.4477218526421490.146862017624605
100.3096523158026860.050748620223091
110.5606878373289820.069291385304605
120.5230325091000380.05182214872781
130.5104807330237230.059532035261703
140.5983431655579260.098561237579275
15[null]0.077294052339785
160.5606878373289820.05182214872781
17[null]0.060507970265993
180.4100665244132040.009759350042902
190.347307644031630.091933077404138
200.2092381071921680.091933077404138
210.6108949416342410.050748620223091
220.4477218526421490.153904950176566
230.5732396134052970.119405647774907
24[null]0.0
250.3347558679553160.266975218277623
26[null]0.101497240446182
270.5857913894816120.049943473844552
280.4602736287184640.162314191734533
29[null]0.05182214872781
300.8744822392368520.13858277060921
310.4853771808710930.139135735382641
320.3849629722605750.101497240446182
330.6234467177105560.207727765663171
340.4853771808710930.057970539254838
350.4477218526421490.061833680375821
36[null]0.432884169006959
370.372411196184260.054164392738107
380.2343416593447970.513341812256651
390.7991715827789630.513341812256651
40[null]0.05182214872781
41[null]0.0
420.4602736287184640.10364429745562
430.5857913894816120.075146995330346
440.2971005397263710.154588104679569
450.8870340153131670.067640493651348
460.4728254047947780.299538851191773
470.5732396134052970.154588104679569
48[null]0.08275928836381
490.5606878373289820.162932348966251
500.4979289569474080.0
510.686205598092130.182499845802269
520.5230325091000380.082954475364668
53[null]0.101228858320002
540.686205598092130.097593500429021
550.5230325091000380.101497240446182
56[null]0.059914211409383
570.6234467177105560.056043067621365
580.5732396134052970.050748620223091
590.6234467177105560.050748620223091
600.4037906363750470.412820506814759
610.7238609263210740.057970539254838
620.5104807330237230.101228858320002
63[null]0.05182214872781
64[null]0.054107398133856
650.3598594201079450.058556100257413
660.372411196184260.088810085390409
670.372411196184260.050748620223091
680.2343416593447970.10364429745562
690.5732396134052970.146862017624605
700.6736538220158150.101228858320002
710.347307644031630.16038672010106
720.8117233588552780.05182214872781
730.5481360612526670.175668300772238
740.686205598092130.059532035261703
750.5857913894816120.08275928836381
760.4602736287184640.057970539254838
770.7238609263210740.221098075221947
780.7991715827789630.050748620223091
790.8117233588552780.120975341635808
800.3535835320697880.054107398133856
81[null]0.0
820.5669637253671390.055628295244542
830.2845487636500560.182499845802269
840.3598594201079450.129994542571456
850.2217898832684820.212558643934408
860.5857913894816120.101497240446182
870.4728254047947780.0
880.2719969875737420.264738570434791
89[null]0.44409922370226
900.3849629722605750.098561237579275
91[null]0.097593500429021
920.4477218526421490.078318784094289
930.686205598092130.115941078509677
940.4100665244132040.05182214872781
950.7615162545500190.512121893501288
960.6234467177105560.109109533479646
970.6987573741684450.05182214872781
980.6987573741684450.059914211409383
990.2971005397263710.117112200514825
100[null]0.050748620223091
Rows: 1-100 | Columns: 2

Both of the features now scale in [0,1]. It is also possible to normalize by a specific partition with the ‘by’ parameter.