Normalizing Data

The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale. For example, suppose you execute the LINEAR_REG function on a data set with two feature columns, current_salary and years_worked. The output value you are trying to predict is a worker's future salary. The values in the current_salary column are likely to have a far wider range, and much larger values, than the values in the years_worked column. Therefore, the values in the current_salary column can overshadow the values in the years_worked column, thus skewing your model.

Vertica offers the following data preparation methods which use normalization. These methods are:

Normalizing data results in the creation of a view where the normalized data is saved. The output_view option in the NORMALIZE function determines name of the view .

Normalizing Salary Data Using MinMax

The following example shows how you can normalize the salary_data table using the MinMax normalization method.

Before you begin the example, make sure that you have loaded the Machine Learning sample data.

=> SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)
=> SELECT * FROM normalized_salary_data;
current_salary       |     years_worked
---------------------+----------------------
0.437246565765357217 | 0.350000000000000000
0.978867411144492943 | 0.100000000000000000
0.909048995710749580 | 0.250000000000000000
0.601863084103319918 | 0.100000000000000000
0.455949209228501786 | 0.050000000000000000
0.538816771536005140 | 0.400000000000000000
0.183954046444834949 | 0.900000000000000000
0.735279557092379495 | 0.100000000000000000
0.671828883472214349 | 0.800000000000000000
0.092901007123556866 | 0.350000000000000000
0.647827976494151881 | 0.150000000000000000
0.779555202841864596 | 0.600000000000000000
0.942803695935604558 | 0.050000000000000000
0.919445231751790337 | 0.900000000000000000
0.314180979914214992 | 0.800000000000000000
.
.
.
(1000 rows)

Normalizing Salary Data Using Z-score

The following example shows how you can normalize the salary_data table using the Z-score normalization method.

Before you begin the example, make sure that you have loaded the Machine Learning sample data.

=> SELECT NORMALIZE('normalized_z_salary_data', 'salary_data', 'current_salary, years_worked', 
                     'zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)
=> SELECT * FROM normalized_z_salary_data;
 employee_id |    current_salary    |    years_worked
-------------+----------------------+---------------------
         189 |   -0.221041249770669 |  -0.524447274157005
         518 |     1.66054215981221 |   -1.35743214416495
        1126 |     1.41799393943946 |  -0.857641222160185
        1157 |    0.350834283622416 |   -1.35743214416495
        1277 |   -0.156068522159045 |   -1.52402911816654
        3188 |    0.131812255991634 |  -0.357850300155415
        3196 |    -1.10097599783475 |    1.30811943986048
        3430 |    0.814321286168547 |   -1.35743214416495
        3522 |    0.593894513770248 |   0.974925491857304
        3892 |    -1.41729301118583 |  -0.524447274157005
        3939 |    0.510515691167414 |   -1.19083517016336
        4165 |    0.968134273983823 |   0.308537595850945
        4335 |     1.53525730638774 |   -1.52402911816654
        4534 |       1.454110321266 |    1.30811943986048
        4806 |   -0.648569411959509 |   0.974925491857304
.
.
.
(1000 rows)

Normalizing Salary Data Using Robust Z-score

The following example shows how you can normalize the salary_data table using the Z-score normalization method.

Before you begin the example, make sure that you have loaded the Machine Learning sample data.

=> SELECT NORMALIZE('normalized_robustz_salary_data', 'salary_data', 'current_salary, years_worked', 'robust_zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)
=> SELECT * FROM normalized_robustz_salary_data;
 employee_id |    current_salary     |    years_worked
-------------+-----------------------+--------------------
         110 | -0.568102001091691813 | -1.348981518953190
         112 | -0.459684645251655561 | -0.404694455685957
         313 |  1.192992574435873405 |  0.404694455685957
         426 | -0.339722133536566526 |  0.134898151895319
         593 |  1.006869419240553261 |  1.214083367057871
         620 |  0.488357003820921347 |  0.944287063267233
         630 | -0.381700773453350773 | -0.404694455685957
         718 | -1.218698826946249627 | -0.539592607581276
         777 | -1.184495916454679341 |  0.809388911371914
        1202 | -0.572973418334243077 |  1.214083367057871
        1464 | -0.133042215516442321 |  0.539592607581276
        1573 | -1.056139736552107686 |  0.539592607581276
        1613 | -1.101208070280235229 |  0.539592607581276
        1792 | -0.152599977342245383 | -0.944287063267233
        2433 | -0.469849737890975004 |  0.404694455685957
.
.
.
(1000 rows)

See Also