Normalizing Data
The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale. For example, suppose you execute the LINEAR_REG function on a data set with two feature columns, current_salary
and years_worked
. The output value you are trying to predict is a worker's future salary. The values in the current_salary
column are likely to have a far wider range, and much larger values, than the values in the years_worked
column. Therefore, the values in the current_salary
column can overshadow the values in the years_worked
column, thus skewing your model.
Vertica offers the following data preparation methods which use normalization. These methods are:
- MinMax
Using the MinMax normalization method, you can normalize the values in both of these columns to be within a distribution of values between 0 and 1. Doing so allows you to compare values on very different scales to one another by reducing the dominance of one column over the other. - Z-score
Using the Z-score normalization method, you can normalize the values in both of these columns to be the number of standard deviations an observation is from the mean of each column. This allows you to compare your data to a normally distributed random variable. - Robust Z-score
Using the Robust Z-score normalization method, you can lessen the influence of outliers on Z-score calculations. Robust Z-score normalization uses the median value as opposed to the mean value used in Z-score. By using the median instead of the mean, it helps remove some of the influence of outliers in the data.
Normalizing data results in the creation of a view where the normalized data is saved. The output_view
option in the NORMALIZE function determines name of the view .
Normalizing Salary Data Using MinMax
The following example shows how you can normalize the salary_data
table using the MinMax normalization method.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax'); NORMALIZE -------------------------- Finished in 1 iteration (1 row)
=> SELECT * FROM normalized_salary_data; current_salary | years_worked ---------------------+---------------------- 0.437246565765357217 | 0.350000000000000000 0.978867411144492943 | 0.100000000000000000 0.909048995710749580 | 0.250000000000000000 0.601863084103319918 | 0.100000000000000000 0.455949209228501786 | 0.050000000000000000 0.538816771536005140 | 0.400000000000000000 0.183954046444834949 | 0.900000000000000000 0.735279557092379495 | 0.100000000000000000 0.671828883472214349 | 0.800000000000000000 0.092901007123556866 | 0.350000000000000000 0.647827976494151881 | 0.150000000000000000 0.779555202841864596 | 0.600000000000000000 0.942803695935604558 | 0.050000000000000000 0.919445231751790337 | 0.900000000000000000 0.314180979914214992 | 0.800000000000000000 . . . (1000 rows)
Normalizing Salary Data Using Z-score
The following example shows how you can normalize the salary_data
table using the Z-score normalization method.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_z_salary_data', 'salary_data', 'current_salary, years_worked', 'zscore'); NORMALIZE -------------------------- Finished in 1 iteration (1 row)
=> SELECT * FROM normalized_z_salary_data; employee_id | current_salary | years_worked -------------+----------------------+--------------------- 189 | -0.221041249770669 | -0.524447274157005 518 | 1.66054215981221 | -1.35743214416495 1126 | 1.41799393943946 | -0.857641222160185 1157 | 0.350834283622416 | -1.35743214416495 1277 | -0.156068522159045 | -1.52402911816654 3188 | 0.131812255991634 | -0.357850300155415 3196 | -1.10097599783475 | 1.30811943986048 3430 | 0.814321286168547 | -1.35743214416495 3522 | 0.593894513770248 | 0.974925491857304 3892 | -1.41729301118583 | -0.524447274157005 3939 | 0.510515691167414 | -1.19083517016336 4165 | 0.968134273983823 | 0.308537595850945 4335 | 1.53525730638774 | -1.52402911816654 4534 | 1.454110321266 | 1.30811943986048 4806 | -0.648569411959509 | 0.974925491857304 . . . (1000 rows)
Normalizing Salary Data Using Robust Z-score
The following example shows how you can normalize the salary_data
table using the Z-score normalization method.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_robustz_salary_data', 'salary_data', 'current_salary, years_worked', 'robust_zscore'); NORMALIZE -------------------------- Finished in 1 iteration (1 row)
=> SELECT * FROM normalized_robustz_salary_data; employee_id | current_salary | years_worked -------------+-----------------------+-------------------- 110 | -0.568102001091691813 | -1.348981518953190 112 | -0.459684645251655561 | -0.404694455685957 313 | 1.192992574435873405 | 0.404694455685957 426 | -0.339722133536566526 | 0.134898151895319 593 | 1.006869419240553261 | 1.214083367057871 620 | 0.488357003820921347 | 0.944287063267233 630 | -0.381700773453350773 | -0.404694455685957 718 | -1.218698826946249627 | -0.539592607581276 777 | -1.184495916454679341 | 0.809388911371914 1202 | -0.572973418334243077 | 1.214083367057871 1464 | -0.133042215516442321 | 0.539592607581276 1573 | -1.056139736552107686 | 0.539592607581276 1613 | -1.101208070280235229 | 0.539592607581276 1792 | -0.152599977342245383 | -0.944287063267233 2433 | -0.469849737890975004 | 0.404694455685957 . . . (1000 rows)