Normalizing Data
The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale. For example, suppose you execute the LINEAR_REG function on a data set with two feature columns, current_salary
and years_worked
. The output value you are trying to predict is a worker's future salary. The values in the current_salary
column are likely to have a far wider range, and much larger values, than the values in the years_worked
column. Therefore, the values in the current_salary
column can overshadow the values in the years_worked
column, thus skewing your model.
Vertica offers the following data preparation methods which use normalization. These methods are:
- MinMax
Using the MinMax normalization method, you can normalize the values in both of these columns to be within a distribution of values between 0 and 1. Doing so allows you to compare values on very different scales to one another by reducing the dominance of one column over the other. - Z-score
Using the Z-score normalization method, you can normalize the values in both of these columns to be the number of standard deviations an observation is from the mean of each column. This allows you to compare your data to a normally distributed random variable. - Robust Z-score
Using the Robust Z-score normalization method, you can lessen the influence of outliers on Z-score calculations. Robust Z-score normalization uses the median value as opposed to the mean value used in Z-score. By using the median instead of the mean, it helps remove some of the influence of outliers in the data.
Normalizing data results in the creation of a view where the normalized data is saved. The output_view
option in the NORMALIZE function determines name of the view .
Normalizing Salary Data Using MinMax
The following example shows how you can normalize the salary_data
table using the MinMax normalization method.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax'); NORMALIZE -------------------------- Finished in 1 iteration (1 row)
=> SELECT * FROM normalized_salary_data; employee_id | first_name | last_name | years_worked | current_salary ------------+-------------+------------+----------------------+---------------------- 189 | Shawn | Moore | 0.350000000000000000 | 0.437246565765357217 518 | Earl | Shaw | 0.100000000000000000 | 0.978867411144492943 1126 | Susan | Alexander | 0.250000000000000000 | 0.909048995710749580 1157 | Jack | Stone | 0.100000000000000000 | 0.601863084103319918 1277 | Scott | Wagner | 0.050000000000000000 | 0.455949209228501786 3188 | Shirley | Flores | 0.400000000000000000 | 0.538816771536005140 3196 | Andrew | Holmes | 0.900000000000000000 | 0.183954046444834949 3430 | Philip | Little | 0.100000000000000000 | 0.735279557092379495 3522 | Jerry | Ross | 0.800000000000000000 | 0.671828883472214349 3892 | Barbara | Flores | 0.350000000000000000 | 0.092901007123556866
. . . (1000 rows)
Normalizing Salary Data Using Z-score
The following example shows how you can normalize the salary_data
table using the Z-score normalization method.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_z_salary_data', 'salary_data', 'current_salary, years_worked', 'zscore'); NORMALIZE -------------------------- Finished in 1 iteration (1 row)
=> SELECT * FROM normalized_z_salary_data; employee_id | first_name | last_name | years_worked | current_salary ------------+-------------+------------+---------------------+---------------------- 189 | Shawn | Moore | -0.524447274157005 | -0.221041249770669 518 | Earl | Shaw | -1.35743214416495 | 1.66054215981221 1126 | Susan | Alexander | -0.857641222160185 | 1.41799393943946 1157 | Jack | Stone | -1.35743214416495 | 0.350834283622416 1277 | Scott | Wagner | -1.52402911816654 | -0.156068522159045 3188 | Shirley | Flores | -0.357850300155415 | 0.131812255991634 3196 | Andrew | Holmes | 1.30811943986048 | -1.10097599783475 3430 | Philip | Little | -1.35743214416495 | 0.814321286168547 3522 | Jerry | Ross | 0.974925491857304 | 0.593894513770248 3892 | Barbara | Flores | -0.524447274157005 | -1.41729301118583
. . . (1000 rows)
Normalizing Salary Data Using Robust Z-score
The following example shows how you can normalize the salary_data
table using the robust Z-score normalization method.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_robustz_salary_data', 'salary_data', 'current_salary, years_worked', 'robust_zscore'); NORMALIZE -------------------------- Finished in 1 iteration (1 row)
=> SELECT * FROM normalized_robustz_salary_data; employee_id | first_name | last_name | years_worked | current_salary ------------+-------------+------------+--------------------+----------------------- 189 | Shawn | Moore | -0.404694455685957 | -0.158933849655499140 518 | Earl | Shaw | -1.079185215162552 | 1.317126172796275889 1126 | Susan | Alexander | -0.674490759476595 | 1.126852528914384584 1157 | Jack | Stone | -1.079185215162552 | 0.289689691751547422 1277 | Scott | Wagner | -1.214083367057871 | -0.107964200747705902 3188 | Shirley | Flores | -0.269796303790638 | 0.117871818902746738 3196 | Andrew | Holmes | 1.079185215162552 | -0.849222942006447161 3430 | Philip | Little | -1.079185215162552 | 0.653284859470426481 3522 | Jerry | Ross | 0.809388911371914 | 0.480364995828913355 3892 | Barbara | Flores | -0.404694455685957 | -1.097366550974798397 3939 | Anna | Walker | -0.944287063267233 | 0.414956177842775781 4165 | Martha | Reyes | 0.269796303790638 | 0.773947701782753329 4335 | Phillip | Wright | -1.214083367057871 | 1.218843012657445647 4534 | Roger | Harris | 1.079185215162552 | 1.155185021164402608 4806 | John | Robinson | 0.809388911371914 | -0.494320112876813908 4881 | Kelly | Welch | 0.134898151895319 | -0.540778808820045933 4889 | Jennifer | Arnold | 1.214083367057871 | -0.299762093576526566 5067 | Martha | Parker | 0.000000000000000 | 0.719991348857328239 5523 | John | Martin | -0.269796303790638 | -0.411248545269163826 6004 | Nicole | Sullivan | 0.269796303790638 | 1.065141044522487821 6013 | Harry | Woods | -0.944287063267233 | 1.005664438654129376 6240 | Norma | Martinez | 1.214083367057871 | 0.762412844887071691
. . . (1000 rows)