Data Preparation Tools – Technical Brief

Posted August 20, 2018 by Phil Molea, Sr. Information Developer, Vertica

Curtis Bennett authored this blog Vertica supports a number of industry standard data preparation tools for use in the data science life-cycle. In addition to the functions described here, Vertica has a wide array of analytic capabilities which can be leveraged for additional data preparations including time-series analytics (with missing value imputation), analytic windowing and partitioning functions, lag/lead functionality, and Vertica-specific functions for Conditional Event tracking. BALANCE allows the data scientist to minimize uneven data distribution across classes of data. Building a predictive model on imbalanced data could cause a model to yield high accuracy but would not generalize well to new data in the minority class(es). This can occur when training against a dataset that has limited amounts of positive test cases available, but many negative test cases. Balance returns a view with an equal distribution of the input data based on the supplied response column. => SELECT fraud, COUNT(fraud) FROM transaction_data GROUP BY fraud; fraud | COUNT -------+------- TRUE | 19 FALSE | 981 (2 rows) => SELECT BALANCE(‘balance_fin_data’, ‘transaction_data’, ‘fraud’, ‘under_sampling’ USING PARAMETERS sampling_ratio = 0.2); BALANCE ————————– Finished in 1 iteration (1 row) => SELECT fraud, COUNT(fraud) FROM balance_fin_data GROUP BY fraud; fraud | COUNT ——-+——- t | 19 f | 236 (2 rows) DETECT_OUTLIERS identifies data points that greatly differ from other similar data points. Detecting outliers can be a useful exercise by itself, in order to identify anomalous or fraudulent data points in your data worth investigating. As a data preparation tool, removing outliers is a common practice in order to minimize the risk of misclassifying data, introducing bias or creating incorrect calculations. DETECT_OUTLIERS uses a robust z-score algorithm which normalizes values to the number of standard deviations an observation is from the median of each column. Increasing or decreasing the threshold value results in either less or more outliers detected, respectively. => CREATE TABLE baseball_roster (id identity, last_name varchar(30), hr int, avg float); => COPY baseball_roster FROM ‘/tmp/baseball.txt’; 10 row(s) loaded. => SELECT * FROM baseball_roster order by avg desc limit 5; id | last_name | hr | avg —-+———–+——–+——– 9 | Hank | 999999 | 0.8888 10 | Popup | 35 | 0.378 3 | Gus | 12 | 0.345 7 | Wicker | 78 | 0.333 1 | Polo | 7 | 0.233 (5 rows) => SELECT DETECT_OUTLIERS(‘baseball_outliers’, ‘baseball_roster’, ‘id, hr, avg’, ‘robust_zscore’ USING PARAMETERS outlier_threshold=3.0); DETECT_OUTLIERS ————————– Detected 2 outliers => SELECT * FROM baseball_outliers; id | last_name | hr | avg —-+———–+————+————- 7 | Wicker | 78 | 0.333 9 | Hank | 999999 | 0.8888 (2 rows) Many machine learning functions cannot work with categorical data. To accommodate these models, categorical data must be converted to numerical data before training. Directly mapping the categorical values into indices is not enough. For example, if your categorical feature has three distinct values “red”, “green” and “blue”, replacing them with 1, 2 and 3 may have a negative impact on the training process because algorithms usually rely on some kind of numerical distances between values to discriminate between them. In this case, the Euclidean distance from 1 to 3 is twice the distance from 1 to 2, which means the training process will think that “red” is much more different than “blue”, while it is more similar to “green”. Alternatively, one hot encoding maps each categorical value to a binary vector to avoid this problem. For example, “red” can be mapped to [1,0,0], “green” to [0,1,0] and “blue” to [0,0,1]. Now, the pair-wise distances between the three categories are all the same. One hot encoding allows you to convert categorical variables to binary values so that you can use different machine learning algorithms to evaluate your data. ONE_HOT_ENCODER_FIT generates a sorted list of the category levels for each feature that will be encoded and stores it as a reusable model. The results can be viewed with the Vertica function GET_MODEL_ATTRIBUTES(). => SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'titanic_training', 'sex, embarkation_point'); ONE_HOT_ENCODER_FIT --------------------- Success APPLY_ONE_HOT_ENCODER allows you to apply the encoding to any desired data set. You can then create a view from that output, for example: => CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age, sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') FROM titanic_training) AS sq; CREATE VIEW IMPUTE is a function which allows you to fill in missing values with either the mean or the mode of an observed set of non-null values. For example, if you were collecting data about the heights and weights of students in a class, and some students returned blank forms, you could impute the missing values with the average values for the remainder of the class. You can also partition the dataset so that missing values for girls are imputed with only data from other girls, for example: => SELECT * FROM small_input_impute; pid | pclass | x1 | x2 | x3 | x4 | x5 | x6 ----+--------+-----------+-----------+-----------+----+----+---- 5 | 0 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C 7 | 1 | 3.829239 | 3.08765 | Infinity | | f | C 13 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C 15 | 0 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A 16 | 0 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A 19 | 1 | | 3.841606 | 3.754375 | 20 | t | => SELECT IMPUTE(‘output_view’,’small_input_impute’, ‘pid, x1,x2,x3,x4′,’mean’ USING PARAMETERS exclude_columns=’pid’); IMPUTE ————————– Finished in 1 iteration => SELECT * FROM output_view; pid | pclass | x1 | x2 | x3 | x4 | x5 | x6 —-+——–+———–+———–+——————+—-+—-+— 5 | 0 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C 7 | 1 | 3.829239 | 3.08765 | -3.1298970526315 | 11 | f | C 13 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C 15 | 0 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A 16 | 0 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A 19 | 1 | -3.866450 | 3.841606 | 3.754375 | 20 | t | Note that “x6” was not imputed since it was not listed in the column list in the function call. For categorical values such as “A, C”, etc., mode imputation can be used. The NORMALIZE function scales numeric data from different columns down to an equivalent scale. For example, suppose you wanted to compare the average life expectancy of a country to the total number of people who lived in that country. The population could be well over 1 billion, whereas the life expectancy rarely exceeds 80. These are wildly disparate values, and therefore the values in the population field can overshadow the values in the life expectancy column, thus skewing your model. Vertica provides three methodologies for normalizing data: • MinMax – Using the MinMax normalization method, you can normalize the values in both of these columns to be within a distribution of values between 0 and 1. Doing so allows you to compare values on very different scales to one another by reducing the dominance of one column over the other. • Z-score – Using the Z-score normalization method, you can normalize the values in both of these columns to be the number of standard deviations an observation is from the mean of each column. This allows you to compare your data to a normally distributed random variable. • Robust Z-score – Using the Robust Z-score normalization method, you can lessen the influence of outliers on Z-score calculations. Robust Z-score normalization uses the median value as opposed to the mean value used in Z-score. By using the median instead of the mean, it helps remove some of the influence of outliers in the data. NORMALIZE_FIT allows you the same normalization functionality, but creates it as a model object for easy re-use by using the APPLY_NORMALIZE function. REVERSE_NORMALIZE can also take normalized output and reverse it, returning it back into its original state. => SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax'); NORMALIZE -------------------------- Finished in 1 iteration => SELECT * FROM normalized_salary_data; emp_id | first_name | last_name | years_worked | current_salary ——-+————-+————+———————-+———— 189 | Shawn | Moore | 0.3500000000000 | 0.4372465657653 518 | Earl | Shaw | 0.1000000000000 | 0.9788674111444 1126 | Susan | Alexander | 0.2500000000000 | 0.9090489957107 3196 | Andrew | Holmes | 0.9000000000000 | 0.1839540464448 TABLESAMPLE allows you to easily take a smaller, more manageable sample of a much larger data set. This function allows you to control the percentage of sample data you want, though this is an approximation of that percentage, as the function does not guarantee that the exact percentage is returned. For example, it would be impossible to return exactly 25% of the data from a 99 row table. => CREATE TABLE baseball_sample AS SELECT * FROM baseball TABLESAMPLE(25); CREATE TABLE => SELECT * FROM baseball_sample; id | first_name | last_name | dob | hr | hits | avg -----+------------+------------+------------+----+------+------- 4 | Amanda | Turner | 1997-12-22 | 58 | 177 | 0.187 20 | Jesse | Cooper | 1983-04-13 | 97 | 39 | 0.523 22 | Randy | Peterson | 1980-05-28 | 14 | 16 | 0.141 24 | Carol | Harris | 1991-04-02 | 96 | 12 | 0.456 32 | Rose | Morrison | 1977-07-26 | 27 | 153 | 0.442 50 | Helen | Medina | 1987-12-26 | 12 | 150 | 0.54 70 | Richard | Gilbert | 1983-07-13 | 1 | 250 | 0.213 81 | Angela | Cole | 1991-08-16 | 87 | 136 | 0.706 82 | Elizabeth | Foster | 1994-04-30 | 46 | 163 | 0.481 98 | Philip | Gardner | 1992-05-06 | 39 | 239 | 0.697 102 | Ernest | Freeman | 1983-10-05 | 46 | 77 | 0.564 … Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data. In essence, a new coordinate system is constructed so that data variation is strongest along the first axis, less strong along the second axis, and so on. Then, the data points are transformed into this new coordinate system. The directions of the axes are called principal components. If the input data is a table with p columns, there could be maximum p principal components. However, it’s usually the case that the data variation along the direction of some k-th principal component becomes almost negligible, which allows us to keep only the first k components. As a result, the new coordinate system has fewer axes. Hence, the transformed data table has only k columns instead of p. It is important to remember that the k output columns are not simply a subset of p input columns. Instead, each of the k output columns is a combination of all p input columns. The PCA function computes the principal components from the input table and saves them as a model. This function uses a covariance matrix by default, but if column standardization is required, then an optional parameter can be set which forces the function to uses a correlation matrix. Once a model is generated, you can use the APPLY_PCA function to apply the technique to any new dataset. Furthermore, APPLY_INVERSE_PCA can reverse the data back into its original state. PCA uses SVD (Singular Value Decomposition). This is a matrix decomposition method that allows you to approximate matrix X with dimensions n-by-p as a product of 3 matrices: X(n-by-p) = U(n-by-k).S(k-byk).VT(k-by-p) where k is an integer from 1 to p, and S is a diagonal matrix. Its diagonal has non-negative values, called singular values, sorted from the largest, at the top left, to the smallest, at the bottom right. All other elements of S are zero. Like PCA, SVD creates a model which can be applied using the APPLY_SVD function, and reversed with the APPLY_INVERSE_SVD function.