Machine Learning Mondays: Data Preparation for Machine Learning in Vertica

Posted May 8, 2017 by Soniya Shah, Information Developer

white cloud in vault type room representing cloud computing
This blog post was authored by Vincent Xu.

This post is part of our Machine Learning Mondays series. Stay tuned for more!


Machine learning (ML) is an iterative process. From understanding data, preparing data, building models, testing models to deploying models, every step of the way requires careful examination and manipulation of the data. This is especially true at the beginning of this cycle where the raw data must be cleaned and prepared for modelling. In this blog, I will briefly walk through all the data-wrangling tools available in Vertica. Anyone who tries to conduct ML in Vertica should take advantage of the following capabilities to help prepare his/her data.

Loading data in various formats from a variety of data sources

Most data scientists probably agree that one of the trickiest steps is to gather all the data from various sources into one place and convert them into one common format. Table format in a relational database like Vertica is well suited for ML, since ML data is usually represented in a data-frame format.

Over the years, Vertica has developed a comprehensive set of parsers to extract data in various formats, including text delimited format, JSON, Regex, ORC, Avro, Parquet or even Shapefile, etc. For example, the following statement loads some compressed Twitter data into Vertica: => COPY twitter FROM '/server1/TWITTER/tweets1.json.bz2' BZIP parser fjsonparser() direct; Vertica also accepts data can from various storage systems, including local file systems, SQL compatible data sources, HDFS, Spark, or cloud storages such as S3 and ABS, etc. Vertica has an especially tight integration with HDFS and Spark; transferring data between these products and Vertica is highly optimized.

The following statement creates an external table ‘hadoopExample’, which maps to the data in HDFS. After that, one can read the table just like any other table in Vertica. => CREATE EXTERNAL TABLE hadoopExample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER) AS COPY SOURCE Hdfs( url=http://hadoop01:50070/webhdfs/v1/user/hadoopUser/example/output/*', username='hadoopUser');

A powerful SQL query engine

The best part of having the data in Vertica is that users can take advantage of its fast SQL engine. Users can slice and dice that data however they want. Filtering rows/columns, joining tables, grouping-by, and ordering-by are highly scalable and can be used to express sophisticated processing logic. These operations typically need a more verbose implementation in a non-SQL environment.

This quick statement generates a list of total vacation days accrued by each employee in each region: => SELECT s.store_region, SUM(e.vacation_days) totalVacationDays FROM public.employee_dimension e JOIN store.store_dimension s ON s.store_region=e.employee_region GROUP BY s.store_region ORDER BY totalVacationDays;

Rich set of built-in functions

Vertica has hundreds of functions that perform all types of data transformations, including mathematical calculations, string manipulation, date/time, aggregate, regular expression and many other functions. Transforming the data or getting statistical information about the data is just one function call away. The following statement provides a good idea of the data distribution for a table column: => SELECT COUNT(salary), AVG(salary), STDDEV(salary), MIN(salary), APPROXIMATE_PERCENTILE(salary using parameters percentile=.25) p25, APPROXIMATE_MEDIAN(salary) median, APPROXIMATE_PERCENTILE(salary using parameters percentile=.75) p75, MAX(salary) FROM employees; Vertica also offers an extended set of functions to help conduct advanced data analysis, such as time series analysis and sessionization of event series. This simple statement will linearly interpolate bid prices in a 2-second interval for each stock ticker: => SELECT slice_time, TS_FIRST_VALUE(bid, 'LINEAR') bid FROM Tickstore TIMESERIES slice_time AS '2 seconds' OVER (PARTITION BY symbol ORDER BY ts);

Specialized data-preparation functions for machine learning

To make the common data transformations for ML easy, Vertica has also developed dedicated functions. Vertica 8.1 comes with functions such as normalization, missing value imputation, imbalanced data processing, outlier detection, and sampling.

Normalization is a common step in data preparation. For example, to normalize the ‘salary’ and ‘years_of_experience’ column in the ‘employees’ table, one can run the following statement, which normalizes the columns using the ‘robust_zscore’ method and store the results in the ‘employees_norm’ view: => SELECT NORMALIZE('employees_norm', 'employees', 'salary, years_of_experience', 'robust_zscore'); If the model is trained on normalized data, the prediction data should also be normalized in the same way before they are fed into the prediction function. For that reason, a set of other functions are provided: NORMALIZE_FIT, APPLY_NORMALIZE, and REVERSE_NORMALIZE. NORMALIZE_FIT generates the normalized parameters and stores them as a model. The following statement stores the median and MAD score for each column in the model ’employees_normrz’: => SELECT NORMALIZE_FIT('employees_normrz', 'employees', 'salary,years_of_experience', 'robust_zscore'); APPLY_NORMALIZE applies the same normalization parameters to the prediction data before feeding them to the prediction function: => CREATE TABLE new_employees_norm AS SELECT APPLY_NORMALIZE (salary, years_of_experience USING PARAMETERS model_name = 'employees_normrz') FROM new_employees; REVERSE_NORMALIZE can be called to transform the normalized data back to its original value. For example, it is necessary to normalize each column before clustering with Kmeans. Thus, the resulting cluster centers are expressed in normalized values. To better interpret the centers, it helps to call REVERSE_NORMALIZE to convert the center values back to its original scale: => CREATE TABLE customer_segments AS SELECT REVERSE_NORMALIZE (revenue, profit, number_of_employees USING PARAMETERS model_name = 'customers_norm') FROM customers_segments_norm;


I hope this quick walk-through gives you a taste of all the Vertica tools that data scientists can leverage to get their data ready for machine learning. If you have any comments on this blog or any other feedback on Vertica products, please send them to