This blog post was authored by Soniya Shah.
Overall, you will notice that Machine Learning for Predictive Analytics, introduced in Vertica 7.2.2, is more accessible to use in Vertica 8.1, with the addition of several important functions. There are improvements to model management with access control ability to save and re-apply normalization parameters, missing value imputation, and support for SVM (support vector machine).
Summary of Enhancements
||Imputes missing values per partition in your data, using either mean or mode.
||A classification algorithm that assigns data to one category or the other, based on the training data.
||Support for machine learning model security.
|New normalization functions
||Three new normalization functions – NORMALIZE_FIT, APPLY_NORMALIZE, and REVERSE_NORMALIZE make it easy for you to save normalization parameters.
||Machine learning functions now use the USING PARAMETERS syntax.
|Support for ALTER MODEL and DROP MODEL statements
||ALTER MODEL and DROP MODEL replace RENAME_MODEL and DELETE_MODEL.
|Support for APPROXIMATE_MEDIAN and APPROXIMATE_PERCENTILE functions
||Use the new functions to obtain estimated values on large data sets.
In Vertica 8.1, there is now support for machine learning model security, similar to that of other database objects. You can use GRANT and REVOKE statements to grant and revoke privileges on models to other users.
For example, if you have a model that you created and want to share with a co-worker, you can grant them permission on the model. For more information and a complete example, see Managing Model Security
in the Vertica documentation.
New Normalization Functions
In Vertica 8.1, the new normalization functions allow you to save and apply normalization parameters later. The purpose of normalizing your data is to scale numeric data from different columns down to an equivalent scale. Vertica offers three methods for normalization:
• MinMax: Normalize the values to be within a distribution of 0 and 1.
• Z-score: Normalize values to be the number of standard deviations that an observation is from the mean.
• Robust Z-score: You can use this method to lessen this influence of outliers on Z-score calculations. This method uses the median value rather than the mean value, as used in the Z-score method.
Previously, you could use the NORMALIZE function to apply the normalization directly on the columns of an input table. Now, with the new functions you can do the following:
• Use NORMALIZE_FIT to store normalization parameters into a model for later operation.
• Use APPLY_NORMALIZE to apply normalization to a set of specified columns in the input table using the normalization parameters that were saved in a model.
• Use REVERSE_NORMALIZE to reverse the normalization transformation on a normalized data set.
New IMPUTE Function
Imputation is the process of replacing missing data with substituted values. In Vertica 8.1, you can use the new IMPUTE function to replace missing values, by specifying either the mean or mode method.
=> SELECT IMPUTE (‘output_view’, ‘customers’, ‘name, id’, ‘mean’);
=> SELECT IMPUTE (‘output_view’, ‘customers’, ‘name, id’, ‘mode’);
SVM (Support Vector Machine)
SVM is a classification algorithm often used in image classification or text categorization. You can use the SVM_CLASSIFIER and PREDICT_SVM_CLASSIFIER for training and prediction.
For a complete example of how to use the SVM algorithm in Vertica, see Classifying Data Using SVM
in the Vertica documentation.
For More Information
To get the full story on Vertica machine learning, take a look at the documentation: Machine Learning Functions
in the SQL Reference manual and Machine Learning for Predictive Analytics
in the Analyzing Data guide. We are constantly expanding machine learning features in Vertica. You can expect to see expanded functionality in future releases.