This blog post was authored by Soniya Shah.
Vertica 8.1.1 continues with the fast-paced development for machine learning. In this release, we introduce the highly-requested random forest algorithm. We added support for SVM to include SVM for regression, in addition to the existing SVM for classification algorithm. L2 regularization was added to both the linear and logistic regression algorithms. You can further manage your models using the new GET_MODEL_ATTRIBUTE function to query individual attributes that are returned as a table. We’ve also made improvements to prediction and data preparation functions. These new features and enhancements make machine learning in Vertica more powerful and easier to use.
Summary of Enhancements
|Random forest for classification
||A powerful classification method that supports multi-class classification and most data types for predictors.
||A regression algorithm that has the generalization power to avoid overfitting.
|L2 regularization for regression algorithms
||Both linear regression and logistic regression support L2 regularization to avoid overfitting.
|Extract model attributes
||Provides the ability to list all attributes and query a specific attribute from a model. The result returns a table, which can be conveniently fed into other queries.
|Hybrid method for imbalanced data processing
||The BALANCE function supports three methods for data processing – under-sampling, over-sampling, and hybrid-sampling, making it easier to balance a data set.
|Outlier detection with the PARTITION BY clause
||With this enhancement, the DETECT_OUTLIERS function can detect outliers by group.
|Match-by-position for prediction parameters
||Prediction functions can match the input columns with model features by the position of the parameters, rather than by column name.
Random Forest for Classification
Random forest supports multi-class classification, in addition to the naïve bayes algorithm. Random forest is a robust classification algorithm that works well on many different types of data sets. A set of function parameters provide good control over how the ensemble model is built, including the number of trees, tree depth, feature size, and more.
The random forest model is a set of decision trees. The algorithm constructs decision trees during training of a model, and then uses them for prediction. The predicted class of a random forest model is the one that is predicted by the most number of trees in the forest. A decision tree is a set of decision nodes in a tree structure. With the exception of leaf nodes, each decision node contains a rule for splitting its input data among its children. The input data of the decision tree is entered into its root node at the top of the tree, and as it travels down the tree towards its leaf nodes, it is bucketed into smaller sets.
For example, you could want to know whether or not the weather is suitable for a run outside. If the weather is sunny and the wind is less than 10 MPH, you probably will want to run outside.
You can look at the outlook first and break the variables down into sunny or rain. From there, the tree splits further based on the wind to determine if it’s ok to run.
For a complete example, see Classifying Data Using Random Forest
in the Vertica documentation.
In Vertica 8.1, we introduced SVM classification and now we have extended support for SVM for regression. This algorithm provides an alternative to linear regression. SVM allows the user to specify an error tolerance level, where some difference between the predicted value and the actual value is tolerated. When the error_tolerance is small and the parameter C is large, SVM for regression behaves similarly to linear regression.
SVM regression predicts continuous ordered variables, for use cases such as pattern recognition or predicting time series. SVM regression is used to predict continuous numeric outcomes, rather than a binary classification outcome. You can use the SVM_REGRESSOR and PREDICT_SVM_REGRESSOR functions for training and prediction.
For a complete example of how to use the SVM algorithm in Vertica, see Building an SVM for Regression Model
in the Vertica documentation.
For More Information
To get the full story on Vertica machine learning, take a look at the documentation:
• Machine Learning Functions
in the SQL Reference Manual
• Machine Learning for Predictive Analytics
in the Analyzing Data guide.
We are constantly expanding machine learning features in Vertica. You can expect to see expanded functionality in future releases.