This blog post was authored by Soniya Shah.
white cloud in vault type room representing cloud computing
Vertica 9.0 introduces new functionality that continues to match our goals for fast-paced development of the existing machine learning functions. In this release, we introduce two new summary functions, support for cross validation, support for one hot encoding, and the ability to import and export your models to other Vertica clusters. These new features will continue to make machine learning in Vertica easier and more powerful to use.
Summary of Enhancements
||The new CROSS_VALIDATE function allows you to obtain more accurate measurements across your data set.
||A new summary function, with expanded functionality from the SUMMARIZE_MODEL function.
|Import and export models
||Use the new IMPORT_MODELS and EXPORT_MODELS functions to import and export your models to other Vertica clusters.
|One hot encoding
||Support for one hot encoding using the new ONE_HOT_ENCODER_FIT and APPLY_ONE_HOT_ENCODER functions. One hot encoding converts any categorical column into multiple binary columns, each of which indicates the presence or absence of some level of that category.
||This function provides a statistical summary of each numerical feature in an input data set.
Cross validation is useful for obtaining more accurate measurements across your data set, to verify that every sample from the original data set has the same chance of appearing in the training set and the testing set. Without cross validation, you only have information about how the model performs using the data in the sample. Use cross validation to see how a model performs with new data.
In Vertica, you can also use the new CROSS_VALIDATE function in hyper parameter selection to vary a particular parameter over a set of potential values, and then apply cross validation to each value of that parameter. Then, based on the results, you can choose the value with the best performance.
For more information, see CROSS_VALIDATE in the Vertica documentation.
One Hot Encoding
There is a good chance your data set will contain both categorical and numeric variables. For example, your data set could include a color variable with the values red, green, and yellow. Each value represents a different category. While some algorithms can work with categorical data directly, like random forest, others can only work with numeric data, like linear regression. These algorithms require that all input variables are numeric.
Directly mapping the categorical values into indexes is not enough. For example, if your categorical feature has three distinct values “red”, “green” and “blue”, replacing them with 1,2 and 3 may have a negative impact on the training process because algorithms usually rely on some kind of numerical distances between values to discriminate between them. In this case, the Euclidean distance from 1 to 3 is twice the distance from 1 to 2, which means the training process will think that “red” is much more different than “blue”, while it is more similar to “green”. Alternatively, one hot encoding maps each categorical value to a binary vector to avoid this problem. For example, “red” can be map to [1,0,0], “green” to [0,1,0] and “blue” to [0,0,1]. Now, the pair-wise distances between the three categories are all the same. One hot encoding allows you to convert categorical variables to binary values so that you can use different machine learning algorithms to evaluate your data. In this release, we introduce new one hot encoding functions. For more information and a full example of how to use these functions in Vertica, see Encoding Categorical Columns in the Vertica documentation.
For More Information
To get the full story on Vertica machine learning, take a look at the documentation:
• Machine Learning Functions
in the SQL Reference Manual
• Machine Learning for Predictive Analytics
in the Analyzing Data guide.
We are constantly expanding machine learning features in Vertica. You can expect to see expanded functionality in future releases.