Evaluating Classifier Models in Vertica

Posted August 2, 2019 by Arash Fard, Senior Software Engineer at Vertica

Analyze, Design, Develop, Implement, Evaluate cycle illustration
Co-authored by Elizabeth Michaud

Vertica provides an out-of-the box machine learning toolset that covers a complete data science workflow. The toolset includes several distributed functions for evaluating both classifier and regressor machine learning models.

The goal of this blog post is to demonstrate how you can use the built-in functions for evaluating the prediction performance of classifier models, and to show you some examples. In particular, we’ll show the flexibility of these built-in functions with respect to categorical labels. We’re using Vertica 9.2SP1 for running these examples.

Loading the data to the database


We use the iris dataset for this blogpost. You probably already know this dataset. It’s often used in tutorials for presenting classification problems. No worries if you don’t know it though. It’s a small dataset with 150 records of iris flowers, which you can download from the UCI Machine Learning repository. Each record of the data has five features: 4 numerical attributes which have dimensions (length and width) of the sepal and petal of a flower in centimeters, and a categorical one which specifies the species class of the flower. There are 3 iris species in this dataset: Setosa, Versicolour, and Virginica.

The following code snippet shows you how to load the data to a table. -- setting an environmental variable for file_path \set file_path '\'/home/my_folder/iris.data\'' -- creating the table CREATE TABLE iris(sepal_length FLOAT, sepal_width FLOAT, petal_length FLOAT, petal_width FLOAT, species VARCHAR); -- loading the dataset from the file to the table COPY iris FROM LOCAL :file_path DELIMITER ','; Now, let’s randomly put aside 20% of the samples as a test dataset. In other words, we don’t use these samples for training our model. Instead, we use them for evaluating the trained predictive model, a little later. -- creating the sample table where 20% of rows marked ‘test’ and the rest ‘train’ CREATE TABLE iris_sample AS SELECT *, CASE WHEN RANDOM() < 0.2 THEN true ELSE false END AS test FROM iris; -- creating view for train data CREATE VIEW train_data AS SELECT sepal_length, sepal_width, petal_length, petal_width, species FROM iris_sample WHERE test=false; -- creating view for test data CREATE VIEW test_data AS SELECT sepal_length, sepal_width, petal_length, petal_width, species FROM iris_sample WHERE test=true;

Training a model


Let’s use Vertica’s built-in random forest algorithm to train a classifier. Random forest is a very powerful algorithm. Its distributed implementation in Vertica is fast and easy to use. -- training a classifier model SELECT RF_CLASSIFIER ('iris_model', 'train_data', 'species', '*' USING PARAMETERS exclude_columns=' species');

Model evaluation


Now that we have trained a predictive model, we can apply it on the test data to measure its prediction performance. There are 5 functions in Vertica for evaluating classifier models: ERROR_RATE, CONFUSION_MATRIX, ROC, PRC, and LIFT_TABLE. We will demonstrate their usage one by one. We encourage you rerun these examples on your own. Your results might be slightly different from those below, because of the randomness involved in picking the test dataset as well as the intrinsic randomness of the random forest algorithm.

Predict functions


The task of an evaluation function is to compare real values with predicted values of a model on the test samples, and measure prediction performance of the model based on this comparison. Therefore, we first need to find the predicted values.

Each classifier algorithm in Vertica has a predict function of type UDSF (User Defined Scalar Function). Being of type UDSF indicates that its usage is very flexible; i.e., you can use it along with other columns in the SELECT or WHERE clause of a SQL query. Its limitation is that it can return only one value per input row.

The name of each of these functions is composed of “PREDICT_” and the name of the classifier; for instance, it is PREDICT_RF_CLASSIFIER for our random forest classifier. Here is how we can use it in our example. -- predicting the species of test data CREATE VIEW simple_prediction AS SELECT species AS real_species, PREDICT_RF_CLASSIFIER (sepal_length, sepal_width, petal_length, petal_width USING PARAMETERS model_name='iris_model') AS predicted_species FROM test_data; If you look at the contents of the simple_prediction view, you’ll notice that both real and predicted values are categorical values. Since Vertica version 9.2.0, you don’t need to convert these categorical values to numerical indexes in order to feed them to the evaluation functions. -- display the content of simple_prediction => SELECT * FROM simple_prediction limit 3; real_species | predicted_species -----------------+------------------- Iris-versicolor | Iris-virginica Iris-versicolor | Iris-versicolor Iris-setosa | Iris-setosa (3 rows) The UDSF predict functions for classifiers can optionally return the probability of predictions if you specify type=’probability’ in the list of their optional parameters. If you have many classes, and you want to compare their prediction probabilities, you may need to call the UDSF function many times in a SQL query. To make such a use case more convenient, we have developed a UDTF prediction for each of our multiclass classifiers as well. It is called PREDICT_RF_CLASSIFIER_CLASSES and you use it with the random forest algorithm. Because the type is UDTF, you need to specify OVER() after the name of the function, and the function can appear only by itself at the SELECT clause. Here is how it can be used in our example. -- predicting the species of test data => CREATE TABLE comprehensive_prediction AS SELECT PREDICT_RF_CLASSIFIER_CLASSES (* USING PARAMETERS model_name='iris_model', key_columns='species', exclude_columns='species', classes=' Iris-setosa, Iris-versicolor, Iris-virginica') OVER() FROM test_data; CREATE TABLE -- display the content of comprehensive_prediction => SELECT * FROM comprehensive_prediction limit 3; Species | predicted | probability | Iris-setosa | Iris-versicolor | Iris-virginica -------------+-------------+-------------+-------------+-----------------+---------------- Iris-setosa | Iris-setosa | 1 | 1 | 0 | 0 Iris-setosa | Iris-setosa | 1 | 1 | 0 | 0 Iris-setosa | Iris-setosa | 1 | 1 | 0 | 0 (3 rows) The first column is the same as the species column of the test data. The second column is the predicted value, and the third column is the probability of this prediction. The next columns indicate the prediction probability of each class.

You may have noticed that I have created simple_prediction as a view, but comprehensive_prediction as a table. Indeed, you can use either view or table for each one. The main difference is that views in Vertica are not materialized. In other words, views don’t take any physical storage space: their contents will be created on the fly each time you use them.

Error_rate


Error rate measures misclassification. The error rate for each class is the ratio of misclassified samples of that class in the test data. The total error rate is the ratio of all misclassified samples to the total test data.

It’s easy to measure the error rate of predictions in a multiclass data using the ERROR_RATE function. You just need to specify the number of classes for the function. -- calculating error_rate => SELECT ERROR_RATE(real_species, predicted_species USING PARAMETERS num_classes=3) OVER() FROM simple_prediction; class | error_rate | comment -----------------+--------------------+--------------------------------------------- Iris-versicolor | 0.0714285714285714 | Iris-setosa | 0 | Iris-virginica | 0.111111111111111 | | 0.0606060606060606 | Of 33 rows, 33 were used and 0 were ignored (4 rows)

Confusion_matrix


Confusion Matrix is a table that displays the distribution of the samples in the test data based on their actual classes and predicted classes. The statement for calculating confusion matrix is very similar to the ERROR_RATE function. -- calculating confusion matrix => SELECT CONFUSION_MATRIX(real_species, predicted_species USING PARAMETERS num_classes=3) OVER() FROM simple_prediction; actual_class | class_index | predicted_0 | predicted_1 | predicted_2 | comment -----------------+-------------+-------------+-------------+-------------+--------------------------------------------- Iris-virginica | 0 | 8 | 1 | 0 | Iris-versicolor | 1 | 1 | 13 | 0 | Iris-setosa | 2 | 0 | 0 | 10 | Of 33 rows, 33 were used and 0 were ignored (3 rows) You may notice that the numerical index of each class is displayed in the class_index column. These indexes also appear in the column names of the confusion matrix. For example, the column predicted_0 corresponds to the Iris-virginica class.

Binary Evaluators


ROC, PRC, and Lift_table are by definition for evaluating binary classifiers. Nevertheless, you might also use them in a multiclass situation when you are interested in a single class in comparison to all others. For example, let’s say we want to evaluate the quality of our model in predicting “Iris-virginica” versus any other species. For this purpose, we should specify main_class=’Iris-virginica’ in the list of optional parameters. These functions take real values and the probabilities of the main class as their input arguments; e.g., the species, “Iris-virginica” columns of the comprehensive_prediction table in our example.

ROC


The curve created by the ROC (Receiver Operating Characteristics) function is one of the most important evaluation metrics for checking the performance of a binary classifier. The area under its curve, called AUC, is also an important performance indication. The ROC function in Vertica provides the data points of this curve, and also optionally calculates AUC. -- calculating ROC => SELECT ROC(species, "Iris-virginica" USING PARAMETERS num_bins=5, main_class='Iris-virginica') OVER() FROM comprehensive_prediction; decision_boundary | false_positive_rate | true_positive_rate | AUC | comment -------------------+---------------------+--------------------+-------------------+--------------------------------------------- 0 | 1 | 1 | | 0.2 | 0.125 | 1 | | 0.4 | 0.0416666666666667 | 1 | | 0.6 | 0.0416666666666667 | 0.888888888888889 | | 0.8 | 0.0416666666666667 | 0.888888888888889 | | 1 | 0 | 0 | 0.976851851851852 | Of 33 rows, 33 were used and 0 were ignored (6 rows)

PRC


PRC calculates the Precision versus Recall curve for different decision boundaries. The PRC function in Vertica provides the data points of this curve, and also optionally calculates F1 scores. -- calculating PRC => SELECT PRC(species, "Iris-virginica" USING PARAMETERS num_bins=5, f1_score=true, main_class='Iris-virginica') OVER() FROM comprehensive_prediction; decision_boundary | recall | precision | f1_score | comment -------------------+-------------------+-------------------+-------------------+--------------------------------------------- 0 | 1 | 0.272727272727273 | 0.428571428571429 | 0.2 | 1 | 0.75 | 0.857142857142857 | 0.4 | 1 | 0.9 | 0.947368421052632 | 0.6 | 0.888888888888889 | 0.888888888888889 | 0.888888888888889 | 0.8 | 0.888888888888889 | 0.888888888888889 | 0.888888888888889 | Of 33 rows, 33 were used and 0 were ignored (5 rows)

Lift_table


Lift calculates the ratio between the results with and without a model to measure its effectiveness. Lift Table (aka Lift Chart) plots the lift versus the positive prediction ratio value for different decision boundaries. -- calculating Lift_table => SELECT Lift_table(species, "Iris-virginica" USING PARAMETERS num_bins=5, main_class='Iris-virginica') OVER() FROM comprehensive_prediction; decision_boundary | positive_prediction_ratio | lift | comment -------------------+---------------------------+------------------+--------------------------------------------- 1 | 0 | NaN | 0.8 | 0.888888888888889 | 3.25925925925926 | 0.6 | 0.888888888888889 | 3.25925925925926 | 0.4 | 1 | 3.3 | 0.2 | 1 | 2.75 | 0 | 1 | 1 | Of 33 rows, 33 were used and 0 were ignored (6 rows)

Takeaway


We have used a very small dataset in this blog post to demonstrate the Vertica functions for evaluating a classifier. Nevertheless, all ML functions in Vertica are developed for machine learning at scale. Not only our training ML algorithms, but also the prediction and evaluation functions are designed to take advantage of the powerful distributed execution engine of Vertica database. We have tested the scalability of these functions on datasets with billions of records and hundreds of features.

Moreover, we have improved our evaluation functions to easily work with categorical values. In other words, you don’t need to convert the string labels to indexes, which is required in many other systems like Spark-ML.

Please go to our online documentation to learn more about the functions used here as well as other machine learning functions.



Related Posts:
Doing in-Database Machine Learning 1 – Why Would You Do That?
Vertica’s In-Database Random Forest, Highly Scalable and Easy to Use
The Extensibility of Vertica – User-Defined Extensions in Action
Content Analytics and Video Recommendation Systems with Vertica
Can you Hear Me Now? Network Optimization at Work