VerticaPy

Python API for Vertica Data Science at Scale

Classification

Classifications are ML algorithms used to predict categorical response columns. For predicting more than two categories, these are called 'Multiclass Classifications'. Examples of classification are predicting the flower species using specific characteristics or predicting whether Telco customers will churn.

To understand how to create a classification model, let's predict the species of flowers with the Iris dataset.

We'll start by importing the Random Forest Classifier.

In [8]:
from verticapy.learn.ensemble import RandomForestClassifier

Next, we'll create a model object. Vertica has its own model management system, so we just need to choose a model name.

In [9]:
model = RandomForestClassifier("RF_Iris")

We can then fit the model to our dataset.

In [11]:
model.fit("iris", ["PetalLengthCm", "SepalLengthCm"], "Species")
Out[11]:

===========
call_string
===========
SELECT rf_classifier('public.RF_Iris', 'iris', '"species"', '"PetalLengthCm", "SepalLengthCm"' USING PARAMETERS exclude_columns='', ntree=10, mtry=1, sampling_size=0.632, max_depth=5, max_breadth=1000000000, min_leaf_size=1, min_info_gain=0, nbins=32);

=======
details
=======
  predictor  |      type      
-------------+----------------
petallengthcm|float or numeric
sepallengthcm|float or numeric


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 10  
rejected_row_count|  0  
accepted_row_count| 150 

We have many metrics to evaluate the model.

In [12]:
model.classification_report()
Out[12]:
Iris-setosa
Iris-versicolor
Iris-virginica
auc1.00.99620000000000030.9960000000000001
prc_auc1.00.9922974107687890.9927592447454527
accuracy1.00.97333333333333340.9733333333333334
log_loss0.01707655867431670.04676090969271170.0465014165829424
precision1.00.94230769230769230.9791666666666666
recall1.00.980.94
f1_score1.00.9749743589743590.9643523316062176
mcc1.00.94100926145351370.9398255470157904
informedness1.00.950.9299999999999999
markedness1.00.93210361067503910.9497549019607843
csi1.00.92452830188679250.9215686274509803
cutoff0.780.4430.506
Rows: 1-12 | Columns: 4

Our example forgoes splitting the data into training and testing, which is important for real-world work. Our main goal in this lesson is to look at the metrics used to evaluate classifications. The most famous metric is accuracy: generally speaking, the closer accuracy is to 1, the better the model is. However, taking metrics at face value can lead to incorrect interpretations.

For example, let's say our goal is to identify bank fraud. Fraudulent activity is relatively rare, so let's say that they represent less than 1% of the data. If we were to predict that there are no frauds in the dataset, we'd end up with an accuracy of 99%. This is why ROC AUC and PRC AUC are more robust metrics.

That said, a good model is simply a model that might solve a the given problem. In that regard, any model is better than a random one.