CONFUSION_MATRIX
Computes the confusion matrix of a table with observed and predicted values of a response variable. CONFUSION_MATRIX
produces a table with the following dimensions:
- Rows: Number of classes
- Columns: Number of classes + 2
Syntax
CONFUSION_MATRIX ( targets, predictions [ USING PARAMETERS num_classes=num‑classes ] ) OVER()
Arguments
targets |
An input column that contains the true values of the response variable. |
predictions |
An input column that contains the predicted class labels. |
Arguments targets and predictions must be set to input columns of the same data type, one of the following: INTEGER, BOOLEAN, or CHAR/VARCHAR. Depending on their data type, these columns identify classes as follows:
- INTEGER: Zero-based consecutive integers between 0 and (num-classes-1) inclusive, where num-classes is the number of classes. For example, given the following input column values—
{0, 1, 2, 3, 4
}—Vertica assumes five classes.If input column values are not consecutive, Vertica interpolates the missing values. Thus, given the following input values—
{0, 1, 3, 5, 6,}
— Vertica assumes seven classes. - BOOLEAN: Yes or No
- CHAR/VARCHAR: Class names. If the input columns are of type CHAR/VARCHAR columns, you must also set parameter
num_classes
to the number of classes.Vertica computes the number of classes as the union of values in both input columns. For example, given the following sets of values in the targets and predictions input columns, Vertica counts four classes:
{'milk', 'soy milk', 'cream'} {'soy milk', 'almond milk'}
Parameter Settings
Parameter name | Set to… |
---|---|
num_classes
|
An integer > 1, specifies the number of classes to pass to the function. You must set this parameter if the specified input columns are of type CHAR/VARCHAR. Otherwise, the function processes this parameter according to the column data types:
|
Examples
This example computes the confusion matrix for a logistic regression model that classifies cars in the mtcars
data set as automatic or manual transmission. Observed values are in input column obs
, while predicted values are in input column pred
. Because this is a binary classification problem, all values are either 0 or 1.
In the table returned, all 19 cars with a value of 0 in column am
are correctly predicted by PREDICT_LOGISTIC_REGRESSION
as having a value of 0. Of the 13 cars with a value of 1 in column am
, 12 are correctly predicted to have a value of 1, while 1 car is incorrectly classified as having a value of 0:
=> SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT am AS obs, PREDICT_LOGISTIC_REG(mpg, cyl, disp,drat, wt, qsec, vs, gear, carb USING PARAMETERS model_name='myLogisticRegModel')AS PRED FROM mtcars) AS prediction_output; actual_class | predicted_0 | predicted_1 | comment -------------+-------------+-------------+------------------------------------------ 0 | 19 | 0 | 1 | 0 | 13 | Of 32 rows, 32 were used and 0 were ignored (2 rows)