verticapy.machine_learning.metrics.classification_report#
- verticapy.machine_learning.metrics.classification_report(y_true: str | None = None, y_score: list | None = None, input_relation: str | vDataFrame | None = None, metrics: None | str | list[str] = None, labels: list | ndarray | None = None, cutoff: int | float | Decimal | None = None, nbins: int = 10000, estimator: VerticaModel | None = None) float | TableSample #
Computes a classification report using multiple metrics (AUC, accuracy, PRC AUC, F1…). In the case of multiclass classification, it considers each category as positive and switches to the next one during the computation.
Parameters#
- y_true: str
Response column.
- y_score: str
Prediction.
- input_relation: SQLRelation
Relation to use for scoring. This relation can be a view, table, or a customized relation (if an alias is used at the end of the relation). For example: (SELECT … FROM …) x
- metrics: list, optional
List of the metrics used to compute the final report.
- accuracy:
Accuracy.
\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
- aic:
Akaike’s Information Criterion
\[AIC = 2k - 2\ln(\hat{L})\]
- auc:
Area Under the Curve (ROC).
\[AUC = \int_{0}^{1} TPR(FPR) \, dFPR\]
- ba:
Balanced Accuracy.
\[BA = \frac{TPR + TNR}{2}\]
- best_cutoff:
Cutoff which optimised the ROC Curve prediction.
- bic:
Bayesian Information Criterion
\[BIC = -2\ln(\hat{L}) + k \ln(n)\]
- bm:
Informedness
\[BM = TPR + TNR - 1\]
- csi:
Critical Success Index
\[index = \frac{TP}{TP + FN + FP}\]
- f1:
F1 Score
\[F_1 Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}\]
- fdr:
False Discovery Rate
\[FDR = 1 - PPV\]
- fm:
Fowlkes-Mallows index
\[FM = \sqrt{PPV * TPR}\]
- fnr:
False Negative Rate
\[FNR = \frac{FN}{FN + TP}\]
- for:
False Omission Rate
\[FOR = 1 - NPV\]
- fpr:
False Positive Rate
\[FPR = \frac{FP}{FP + TN}\]
- logloss:
Log Loss.
\[Loss = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)\]
- lr+:
Positive Likelihood Ratio.
\[LR+ = \frac{TPR}{FPR}\]
- lr-:
Negative Likelihood Ratio.
\[LR- = \frac{FNR}{TNR}\]
- dor:
Diagnostic Odds Ratio.
\[DOR = \frac{TP \times TN}{FP \times FN}\]
- mc:
Matthews Correlation Coefficient .. math:
MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
- mk:
Markedness
\[MK = PPV + NPV - 1\]
- npv:
Negative Predictive Value
\[NPV = \frac{TN}{TN + FN}\]
- prc_auc:
Area Under the Curve (PRC)
\[AUC = \int_{0}^{1} Precision(Recall) \, dRecall\]
- precision:
Precision
\[Precision = TP / (TP + FP)\]
- pt:
Prevalence Threshold.
\[threshold = \frac{\sqrt{FPR}}{\sqrt{TPR} + \sqrt{FPR}}\]
- recall:
Recall.
\[Recall = \frac{TP}{TP + FN}\]
- specificity:
Specificity.
\[Specificity = \frac{TN}{TN + FP}\]
- labels: ArrayLike, optional
List of the response column categories to use.
- cutoff: PythonNumber, optional
Cutoff for which the tested category will be accepted as prediction.
- nbins: int, optional
[Used to compute ROC AUC, PRC AUC and the best cutoff] An integer value that determines the number of decision boundaries. Decision boundaries are set at equally spaced intervals between 0 and 1, inclusive. Greater values for nbins give more precise estimations of the AUC, but can potentially decrease performance. The maximum value is 999,999. If negative, the maximum value is used.
- estimator: object, optional
Estimator used to compute the classification report.
Returns#
- TableSample
report.
Examples#
We should first import verticapy.
import verticapy as vp
Binary Classification#
Let’s create a small dataset that has:
true value
probability of the true value
predicted value
data = vp.vDataFrame( { "y_true": [1, 1, 0, 0, 1], "y_prob": [0.8, 0.2, 0.1, 0.6, 0.8], "y_pred": [1, 0, 0, 1, 1] }, )
Next, we import the metric:
from verticapy.machine_learning.metrics import classification_report
Now we can conveniently calculate the score:
classification_report( y_true = "y_true", y_score = ["y_prob", "y_pred"], input_relation = data, ) Out[4]: None value auc 0.8333333333333335 prc_auc 0.9027777777777779 accuracy 0.6 log_loss 0.267297505916969 precision 0.6666666666666666 recall 0.6666666666666666 f1_score 0.6666666666666666 mcc 0.16666666666666666 informedness 0.16666666666666652 markedness 0.16666666666666652 csi 0.5 Rows: 1-11 | Columns: 2
Important
In binary classification,
y_score
should be a list of two column names: - Probability of true value - Prediction valueIn the case of multi-class,
y_score
, is the list of two elements: - list of column names for class probabilitiesfor each class
Prediction value
Note
For multi-class classification, we can select the
average
method for averaging from the following options: - binary - micro - macro - scores - weightedIt is also possible to directly compute the score from the vDataFrame:
data.score( y_true = "y_true", y_score = ["y_prob", "y_pred"], metric = "classification_report", ) Out[5]: None value auc 0.8333333333333335 prc_auc 0.9027777777777779 accuracy 0.6 log_loss 0.267297505916969 precision 0.6666666666666666 recall 0.6666666666666666 f1_score 0.6666666666666666 mcc 0.16666666666666666 informedness 0.16666666666666652 markedness 0.16666666666666652 csi 0.5 Rows: 1-11 | Columns: 2
Note
VerticaPy uses simple SQL queries to compute various metrics. You can use the
set_option()
function with thesql_on
parameter to enable SQL generation and examine the generated queries.Multi-class Classification#
Let’s create a small dataset that has:
true value with more than two classes
probability of each class
predicted value
data = vp.vDataFrame( { "y_true": [1, 2, 0, 0, 1], "y_prob_0": [0.1, 0.1, 0.1, 0.1, 0.1], "y_prob_1": [0.8, 0.6, 0.4, 0.6, 0.2], "y_prob_2": [0.1, 0.3, 0.5, 0.3, 0.7], "y_pred": [1, 2, 0, 1, 1], }, )
Next, we import the metric:
from verticapy.machine_learning.metrics import classification_report
Now we can conveniently calculate the score:
classification_report( y_true = "y_true", y_score =[["y_prob_0","y_prob_1","y_prob_1"], "y_pred"], labels = [0,1,2], input_relation = data, ) Out[8]: None 0 1 2 \\ auc 0.5 0.5 0.625 \\ prc_auc 0.7 0.6625 0.1666666666666665 \\ accuracy 0.8 0.8 1.0 \\ log_loss 0.427454494336405 0.362721756860901 0.327503505049765 \\ precision 1.0 0.6666666666666666 1.0 \\ recall 0.5 1.0 1.0 \\ f1_score 0.6666666666666666 0.8 1.0 \\ mcc 0.6123724356957946 0.6666666666666666 1.0 \\ informedness 0.5 0.6666666666666665 1.0 \\ markedness 0.75 0.6666666666666665 1.0 \\ csi 0.5 0.6666666666666666 1.0 \\ None avg_macro avg_weighted avg_micro auc 0.5416666666666666 0.525 None prc_auc 0.5097222222222221 0.5783333333333333 None accuracy 0.8666666666666667 0.8400000000000001 0.8666666666666667 log_loss 0.37255991874902367 0.3815712014888754 None precision 0.8888888888888888 0.8666666666666666 0.8 recall 0.8333333333333334 0.8 0.8 f1_score 0.8222222222222223 0.7866666666666667 0.8000000000000002 mcc 0.759679700787487 0.7116156409449845 0.7 informedness 0.7222222222222222 0.6666666666666666 0.7000000000000002 markedness 0.8055555555555555 0.7666666666666666 0.7000000000000002 csi 0.7222222222222222 0.6666666666666666 0.6666666666666666 Rows: 1-11 | Columns: 7
See also
vDataFrame.
score()
: Computes the input ML metric.