verticapy.machine_learning.model_selection.hp_tuning.validation_curve#

Draws the validation curve.

Parameters#

estimator: VerticaModel

Vertica estimator with a fit method.

param_name: str

Parameter name.

param_range: list

Parameter Range.

input_relation: SQLRelation

Relation used to train the model.

X: SQLColumns

List of the predictor columns.

y: str

Response Column.

metric: str, optional

Metric used to for model evaluation. - auto:

logloss for classification & RMSE for regression.

For Classification

accuracy:
Accuracy.

\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
auc:
Area Under the Curve (ROC).

\[AUC = \int_{0}^{1} TPR(FPR) \, dFPR\]
ba:
Balanced Accuracy.

\[BA = \frac{TPR + TNR}{2}\]
bm:
Informedness

\[BM = TPR + TNR - 1\]
csi:
Critical Success Index

\[index = \frac{TP}{TP + FN + FP}\]

f1:

F1 Score .. math:

F_1 Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

fdr:
False Discovery Rate

\[FDR = 1 - PPV\]
fm:
Fowlkes-Mallows index

\[FM = \sqrt{PPV * TPR}\]
fnr:
False Negative Rate

\[FNR = \frac{FN}{FN + TP}\]
for:
False Omission Rate

\[FOR = 1 - NPV\]
fpr:
False Positive Rate

\[FPR = \frac{FP}{FP + TN}\]
logloss:
Log Loss

\[Loss = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)\]
lr+:
Positive Likelihood Ratio.

\[LR+ = \frac{TPR}{FPR}\]
lr-:
Negative Likelihood Ratio.

\[LR- = \frac{FNR}{TNR}\]
dor:
Diagnostic Odds Ratio.

\[DOR = \frac{TP \times TN}{FP \times FN}\]
mcc:
Matthews Correlation Coefficient
mk:
Markedness

\[MK = PPV + NPV - 1\]
npv:
Negative Predictive Value

\[NPV = \frac{TN}{TN + FN}\]
prc_auc:
Area Under the Curve (PRC)

\[AUC = \int_{0}^{1} Precision(Recall) \, dRecall\]
precision:
Precision

\[TP / (TP + FP)\]
pt:
Prevalence Threshold.

\[\frac{\sqrt{FPR}}{\sqrt{TPR} + \sqrt{FPR}}\]
recall:
Recall.

\[TP / (TP + FN)\]
specificity:
Specificity.

\[TN / (TN + FP)\]

For Regression

max:
Max Error.

\[ME = \max_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
mae:
Mean Absolute Error.

\[MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
median:
Median Absolute Error.

\[MedAE = \text{median}_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
mse:
Mean Squared Error.

\[MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]
msle:
Mean Squared Log Error.

\[MSLE = \frac{1}{n} \sum_{i=1}^{n} (\log(1 + y_i) - \log(1 + \hat{y}_i))^2\]
r2:
R squared coefficient.

\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\]
r2a:
R2 adjusted

\[\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}\]
var:
Explained Variance.

\[VAR = 1 - \frac{Var(y - \hat{y})}{Var(y)}\]
rmse:
Root-mean-squared error

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

cv: int, optional

Number of folds.

average: str, optional

The method used to compute the final score for multiclass-classification.

binary:
considers one of the classes as positive and use the binary confusion matrix to compute the score.
micro:
positive and negative values globally.
macro:
average of the score of each class.
weighted:
weighted average of the score of each class.

pos_label: PythonScalar, optional

The main class to be considered as positive (classification only).

cutoff: float, optional

The model cutoff (classification only).

std_coeff: float, optional

Value of the standard deviation coefficient used to compute the area plot around each score.

chart: PlottingObject, optional

The chart object to plot on.

show: bool, optional

Select whether you want to get the chart as the output only.

**style_kwargs

Any optional parameter to pass to the Plotting functions.

Returns#

TableSample: training_score_lower, training_score,training_score_upper, test_score_lower,test_score,test_score_upper

Examples#

Note

The below example is a very basic one. For other more detailed examples and customization options, please see :ref:`chart_gallery.learning`_

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let’s generate a dataset using the following data.

import random

N = 500 # Number of Records

k = 10 # step

# Normal Distributions
x = np.random.normal(5, 1, round(N / 2))

y = np.random.normal(3, 1, round(N / 2))

z = np.random.normal(3, 1, round(N / 2))

# Creating a vDataFrame with two clusters
data = vp.vDataFrame({
    "x": np.concatenate([x, x + k]),
    "y": np.concatenate([y, y + k]),
    "z": np.concatenate([z, z + k]),
    "c": [random.randint(0, 1) for _ in range(N)]
})

Let’s proceed by creating a RandomForestClassifier model using the complete dataset.

# Importing the Vertica ML module
import verticapy.machine_learning.vertica as vml

# Importing the model selection module
import verticapy.machine_learning.model_selection as vms

# Defining the Model
model = vml.RandomForestClassifier()

Let’s draw the validation curve.

vms.validation_curve(
  model,
  param_name = "max_depth",
  param_range = [1, 2, 3],
  input_relation = data,
  X = ["x", "y", "z"],
  y = "c",
  cv = 3,
  metric = "auc",
  show = True,
)

Note

VerticaPy’s Learning Curve tool is an essential asset for evaluating machine learning models. It enables users to visualize a model’s performance by plotting key metrics against varying training dataset sizes. By analyzing these curves, data analysts can identify issues such as overfitting or underfitting, make informed decisions about dataset size, and optimize model performance. This feature plays a crucial role in enhancing model robustness and facilitating data-driven decision-making.