verticapy.machine_learning.model_selection.learning_curve#

Draws the learning curve.
estimator: object
Vertica estimator with a fit method.

input_relation: SQLRelation
Relation used to train the model.

X: SQLColumns
list of the predictor columns.

y: str
Response Column.

sizes: list, optional
Different sizes of the dataset used to train the model. Multiple models are trained using the different sizes.

method: str, optional
Method used to plot the curve.

efficiency:
Draws train/test score vs sample size.

performance:
draws score vs time.

scalability:
draws time vs sample size.

metric: str, optional
Metric used to do the model evaluation.

auto:
logloss for classification & RMSE for regression.

For Classification:
accuracy:
Accuracy.

\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]

auc:
Area Under the Curve (ROC).

\[AUC = \int_{0}^{1} TPR(FPR) \, dFPR\]

ba:
Balanced Accuracy.

\[BA = \frac{TPR + TNR}{2}\]

bm:
Informedness

\[BM = TPR + TNR - 1\]

csi:
Critical Success Index

\[index = \frac{TP}{TP + FN + FP}\]
f1:
F1 Score .. math:
F_1 Score = 2 \times 

rac{Precision times Recall}{Precision + Recall}

fdr:
False Discovery Rate

\[FDR = 1 - PPV\]

fm:
Fowlkes-Mallows index

\[FM = \sqrt{PPV * TPR}\]

fnr:
False Negative Rate

\[FNR = \frac{FN}{FN + TP}\]

for:
False Omission Rate

\[FOR = 1 - NPV\]

fpr:
False Positive Rate

\[FPR = \frac{FP}{FP + TN}\]

logloss:
Log Loss

\[Loss = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)\]

lr+:
Positive Likelihood Ratio.

\[LR+ = \frac{TPR}{FPR}\]

lr-:
Negative Likelihood Ratio.

\[LR- = \frac{FNR}{TNR}\]

dor:
Diagnostic Odds Ratio.

\[DOR = \frac{TP \times TN}{FP \times FN}\]

mcc:
Matthews Correlation Coefficient

mk:
Markedness

\[MK = PPV + NPV - 1\]

npv:
Negative Predictive Value

\[NPV = \frac{TN}{TN + FN}\]

prc_auc:
Area Under the Curve (PRC)

\[AUC = \int_{0}^{1} Precision(Recall) \, dRecall\]

precision:
Precision

\[TP / (TP + FP)\]

pt:
Prevalence Threshold.

\[\frac{\sqrt{FPR}}{\sqrt{TPR} + \sqrt{FPR}}\]

recall:
Recall.

\[TP / (TP + FN)\]

specificity:
Specificity.

\[TN / (TN + FP)\]

For Regression:

max:
Max Error.

\[ME = \max_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]

mae:
Mean Absolute Error.

\[MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]

median:
Median Absolute Error.

\[MedAE = \text{median}_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]

mse:
Mean Squared Error.

\[MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]

msle:
Mean Squared Log Error.

\[MSLE = \frac{1}{n} \sum_{i=1}^{n} (\log(1 + y_i) - \log(1 + \hat{y}_i))^2\]

r2:
R squared coefficient.

\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\]

r2a:
R2 adjusted

\[\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}\]

var:
Explained Variance.

\[VAR = 1 - \frac{Var(y - \hat{y})}{Var(y)}\]

rmse:
Root-mean-squared error

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

cv: int, optional
Number of folds.

average: str, optional
The method used to compute the final score for multiclass-classification.

binary:
considers one of the classes as positive and use the binary confusion matrix to compute the score.

micro:
positive and negative values globally.

macro:
average of the score of each class.

weighted:
weighted average of the score of each class.

pos_label: PythonScalar, optional
The main class to be considered as positive (classification only).

cutoff: float, optional
The model cutoff (classification only).

std_coeff: PythonNumber, optional
Value of the standard deviation coefficient used to compute the area plot around each score.

chart: PlottingObject, optional
The chart object to plot on.

return_chart: bool, optional
Select whether you want to get the chart as the output only.

**style_kwargs
Any optional parameter to pass to the Plotting functions.

TableSample
result of the learning curve.

Note

The below example is a very basic one. For other more detailed examples and customization options, please see :ref:`chart_gallery.learning`_

We import verticapy:
import verticapy as vp
Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let’s generate a dataset using the following data.
import random

N = 500 # Number of Records

k = 10 # step

# Normal Distributions
x = np.random.normal(5, 1, round(N / 2))

y = np.random.normal(3, 1, round(N / 2))

z = np.random.normal(3, 1, round(N / 2))

# Creating a vDataFrame with two clusters
data = vp.vDataFrame({
    "x": np.concatenate([x, x + k]),
    "y": np.concatenate([y, y + k]),
    "z": np.concatenate([z, z + k]),
    "c": [random.randint(0, 1) for _ in range(N)]
})
Let’s proceed by creating a RandomForestClassifier model using the complete dataset.
# Importing the Vertica ML module
import verticapy.machine_learning.vertica as vml

# Importing the model selection module
import verticapy.machine_learning.model_selection as vms

# Defining the Model
model = vml.RandomForestClassifier()
Let’s draw the learning curve.
vms.learning_curve(
    model,
    data,
    X = ["x", "y", "z"],
    y = "c",
    method = "efficiency",
    cv = 3,
    metric = "auc",
    return_chart = True,
)
Note

VerticaPy’s Learning Curve tool is an essential asset for evaluating machine learning models. It enables users to visualize a model’s performance by plotting key metrics against varying training dataset sizes. By analyzing these curves, data analysts can identify issues such as overfitting or underfitting, make informed decisions about dataset size, and optimize model performance. This feature plays a crucial role in enhancing model robustness and facilitating data-driven decision-making.

See also

validation_curve() : Draws the validation curve.