Loading...

verticapy.machine_learning.model_selection.cross_validate#

verticapy.machine_learning.model_selection.cross_validate(estimator: VerticaModel, input_relation: str | vDataFrame, X: str | list[str], y: str, metrics: None | str | list[str] = None, cv: int = 3, average: Literal['binary', 'micro', 'macro', 'weighted'] = 'weighted', pos_label: bool | float | str | timedelta | datetime | None = None, cutoff: int | float | Decimal = -1, show_time: bool = True, training_score: bool = False, **kwargs) TableSample#

Computes the K-Fold cross validation of an estimator.

estimator: object

Vertica estimator with a fit method.

input_relation: SQLRelation

Relation used to train the model.

X: SQLColumns

list of the predictor columns.

y: str

Response Column.

metrics: str | list, optional

Metrics used to do the model evaluation. It can also be a list of metrics. If empty, most of the estimator metrics are computed.

For Classification:

  • accuracy:

    Accuracy.

    \[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
  • auc:

    Area Under the Curve (ROC).

    \[AUC = \int_{0}^{1} TPR(FPR) \, dFPR\]
  • ba:

    Balanced Accuracy.

    \[BA = \frac{TPR + TNR}{2}\]
  • bm:

    Informedness

    \[BM = TPR + TNR - 1\]
  • csi:

    Critical Success Index

    \[index = \frac{TP}{TP + FN + FP}\]
  • f1:

    F1 Score .. math:

    F_1 Score = 2 \times 
    

rac{Precision times Recall}{Precision + Recall}

  • fdr:

    False Discovery Rate

    \[FDR = 1 - PPV\]
  • fm:

    Fowlkes-Mallows index

    \[FM = \sqrt{PPV * TPR}\]
  • fnr:

    False Negative Rate

    \[FNR = \frac{FN}{FN + TP}\]
  • for:

    False Omission Rate

    \[FOR = 1 - NPV\]
  • fpr:

    False Positive Rate

    \[FPR = \frac{FP}{FP + TN}\]
  • logloss:

    Log Loss

    \[Loss = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)\]
  • lr+:

    Positive Likelihood Ratio.

    \[LR+ = \frac{TPR}{FPR}\]
  • lr-:

    Negative Likelihood Ratio.

    \[LR- = \frac{FNR}{TNR}\]
  • dor:

    Diagnostic Odds Ratio.

    \[DOR = \frac{TP \times TN}{FP \times FN}\]
  • mcc:

    Matthews Correlation Coefficient

  • mk:

    Markedness

    \[MK = PPV + NPV - 1\]
  • npv:

    Negative Predictive Value

    \[NPV = \frac{TN}{TN + FN}\]
  • prc_auc:

    Area Under the Curve (PRC)

    \[AUC = \int_{0}^{1} Precision(Recall) \, dRecall\]
  • precision:

    Precision

    \[TP / (TP + FP)\]
  • pt:

    Prevalence Threshold.

    \[\frac{\sqrt{FPR}}{\sqrt{TPR} + \sqrt{FPR}}\]
  • recall:

    Recall.

    \[TP / (TP + FN)\]
  • specificity:

    Specificity.

    \[TN / (TN + FP)\]

For Regression:

  • max:

    Max Error.

    \[ME = \max_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
  • mae:

    Mean Absolute Error.

    \[MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
  • median:

    Median Absolute Error.

    \[MedAE = \text{median}_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
  • mse:

    Mean Squared Error.

    \[MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]
  • msle:

    Mean Squared Log Error.

    \[MSLE = \frac{1}{n} \sum_{i=1}^{n} (\log(1 + y_i) - \log(1 + \hat{y}_i))^2\]
  • r2:

    R squared coefficient.

    \[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\]
  • r2a:

    R2 adjusted

    \[\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}\]
  • var:

    Explained Variance.

    \[VAR = 1 - \frac{Var(y - \hat{y})}{Var(y)}\]
  • rmse:

    Root-mean-squared error

    \[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]
cv: int, optional

Number of folds.

average: str, optional

The method used to compute the final score for multiclass-classification.

  • binary:

    considers one of the classes as positive and use the binary confusion matrix to compute the score.

  • micro:

    positive and negative values globally.

  • macro:

    average of the score of each class.

  • weighted:

    weighted average of the score of each class.

pos_label: PythonScalar, optional

The main class to be considered as positive (classification only).

cutoff: PythonNumber, optional

The model cutoff (classification only).

show_time: bool, optional

If set to True, the time and the average time are added to the report.

training_score: bool, optional

If set to True, the training score is computed with the validation score.

TableSample

result of the cross validation.

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the Wine Quality dataset.

import verticapy.datasets as vpd

data = vpd.load_winequality()
123
fixed_acidity
Numeric(8)
123
volatile_acidity
Numeric(9)
123
citric_acid
Numeric(8)
123
residual_sugar
Numeric(9)
123
chlorides
Float(22)
123
free_sulfur_dioxide
Numeric(9)
123
total_sulfur_dioxide
Numeric(9)
123
density
Float(22)
123
pH
Numeric(8)
123
sulphates
Numeric(8)
123
alcohol
Float(22)
123
quality
Integer
123
good
Integer
Abc
color
Varchar(20)
13.80.310.0211.10.03620.0114.00.992483.750.4412.460white
23.90.2250.44.20.0329.0118.00.9893.570.3612.881white
34.20.170.361.80.02993.0161.00.989993.650.8912.071white
44.20.2150.235.10.04164.0157.00.996883.420.448.030white
54.40.320.394.30.0331.0127.00.989043.460.3612.881white
64.40.460.12.80.02431.0111.00.988163.480.3413.160white
74.40.540.095.10.03852.097.00.990223.410.412.271white
84.50.190.210.950.03389.0159.00.993323.340.428.050white
94.60.4450.01.40.05311.0178.00.994263.790.5510.250white
104.60.520.152.10.0548.065.00.99343.90.5613.140red
114.70.1450.291.00.04235.090.00.99083.760.4911.360white
124.70.3350.141.30.03669.0168.00.992123.470.4610.550white
134.70.4550.181.90.03633.0106.00.987463.210.8314.071white
144.70.60.172.30.05817.0106.00.99323.850.612.960red
154.70.670.091.00.025.09.00.987223.30.3413.650white
164.70.7850.03.40.03623.0134.00.989813.530.9213.860white
174.80.130.321.20.04240.098.00.98983.420.6411.871white
184.80.170.282.90.0322.0111.00.99023.380.3411.371white
194.80.210.2110.20.03717.0112.00.993243.660.4812.271white
204.80.2250.381.20.07447.0130.00.991323.310.410.360white
214.80.260.2310.60.03423.0111.00.992743.460.2811.571white
224.80.290.231.10.04438.0180.00.989243.280.3411.960white
234.80.330.06.50.02834.0163.00.99373.350.619.950white
244.80.340.06.50.02833.0163.00.99393.360.619.960white
254.80.650.121.10.0134.010.00.992463.320.3613.540white
264.90.2350.2711.750.0334.0118.00.99543.070.59.460white
274.90.330.311.20.01639.0150.00.987133.330.5914.081white
284.90.3350.141.30.03669.0168.00.992123.470.4610.466666666666750white
294.90.3350.141.30.03669.0168.00.992123.470.4610.466666666666750white
304.90.3450.341.00.06832.0143.00.991383.240.410.150white
314.90.3450.341.00.06832.0143.00.991383.240.410.150white
324.90.420.02.10.04816.042.00.991543.710.7414.071red
334.90.470.171.90.03560.0148.00.989643.270.3511.560white
345.00.170.561.50.02624.0115.00.99063.480.3910.871white
355.00.20.41.90.01520.098.00.98973.370.5512.0560white
365.00.2350.2711.750.0334.0118.00.99543.070.59.460white
375.00.240.195.00.04317.0101.00.994383.670.5710.050white
385.00.240.212.20.03931.0100.00.990983.690.6211.760white
395.00.240.341.10.03449.0158.00.987743.320.3213.171white
405.00.2550.222.70.04346.0153.00.992383.750.7611.360white
415.00.270.324.50.03258.0178.00.989563.450.3112.671white
425.00.270.324.50.03258.0178.00.989563.450.3112.671white
435.00.270.41.20.07642.0124.00.992043.320.4710.160white
445.00.290.545.70.03554.0155.00.989763.270.3412.981white
455.00.30.333.70.0354.0173.00.98873.360.313.071white
465.00.310.06.40.04643.0166.00.9943.30.639.960white
475.00.330.161.50.04910.097.00.99173.480.4410.760white
485.00.330.161.50.04910.097.00.99173.480.4410.760white
495.00.330.161.50.04910.097.00.99173.480.4410.760white
505.00.330.184.60.03240.0124.00.991143.180.411.060white
515.00.330.2311.80.0323.0158.00.993223.410.6411.860white
525.00.350.257.80.03124.0116.00.992413.390.411.360white
535.00.350.257.80.03124.0116.00.992413.390.411.360white
545.00.380.011.60.04826.060.00.990843.70.7514.060red
555.00.40.54.30.04629.080.00.99023.490.6613.660red
565.00.420.242.00.0619.050.00.99173.720.7414.081red
575.00.440.0418.60.03938.0128.00.99853.370.5710.260white
585.00.4550.181.90.03633.0106.00.987463.210.8314.071white
595.00.550.148.30.03235.0164.00.99183.530.5112.581white
605.00.610.121.30.00965.0100.00.98743.260.3713.550white
615.00.740.01.20.04116.046.00.992584.010.5912.560red
625.01.020.041.40.04541.085.00.99383.750.4810.540red
635.01.040.241.60.0532.096.00.99343.740.6211.550red
645.10.110.321.60.02812.090.00.990083.570.5212.260white
655.10.140.250.70.03915.089.00.99193.220.439.260white
665.10.1650.225.70.04742.0146.00.99343.180.559.960white
675.10.210.281.40.04748.0148.00.991683.50.4910.450white
685.10.230.181.00.05313.099.00.989563.220.3911.550white
695.10.250.361.30.03540.078.00.98913.230.6412.171white
705.10.260.331.10.02746.0113.00.989463.350.4311.471white
715.10.260.346.40.03426.099.00.994493.230.419.260white
725.10.290.288.30.02627.0107.00.993083.360.3711.060white
735.10.290.288.30.02627.0107.00.993083.360.3711.060white
745.10.30.32.30.04840.0150.00.989443.290.4612.260white
755.10.3050.131.750.03617.073.00.993.40.5112.333333333333350white
765.10.310.30.90.03728.0152.00.9923.540.5610.160white
775.10.330.221.60.02718.089.00.98933.510.3812.571white
785.10.330.221.60.02718.089.00.98933.510.3812.571white
795.10.330.221.60.02718.089.00.98933.510.3812.571white
805.10.330.276.70.02244.0129.00.992213.360.3911.071white
815.10.350.266.80.03436.0120.00.991883.380.411.560white
825.10.350.266.80.03436.0120.00.991883.380.411.560white
835.10.350.266.80.03436.0120.00.991883.380.411.560white
845.10.390.211.70.02715.072.00.98943.50.4512.560white
855.10.420.01.80.04418.088.00.991573.680.7313.671red
865.10.420.011.50.01725.0102.00.98943.380.3612.371white
875.10.470.021.30.03418.044.00.99213.90.6212.860red
885.10.510.182.10.04216.0101.00.99243.460.8712.971red
895.10.520.062.70.05230.079.00.99323.320.439.350white
905.10.5850.01.70.04414.086.00.992643.560.9412.971red
915.20.1550.331.60.02813.059.00.989753.30.8411.981white
925.20.1550.331.60.02813.059.00.989753.30.8411.981white
935.20.160.340.80.02926.077.00.991553.250.5110.160white
945.20.170.270.70.0311.068.00.992183.30.419.850white
955.20.1850.221.00.0347.0123.00.992183.550.4410.1560white
965.20.20.273.20.04716.093.00.992353.440.5310.171white
975.20.210.311.70.04817.061.00.989533.240.3712.071white
985.20.220.466.20.06641.0187.00.993623.190.429.7333333333333350white
995.20.240.157.10.04332.0134.00.993783.240.489.960white
1005.20.240.453.80.02721.0128.00.9923.550.4911.281white
Rows: 1-100 | Columns: 14

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

Next, we can initialize a LogisticRegression model:

from verticapy.machine_learning.vertica import LogisticRegression

model = LogisticRegression()

Now we can conveniently use the cross_validate() function to evaluate our model.

from verticapy.machine_learning.model_selection import cross_validate

cross_validate(
    model,
    input_relation = data,
    X = [
        "fixed_acidity",
        "volatile_acidity",
        "citric_acid",
        "residual_sugar",
        "chlorides",
        "density",
    ],
    y = "good",
    cv = 3,
    metric = "auc",
)
auc
prc_auc
accuracy
log_loss
precision
recall
f1_score
mcc
informedness
markedness
csi
time
1-fold0.74607382194252750.388454112680658550.80664513151822790.1871069656734930.51612903225806450.076009501187648460.132505175983436860.13961134466191130.0588273706034558860.33133093249559420.070953436807095340.27753233909606934
2-fold0.75300652109015360.39377801159913040.81062355658198610.1852172439990510.56060606060606060.088516746411483260.15289256198347110.165115228121517550.07191686089345240.379091053459800430.082774049217002240.2949562072753906
3-fold0.7483120161761550.39553769651412810.79362880886426590.1915424039436170.40909090909090910.041002277904328020.074534161490683230.07393857202629320.0259472692187461270.210693171107874330.038709677419354840.28208184242248535
avg0.74913078640294540.392589940264639060.803632498988160.187955537872053660.49527533398501140.068509508501153240.119977299819197070.1262217149365740.052230500238551470.30703838568775630.064145721147817470.28485679626464844
std0.0028888715075940390.00301141440780772940.0072577466001579780.0026510335296776910.06358913866068890.0201095930192942660.033193578624355750.038407961781492910.0193380430523760.070861637801947220.018622136645285160.007378937240650716
Rows: 1-5 | Columns: 13

Note

VerticaPy Cross-Validation involves splitting the dataset into multiple folds, training the model on subsets of the data, and evaluating its performance on the remaining data. This process is repeated for each fold, and the overall model performance is averaged across all folds. Cross-Validation helps assess how well a model generalizes to new, unseen data and provides more robust performance metrics. In VerticaPy, cross-validation is a valuable technique for model evaluation and parameter tuning, contributing to the reliability and effectiveness of machine learning models.

For example, grid_search_cv(), randomized_search_cv() and some other model validation functions are using Cross-Validation techniques.

See also

grid_search_cv() : Computes the k-fold grid search of an estimator.
randomized_search_cv() : Computes the K-Fold randomized search of an estimator.