bayesian_search_cv

In [ ]:
bayesian_search_cv(estimator,
                   input_relation: (str, vDataFrame),
                   X: list,
                   y: str,
                   metric: str = "auto",
                   cv: int = 3,
                   pos_label: (int, float, str) = None,
                   cutoff: float = -1,
                   param_grid: (dict, list) = {},
                   random_nbins: int = 16,
                   bayesian_nbins: int = None,
                   random_grid: bool = False,
                   lmax: int = 15,
                   nrows: int = 100000,
                   k_tops: int = 10,
                   RFmodel_params: dict = {},
                   print_info: bool = True,)

Computes the k-fold bayesian search of an estimator using a random forest model to estimate a probable optimal set of parameters.

Parameters

Name Type Optional Description
estimator
object
Vertica estimator having a fit method and a DB cursor.
input_relation
str / vDataFrame
Input Relation.
X
list
List of the predictor columns.
y
str
Response Column.
metric
str / list
Metric used to do the model evaluation.
  • auto : logloss for classification & rmse for regression.

For Classification:
  • accuracy : Accuracy
  • auc : Area Under the Curve (ROC)
  • bm : Informedness = tpr + tnr - 1
  • csi : Critical Success Index = tp / (tp + fn + fp)
  • f1 : F1 Score
  • logloss : Log Loss
  • mcc : Matthews Correlation Coefficient
  • mk : Markedness = ppv + npv - 1
  • npv : Negative Predictive Value = tn / (tn + fn)
  • prc_auc : Area Under the Curve (PRC)
  • precision : Precision = tp / (tp + fp)
  • recall : Recall = tp / (tp + fn)
  • specificity : Specificity = tn / (tn + fp)

For Regression:
  • max : Max Error
  • mae : Mean Absolute Error
  • median : Median Absolute Error
  • mse : Mean Squared Error
  • msle : Mean Squared Log Error
  • r2 : R-squared coefficient
  • r2a : R2 adjusted
  • rmse : Root Mean Squared Error
  • var : Explained Variance
cv
int
Number of folds.
pos_label
int / float / str
The main class to be considered as positive (classification only).
cutoff
float
The model cutoff (classification only).
param_grid
dict / list
Dictionary of the parameters to test. It can also be a list of the different combinations. If empty, a parameter grid will be generated.
random_nbins
int
Number of bins used to compute the different parameters categories in the random parameters generation.
bayesian_nbins
int
Number of bins used to compute the different parameters categories in the bayesian table generation.
random_grid
bool
If True, the rows used to find the optimal function will be used randomnly. Otherwise, they will be regularly spaced.
lmax
int
Maximum length of each parameter list.
nrows
int
Number of rows to use when performing the bayesian search.
k_tops
int
When performing the bayesian search, the final stage will be to retrain the top possible combinations. 'k_tops' represents the number of models to train at this stage to find the most efficient model.
RFmodel_params
dict
Dictionary of the random forest model parameters used to estimate a probable optimal set of parameters.
print_info
bool
If True, prints the model information at each step.

Returns

tablesample : An object containing the result. For more information, see utilities.tablesample.

Example

In [62]:
from verticapy.learn.linear_model import LogisticRegression
model = LogisticRegression(name = "public.LR_titanic",
                           tol = 1e-4,
                           max_iter = 100, 
                           solver = 'Newton')

from verticapy.learn.model_selection import bayesian_search_cv
bayesian_search_cv(model,
                   input_relation = "public.titanic", 
                   X = ["age", "fare", "pclass"], 
                   y = "survived", 
                   cv = 3)
Starting Bayesian Search

Step 1 - Computing Random Models using Grid Search

Model: LogisticRegression; Parameters: {'tol': 1e-08, 'max_iter': 500, 'penalty': 'enet', 'solver': 'cgd', 'C': 4.07, 'l1_ratio': 0.379}; Test_score: 0.28812685574361635; Train_score: 0.283396406051618; Time: 0.3235960801442464;
Model: LogisticRegression; Parameters: {'tol': 1e-06, 'max_iter': 500, 'penalty': 'enet', 'solver': 'cgd', 'C': 0.314, 'l1_ratio': 0.001}; Test_score: 0.2854526060241123; Train_score: 0.274555291668678; Time: 0.34723663330078125;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'bfgs', 'C': 3.757}; Test_score: 0.27555990545745834; Train_score: 0.26094440519552037; Time: 1.0109228293100994;
Model: LogisticRegression; Parameters: {'tol': 1e-06, 'max_iter': 500, 'penalty': 'enet', 'solver': 'cgd', 'C': 4.383, 'l1_ratio': 0.631}; Test_score: 0.2873805433696933; Train_score: 0.289291526361243; Time: 0.2942519187927246;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 500, 'penalty': 'enet', 'solver': 'cgd', 'C': 1.566, 'l1_ratio': 0.631}; Test_score: 0.289012465616229; Train_score: 0.281823755471626; Time: 0.31007663408915204;
Model: LogisticRegression; Parameters: {'tol': 1e-08, 'max_iter': 100, 'penalty': 'none', 'solver': 'bfgs'}; Test_score: 0.259712143263803; Train_score: 0.2535097073776583; Time: 1.86483629544576;
Model: LogisticRegression; Parameters: {'tol': 1e-08, 'max_iter': 500, 'penalty': 'l2', 'solver': 'newton', 'C': 2.818}; Test_score: 0.2587137453451607; Train_score: 0.2542751699158963; Time: 0.35260597864786786;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 100, 'penalty': 'enet', 'solver': 'cgd', 'C': 0.314, 'l1_ratio': 0.946}; Test_score: 0.281828181068283; Train_score: 0.28333415454824334; Time: 0.38917016983032227;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 500, 'penalty': 'enet', 'solver': 'cgd', 'C': 4.383, 'l1_ratio': 0.19}; Test_score: 0.29364390979279203; Train_score: 0.27945022251259266; Time: 0.3257162570953369;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 100, 'penalty': 'enet', 'solver': 'cgd', 'C': 4.696, 'l1_ratio': 0.064}; Test_score: 0.28472563153231933; Train_score: 0.2824044088018057; Time: 0.2801942825317383;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 1000, 'penalty': 'enet', 'solver': 'cgd', 'C': 2.192, 'l1_ratio': 0.883}; Test_score: 0.285237668959101; Train_score: 0.2880575188854733; Time: 0.29958049456278485;
Model: LogisticRegression; Parameters: {'tol': 1e-06, 'max_iter': 500, 'penalty': 'l1', 'solver': 'cgd', 'C': 3.444}; Test_score: 0.2910605587685393; Train_score: 0.2901470939729513; Time: 0.33542760213216144;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 500, 'penalty': 'enet', 'solver': 'cgd', 'C': 0.314, 'l1_ratio': 0.19}; Test_score: 0.288035838374703; Train_score: 0.2796454186725533; Time: 0.3477969964345296;
Model: LogisticRegression; Parameters: {'tol': 0.0001, 'max_iter': 100, 'penalty': 'enet', 'solver': 'cgd', 'C': 4.383, 'l1_ratio': 0.694}; Test_score: 0.288927038909389; Train_score: 0.2886361571551873; Time: 0.33753132820129395;
Model: LogisticRegression; Parameters: {'tol': 1e-06, 'max_iter': 100, 'penalty': 'enet', 'solver': 'cgd', 'C': 4.696, 'l1_ratio': 0.064}; Test_score: 0.28879972640237733; Train_score: 0.2788891068695907; Time: 0.2597200075785319;

Step 2 - Fitting the RF model with the hyperparameters data

Step 3 - Computing Most Probable Good Models using Grid Search

Model: LogisticRegression; Parameters: {'tol': 0.01, 'max_iter': 1, 'penalty': 'none', 'solver': 'bfgs', 'C': 0.0, 'l1_ratio': 1.0}; Test_score: 0.34843847377293835; Train_score: 0.34510245831672964; Time: 0.24924961725870767;
Model: LogisticRegression; Parameters: {'tol': 0.01, 'max_iter': 1, 'penalty': 'l2', 'solver': 'bfgs', 'C': 0.0, 'l1_ratio': 1.0}; Test_score: 0.295674244636197; Train_score: 0.308816399060115; Time: 0.24345088005065918;
Model: LogisticRegression; Parameters: {'tol': 0.008333335, 'max_iter': 1, 'penalty': 'none', 'solver': 'bfgs', 'C': 0.0, 'l1_ratio': 1.0}; Test_score: 0.37826629014006735; Train_score: 0.35666606291173; Time: 0.38054712613423664;

Bayesian Search Selected Model
Parameters: {'solver': 'newton', 'penalty': 'l2', 'max_iter': 500, 'C': 2.818, 'tol': 1e-08}; Test_score: 0.2587137453451607; Train_score: 0.2542751699158963; Time: 0.35260597864786786;
Out[62]:
avg_score
avg_train_score
avg_time
score_std
score_train_std
10.25871374534516070.25427516991589630.352605978647867860.00338740167780723170.0017396247442451331
20.2597121432638030.25350970737765831.864836295445760.0029537936676489340.001294410402914851
30.275559905457458340.260944405195520371.01092282931009940.0100540552038780580.004624357220719312
40.2818281810682830.283334154548243340.389170169830322270.00217588163286561420.0012487368376270758
50.284725631532319330.28240440880180570.28019428253173830.0035312807955406250.0018375333222061185
60.2852376689591010.28805751888547330.299580494562784850.00173574534122453230.003377342058046735
70.28545260602411230.2745552916686780.347236633300781250.0080560329275710630.0044270465267294366
80.28738054336969330.2892915263612430.29425191879272460.00035982743163543410.0005968201416831517
90.2880358383747030.27964541867255330.34779699643452960.0173130391544042830.005837568570358881
100.288126855743616350.2833964060516180.32359608014424640.00189764940725560680.0008438504112709961
110.288799726402377330.27888910686959070.25972000757853190.0055317319244073570.0019121558268743672
120.2889270389093890.28863615715518730.337531328201293950.00051765492196160710.001318886647861579
130.2890124656162290.2818237554716260.310076634089152040.0071800056664350440.004503788114531203
140.29106055876853930.29014709397295130.335427602132161440.00126629570187623180.005020063881394776
150.293643909792792030.279450222512592660.32571625709533690.00276322028356551640.0009210107427996371
160.2956742446361970.3088163990601150.243450880050659180.0233733721359132460.015424213445165022
170.348438473772938350.345102458316729640.249249617258707670.085793082604748840.04705673094172616
180.378266290140067350.356666062911730.380547126134236640.077681116892575630.03999130056492263
Rows: 1-18 | Columns: 6