Example: XGBoost.to_json#

Starting from VerticaPy 0.7.1, you can export any native Vertica XGBoost model to the Python XGBoost JSON file format. This page demonstrates the exporting process and the nuances involved.

Connect to Vertica#

For a demonstration on how to create a new connection to Vertica, see connection. In this example, we will use an existing connection named ‘VerticaDSN’.

[16]:
import verticapy as vp
vp.connect("VerticaDSN")

Create a Schema (Optional)#

Schemas allow you to organize database objects in a collection, similar to a namespace. If you create a database object without specifying a schema, Vertica uses the ‘public’ schema. For example, to specify the ‘example_table’ in ‘example_schema’, you would use: ‘example_schema.example_table’.

To keep things organized, this example creates the ‘xgb_to_json’ schema and drops it (and its associated tables, views, etc.) at the end:

[17]:
vp.drop("xgb_to_json", method = "schema")
vp.create_schema("xgb_to_json")
[17]:
True

Load Data#

VerticaPy lets you load many well-known datasets like Iris, Titanic, Amazon, etc.

This example loads the Titanic dataset with the load_titanic function into a table called ‘titanic’ in the ‘xgb_to_json’ schema:

[18]:
from verticapy.datasets import load_titanic
vdf = load_titanic(name = "titanic",
                   schema = "xgb_to_json",)

You can also load your own data. To ingest data from a CSV file, use the read_csv() function.

The read_csv() function uses parses the dataset and uses flex tables to identify data types.

If read_csv() runs for too long, you can use the ‘parse_nrows’ parameters to limit the number of lines read_csv() parses before guessing the data types at the possible expense of data type identification accuracy.

For example, to load the ‘iris.csv’ file with the read_csv() function:

[19]:
vdf = vp.read_csv("data/iris.csv",
                  table_name = "iris",
                  schema = "xgb_to_json",)
The table "xgb_to_json"."iris" has been successfully created.

Create a vDataFrame#

vDataFrames allow you to prepare and explore your data without modifying its representation in your Vertica database. Any changes you make are applied to the vDataFrame as modifications to the SQL query for the table underneath.

To create a vDataFrame out of a table in your Vertica database, specify its schema and table name with the standard SQL syntax. For example, to create a vDataFrame out of the ‘titanic’ table in the ‘xgb_to_json’ schema:

[21]:
vdf = vp.vDataFrame("xgb_to_json.titanic")

Create an XGB model#

Create a XGBoostClassifier XGBoostClassifier model.

Unlike a vDataFrame object, which simply queries the table it was created with, the VerticaPy XGBoostClassifier object creates and then references a model in Vertica, so it must be stored in a schema like any other database object.

This example creates the ‘my_model’ XGBoostClassifier model in the ‘xgb_to_json’ schema:

[18]:
from verticapy.learn.ensemble import XGBoostClassifier
model = XGBoostClassifier("xgb_to_json.my_model",
                          max_ntree = 4,
                          max_depth = 3,)

Prepare the Data#

While Vertica XGBoost supports columns of type VARCHAR, Python XGBoost does not, so you must encode the categorical columns you want to use. You must also drop or impute missing values.

This example drops ‘age,’ ‘fare,’ ‘sex,’ ‘embarked,’ and ‘survived’ columns from the vDataFrame and then encodes the ‘sex’ and ‘embarked’ columns. These changes are applied to the vDataFrame’s query and does not affect the main “xgb_to_json.titanic’ table stored in Vertica:

[22]:
vdf = vdf[["age", "fare", "sex", "embarked", "survived"]]
vdf.dropna()
vdf["sex"].label_encode()
vdf["embarked"].label_encode()
Nothing was filtered.
[22]:
123
age
Numeric(6,3)
123
fare
Numeric(10,5)
123
sex
Int
123
embarked
Int
123
survived
Int
12.0151.55020
230.0151.55120
325.0151.55020
439.00.0120
571.049.5042100
647.0227.525100
724.0247.5208100
836.075.2417100
925.026.0100
1045.035.5120
1142.026.55120
1241.030.5120
1348.050.4958100
1445.026.55120
1533.05.0120
1628.047.1120
1717.047.1120
1849.026.0120
1936.078.85120
2046.061.175120
2127.0136.7792100
2247.025.5875120
2337.083.1583100
2470.071.0120
2539.071.2833100
2631.052.0120
2750.0106.425100
2839.029.7100
2936.031.6792000
3030.027.75100
3119.0263.0120
3264.0263.0120
3337.053.1120
3447.038.5120
3524.079.2100
3671.034.6542100
3738.0153.4625120
3846.079.2100
3945.083.475120
4040.00.0120
4155.093.5120
4242.042.5120
4355.050.0120
4442.052.0120
4550.028.7125000
4646.026.0120
4750.026.0120
4832.5211.5100
4958.029.7100
5041.051.8625120
5129.030.0120
5230.045.5120
5330.026.0120
5419.053.1120
5546.075.2417100
5654.051.8625120
5728.082.1708100
5865.026.55120
5944.090.0110
6055.030.5120
6147.042.4120
6237.029.7100
6358.0113.275100
6464.026.0120
6565.061.9792100
6628.527.7208100
6745.528.5120
6823.093.5120
6929.066.6120
7018.0108.9100
7147.052.0120
7238.00.0120
7322.0135.6333100
7431.050.4958120
7536.040.125100
7655.059.4100
7733.026.55120
7861.0262.375100
7950.055.9120
8056.026.55120
8156.030.6958100
8224.060.0120
8357.0146.5208100
8462.026.55120
8567.0221.7792120
8663.0221.7792020
8761.032.3208120
8852.079.65120
8949.0110.8833100
9040.027.7208100
9161.033.5120
9247.034.0208120
9364.075.25100
9460.026.55120
9554.077.2875120
9621.077.2875120
9757.0164.8667120
9850.0211.5100
9927.0211.5100
10051.061.3792100
Rows: 1-100 of 994 | Columns: 5

Train the Model#

Define the predictor and the response columns:

[23]:
relation = "xgb_to_json.titanic"
X = ["age", "fare", "sex", "embarked"]
y = "survived"

Train the model with fit():

[24]:
model.fit(relation, X, y)
[24]:


===========
call_string
===========
xgb_classifier('xgb_to_json.my_model', 'xgb_to_json.titanic', '"survived"', '"age", "fare", "sex", "embarked"' USING PARAMETERS exclude_columns='', max_ntree=4, max_depth=3, learning_rate=0.1, min_split_loss=0, weight_reg=0, nbins=32, objective=crossentropy, sampling_size=1, col_sample_by_tree=1, col_sample_by_node=1)

=======
details
=======
predictor|      type
---------+----------------
   age   |float or numeric
  fare   |float or numeric
   sex   |char or varchar
embarked |char or varchar


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    |  4
rejected_row_count| 240
accepted_row_count| 994

Evaluate the Model#

Evaluate the model with .report():

[25]:
model.report()
[25]:
value
auc0.807342412203361
prc_auc0.8090882088048466
accuracy0.789738430583501
log_loss0.253652070598684
precision0.7261306532663316
recall0.7429305912596401
f1_score0.7794904143408312
mcc0.5605510039144916
informedness0.5627653020034418
markedness0.5583454183670029
csi0.5803212851405622
cutoff0.4804
Rows: 1-12 | Columns: 2

Use to_json() to export the model to a JSON file. If you omit a filename, VerticaPy prints the model:

[26]:
model.to_json()
[26]:
'{"learner": {"attributes": {"scikit_learn": "{\\"use_label_encoder\\": true, \\"n_estimators\\": 4, \\"objective\\": \\"binary:logistic\\", \\"max_depth\\": 3, \\"learning_rate\\": 0.1, \\"verbosity\\": null, \\"booster\\": null, \\"tree_method\\": null, \\"gamma\\": null, \\"min_child_weight\\": null, \\"max_delta_step\\": null, \\"subsample\\": null, \\"colsample_bytree\\": 1.0, \\"colsample_bylevel\\": null, \\"colsample_bynode\\": 1.0, \\"reg_alpha\\": null, \\"reg_lambda\\": null, \\"scale_pos_weight\\": null, \\"base_score\\": null, \\"missing\\": NaN, \\"num_parallel_tree\\": null, \\"kwargs\\": {}, \\"random_state\\": null, \\"n_jobs\\": null, \\"monotone_constraints\\": null, \\"interaction_constraints\\": null, \\"importance_type\\": \\"gain\\", \\"gpu_id\\": null, \\"validate_parameters\\": null, \\"classes_\\": [0, 1], \\"n_classes_\\": 2, \\"_le\\": {\\"classes_\\": [0, 1]}, \\"_estimator_type\\": \\"classifier\\"}"}, "feature_names": [], "feature_types": [], "gradient_booster": {"model": {"trees": [{"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 0, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [588197202, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", 12.778438, 48.030862, 0.00425532, -0.13231800000000002, 0.055144000000000006, 0.18938100000000002], "split_indices": [2, 0, 1, 0, 0, 0, 0], "split_type": [1, 0, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}, {"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 1, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [997849600, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", 12.778438, 48.030862, 0.0038298100000000003, -0.11962800000000001, 0.049668800000000006, 0.17203200000000002], "split_indices": [2, 0, 1, 0, 0, 0, 0], "split_type": [1, 0, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}, {"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 2, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [865841526, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", 12.778438, 48.030862, 0.00344687, -0.108967, 0.044795100000000004, 0.158699], "split_indices": [2, 0, 1, 0, 0, 0, 0], "split_type": [1, 0, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}, {"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 3, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [200216671, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", "Q", 48.030862, -0.14433, -0.08975060000000001, 0.0404365, 0.148091], "split_indices": [2, 3, 1, 0, 0, 0, 0], "split_type": [1, 1, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}], "tree_info": [0, 0, 0, 0], "gbtree_model_param": {"num_trees": "4", "size_leaf_vector": "0"}}, "name": "gbtree"}, "learner_model_param": {"base_score": "3.9134809E-01", "num_class": "0", "num_feature": "4"}, "objective": {"name": "binary:logistic", "reg_loss_param": {"scale_pos_weight": "1"}}}, "version": [1, 4, 2]}'

To export and save the model as a JSON file, specify a filename:

[27]:
model.to_json("exported_xgb_model.json")

Unlike Python XGBoost, Vertica does not store some information like ‘sum_hessian’ or ‘loss_changes,’ and the exported model from to_json() replaces this information with a list of zeroes These information are replaced by a list filled with zeros.

Make Predictions with an Exported Model#

This exported model can be used with the Python XGBoost API right away, and exported models make identical predictions in Vertica and Python:

[ ]:
import xgboost as xgb
model_python = xgb.XGBClassifier()
model_python.load_model("exported_xgb_model.json")
y_test_vertica = model.to_python(return_proba = True)(X_test)
y_test_python = model_python.predict_proba(X_test)
result = (y_test_vertica - y_test_python) ** 2
result = result.sum() / len(result)
assert result == pytest.approx(0.0, abs = 1.0E-14)

For multiclass classifiers, the probabilities returned by the VerticaPy and the exported model may differ slightly because of normalization; while Vertica uses multinomial logistic regression, XGBoost Python uses Softmax. Again, this difference does not affect the model’s final predictions. Categorical predictors must be encoded.

Clean the Example Environment#

Drop the ‘xgb_to_json’ schema, using CASCADE to drop any database objects stored inside (the ‘titanic’ table, the XGBoostClassifier model, etc.), then delete the ‘exported_xgb_model.json’ file:

[29]:
import os
os.remove("exported_xgb_model.json")
vp.drop("xgb_to_json", method = "schema")
DROP
Execution: 0.015s

Conclusion#

VerticaPy lets you to create, train, evaluate, and export Vertica machine learning models. There are some notable nuances when importing a Vertica XGBoost model into Python XGBoost, but these do not affect the accuracy of the model or its predictions:

  • Some information computed during the training phase may not be stored (e.g. ‘sum_hessian’ and ‘loss_changes’).

  • The exact probabilities of multiclass classifiers in a Vertica model may differ from those in Python, but both will make the same predictions.

  • Python XGBoost does not support categorical predictors, so you must encode them before training the model in VerticaPy.