Example: XGBoost.to_json#

Starting from VerticaPy 0.7.1, you can export any native Vertica XGBoost model to the Python XGBoost JSON file format. This page demonstrates the exporting process and the nuances involved.

Connect to Vertica#

For a demonstration on how to create a new connection to Vertica, see connection. In this example, we will use an existing connection named ‘VerticaDSN’.

[16]:

import verticapy as vp
vp.connect("VerticaDSN")

Create a Schema (Optional)#

Schemas allow you to organize database objects in a collection, similar to a namespace. If you create a database object without specifying a schema, Vertica uses the ‘public’ schema. For example, to specify the ‘example_table’ in ‘example_schema’, you would use: ‘example_schema.example_table’.

To keep things organized, this example creates the ‘xgb_to_json’ schema and drops it (and its associated tables, views, etc.) at the end:

[17]:

vp.drop("xgb_to_json", method = "schema")
vp.create_schema("xgb_to_json")

[17]:

True

Load Data#

VerticaPy lets you load many well-known datasets like Iris, Titanic, Amazon, etc.

This example loads the Titanic dataset with the load_titanic function into a table called ‘titanic’ in the ‘xgb_to_json’ schema:

[18]:

from verticapy.datasets import load_titanic
vdf = load_titanic(name = "titanic",
                   schema = "xgb_to_json",)

You can also load your own data. To ingest data from a CSV file, use the read_csv() function.

The read_csv() function uses parses the dataset and uses flex tables to identify data types.

If read_csv() runs for too long, you can use the ‘parse_nrows’ parameters to limit the number of lines read_csv() parses before guessing the data types at the possible expense of data type identification accuracy.

For example, to load the ‘iris.csv’ file with the read_csv() function:

[19]:

vdf = vp.read_csv("data/iris.csv",
                  table_name = "iris",
                  schema = "xgb_to_json",)

The table "xgb_to_json"."iris" has been successfully created.

Create a vDataFrame#

vDataFrames allow you to prepare and explore your data without modifying its representation in your Vertica database. Any changes you make are applied to the vDataFrame as modifications to the SQL query for the table underneath.

To create a vDataFrame out of a table in your Vertica database, specify its schema and table name with the standard SQL syntax. For example, to create a vDataFrame out of the ‘titanic’ table in the ‘xgb_to_json’ schema:

[21]:

vdf = vp.vDataFrame("xgb_to_json.titanic")

Create an XGB model#

Create a XGBoostClassifier XGBoostClassifier model.

Unlike a vDataFrame object, which simply queries the table it was created with, the VerticaPy XGBoostClassifier object creates and then references a model in Vertica, so it must be stored in a schema like any other database object.

This example creates the ‘my_model’ XGBoostClassifier model in the ‘xgb_to_json’ schema:

[18]:

from verticapy.learn.ensemble import XGBoostClassifier
model = XGBoostClassifier("xgb_to_json.my_model",
                          max_ntree = 4,
                          max_depth = 3,)

Prepare the Data#

While Vertica XGBoost supports columns of type VARCHAR, Python XGBoost does not, so you must encode the categorical columns you want to use. You must also drop or impute missing values.

This example drops ‘age,’ ‘fare,’ ‘sex,’ ‘embarked,’ and ‘survived’ columns from the vDataFrame and then encodes the ‘sex’ and ‘embarked’ columns. These changes are applied to the vDataFrame’s query and does not affect the main “xgb_to_json.titanic’ table stored in Vertica:

[22]:

vdf = vdf[["age", "fare", "sex", "embarked", "survived"]]
vdf.dropna()
vdf["sex"].label_encode()
vdf["embarked"].label_encode()

Nothing was filtered.

[22]:

	123 age Numeric(6,3)	123 fare Numeric(10,5)	123 sex Int	123 embarked Int	123 survived Int
1	2.0	151.55	0	2	0
2	30.0	151.55	1	2	0
3	25.0	151.55	0	2	0
4	39.0	0.0	1	2	0
5	71.0	49.5042	1	0	0
6	47.0	227.525	1	0	0
7	24.0	247.5208	1	0	0
8	36.0	75.2417	1	0	0
9	25.0	26.0	1	0	0
10	45.0	35.5	1	2	0
11	42.0	26.55	1	2	0
12	41.0	30.5	1	2	0
13	48.0	50.4958	1	0	0
14	45.0	26.55	1	2	0
15	33.0	5.0	1	2	0
16	28.0	47.1	1	2	0
17	17.0	47.1	1	2	0
18	49.0	26.0	1	2	0
19	36.0	78.85	1	2	0
20	46.0	61.175	1	2	0
21	27.0	136.7792	1	0	0
22	47.0	25.5875	1	2	0
23	37.0	83.1583	1	0	0
24	70.0	71.0	1	2	0
25	39.0	71.2833	1	0	0
26	31.0	52.0	1	2	0
27	50.0	106.425	1	0	0
28	39.0	29.7	1	0	0
29	36.0	31.6792	0	0	0
30	30.0	27.75	1	0	0
31	19.0	263.0	1	2	0
32	64.0	263.0	1	2	0
33	37.0	53.1	1	2	0
34	47.0	38.5	1	2	0
35	24.0	79.2	1	0	0
36	71.0	34.6542	1	0	0
37	38.0	153.4625	1	2	0
38	46.0	79.2	1	0	0
39	45.0	83.475	1	2	0
40	40.0	0.0	1	2	0
41	55.0	93.5	1	2	0
42	42.0	42.5	1	2	0
43	55.0	50.0	1	2	0
44	42.0	52.0	1	2	0
45	50.0	28.7125	0	0	0
46	46.0	26.0	1	2	0
47	50.0	26.0	1	2	0
48	32.5	211.5	1	0	0
49	58.0	29.7	1	0	0
50	41.0	51.8625	1	2	0
51	29.0	30.0	1	2	0
52	30.0	45.5	1	2	0
53	30.0	26.0	1	2	0
54	19.0	53.1	1	2	0
55	46.0	75.2417	1	0	0
56	54.0	51.8625	1	2	0
57	28.0	82.1708	1	0	0
58	65.0	26.55	1	2	0
59	44.0	90.0	1	1	0
60	55.0	30.5	1	2	0
61	47.0	42.4	1	2	0
62	37.0	29.7	1	0	0
63	58.0	113.275	1	0	0
64	64.0	26.0	1	2	0
65	65.0	61.9792	1	0	0
66	28.5	27.7208	1	0	0
67	45.5	28.5	1	2	0
68	23.0	93.5	1	2	0
69	29.0	66.6	1	2	0
70	18.0	108.9	1	0	0
71	47.0	52.0	1	2	0
72	38.0	0.0	1	2	0
73	22.0	135.6333	1	0	0
74	31.0	50.4958	1	2	0
75	36.0	40.125	1	0	0
76	55.0	59.4	1	0	0
77	33.0	26.55	1	2	0
78	61.0	262.375	1	0	0
79	50.0	55.9	1	2	0
80	56.0	26.55	1	2	0
81	56.0	30.6958	1	0	0
82	24.0	60.0	1	2	0
83	57.0	146.5208	1	0	0
84	62.0	26.55	1	2	0
85	67.0	221.7792	1	2	0
86	63.0	221.7792	0	2	0
87	61.0	32.3208	1	2	0
88	52.0	79.65	1	2	0
89	49.0	110.8833	1	0	0
90	40.0	27.7208	1	0	0
91	61.0	33.5	1	2	0
92	47.0	34.0208	1	2	0
93	64.0	75.25	1	0	0
94	60.0	26.55	1	2	0
95	54.0	77.2875	1	2	0
96	21.0	77.2875	1	2	0
97	57.0	164.8667	1	2	0
98	50.0	211.5	1	0	0
99	27.0	211.5	1	0	0
100	51.0	61.3792	1	0	0

Rows: 1-100 of 994 | Columns: 5

Train the Model#

Define the predictor and the response columns:

[23]:

relation = "xgb_to_json.titanic"
X = ["age", "fare", "sex", "embarked"]
y = "survived"

Train the model with fit():

[24]:

model.fit(relation, X, y)

[24]:



===========
call_string
===========
xgb_classifier('xgb_to_json.my_model', 'xgb_to_json.titanic', '"survived"', '"age", "fare", "sex", "embarked"' USING PARAMETERS exclude_columns='', max_ntree=4, max_depth=3, learning_rate=0.1, min_split_loss=0, weight_reg=0, nbins=32, objective=crossentropy, sampling_size=1, col_sample_by_tree=1, col_sample_by_node=1)

=======
details
=======
predictor|      type
---------+----------------
   age   |float or numeric
  fare   |float or numeric
   sex   |char or varchar
embarked |char or varchar


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    |  4
rejected_row_count| 240
accepted_row_count| 994

Evaluate the Model#

Evaluate the model with .report():

[25]:

model.report()

[25]:

	value
auc	0.807342412203361
prc_auc	0.8090882088048466
accuracy	0.789738430583501
log_loss	0.253652070598684
precision	0.7261306532663316
recall	0.7429305912596401
f1_score	0.7794904143408312
mcc	0.5605510039144916
informedness	0.5627653020034418
markedness	0.5583454183670029
csi	0.5803212851405622
cutoff	0.4804

Rows: 1-12 | Columns: 2

Use to_json() to export the model to a JSON file. If you omit a filename, VerticaPy prints the model:

[26]:

model.to_json()

[26]:

'{"learner": {"attributes": {"scikit_learn": "{\\"use_label_encoder\\": true, \\"n_estimators\\": 4, \\"objective\\": \\"binary:logistic\\", \\"max_depth\\": 3, \\"learning_rate\\": 0.1, \\"verbosity\\": null, \\"booster\\": null, \\"tree_method\\": null, \\"gamma\\": null, \\"min_child_weight\\": null, \\"max_delta_step\\": null, \\"subsample\\": null, \\"colsample_bytree\\": 1.0, \\"colsample_bylevel\\": null, \\"colsample_bynode\\": 1.0, \\"reg_alpha\\": null, \\"reg_lambda\\": null, \\"scale_pos_weight\\": null, \\"base_score\\": null, \\"missing\\": NaN, \\"num_parallel_tree\\": null, \\"kwargs\\": {}, \\"random_state\\": null, \\"n_jobs\\": null, \\"monotone_constraints\\": null, \\"interaction_constraints\\": null, \\"importance_type\\": \\"gain\\", \\"gpu_id\\": null, \\"validate_parameters\\": null, \\"classes_\\": [0, 1], \\"n_classes_\\": 2, \\"_le\\": {\\"classes_\\": [0, 1]}, \\"_estimator_type\\": \\"classifier\\"}"}, "feature_names": [], "feature_types": [], "gradient_booster": {"model": {"trees": [{"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 0, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [588197202, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", 12.778438, 48.030862, 0.00425532, -0.13231800000000002, 0.055144000000000006, 0.18938100000000002], "split_indices": [2, 0, 1, 0, 0, 0, 0], "split_type": [1, 0, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}, {"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 1, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [997849600, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", 12.778438, 48.030862, 0.0038298100000000003, -0.11962800000000001, 0.049668800000000006, 0.17203200000000002], "split_indices": [2, 0, 1, 0, 0, 0, 0], "split_type": [1, 0, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}, {"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 2, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [865841526, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", 12.778438, 48.030862, 0.00344687, -0.108967, 0.044795100000000004, 0.158699], "split_indices": [2, 0, 1, 0, 0, 0, 0], "split_type": [1, 0, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}, {"base_weights": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "categories": [], "categories_nodes": [], "categories_segments": [], "categories_sizes": [], "default_left": [true, true, true, true, true, true, true], "id": 3, "left_children": [1, 3, 5, -1, -1, -1, -1], "loss_changes": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "parents": [200216671, 0, 0, 1, 1, 2, 2], "right_children": [2, 4, 6, -1, -1, -1, -1], "split_conditions": ["male", "Q", 48.030862, -0.14433, -0.08975060000000001, 0.0404365, 0.148091], "split_indices": [2, 3, 1, 0, 0, 0, 0], "split_type": [1, 1, 0, 1, 1, 1, 1], "sum_hessian": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "tree_param": {"num_deleted": "0", "num_feature": "4", "num_nodes": "7", "size_leaf_vector": "0"}}], "tree_info": [0, 0, 0, 0], "gbtree_model_param": {"num_trees": "4", "size_leaf_vector": "0"}}, "name": "gbtree"}, "learner_model_param": {"base_score": "3.9134809E-01", "num_class": "0", "num_feature": "4"}, "objective": {"name": "binary:logistic", "reg_loss_param": {"scale_pos_weight": "1"}}}, "version": [1, 4, 2]}'

To export and save the model as a JSON file, specify a filename:

[27]:

model.to_json("exported_xgb_model.json")

Unlike Python XGBoost, Vertica does not store some information like ‘sum_hessian’ or ‘loss_changes,’ and the exported model from to_json() replaces this information with a list of zeroes These information are replaced by a list filled with zeros.

Make Predictions with an Exported Model#

This exported model can be used with the Python XGBoost API right away, and exported models make identical predictions in Vertica and Python:

[ ]:

import xgboost as xgb
model_python = xgb.XGBClassifier()
model_python.load_model("exported_xgb_model.json")
y_test_vertica = model.to_python(return_proba = True)(X_test)
y_test_python = model_python.predict_proba(X_test)
result = (y_test_vertica - y_test_python) ** 2
result = result.sum() / len(result)
assert result == pytest.approx(0.0, abs = 1.0E-14)

For multiclass classifiers, the probabilities returned by the VerticaPy and the exported model may differ slightly because of normalization; while Vertica uses multinomial logistic regression, XGBoost Python uses Softmax. Again, this difference does not affect the model’s final predictions. Categorical predictors must be encoded.

Clean the Example Environment#

Drop the ‘xgb_to_json’ schema, using CASCADE to drop any database objects stored inside (the ‘titanic’ table, the XGBoostClassifier model, etc.), then delete the ‘exported_xgb_model.json’ file:

[29]:

import os
os.remove("exported_xgb_model.json")
vp.drop("xgb_to_json", method = "schema")

DROP

Execution: 0.015s

Conclusion#

VerticaPy lets you to create, train, evaluate, and export Vertica machine learning models. There are some notable nuances when importing a Vertica XGBoost model into Python XGBoost, but these do not affect the accuracy of the model or its predictions:

Some information computed during the training phase may not be stored (e.g. ‘sum_hessian’ and ‘loss_changes’).
The exact probabilities of multiclass classifiers in a Vertica model may differ from those in Python, but both will make the same predictions.
Python XGBoost does not support categorical predictors, so you must encode them before training the model in VerticaPy.