Model Tracking and Versioning#

Introduction#

VerticaPy is an open-source Python package on top of Vertica database that supports pandas-like virtual dataframes over database relations. VerticaPy provides scikit-type machine learning functionality on these virtual dataframes. Data is not moved out of the database while performing machine learning or statistical analysis on virtual dataframes. Instead, the computations are done at scale in a distributed fashion inside the Vertica cluster. VerticaPy also takes advantage of multiple Python libraries to create a variety of charts, providing a quick and easy method to illustrate your statistical data.

In this article, we will introduce two new MLOps tools recently added to VerticaPy: Model Tracking and Model Versioning. For comprehensive documentation of all VerticaPy functionality, visit https://www.vertica.com/python.

Model Tracking#

Data scientists usually train many ML models for a project. To help choose the best model, data scientists need a way to keep track of all candidate models and compare them using various metrics. VerticaPy provides a model tracking system to facilitate this process for a given experiment. The data scientist first creates an experiment object and then adds candidate models to that experiment. The information related to each experiment can be automatically backed up in the database, so if the Python environment is closed for any reason, like a holiday, the data scientist has peace of mind that the experiment can be easily retrieved. The experiment object also provides methods to easily compare the prediction performance of its associated models and to pick the model with the best performance on a specific test dataset.

The following example demonstrates how the model tracking feature can be used for an experiment that trains a few binary-classifier models on the Titanic dataset. First, we must load the titanic data into our database and store it as a virtual dataframe (vDF):

[40]:

from verticapy.datasets import load_titanic

titanic_vDF = load_titanic()
predictors = ["age", "fare", "pclass"]
response = "survived"

We then define a vExperiment object to track the candidate models. To define the experiment object, specify the following parameters:

experiment_name: The name of the experiment.
test_relation: Relation or vDF to use to test the model.
X: List of the predictors.
y: Response column.

Note: If experiments_type is set to clustering, test_relation, X, and Y must be set to None.

The following parameters are optional:

experiment_type: By default ‘auto’, meaning VerticaPy tries to detect the experiment type from the response value. However, it might be cleaner to explicitly specify the experiment type. The other valid values for this parameter are ‘regressor’ (for regression models), ‘binary’ (for binary classification models), ‘multi’ (for multiclass classification models), and ‘clustering’ (for clustering models).
experiment_table: The name of the table ([schema_name.]table_name) in the database to archive the experiment. The experiment information won’t be backed up in the database without specifying this parameter. If the table already exists, its previously stored experiments are loaded to the object. In this case, the user must have SELECT, INSERT, and DELETE privileges on the table. If the table doesn’t exist and the user has the necessary privileges for creating such a table, the table is created.

[41]:

import verticapy.mlops.model_tracking as mt

my_experiment_1 = mt.vExperiment(experiment_name = "my_exp_1",
                              test_relation = titanic_vDF,
                              X=predictors,
                              y=response,
                              experiment_type="binary",
                              experiment_table="my_exp_table_1")

After creating the experiment object, we can train different models and add them to the experiment:

[42]:

# training a LogisticRegression model
from verticapy.learn.linear_model import LogisticRegression
model_1 = LogisticRegression("logistic_reg_m", overwrite_model=True)
model_1.fit(titanic_vDF, predictors, response)
my_experiment_1.add_model(model_1)

# training a LinearSVC model
from verticapy.learn.svm import LinearSVC
model_2 = LinearSVC("svc_m", overwrite_model=True)
model_2.fit(titanic_vDF, predictors, response)
my_experiment_1.add_model(model_2)

# training a DecisionTreeClassifier model
from verticapy.learn.tree import DecisionTreeClassifier
model_3 = DecisionTreeClassifier("tree_m", overwrite_model=True, max_depth=3)
model_3.fit(titanic_vDF, predictors, response)
my_experiment_1.add_model(model_3)

So far we have only added three models to the experiment, but we could add many more in a real scenario. Using the experiment object, we can easily list the models in the experiment and pick the one with the best prediction performance based on a specified metric.

[43]:

my_experiment_1.list_models()

[43]:

	model_name	model_type	auc	prc_auc	accuracy	log_loss	precision	recall	f1_score	mcc	informedness	markedness	csi	user_defined_metrics
1	logistic_reg_m	LogisticRegression	0.7260510240747399	0.6425502893997783	0.6957831325301205	0.255281396006693	0.658273381294964	0.4680306905370844	0.547085201793722	nan	0.3110058971486547	0.36857978798020086	0.3765432098765432	[null]
2	svc_m	LinearSVC	0.7262327999830908	0.6422018358059188	0.6997991967871486	0.268972113802438	0.6678832116788321	0.4680306905370844	0.5503759398496241	nan	0.3176174673965886	0.37979456901955233	0.3796680497925311	[null]
3	tree_m	RandomForestClassifier	0.7211726659761998	0.6890311481722489	0.7068273092369478	0.251082469122067	0.7239819004524887	0.4092071611253197	0.522875816993464	1.6908954624726606	0.30838071484432805	0.42591738432345627	0.35398230088495575	[null]

Rows: 1-3 | Columns: 15

[44]:

top_model = my_experiment_1.load_best_model(metric="auc")

The experiment object facilitates not only model tracking but also makes cleanup super easy, especially in real-world scenarios where there is often a large number of leftover models. The ‘drop’ method drops from the database the info of the experiment and all associated models other than those specified in the keeping_models list.

[45]:

my_experiment_1.drop(keeping_models=[top_model.model_name])

Experiments are also helpful for performing grid search on hyper-parameters. The following example shows how they can be used to study the impact of the max_iter parameter on the prediction performance of LogisticRegression models.

[46]:

# creating an experiment
my_experiment_2 = mt.vExperiment(experiment_name = "my_exp_2",
                              test_relation = titanic_vDF,
                              X=predictors,
                              y=response,
                              experiment_type="binary")

# training LogisticRegression with different values of max_iter
for i in range(1, 5):
    model = LogisticRegression(max_iter=i)
    model.fit(titanic_vDF, predictors, response)
    my_experiment_2.add_model(model)

# plotting prc_auc vs max_iter
my_experiment_2.plot("max_iter", "prc_auc")

# cleaning all the models associated to the experimen from the database
my_experiment_2.drop()

../../../_images/notebooks_ml_model_tracking_versioning_index_15_0.png

Model Versioning#

In Vertica version 12.0.4, we added support for In-DB ML Model Versioning. Now, we have integrated it into VerticaPy so that users can utilize its capabilities along with the other tools in VerticaPy. In VerticaPy, model versioning is a wrapper around an SQL API already built in Vertica. For more information about the concepts of model versioning in Vertica, see the Vertica documentation.

To showcase model versioning, we will begin by registering the top_model picked from the above experiment.

[47]:

top_model.register("top_model_demo")

[47]:

True

When the model owner registers the model, its ownership changes to DBADMIN, and the previous owner receives USAGE privileges. Registered models are referred to by their registered_name and version. Only DBADMIN or a user with the MLSUPERVISOR role can change the status of a registered model. We have provided the RegisteredModel class in VerticaPy for working with registered models.

We will now make a RegisteredModel object for our recently registered model and change its status to “production”. We can then use the registered model for scoring.

[48]:

import verticapy.mlops.model_versioning as mv

rm = mv.RegisteredModel("top_model_demo")

To see the list of all models registered as “top_model_demo”, use the list_models() method.

[49]:

rm.list_models()

[49]:

	Abc registered_name Varchar(128)	123 registered_version Integer	Abc status Varchar(128)	📅 registered_time Timestamptz(35)	123 model_id Integer	Abc schema_name Varchar(128)	Abc model_name Varchar(128)	Abc model_type Varchar(128)	Abc category Varchar(128)
1	top_model_demo	3	UNDER_REVIEW	2023-10-19 15:56:21.141198-04:00	45035996273853304	public	svc_m	SVM_CLASSIFIER	VERTICA_MODELS

Rows: 1-1 | Columns: 9

The model we just registered has a status of “under_review”. The next step is to change the status of the model to “staging”, which is meant for A/B testing the model. Assuming the model performs well, we will promote it to the “production” status. Please note that we should specify the right version of the registered model from the above table.

[51]:

# changing the status of the model to staging
rm.change_status(version=3, new_status="staging")

# changing the status of the model to production
rm.change_status(version=3, new_status="production")

There can only be one version of the registered model in “production” at any time. The following predict function applies to the model with “production” status by default. If you want to run the predict function on a model with a status other than “production”, you must also specify the model version.

[52]:

rm.predict(titanic_vDF, X=predictors, name="predicted_value")

[52]:

	123 pclass Integer	123 survived Integer	Abc Varchar(164)	Abc sex Varchar(20)	123 age Numeric(8)	123 sibsp Integer	123 parch Integer	Abc ticket Varchar(36)	123 fare Numeric(12)	Abc cabin Varchar(30)	Abc embarked Varchar(20)	Abc boat Varchar(100)	123 body Integer	Abc Varchar(100)	123 predicted_value Integer
1	1	0		male	30.0	1	2	113781	151.55	C22 C26	S	[null]	135		1
2	1	0		male	45.0	0	0	113784	35.5	T	S	[null]	[null]		1
3	1	0		male	[null]	0	0	113798	31.0	[null]	S	[null]	[null]		[null]
4	1	0		male	28.0	0	0	113059	47.1	[null]	S	[null]	[null]		1
5	1	0		male	50.0	1	0	PC 17761	106.425	C86	C	[null]	62		1
6	1	0		female	36.0	0	0	PC 17531	31.6792	A29	C	[null]	[null]		1
7	1	0		male	30.0	0	0	113051	27.75	C111	C	[null]	[null]		1
8	1	0		male	46.0	0	0	PC 17593	79.2	B82 B84	C	[null]	[null]		1
9	1	0		male	40.0	0	0	112059	0.0	B94	S	[null]	110		1
10	1	0		male	[null]	0	0	17463	51.8625	E46	S	[null]	[null]		[null]
11	1	0		male	42.0	1	0	113789	52.0	[null]	S	[null]	38		1
12	1	0		male	46.0	0	0	694	26.0	[null]	S	[null]	80		1
13	1	0		male	[null]	0	0	PC 17612	27.7208	[null]	C	[null]	[null]		[null]
14	1	0		male	46.0	0	0	13050	75.2417	C6	C	[null]	292		1
15	1	0		male	54.0	0	0	17463	51.8625	E46	S	[null]	175		0
16	1	0		male	65.0	0	1	113509	61.9792	B30	C	[null]	234		0
17	1	0		male	45.5	0	0	113043	28.5	C124	S	[null]	166		1
18	1	0		male	23.0	0	0	12749	93.5	B24	S	[null]	[null]		1
19	1	0		male	29.0	1	0	113776	66.6	C2	S	[null]	[null]		1
20	1	0		male	47.0	0	0	110465	52.0	C110	S	[null]	207		1
21	1	0		male	38.0	0	0	19972	0.0	[null]	S	[null]	[null]		1
22	1	0		male	22.0	0	0	PC 17760	135.6333	[null]	C	[null]	232		1
23	1	0		male	31.0	0	0	PC 17590	50.4958	A24	S	[null]	[null]		1
24	1	0		male	36.0	0	0	13049	40.125	A10	C	[null]	[null]		1
25	1	0		male	33.0	0	0	113790	26.55	[null]	S	[null]	109		1
26	1	0		male	56.0	0	0	17764	30.6958	A7	C	[null]	[null]		0
27	1	0		male	62.0	0	0	113514	26.55	C87	S	[null]	[null]		0
28	1	0		male	[null]	0	0	PC 17605	27.7208	[null]	C	[null]	[null]		[null]
29	1	0		female	63.0	1	0	PC 17483	221.7792	C55 C57	S	[null]	[null]		1
30	1	0		male	61.0	0	0	36963	32.3208	D50	S	[null]	46		0
31	1	0		male	40.0	0	0	PC 17601	27.7208	[null]	C	[null]	[null]		1
32	1	0		male	21.0	0	1	35281	77.2875	D26	S	[null]	169		1
33	1	0		male	27.0	0	2	113503	211.5	C82	C	[null]	[null]		1
34	1	0		male	62.0	0	0	113807	26.55	[null]	S	[null]	[null]		0
35	1	1		female	63.0	1	0	13502	77.9583	D7	S	10	[null]		0
36	1	1		female	32.0	0	0	11813	76.2917	D15	C	8	[null]		1
37	1	1		female	47.0	1	1	11751	52.5542	D35	S	5	[null]		1
38	1	1		female	29.0	0	0	PC 17483	221.7792	C97	S	8	[null]		1
39	1	1		female	19.0	1	0	11967	91.0792	B49	C	7	[null]		1
40	1	1		female	58.0	0	0	113783	26.55	C103	S	8	[null]		0
41	1	1		female	44.0	0	0	PC 17610	27.7208	B4	C	6	[null]		1
42	1	1		female	59.0	2	0	11769	51.4792	C101	S	D	[null]		0
43	1	1		female	41.0	0	0	16966	134.5	E40	C	3	[null]		1
44	1	1		male	42.0	0	0	PC 17476	26.2875	E24	S	5	[null]		1
45	1	1		female	53.0	0	0	PC 17606	27.4458	[null]	C	6	[null]		0
46	1	1		female	58.0	0	1	PC 17755	512.3292	B51 B53 B55	C	3	[null]		1
47	1	1		male	11.0	1	2	113760	120.0	B96 B98	S	4	[null]		1
48	1	1		female	36.0	1	2	113760	120.0	B96 B98	S	4	[null]		1
49	1	1		female	76.0	1	0	19877	78.85	C46	S	6	[null]		0
50	1	1		female	36.0	0	0	PC 17608	262.375	B61	C	4	[null]		1
51	1	1		female	39.0	1	1	PC 17756	83.1583	E49	C	14	[null]		1
52	1	1		female	38.0	1	0	PC 17599	71.2833	C85	C	4	[null]		1
53	1	1		female	33.0	0	0	113781	151.55	[null]	S	8	[null]		1
54	1	1		female	27.0	1	2	F.C. 12750	52.0	B71	S	3	[null]		1
55	1	1		male	4.0	0	2	33638	81.8583	A34	S	5	[null]		1
56	1	1		female	24.0	3	2	19950	263.0	C23 C25 C27	S	10	[null]		1
57	1	1		female	60.0	1	4	19950	263.0	C23 C25 C27	S	10	[null]		1
58	1	1		male	60.0	1	1	13567	79.2	B41	C	5	[null]		0
59	1	1		female	45.0	0	1	112378	59.4	[null]	C	7	[null]		1
60	1	1		male	49.0	1	0	17453	89.1042	C92	C	5	[null]		1
61	1	1		male	48.0	1	0	PC 17572	76.7292	D33	C	3	[null]		1
62	1	1		male	27.0	0	0	PC 17572	76.7292	D49	C	3	[null]		1
63	1	1		female	24.0	0	0	11767	83.1583	C54	C	7	[null]		1
64	1	1		female	52.0	1	1	12749	93.5	B69	S	3	[null]		1
65	1	1		female	16.0	0	1	111361	57.9792	B18	C	4	[null]		1
66	1	1		female	44.0	0	1	111361	57.9792	B18	C	4	[null]		1
67	1	1		female	30.0	0	0	PC 17761	106.425	[null]	C	2	[null]		1
68	1	1		female	49.0	0	0	17465	25.9292	D17	S	8	[null]		1
69	1	1		male	35.0	0	0	PC 17755	512.3292	B101	C	3	[null]		1
70	1	1		female	55.0	0	0	112377	27.7208	[null]	C	6	[null]		0
71	1	1		female	58.0	0	0	PC 17569	146.5208	B80	C	[null]	[null]		1
72	1	1		female	15.0	0	1	24160	211.3375	B5	S	2	[null]		1
73	1	1		female	[null]	1	0	PC 17604	82.1708	[null]	C	6	[null]		[null]
74	1	1		female	39.0	0	0	PC 17758	108.9	C105	C	8	[null]		1
75	1	1		female	22.0	0	1	113509	61.9792	B36	C	5	[null]		1
76	1	1		female	17.0	1	0	PC 17758	108.9	C65	C	8	[null]		1
77	1	1		male	52.0	0	0	113786	30.5	C104	S	6	[null]		0
78	1	1		female	56.0	0	1	11767	83.1583	C50	C	7	[null]		1
79	1	1		male	[null]	0	0	111163	26.0	[null]	S	1	[null]		[null]
80	1	1		female	35.0	1	0	13236	57.75	C28	C	11	[null]		1
81	1	1		male	56.0	0	0	13213	35.5	A26	C	3	[null]		0
82	1	1		male	45.0	1	1	16966	134.5	E34	C	3	[null]		1
83	1	1		female	40.0	1	1	16966	134.5	E34	C	3	[null]		1
84	1	1		female	[null]	0	0	PC 17585	79.2	[null]	C	D	[null]		[null]
85	1	1		female	35.0	0	0	PC 17755	512.3292	[null]	C	3	[null]		1
86	1	1		female	21.0	0	0	113795	26.55	[null]	S	8 10	[null]		1
87	1	1		male	21.0	0	1	PC 17597	61.3792	[null]	C	A	[null]		1
88	1	1		male	[null]	0	0	19947	35.5	C52	S	D	[null]		[null]
89	2	0		male	23.0	0	0	C.A. 31030	10.5	[null]	S	[null]	[null]		0
90	2	0		male	28.0	0	0	C.A./SOTON 34068	10.5	[null]	S	[null]	[null]		0
91	2	0		male	28.0	0	0	244358	26.0	[null]	S	[null]	[null]		0
92	2	0		male	42.0	0	0	211535	13.0	[null]	S	[null]	[null]		0
93	2	0		male	27.0	0	0	220367	13.0	[null]	S	[null]	[null]		0
94	2	0		male	60.0	1	1	29750	39.0	[null]	S	[null]	[null]		0
95	2	0		male	25.0	1	0	236853	26.0	[null]	S	[null]	[null]		0
96	2	0		male	25.0	0	0	234686	13.0	[null]	S	[null]	97		0
97	2	0		male	42.0	0	0	244310	13.0	[null]	S	[null]	[null]		0
98	2	0		female	[null]	0	0	F.C.C. 13534	21.0	[null]	S	[null]	[null]		[null]
99	2	0		male	18.0	0	0	S.O.C. 14879	73.5	[null]	S	[null]	[null]		1
100	2	0		male	25.0	0	0	C.A. 31029	31.5	[null]	S	[null]	[null]		0

Rows: 1-100 of 1234 | Columns: 15

DBADMIN and users who are granted SELECT privileges on the v_monitor.model_status_history table are able to monitor the status history of registered models.

[53]:

rm.list_status_history()

[53]:

	Abc registered_name Varchar(128)	123 registered_version Integer	Abc new_status Varchar(128)	Abc old_status Varchar(128)	📅 status_change_time Timestamptz(35)	123 operator_id Integer	Abc operator_name Varchar(128)	123 model_id Integer	Abc schema_name Varchar(128)	Abc model_name Varchar(128)
1	top_model_demo	3	UNDER_REVIEW	UNREGISTERED	2023-10-19 15:56:21.150875-04:00	45035996273704962	afard	45035996273853304	public	svc_m
2	top_model_demo	3	STAGING	UNDER_REVIEW	2023-10-19 15:56:38.961325-04:00	45035996273704962	afard	45035996273853304	public	svc_m
3	top_model_demo	3	PRODUCTION	STAGING	2023-10-19 15:56:39.088134-04:00	45035996273704962	afard	45035996273853304	public	svc_m
4	top_model_demo	1	UNDER_REVIEW	UNREGISTERED	2023-10-19 15:46:38.893000-04:00	45035996273704962	afard	45035996273851606	[null]	[null]
5	top_model_demo	1	STAGING	UNDER_REVIEW	2023-10-19 15:47:31.840529-04:00	45035996273704962	afard	45035996273851606	[null]	[null]
6	top_model_demo	1	PRODUCTION	STAGING	2023-10-19 15:47:31.960033-04:00	45035996273704962	afard	45035996273851606	[null]	[null]
7	top_model_demo	2	UNDER_REVIEW	UNREGISTERED	2023-10-19 15:53:28.648945-04:00	45035996273704962	afard	45035996273852750	[null]	[null]
8	top_model_demo	2	STAGING	UNDER_REVIEW	2023-10-19 15:54:02.120317-04:00	45035996273704962	afard	45035996273852750	[null]	[null]
9	top_model_demo	2	PRODUCTION	STAGING	2023-10-19 15:54:02.238576-04:00	45035996273704962	afard	45035996273852750	[null]	[null]

Rows: 1-9 | Columns: 10

Conclusion#

The addition of model tracking and model versioning to the VerticaPy toolkit greatly improves VerticaPy’s MLOps capabilities. We are constantly working to improve VerticaPy and address the needs of data scientists who wish to harness the power of Vertica database to empower their data analyses. If you have any comments or questions, don’t hesitate to reach out in the VerticaPy github community.