Model Tracking and Versioning#

Introduction#

VerticaPy is an open-source Python package on top of Vertica database that supports pandas-like virtual dataframes over database relations. VerticaPy provides scikit-type machine learning functionality on these virtual dataframes. Data is not moved out of the database while performing machine learning or statistical analysis on virtual dataframes. Instead, the computations are done at scale in a distributed fashion inside the Vertica cluster. VerticaPy also takes advantage of multiple Python libraries to create a variety of charts, providing a quick and easy method to illustrate your statistical data.

In this article, we will introduce two new MLOps tools recently added to VerticaPy: Model Tracking and Model Versioning. For comprehensive documentation of all VerticaPy functionality, visit https://www.vertica.com/python.

Model Tracking#

Data scientists usually train many ML models for a project. To help choose the best model, data scientists need a way to keep track of all candidate models and compare them using various metrics. VerticaPy provides a model tracking system to facilitate this process for a given experiment. The data scientist first creates an experiment object and then adds candidate models to that experiment. The information related to each experiment can be automatically backed up in the database, so if the Python environment is closed for any reason, like a holiday, the data scientist has peace of mind that the experiment can be easily retrieved. The experiment object also provides methods to easily compare the prediction performance of its associated models and to pick the model with the best performance on a specific test dataset.

The following example demonstrates how the model tracking feature can be used for an experiment that trains a few binary-classifier models on the Titanic dataset. First, we must load the titanic data into our database and store it as a virtual dataframe (vDF):

[40]:
from verticapy.datasets import load_titanic

titanic_vDF = load_titanic()
predictors = ["age", "fare", "pclass"]
response = "survived"

We then define a vExperiment object to track the candidate models. To define the experiment object, specify the following parameters:

  • experiment_name: The name of the experiment.

  • test_relation: Relation or vDF to use to test the model.

  • X: List of the predictors.

  • y: Response column.

Note: If experiments_type is set to clustering, test_relation, X, and Y must be set to None.

The following parameters are optional:

  • experiment_type: By default ‘auto’, meaning VerticaPy tries to detect the experiment type from the response value. However, it might be cleaner to explicitly specify the experiment type. The other valid values for this parameter are ‘regressor’ (for regression models), ‘binary’ (for binary classification models), ‘multi’ (for multiclass classification models), and ‘clustering’ (for clustering models).

  • experiment_table: The name of the table ([schema_name.]table_name) in the database to archive the experiment. The experiment information won’t be backed up in the database without specifying this parameter. If the table already exists, its previously stored experiments are loaded to the object. In this case, the user must have SELECT, INSERT, and DELETE privileges on the table. If the table doesn’t exist and the user has the necessary privileges for creating such a table, the table is created.

[41]:
import verticapy.mlops.model_tracking as mt

my_experiment_1 = mt.vExperiment(experiment_name = "my_exp_1",
                              test_relation = titanic_vDF,
                              X=predictors,
                              y=response,
                              experiment_type="binary",
                              experiment_table="my_exp_table_1")

After creating the experiment object, we can train different models and add them to the experiment:

[42]:
# training a LogisticRegression model
from verticapy.learn.linear_model import LogisticRegression
model_1 = LogisticRegression("logistic_reg_m", overwrite_model=True)
model_1.fit(titanic_vDF, predictors, response)
my_experiment_1.add_model(model_1)

# training a LinearSVC model
from verticapy.learn.svm import LinearSVC
model_2 = LinearSVC("svc_m", overwrite_model=True)
model_2.fit(titanic_vDF, predictors, response)
my_experiment_1.add_model(model_2)

# training a DecisionTreeClassifier model
from verticapy.learn.tree import DecisionTreeClassifier
model_3 = DecisionTreeClassifier("tree_m", overwrite_model=True, max_depth=3)
model_3.fit(titanic_vDF, predictors, response)
my_experiment_1.add_model(model_3)

So far we have only added three models to the experiment, but we could add many more in a real scenario. Using the experiment object, we can easily list the models in the experiment and pick the one with the best prediction performance based on a specified metric.

[43]:
my_experiment_1.list_models()
[43]:
model_name
model_type
auc
prc_auc
accuracy
log_loss
precision
recall
f1_score
mcc
informedness
markedness
csi
user_defined_metrics
1logistic_reg_mLogisticRegression0.72605102407473990.64255028939977830.69578313253012050.2552813960066930.6582733812949640.46803069053708440.547085201793722nan0.31100589714865470.368579787980200860.3765432098765432[null]
2svc_mLinearSVC0.72623279998309080.64220183580591880.69979919678714860.2689721138024380.66788321167883210.46803069053708440.5503759398496241nan0.31761746739658860.379794569019552330.3796680497925311[null]
3tree_mRandomForestClassifier0.72117266597619980.68903114817224890.70682730923694780.2510824691220670.72398190045248870.40920716112531970.5228758169934641.69089546247266060.308380714844328050.425917384323456270.35398230088495575[null]
Rows: 1-3 | Columns: 15
[44]:
top_model = my_experiment_1.load_best_model(metric="auc")

The experiment object facilitates not only model tracking but also makes cleanup super easy, especially in real-world scenarios where there is often a large number of leftover models. The ‘drop’ method drops from the database the info of the experiment and all associated models other than those specified in the keeping_models list.

[45]:
my_experiment_1.drop(keeping_models=[top_model.model_name])

Experiments are also helpful for performing grid search on hyper-parameters. The following example shows how they can be used to study the impact of the max_iter parameter on the prediction performance of LogisticRegression models.

[46]:
# creating an experiment
my_experiment_2 = mt.vExperiment(experiment_name = "my_exp_2",
                              test_relation = titanic_vDF,
                              X=predictors,
                              y=response,
                              experiment_type="binary")

# training LogisticRegression with different values of max_iter
for i in range(1, 5):
    model = LogisticRegression(max_iter=i)
    model.fit(titanic_vDF, predictors, response)
    my_experiment_2.add_model(model)

# plotting prc_auc vs max_iter
my_experiment_2.plot("max_iter", "prc_auc")

# cleaning all the models associated to the experimen from the database
my_experiment_2.drop()
../../../_images/notebooks_ml_model_tracking_versioning_index_15_0.png

Model Versioning#

In Vertica version 12.0.4, we added support for In-DB ML Model Versioning. Now, we have integrated it into VerticaPy so that users can utilize its capabilities along with the other tools in VerticaPy. In VerticaPy, model versioning is a wrapper around an SQL API already built in Vertica. For more information about the concepts of model versioning in Vertica, see the Vertica documentation.

To showcase model versioning, we will begin by registering the top_model picked from the above experiment.

[47]:
top_model.register("top_model_demo")
[47]:
True

When the model owner registers the model, its ownership changes to DBADMIN, and the previous owner receives USAGE privileges. Registered models are referred to by their registered_name and version. Only DBADMIN or a user with the MLSUPERVISOR role can change the status of a registered model. We have provided the RegisteredModel class in VerticaPy for working with registered models.

We will now make a RegisteredModel object for our recently registered model and change its status to “production”. We can then use the registered model for scoring.

[48]:
import verticapy.mlops.model_versioning as mv

rm = mv.RegisteredModel("top_model_demo")

To see the list of all models registered as “top_model_demo”, use the list_models() method.

[49]:
rm.list_models()
[49]:
Abc
registered_name
Varchar(128)
123
registered_version
Integer
Abc
status
Varchar(128)
📅
registered_time
Timestamptz(35)
123
model_id
Integer
Abc
schema_name
Varchar(128)
Abc
model_name
Varchar(128)
Abc
model_type
Varchar(128)
Abc
category
Varchar(128)
1top_model_demo3UNDER_REVIEW2023-10-19 15:56:21.141198-04:0045035996273853304publicsvc_mSVM_CLASSIFIERVERTICA_MODELS
Rows: 1-1 | Columns: 9

The model we just registered has a status of “under_review”. The next step is to change the status of the model to “staging”, which is meant for A/B testing the model. Assuming the model performs well, we will promote it to the “production” status. Please note that we should specify the right version of the registered model from the above table.

[51]:
# changing the status of the model to staging
rm.change_status(version=3, new_status="staging")

# changing the status of the model to production
rm.change_status(version=3, new_status="production")

There can only be one version of the registered model in “production” at any time. The following predict function applies to the model with “production” status by default. If you want to run the predict function on a model with a status other than “production”, you must also specify the model version.

[52]:
rm.predict(titanic_vDF, X=predictors, name="predicted_value")
[52]:
123
pclass
Integer
123
survived
Integer
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(8)
123
sibsp
Integer
123
parch
Integer
Abc
ticket
Varchar(36)
123
fare
Numeric(12)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Integer
Abc
Varchar(100)
123
predicted_value
Integer
110male30.012113781151.55C22 C26S[null]1351
210male45.00011378435.5TS[null][null]1
310male[null]0011379831.0[null]S[null][null][null]
410male28.00011305947.1[null]S[null][null]1
510male50.010PC 17761106.425C86C[null]621
610female36.000PC 1753131.6792A29C[null][null]1
710male30.00011305127.75C111C[null][null]1
810male46.000PC 1759379.2B82 B84C[null][null]1
910male40.0001120590.0B94S[null]1101
1010male[null]001746351.8625E46S[null][null][null]
1110male42.01011378952.0[null]S[null]381
1210male46.00069426.0[null]S[null]801
1310male[null]00PC 1761227.7208[null]C[null][null][null]
1410male46.0001305075.2417C6C[null]2921
1510male54.0001746351.8625E46S[null]1750
1610male65.00111350961.9792B30C[null]2340
1710male45.50011304328.5C124S[null]1661
1810male23.0001274993.5B24S[null][null]1
1910male29.01011377666.6C2S[null][null]1
2010male47.00011046552.0C110S[null]2071
2110male38.000199720.0[null]S[null][null]1
2210male22.000PC 17760135.6333[null]C[null]2321
2310male31.000PC 1759050.4958A24S[null][null]1
2410male36.0001304940.125A10C[null][null]1
2510male33.00011379026.55[null]S[null]1091
2610male56.0001776430.6958A7C[null][null]0
2710male62.00011351426.55C87S[null][null]0
2810male[null]00PC 1760527.7208[null]C[null][null][null]
2910female63.010PC 17483221.7792C55 C57S[null][null]1
3010male61.0003696332.3208D50S[null]460
3110male40.000PC 1760127.7208[null]C[null][null]1
3210male21.0013528177.2875D26S[null]1691
3310male27.002113503211.5C82C[null][null]1
3410male62.00011380726.55[null]S[null][null]0
3511female63.0101350277.9583D7S10[null]0
3611female32.0001181376.2917D15C8[null]1
3711female47.0111175152.5542D35S5[null]1
3811female29.000PC 17483221.7792C97S8[null]1
3911female19.0101196791.0792B49C7[null]1
4011female58.00011378326.55C103S8[null]0
4111female44.000PC 1761027.7208B4C6[null]1
4211female59.0201176951.4792C101SD[null]0
4311female41.00016966134.5E40C3[null]1
4411male42.000PC 1747626.2875E24S5[null]1
4511female53.000PC 1760627.4458[null]C6[null]0
4611female58.001PC 17755512.3292B51 B53 B55C3[null]1
4711male11.012113760120.0B96 B98S4[null]1
4811female36.012113760120.0B96 B98S4[null]1
4911female76.0101987778.85C46S6[null]0
5011female36.000PC 17608262.375B61C4[null]1
5111female39.011PC 1775683.1583E49C14[null]1
5211female38.010PC 1759971.2833C85C4[null]1
5311female33.000113781151.55[null]S8[null]1
5411female27.012F.C. 1275052.0B71S3[null]1
5511male4.0023363881.8583A34S5[null]1
5611female24.03219950263.0C23 C25 C27S10[null]1
5711female60.01419950263.0C23 C25 C27S10[null]1
5811male60.0111356779.2B41C5[null]0
5911female45.00111237859.4[null]C7[null]1
6011male49.0101745389.1042C92C5[null]1
6111male48.010PC 1757276.7292D33C3[null]1
6211male27.000PC 1757276.7292D49C3[null]1
6311female24.0001176783.1583C54C7[null]1
6411female52.0111274993.5B69S3[null]1
6511female16.00111136157.9792B18C4[null]1
6611female44.00111136157.9792B18C4[null]1
6711female30.000PC 17761106.425[null]C2[null]1
6811female49.0001746525.9292D17S8[null]1
6911male35.000PC 17755512.3292B101C3[null]1
7011female55.00011237727.7208[null]C6[null]0
7111female58.000PC 17569146.5208B80C[null][null]1
7211female15.00124160211.3375B5S2[null]1
7311female[null]10PC 1760482.1708[null]C6[null][null]
7411female39.000PC 17758108.9C105C8[null]1
7511female22.00111350961.9792B36C5[null]1
7611female17.010PC 17758108.9C65C8[null]1
7711male52.00011378630.5C104S6[null]0
7811female56.0011176783.1583C50C7[null]1
7911male[null]0011116326.0[null]S1[null][null]
8011female35.0101323657.75C28C11[null]1
8111male56.0001321335.5A26C3[null]0
8211male45.01116966134.5E34C3[null]1
8311female40.01116966134.5E34C3[null]1
8411female[null]00PC 1758579.2[null]CD[null][null]
8511female35.000PC 17755512.3292[null]C3[null]1
8611female21.00011379526.55[null]S8 10[null]1
8711male21.001PC 1759761.3792[null]CA[null]1
8811male[null]001994735.5C52SD[null][null]
8920male23.000C.A. 3103010.5[null]S[null][null]0
9020male28.000C.A./SOTON 3406810.5[null]S[null][null]0
9120male28.00024435826.0[null]S[null][null]0
9220male42.00021153513.0[null]S[null][null]0
9320male27.00022036713.0[null]S[null][null]0
9420male60.0112975039.0[null]S[null][null]0
9520male25.01023685326.0[null]S[null][null]0
9620male25.00023468613.0[null]S[null]970
9720male42.00024431013.0[null]S[null][null]0
9820female[null]00F.C.C. 1353421.0[null]S[null][null][null]
9920male18.000S.O.C. 1487973.5[null]S[null][null]1
10020male25.000C.A. 3102931.5[null]S[null][null]0
Rows: 1-100 of 1234 | Columns: 15

DBADMIN and users who are granted SELECT privileges on the v_monitor.model_status_history table are able to monitor the status history of registered models.

[53]:
rm.list_status_history()
[53]:
Abc
registered_name
Varchar(128)
123
registered_version
Integer
Abc
new_status
Varchar(128)
Abc
old_status
Varchar(128)
📅
status_change_time
Timestamptz(35)
123
operator_id
Integer
Abc
operator_name
Varchar(128)
123
model_id
Integer
Abc
schema_name
Varchar(128)
Abc
model_name
Varchar(128)
1top_model_demo3UNDER_REVIEWUNREGISTERED2023-10-19 15:56:21.150875-04:0045035996273704962afard45035996273853304publicsvc_m
2top_model_demo3STAGINGUNDER_REVIEW2023-10-19 15:56:38.961325-04:0045035996273704962afard45035996273853304publicsvc_m
3top_model_demo3PRODUCTIONSTAGING2023-10-19 15:56:39.088134-04:0045035996273704962afard45035996273853304publicsvc_m
4top_model_demo1UNDER_REVIEWUNREGISTERED2023-10-19 15:46:38.893000-04:0045035996273704962afard45035996273851606[null][null]
5top_model_demo1STAGINGUNDER_REVIEW2023-10-19 15:47:31.840529-04:0045035996273704962afard45035996273851606[null][null]
6top_model_demo1PRODUCTIONSTAGING2023-10-19 15:47:31.960033-04:0045035996273704962afard45035996273851606[null][null]
7top_model_demo2UNDER_REVIEWUNREGISTERED2023-10-19 15:53:28.648945-04:0045035996273704962afard45035996273852750[null][null]
8top_model_demo2STAGINGUNDER_REVIEW2023-10-19 15:54:02.120317-04:0045035996273704962afard45035996273852750[null][null]
9top_model_demo2PRODUCTIONSTAGING2023-10-19 15:54:02.238576-04:0045035996273704962afard45035996273852750[null][null]
Rows: 1-9 | Columns: 10

Conclusion#

The addition of model tracking and model versioning to the VerticaPy toolkit greatly improves VerticaPy’s MLOps capabilities. We are constantly working to improve VerticaPy and address the needs of data scientists who wish to harness the power of Vertica database to empower their data analyses. If you have any comments or questions, don’t hesitate to reach out in the VerticaPy github community.