End-to-End Machine Learning Solution with Vertica and Domino Data Lab Using VerticaPy

Domino Data Lab is a platform that enables data scientists in an organization to collaborate, build and deploy data science applications, train ML models, monitor performance, and govern Machine Learning models.

This document provides an end-to-end solution from loading your data into Vertica™, connecting Vertica and Domino Data Lab using VerticaPy to performing data science operations.

VerticaPy is a Python library that has scikit-like functionality used in machine learning and advanced analytics for Vertica.

Vertica and Domino Data Lab High-Level Design

The following is a high-level design of how Domino Data Lab connects to Vertica using VerticaPy and other components for machine learning and model training. You can connect to Vertica from the IDEs in Domino. Domino provides 4 IDEs – Jupyter Notebook, Jupyter Lab, VSCode, and RStudio from which you can connect to Vertica using VerticaPy or RJDBC as shown in the design diagram. You can then visualize the data to identify trends and train your models for prediction. The sections that follow provide step by step instructions for

  • Using the Sample Dataset

  • Loading Data into Vertica

  • Creating a workspace in Jupyter notebook to connect to Vertica using VerticaPy

  • Exploring data to identify trends

  • Training and building the model for predictive analysis

Environment

To begin you'll need to set up the following environment:

  • Domino Data Lab Cloud Instance

  • Vertica Analytical Database 12.0.0

  • VerticaPy library installed on the Domino Data Lab environment

  • DBVizualizer or any other ETL tool to load data into Vertica

Assumptions and Prerequisites

  • Domino Data Lab is already setup either in the cloud or on-premises and the instance is up and running.

  • No firewall/connection issues exist from the Domino instance to the Vertica instance.

Step by Step Machine Learning Solution for Predicting Medical Costs

The goal of this solution is to build and train a model to analyze and predict medical costs.

The example explains how to get started with a dataset in Vertica and use various visualization features in Domino using VerticaPy functionality to identify trends in datasets. The example also describes encoding of categorical features to help the model understand the categorical variables. Finally, you can create a regression model and train the model with the dataset to determine the medical charges.

Note The following sections are collapsible/expandable. Ensure to click these topics to read more.

Sharing and Collaborating in Domino Data Lab

Domino Data Lab also provides you a platform to collaborate on projects within your organization with multiple users to ensure validation, improving efficiency of the build pipelines, and proper documentation of code.

To invite other users to collaborate on your project

  1. Go to the project dashboard that you want to share and click Settings.


  2. In the Project settings window, click Access & Sharing to view the project visibility and permissions.


  3. In the Collaborators and permissions section, provide the username.

    You can also look up for a user using their first name, last name, or the organization name.

  4. Provide a welcome message to the user that you want to collaborate with and click Invite.

  5. A message appears displaying that the invitation has been sent to the user and the user will be added to the project.

    You can now verify by checking the Collaborators and permissions section to see if the user is added to the project.


  6. Assign a role to the user. By default, the user is assigned a Contributor role.

  7. You can also remove the user by clicking Remove.

  8. The invited user can view the project on their Domino dashboard in the Collaborating Projects tab.


For more information, see Share and Collaborate in the Domino Data Lab documentation.

Summarizing other Domino IDE Environments

You can also use the following IDE environments to execute the same solution.

Jupyter Lab

Jupyter Lab is a web-based interactive development environment for notebooks, data, and code.

You can open a Jupyter notebook and follow the VerticaPy example to get started with Machine learning.

VSCode

VScode is another popular IDE that you can choose to get started with executing Python code. You can create a Python file and execute the Python code.

RStudio

Domino also has support for RStudio. However, you need to create your own solution to train and predict the model.

R studio is a development environment specially used for R, which is commonly used for statistical computing and graphics. After you load the development environment, click New Blank File > R Script to get started with R programming.

  1. You need to upload the Vertica JDBC file to the environment to connect to Vertica. Open the terminal and execute “$wget https://www.vertica.com/client_drivers/12.0.x/12.0.1-0/vertica-jdbc-12.0.1-0.jar” to download the JDBC jar.

  2. Enter the following code in the R file to install the necessary dependencies, connect to Vertica using JDBC driver, and get the data by executing a SQL query.

    install.packages('RJDBC',dep=TRUE)
    install.packages('DBI',dep=TRUE)
    install.packages('rJava',dep=TRUE)
    
    library(RJDBC)
    vDriver = JDBC(driverClass='com.vertica.jdbc.Driver', classPath='/mnt/vertica-jdbc-12.0.1-0.jar')
    vertica = dbConnect(vDriver, 'jdbc:vertica://<IPAddress>:5433/PartPub80DB', 'dbadmin', 'vert1caBdp')
    # Now run your queries
    myresults = dbSendQuery(vertica, 'select * from insurance.insurance')
    dbFetch(myresults)
    


For More Information