End-to-End Machine Learning Solution with Vertica and Saagie Using VerticaPy

This video provides an end-to-end solution with Vertica and Saagie using VerticaPy. Analyze and model your data using Vertica's in-database machine learning capabilities and advance your ML pipeline with Saagie.

Saagie is a DataOps platform that brings together various technologies, so you can run data projects quickly, easily, and reliably. This platform allows you to use a combination of technologies to build and manage every step of a data project, from data extraction to visualization.

This document provides an end-to-end solution from loading your data into Vertica to connecting Vertica and Saagie using VerticaPy for performing data science operations.

VerticaPy is a Python library that has scikit-like functionality used in machine learning and advanced analytics for Vertica.

Vertica and Saagie High-Level Design

The following is a high-level design of how Saagie connects to Vertica using VerticaPy and other components for machine learning and model training. Saagie provides JOBS and APPS (Jupyter Notebook) from which you can connect to Vertica using VerticaPy as shown in the design diagram.

You can then explore and prepare data and train your models. The sections that follow provide step by step instructions for

  • Using the sample dataset

  • Loading data into Vertica

  • Creating a job and app (Jupyter notebook) to connect to Vertica using VerticaPy

  • Exploring and preparing data

  • Training and building a model to evaluate and create clusters

Environment

To begin you'll need to set up the following environment:

  • Saagie Cloud Instance

  • Vertica Analytical Database 12.0.4

  • VerticaPy Library installed on the Saagie environment

  • Jupyter Notebook or any other ETL tool to load data into Vertica

Assumptions and Prerequisites

  • Saagie is already setup either in the cloud or on-premises and the instance is up and running.

  • No firewall/connection issues exist from Saagie to the Vertica instance.

Step by Step Machine Learning Solution to Evaluate Movies Quality and Create Clusters

The goal of this solution is to build and train a model to analyze and evaluate the quality of movies and create clusters of similar movies.

The example explains how to get started with a dataset in Vertica and use data exploration, data preparation, and data modeling features in Saagie using the VerticaPy functionality. The example describes Normalization of numerical columns, categorizes the data, and create dummies to help the model understand the categorical variables. Finally, it will also show how to create regression and clustering models, and train models with the dataset to determine the quality of movies and create clusters of similar movies.

Note The following sections are collapsible/expandable. Ensure to click these topics to read more.

For More Information