VerticaPy

Python API for Vertica Data Science at Scale

Why VerticaPy?

Nowadays, 'Big Data' is one of the main topics in the data science world, and data scientists are often at the center of any organization. The benefits of becoming more data-driven are undeniable and are often needed to survive in the industry.


Vertica was the first real analytic columnar database and is still the fastest in the market. However, SQL alone isn't flexible enough to meet the needs of data scientists. Python has quickly become the most popular tool in this domain, owing much of its flexibility to its high-level of abstraction and impressively large and ever-growing set of libraries. Its accessibility has led to the development of popular and perfomant APIs, like pandas and scikit-learn, and a dedicated community of data scientists.


However, Python only works in-memory for a single node process. While distributed programming languages have tried to face this challenge, they are still generally in-memory and can never hope to process all of your data, and moving data is expensive. On top of all of this, data scientists must also find convenient ways to deploy their data and models. The whole process is time consuming.


VerticaPy aims to solve all of these problems. The idea is simple: instead of moving data to your tools, VerticaPy brings your tools to the data.


History

When the first data science technologies and tools came onto the scene, optimization wasn't a high priority. Companies didn't pay much mind to how the needs of data storage and ingestion might change. Back then, databases were still used as data warehouses, and moving data around was often impossible without making compromises in security.


To address these problems, Vertica implemented the first in-database, scalable machine learning algorithms. That was back in 2015, and other databases have been trying to catch up ever since.


However, what SQL has in scalability, it lacks in flexibility. Python has the opposite problem: it's highly flexible, but not scalable. The idea of combining the strengths of these technologies came about in 2017 by Vertica data scientist Badr Quali and, after 3 years of development, became unique and powerful library, VerticaPy.

First Official Logo

The first of the VerticaPy logo:


A Few Words from the Creator

"This Python Module is the result of my passion for Data Science. I love discovering everything possible in the data. I always kept a passion for mathematics and specially for statistics. When I saw the lack of libraries using as back-end the power of columnar MPP Database, I decided to help the Data Science Community by bringing the logic to the data."