One of the coolest things about working at Vertica is our amazing intern program, which often leads to full-time hires. Last year, the VerticaPy library, also known as vpython, was started as an internship project by Badr Ouali. A year later, he works for Vertica full time and has seen his project through into an open source project, now available on the Vertica GitHub
. and on PyPi
. The library abstracts and streams data science functionality to manipulate large data sets stored in Vertica by taking advantage of what Vertica is known for – speed and built-in analytics and machine learning capabilities. This allows non-SQL expert data scientists to analyze data at scale without moving it out of Vertica.
The library is a Python front-end that exposes functionality similar to that of scikit. All the heavy computation is pushed to Vertica for data exploration, preparation, and machine learning. It supports the entire data science life cycle, using a pipeline mechanism to sequentialize data preparation operations, called the Resilient Vertica Dataset (RVD).
VerticaPy also includes multiple rendering capabilities, including pie charts, correlation matrices, scatterplots, hexbins and more. It uses direct SQL queries to your Vertica database using simple methods.
To connect to your Vertica database, you can use both JDBC and ODBC client drivers. The following figure summarizes how the vertica-ml-python library works:
For more information, see the full documentation on GitHub