This blog post was authored by Soniya Shah.
City in Blur Motion
One of the coolest things about working at Vertica is our amazing intern program, which often leads to full-time hires. Last year, the Vertica-ML-Python library, also known as vpython, was started as an internship project by Badr Ouali. A year later, he works for Vertica full time and has seen his project through into an open source project, now available on the Vertica GitHub
. The library abstracts and streams data science functionality to manipulate large data sets stored in Vertica by taking advantage of what Vertica is known for – speed and built-in analytics and machine learning capabilities. This allows non-SQL expert data scientists to analyze data at scale without moving it out of Vertica.
The library is a Python front-end that exposes functionality similar to that of scikit. All the heavy computation is pushed to Vertica for data exploration, preparation, and machine learning. It supports the entire data science life cycle, using a pipeline mechanism to sequentialize data preparation operations, called the Resilient Vertica Dataset (RVD).
Vertica-ML-Python also includes multiple rendering capabilities, including pie charts, correlation matrices, scatterplots, hexbins and more. It uses direct SQL queries to your Vertica database using simple methods.
To connect to your Vertica database, you can use both JDBC and ODBC client drivers. The following figure summarizes how the vertica-ml-python library works:
For more information, see the full documentation on GitHub