VerticaPy
The Virtual DataFrame¶
The Virtual DataFrame is the main object and star of the library and acts as the perfect transition between small and Big Data. The principle is quite simple: As Vertica is a powerful columnar massive parallel processing (MPP) database with many built-in functions, we want it to do as much of the computation work as possible.
Indeed, columnar orientation allows for high compression, and its structure inherently avoids unncessary parsing when retrieving data. MPP allows to parallelize our computations accross the different nodes.
The best way to take advantage of your data is by simply keeping it in your Vertica database, rather than within the limitations of working memory. VerticaPy pushes all computation to your Vertica database before aggregating the final result, so you can get the best of both worlds: Vertica's power and Python's flexibility.
With Python, it's easy to add abstractions, and the vDataFrame acts as the primary abstraction layer. Simple but powerful, it'll help any user through the data science life cycle.
Creating the vDataFrame¶
There are two main ways to create a vDataFrame.
The first is to create one directly using an existing relation.
from verticapy.datasets import load_titanic
load_titanic() # Loading the titanic dataset in Vertica
import verticapy as vp
vp.vDataFrame("public.titanic")
We can also create one using a customized relation.
vp.vDataFrame(sql = "SELECT pclass, AVG(survived) AS survived FROM titanic GROUP BY 1")
In-memory vs. In-database Loading and Processing¶
First, let's load the expedia dataset in Vertica.
vp.read_csv("data/expedia.csv", schema = "public", parse_nrows = 2000)
To understand the main difference between loading data into memory and loading data into a Vertica database, let's create a vDataFrame using an existing relation.
import time
start_time = time.time()
expedia = vp.vDataFrame("public.expedia")
print("elapsed time = {}".format(time.time() - start_time))
It took less than a second to create a vDataFrame. This dataset comes in at 6GB, which is very expensive for a personal machine, so we store the data entirely in Vertica and nothing is loaded into memory.
Let's compare this to loading the data into memory with pandas. You can try to load the entire dataset in your computer if you have at least 8GB of memory.
import pandas as pd
L_nrows = [10000, 100000, 1000000, 2000000, 5000000, 10000000, 20000000]
L_time = []
for nrows in L_nrows:
start_time = time.time()
expedia_df = pd.read_csv("data/expedia.csv", nrows = nrows)
elapsed_time = time.time() - start_time
L_time.append(elapsed_time)
print("nrows = {}; elapsed time = {}".format(nrows, elapsed_time))
Loading data into pandas is quite fast when the data volume is low (less than some MB), but as the size of the dataset increases, it can become exponentially more expensive.
import matplotlib.pyplot as plt
plt.plot(L_nrows, L_time)
plt.show()
Performance will also drastically decrease.
start_time = time.time()
expedia_df.corr()
print("elapsed time = 119.78299331665039".format(time.time() - start_time))
We're only using a little bit more than half of the dataset and it took almost 2 minutes to compute the correlation matrix.
Let's compute the entire correlation matrix using the vDataFrame.
start_time = time.time()
expedia.corr(show = False)
print("elapsed time = 88.62625098228455".format(time.time() - start_time))
It took almost 1 minute and 30 seconds on one single Community Edition Vertica cluster without using any Vertica-specific features that might increase performance, like creating projections and or compression features.
VerticaPy caches the computed aggregations. With this cache available, we can repeat the correlation matrix computation almost instantaneously.
start_time = time.time()
expedia.corr(show = False)
print("elapsed time = {}".format(time.time() - start_time))
If needed, the cache can be deactivated.
Let's look at the memory usage for less than half of the dataset: Pandas is taking more than 3.6 GB.
expedia_df.info()
Let's compare that to the total memory usage of the vDataFrame; but instead of loading half the dataset, we'll load the entire dataset: less than 44KB!
The vDataFrame remembers the user's modifications to the data, but never loads the data itself into memory.
expedia.memory_usage()
We can see a clear difference. With VerticaPy, we can take advantage of Vertica's structure and scalability and run fast queries without ever loading the data into memory. In-memory processing is limited by many factors which lead to downsampling most of the time.
The Structure of the vDataFrame¶
A vDataFrame is composed of columns called vColumns. You can see these with the 'get_columns' method.
expedia.get_columns()
To access a vColumn, simply write its name between between square brackets.
expedia["is_booking"]
VerticaPy is smart enough to not recompute an aggregation that it's already computed.
expedia["is_booking"].describe()
Each vColumn has its own catalog to save user modifications.
For example, we previously computed some aggregations for the column 'is_booking'. Let's look at the catalog of the vColumn.
expedia["is_booking"].catalog
It will save the most important aggregations to avoid recomputation. The catalog will be updated whenever we make major changes to our data.
We can also view the vDataFrame's backend SQL code generation by setting 'sql_on' with the 'set_option' function.
vp.set_option("sql_on", True)
expedia["cnt"].describe()
You can also display the elapsed time of the different queries. For example, let's compute the correlation matrix of the vDataFrame.
Note: In order to display matplotlib graphics in Jupyter, you'll need to use the '%matplotlib inline' command the first time you decide to draw a graphic.
expedia = vp.vDataFrame("public.expedia") # creating a new vDataFrame to delete the catalog
%matplotlib inline
expedia.corr()
All heavy computations are pushed to Vertica, and each aggregation is saved to each vColumn's catalog. If we call the 'corr' method again, it'll only take a couple seconds (time needed to draw the graphic).
start_time = time.time()
expedia.corr()
print("elapsed time = {}".format(time.time() - start_time))
We can turn off the different functions to display the elapsed time and the SQL code generation.
vp.set_option("sql_on", False)
vp.set_option("time_on", False)
You can access the current vDataFrame relation with the 'current_relation' method.
print(expedia.current_relation())
Since we're working with SQL code generation, this relation will change according to the user's modifications. For example, let's impute the missing values of the vColumn 'orig_destination_distance' by its average and drop the vColumn 'is_package'.
expedia["orig_destination_distance"].fillna(method = "avg")
expedia["is_package"].drop()
print(expedia.current_relation())
Notice how our dropping the vColumn 'is_package' simply removes it from the SELECT statement in our SQL query. Similarly, imputing a vColumn translates to using the 'COALESCE' SQL function.
vDataFrame Attributes and Management¶
As we saw, the vDataFrame has many attributes and methods. vDataFrames have two types of attributes:
- Virtual Columns
- Main Attributes (columns, main_relation ...)
The vDataFrame's main attributes are stored in the _VERTICAPYVARIABLES dictionary.
expedia._VERTICAPY_VARIABLES_
You should never change these attributes manually.
vDataFrame Data Types¶
The vDataFrame uses the data types of its vColumns. Computing an histogram for a numerical data type is not the same as computing one other for a categorical data type. The vDataFrame identifies four main categories:
- int: integers are treated like categorical data types when their cardinality is low and will be considered numeric otherwise
- float: numerics
- date: date-like data types
- text: categorical data types
Other data types may automatically treated as categorical. You can examine these different data types using the 'dtypes' method.
expedia.dtypes()
You can perform conversions with the 'astype' method.
expedia["hotel_market"].astype("varchar")
expedia["hotel_market"].ctype()
You can also get the vColumn category using the 'category' method.
expedia["hotel_market"].category()
Exporting / Saving / Loading a vDataFrame¶
The functions 'save' and 'load' allow the user to save and load their vDataFrame structure.
expedia.save()
expedia.filter("is_booking = 1")
In this example, we filtered some data and want to go back to the previous structure.
expedia = expedia.load()
print(expedia.shape())
Don't forget to use the help function when you need more information about the different functions!
help(expedia.load)
The vDataFrame works the same way as a view. However, nothing is stored in the database unless you do it explicitly with the 'to_db' method.
You can save the vDataFrame's final relation using the 'to_db' method. If you want to save the result into a table, be sure to look at the expected disk usage of exporting the vDataFrame.
expedia.expected_store_usage(unit = "Gb")
After we decide that we have the space to store the vDataFrame, we can store it in our database.
expedia.to_db("public.expedia_clean",
relation_type = "table")