VerticaPy

Python API for Vertica Data Science at Scale

Charts

Charts are a powerful tool for understanding and interpreting data. Most charts use aggregations to represent the dataset, and others downsample the data to represent a subset.

First, let's import the modules needed for this notebook.

In [22]:
# VerticaPy
from verticapy.datasets import load_titanic, load_iris, load_world, load_amazon
import verticapy as vp

# Numpy & Matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Let's start with pies and histograms. Drawing the pie or histogram of a categorical column in VerticaPy is quite easy.

In [13]:
vdf = load_titanic()
vdf["pclass"].hist()
vdf["pclass"].pie()
Out[13]:
<AxesSubplot:>

These methods will draw the most occurent categories and merge the others. To change the number of elements, you can use the 'max_cardinality' parameter.

In [2]:
vdf["home.dest"].hist()
vdf["home.dest"].hist(max_cardinality = 20)
Out[2]:
<AxesSubplot:xlabel='"home.dest"', ylabel='Density'>

When dealing with numerical data types, the process is different. Vertica needs to discretize the numerical features to draw them. You can choose the bar width ('h' parameter) or let VerticaPy compute an optimal width using the Freedman-Diaconis rule.

In [3]:
vdf["age"].hist()
vdf["age"].hist(h = 5)
Out[3]:
<AxesSubplot:xlabel='"age"', ylabel='Density'>

You can also change the occurences by another aggregation with the 'method' and 'of' parameters.

In [4]:
vdf["age"].hist(method = "avg", of = "survived")
Out[4]:
<AxesSubplot:xlabel='"age"', ylabel='avg'>

VerticaPy uses the same process for other graphics, like 2-dimensional histograms and bar charts.

In [5]:
vdf.bar(["pclass", "survived"])
vdf.hist(["fare", "pclass"],
         method = "avg",
         of = "survived")
Out[5]:
<AxesSubplot:xlabel='"fare"', ylabel='avg("survived")'>

Pivot tables give us aggregated information for every category and are more powerful than histograms or bar charts.

In [8]:
vdf.pivot_table(["pclass", "fare"], 
                method = "avg",
                of = "survived",
                fill_none = np.nan)
Out[8]:
"pclass"/"fare"
[0.00;42.69]
[42.69;85.38]
[85.38;128.07]
[128.07;170.76]
[170.76;213.45]
[213.45;256.14]
[256.14;298.83]
[512.28;554.97]
110.440.6637168141592920.7878787878787880.7241379310344830.6666666666666670.50.751.0
220.4251012145748990.25nannannannannannan
330.2279874213836480.230769230769231nannannannannannan
Rows: 1-3 | Columns: 9

Box plots are useful for understanding statistical dispersion.

In [9]:
vdf.boxplot(columns = ["age", "fare"])
vdf["age"].boxplot()
Out[9]:
<AxesSubplot:xlabel='"age"'>

Scatter and bubble plots are also useful for identifying patterns in your data. Note, however, that these methods don't use aggregations; VerticaPy downsamples the data before plotting. You can use the 'max_nb_points' to limit the number of points and avoid unnecessary memory usage.

In [11]:
vdf = load_iris()
vdf.scatter(["SepalLengthCm", "PetalWidthCm"], 
            catcol = "Species", 
            max_nb_points = 1000)
vdf.scatter(["SepalLengthCm", "PetalWidthCm", "SepalWidthCm"], 
            catcol = "Species", 
            max_nb_points = 1000)
vdf.bubble(["SepalLengthCm", "PetalWidthCm"], 
            size_bubble_col = "SepalWidthCm",
            catcol = "Species", 
            max_nb_points = 1000)
Out[11]:
<AxesSubplot:xlabel='"SepalLengthCm"', ylabel='"PetalWidthCm"'>

Hexbin plots can be useful for generating heatmaps. These summarize data in a similar way to scatter plots, but compute aggregations to get the final results.

In [12]:
vdf.hexbin(["SepalLengthCm", "SepalWidthCm"], 
            method = "avg", 
            of = "PetalWidthCm")
Out[12]:
<AxesSubplot:xlabel='"SepalLengthCm"', ylabel='"SepalWidthCm"'>

Hexbin, scatter, and bubble plots also allow you to provide a background image. The dataset used below is available here.

In [11]:
africa = vp.read_csv("data/africa_education.csv")
# displaying avg students score in Africa
africa.hexbin(["lon", "lat"],
              method = "avg",
              of = "zralocp",
              img = "img/africa.png")
# displaying schools in Africa
africa = africa.groupby(["country_long", "lat", "lon"])
africa.scatter(["lon", "lat"],
               catcol = "country_long",
               max_cardinality = 100,
               img = "img/africa.png")
Out[11]:
<AxesSubplot:xlabel='"lon"', ylabel='"lat"'>

It is also possible to use SHP datasets to draw maps.

In [21]:
# Africa Dataset
africa_world = load_world()
africa_world = africa_world[africa_world["continent"] == "Africa"]
ax = africa_world["geometry"].geo_plot(color = "white",
                                       edgecolor='black',)

# displaying schools in Africa
africa.scatter(["lon", "lat"],
               catcol = "country_long",
               ax = ax,
               max_cardinality = 100)
Out[21]:
<AxesSubplot:xlabel='"lon"', ylabel='"lat"'>

Time-series plots are also available with the 'plot' method.

In [23]:
vdf = load_amazon()
vdf.filter(vdf["state"]._in(['ACRE', 'RIO DE JANEIRO', 'PARÁ']))
vdf["number"].plot(ts = "date", by = "state")
5737 elements were filtered
Out[23]:
<AxesSubplot:xlabel='"date"', ylabel='"number"'>

Since time-series plots do not aggregate the data, it's important to choose the correct 'start_date' and 'end_date'.

In [24]:
vdf["number"].plot(ts = "date", 
                   by = "state", 
                   start_date = "2010-01-01")
Out[24]:
<AxesSubplot:xlabel='"date"', ylabel='"number"'>

Each graphical function has a parameter 'ax' used to draw customized graphics. You can use this to draw multiple plots on the same axes.

In [25]:
amazon = load_amazon()
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
amazon.acf(column = "number",
           ts = "date",
           by = ["state"],
           p = 12,
           ax = ax1)
ax1.set_xticklabels([])
ax1.set_xlabel("")
amazon.pacf(column = "number",
            ts = "date",
            by = ["state"],
            p = 12,
            ax = ax2)
plt.show()

You can customize your charts using Matplotlib input parameters.

In [26]:
vdf = load_iris()
vdf.bubble(["SepalLengthCm", "PetalWidthCm"], 
            size_bubble_col = "SepalWidthCm",
            catcol = "Species", 
            max_nb_points = 1000,
            color = ["r", "g", "b"],)
Out[26]:
<AxesSubplot:xlabel='"SepalLengthCm"', ylabel='"PetalWidthCm"'>

You can also draw responsive graphics with Highchart integration:

In [27]:
vdf.hchart(x = "PetalLengthCm",
           y = "SepalLengthCm",
           c = "Species",
           kind = "scatter")
Out[27]:

Graphics are powerful tools and can help us understand and visualize trends in our data.