Charts#

Charts are a powerful tool for understanding and interpreting data. Most charts use aggregations to represent the dataset, and others downsample the data to represent a subset.

First, let’s import the modules needed for this notebook.

[1]:

# VerticaPy
from verticapy.datasets import load_titanic, load_iris, load_world, load_amazon
import verticapy as vp

# Numpy & Matplotlib
import numpy as np
import matplotlib.pyplot as plt

Let’s start with pies and histograms. Drawing the pie or histogram of a categorical column in VerticaPy is quite easy.

[2]:

vp.set_option("plotting_lib", "highcharts")
vdf = load_titanic()
vdf["pclass"].bar()

[2]:

[3]:

vdf["pclass"].pie()

[3]:

These methods will draw the most occurent categories and merge the others. To change the number of elements, you can use the ‘max_cardinality’ parameter.

[4]:

vdf["home.dest"].bar()

[4]:

[5]:

vdf["home.dest"].bar(max_cardinality = 5)

[5]:

When dealing with numerical data types, the process is different. Vertica needs to discretize the numerical features to draw them. You can choose the bar width (‘h’ parameter) or let VerticaPy compute an optimal width using the Freedman-Diaconis rule.

[6]:

vdf["age"].hist()

[6]:

[7]:

vdf["age"].hist(h = 5)

[7]:

You can also change the occurences by another aggregation with the ‘method’ and ‘of’ parameters.

[8]:

vdf["age"].hist(method = "avg", of = "survived")

[8]:

VerticaPy uses the same process for other graphics, like 2-dimensional histograms and bar charts.

[9]:

vdf.bar(["pclass", "survived"])

[9]:

[10]:

vdf.hist(["fare", "pclass"],
         method = "avg",
         of = "survived")

[10]:

Pivot tables give us aggregated information for every category and are more powerful than histograms or bar charts.

[11]:

vdf.pivot_table(["pclass", "fare"],
                method = "avg",
                of = "survived",
                fill_none = np.nan)

[11]:

Box plots are useful for understanding statistical dispersion.

[12]:

vdf.boxplot(columns = ["age", "fare"])

[12]:

[13]:

vdf["age"].boxplot()

[13]:

Scatter and bubble plots are also useful for identifying patterns in your data. Note, however, that these methods don’t use aggregations; VerticaPy downsamples the data before plotting. You can use the ‘max_nb_points’ to limit the number of points and avoid unnecessary memory usage.

[14]:

vdf = load_iris()
vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
            by = "Species",
            max_nb_points = 1000)

[14]:

[15]:

vdf.scatter(["SepalLengthCm", "PetalWidthCm", "SepalWidthCm"],
            by = "Species",
            max_nb_points = 1000)

[15]:

[16]:

vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
            size = "SepalWidthCm",
            by = "Species",
            max_nb_points = 1000)

[16]:

[17]:

help(vdf.scatter)

Help on method scatter in module verticapy.core.vdataframe._plotting:

scatter(columns: Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns'], by: Optional[str] = None, size: Optional[str] = None, cmap_col: Optional[str] = None, max_cardinality: int = 6, cat_priority: Union[NoneType, Annotated[Union[bool, float, str, datetime.timedelta, datetime.datetime], 'Python Scalar'], Annotated[Union[list, numpy.ndarray], 'Array Like Structure']] = None, max_nb_points: int = 20000, dimensions: tuple = None, bbox: Optional[tuple] = None, img: Optional[str] = None, chart: Union[ForwardRef('PlottingBase'), ForwardRef('TableSample'), ForwardRef('Axes'), ForwardRef('mFigure'), ForwardRef('Highchart'), ForwardRef('Highstock'), ForwardRef('Figure'), NoneType] = None, **style_kwargs) -> Union[ForwardRef('PlottingBase'), ForwardRef('TableSample'), ForwardRef('Axes'), ForwardRef('mFigure'), ForwardRef('Highchart'), ForwardRef('Highstock'), ForwardRef('Figure')] method of verticapy.core.vdataframe.base.vDataFrame instance
    Draws the scatter plot of the input vDataColumns.

    Parameters
    ----------
    columns: SQLColumns
        List of the vDataColumns names.
    by: str, optional
        Categorical vDataColumn used to label the data.
    size: str
        Numerical  vDataColumn used to represent  the
        Bubble size.
    cmap_col: str, optional
        Numerical  column used  to represent the  color
        map.
    max_cardinality: int, optional
        Maximum  number  of  distinct elements for  'by'
        to  be  used as categorical.  The less  frequent
        elements are gathered together  to create a
        new category: 'Others'.
    cat_priority: PythonScalar / ArrayLike, optional
        ArrayLike list of the different categories to
        consider when  labeling  the  data using  the
        vDataColumn 'by'.  The  other  categories  are
        filtered.
    max_nb_points: int, optional
        Maximum number of points to display.
    dimensions: tuple, optional
        Tuple of two  elements representing the IDs of the
        PCA's components. If empty and the number of input
        columns  is greater  than 3, the first and  second
        PCA are drawn.
    bbox: list, optional
        Tuple  of 4 elements to delimit the boundaries  of
        the  final Plot. It must be similar the  following
        list: [xmin, xmax, ymin, ymax]
    img: str, optional
        Path to the image to display as background.
    chart: PlottingObject, optional
        The chart object to plot on.
    **style_kwargs
        Any  optional  parameter  to pass to the  plotting
        functions.

    Returns
    -------
    obj
        Plotting Object.

Hexbin plots can be useful for generating heatmaps. These summarize data in a similar way to scatter plots, but compute aggregations to get the final results.

[18]:

vp.set_option("plotting_lib", "matplotlib")
vdf.hexbin(["SepalLengthCm", "SepalWidthCm"],
            method = "avg",
            of = "PetalWidthCm")

[18]:

<AxesSubplot:xlabel='SepalLengthCm', ylabel='SepalWidthCm'>

../../../_images/notebooks_data_exploration_charts_index_28_1.png

Hexbin, scatter, and bubble plots also allow you to provide a background image. The dataset used below is available here.

[19]:

africa = vp.read_csv("data/africa_education.csv")
# displaying avg students score in Africa
africa.hexbin(["lon", "lat"],
              method = "avg",
              of = "zralocp",
              img = "img/africa.png")

[19]:

<AxesSubplot:xlabel='lon', ylabel='lat'>

../../../_images/notebooks_data_exploration_charts_index_30_1.png

It is also possible to use SHP datasets to draw maps.

[20]:

# Africa Dataset
africa_world = load_world()
africa_world = africa_world[africa_world["continent"] == "Africa"]
ax = africa_world["geometry"].geo_plot(color = "white",
                                       edgecolor='black',)

# displaying schools in Africa
africa.scatter(["lon", "lat"],
               by = "country_long",
               ax = ax,
               max_cardinality = 100)

[20]:

<AxesSubplot:xlabel='lon', ylabel='lat'>

../../../_images/notebooks_data_exploration_charts_index_32_1.png

Time-series plots are also available with the ‘plot’ method.

[21]:

vdf = load_amazon()
vdf.filter(vdf["state"]._in(['ACRE', 'RIO DE JANEIRO', 'PARÁ']))
vdf["number"].plot(ts = "date", by = "state")

5737 elements were filtered

[21]:

<AxesSubplot:xlabel='date', ylabel='number'>

../../../_images/notebooks_data_exploration_charts_index_34_2.png

Since time-series plots do not aggregate the data, it’s important to choose the correct ‘start_date’ and ‘end_date’.

[22]:

vdf["number"].plot(ts = "date",
                   by = "state",
                   start_date = "2010-01-01")

[22]:

<AxesSubplot:xlabel='date', ylabel='number'>

../../../_images/notebooks_data_exploration_charts_index_36_1.png

Plotting Libraries#

Currently there are three plotting libraries that are integarted with VerticaPy: - Plotly - Highcharts - Matplotlib

There are various use-cases for the above mentioned different plotting libraries.

For example, in matplotlib, each graphical function has a parameter ‘ax’ used to draw customized graphics. You can use this to draw multiple plots on the same axes.

[23]:

vp.set_option("plotting_lib","matplotlib")
amazon = load_amazon()
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
amazon.acf(column = "number",
           ts = "date",
           by = ["state"],
           p = 12,
           ax = ax1)
ax1.set_xticklabels([])
ax1.set_xlabel("")
amazon.pacf(column = "number",
            ts = "date",
            by = ["state"],
            p = 12,
            ax = ax2)
plt.show()

../../../_images/notebooks_data_exploration_charts_index_39_1.png

You can customize your charts using come common input parameters like colors, height and width.

[24]:

vdf = load_iris()
vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
            size = "SepalWidthCm",
            by = "Species",
            max_nb_points = 1000,
            colors = ["red", "green", "blue"],)

[24]:

<AxesSubplot:xlabel='SepalLengthCm', ylabel='PetalWidthCm'>

../../../_images/notebooks_data_exploration_charts_index_41_1.png

Note: Other parameters that are specific to each plotting library are also possible. You can read the documentation of the plotting libraries to get more details.

Switching between the libraries is very convenient using the following syntax:

[ ]:

vp.set_option("plotting_lib","highcharts")

You can also draw responsive graphics with Highchart or Plotly integration:

[25]:

vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
            by = "Species",
            max_nb_points = 1000,
            colors = ["red", "green", "blue"],)

[25]:

Graphics are powerful tools and can help us understand and visualize trends in our data.