Charts#
Charts are a powerful tool for understanding and interpreting data. Most charts use aggregations to represent the dataset, and others downsample the data to represent a subset.
First, let’s import the modules needed for this notebook.
[1]:
# VerticaPy
from verticapy.datasets import load_titanic, load_iris, load_world, load_amazon
import verticapy as vp
# Numpy & Matplotlib
import numpy as np
import matplotlib.pyplot as plt
Let’s start with pies and histograms. Drawing the pie or histogram of a categorical column in VerticaPy is quite easy.
[2]:
vp.set_option("plotting_lib", "highcharts")
vdf = load_titanic()
vdf["pclass"].bar()
[2]:
[3]:
vdf["pclass"].pie()
[3]:
These methods will draw the most occurent categories and merge the others. To change the number of elements, you can use the ‘max_cardinality’ parameter.
[4]:
vdf["home.dest"].bar()
[4]:
[5]:
vdf["home.dest"].bar(max_cardinality = 5)
[5]:
When dealing with numerical data types, the process is different. Vertica needs to discretize the numerical features to draw them. You can choose the bar width (‘h’ parameter) or let VerticaPy compute an optimal width using the Freedman-Diaconis rule.
[6]:
vdf["age"].hist()
[6]:
[7]:
vdf["age"].hist(h = 5)
[7]:
You can also change the occurences by another aggregation with the ‘method’ and ‘of’ parameters.
[8]:
vdf["age"].hist(method = "avg", of = "survived")
[8]:
VerticaPy uses the same process for other graphics, like 2-dimensional histograms and bar charts.
[9]:
vdf.bar(["pclass", "survived"])
[9]:
[10]:
vdf.hist(["fare", "pclass"],
method = "avg",
of = "survived")
[10]:
Pivot tables give us aggregated information for every category and are more powerful than histograms or bar charts.
[11]:
vdf.pivot_table(["pclass", "fare"],
method = "avg",
of = "survived",
fill_none = np.nan)
[11]:
Box plots are useful for understanding statistical dispersion.
[12]:
vdf.boxplot(columns = ["age", "fare"])
[12]:
[13]:
vdf["age"].boxplot()
[13]:
Scatter and bubble plots are also useful for identifying patterns in your data. Note, however, that these methods don’t use aggregations; VerticaPy downsamples the data before plotting. You can use the ‘max_nb_points’ to limit the number of points and avoid unnecessary memory usage.
[14]:
vdf = load_iris()
vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
by = "Species",
max_nb_points = 1000)
[14]:
[15]:
vdf.scatter(["SepalLengthCm", "PetalWidthCm", "SepalWidthCm"],
by = "Species",
max_nb_points = 1000)
[15]:
[16]:
vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
size = "SepalWidthCm",
by = "Species",
max_nb_points = 1000)
[16]:
[17]:
help(vdf.scatter)
Help on method scatter in module verticapy.core.vdataframe._plotting:
scatter(columns: Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns'], by: Optional[str] = None, size: Optional[str] = None, cmap_col: Optional[str] = None, max_cardinality: int = 6, cat_priority: Union[NoneType, Annotated[Union[bool, float, str, datetime.timedelta, datetime.datetime], 'Python Scalar'], Annotated[Union[list, numpy.ndarray], 'Array Like Structure']] = None, max_nb_points: int = 20000, dimensions: tuple = None, bbox: Optional[tuple] = None, img: Optional[str] = None, chart: Union[ForwardRef('PlottingBase'), ForwardRef('TableSample'), ForwardRef('Axes'), ForwardRef('mFigure'), ForwardRef('Highchart'), ForwardRef('Highstock'), ForwardRef('Figure'), NoneType] = None, **style_kwargs) -> Union[ForwardRef('PlottingBase'), ForwardRef('TableSample'), ForwardRef('Axes'), ForwardRef('mFigure'), ForwardRef('Highchart'), ForwardRef('Highstock'), ForwardRef('Figure')] method of verticapy.core.vdataframe.base.vDataFrame instance
Draws the scatter plot of the input vDataColumns.
Parameters
----------
columns: SQLColumns
List of the vDataColumns names.
by: str, optional
Categorical vDataColumn used to label the data.
size: str
Numerical vDataColumn used to represent the
Bubble size.
cmap_col: str, optional
Numerical column used to represent the color
map.
max_cardinality: int, optional
Maximum number of distinct elements for 'by'
to be used as categorical. The less frequent
elements are gathered together to create a
new category: 'Others'.
cat_priority: PythonScalar / ArrayLike, optional
ArrayLike list of the different categories to
consider when labeling the data using the
vDataColumn 'by'. The other categories are
filtered.
max_nb_points: int, optional
Maximum number of points to display.
dimensions: tuple, optional
Tuple of two elements representing the IDs of the
PCA's components. If empty and the number of input
columns is greater than 3, the first and second
PCA are drawn.
bbox: list, optional
Tuple of 4 elements to delimit the boundaries of
the final Plot. It must be similar the following
list: [xmin, xmax, ymin, ymax]
img: str, optional
Path to the image to display as background.
chart: PlottingObject, optional
The chart object to plot on.
**style_kwargs
Any optional parameter to pass to the plotting
functions.
Returns
-------
obj
Plotting Object.
Hexbin plots can be useful for generating heatmaps. These summarize data in a similar way to scatter plots, but compute aggregations to get the final results.
[18]:
vp.set_option("plotting_lib", "matplotlib")
vdf.hexbin(["SepalLengthCm", "SepalWidthCm"],
method = "avg",
of = "PetalWidthCm")
[18]:
<AxesSubplot:xlabel='SepalLengthCm', ylabel='SepalWidthCm'>
Hexbin, scatter, and bubble plots also allow you to provide a background image. The dataset used below is available here.
[19]:
africa = vp.read_csv("data/africa_education.csv")
# displaying avg students score in Africa
africa.hexbin(["lon", "lat"],
method = "avg",
of = "zralocp",
img = "img/africa.png")
[19]:
<AxesSubplot:xlabel='lon', ylabel='lat'>
It is also possible to use SHP datasets to draw maps.
[20]:
# Africa Dataset
africa_world = load_world()
africa_world = africa_world[africa_world["continent"] == "Africa"]
ax = africa_world["geometry"].geo_plot(color = "white",
edgecolor='black',)
# displaying schools in Africa
africa.scatter(["lon", "lat"],
by = "country_long",
ax = ax,
max_cardinality = 100)
[20]:
<AxesSubplot:xlabel='lon', ylabel='lat'>
Time-series plots are also available with the ‘plot’ method.
[21]:
vdf = load_amazon()
vdf.filter(vdf["state"]._in(['ACRE', 'RIO DE JANEIRO', 'PARÁ']))
vdf["number"].plot(ts = "date", by = "state")
5737 elements were filtered
[21]:
<AxesSubplot:xlabel='date', ylabel='number'>
Since time-series plots do not aggregate the data, it’s important to choose the correct ‘start_date’ and ‘end_date’.
[22]:
vdf["number"].plot(ts = "date",
by = "state",
start_date = "2010-01-01")
[22]:
<AxesSubplot:xlabel='date', ylabel='number'>
Plotting Libraries#
Currently there are three plotting libraries that are integarted with VerticaPy: - Plotly - Highcharts - Matplotlib
There are various use-cases for the above mentioned different plotting libraries.
For example, in matplotlib, each graphical function has a parameter ‘ax’ used to draw customized graphics. You can use this to draw multiple plots on the same axes.
[23]:
vp.set_option("plotting_lib","matplotlib")
amazon = load_amazon()
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
amazon.acf(column = "number",
ts = "date",
by = ["state"],
p = 12,
ax = ax1)
ax1.set_xticklabels([])
ax1.set_xlabel("")
amazon.pacf(column = "number",
ts = "date",
by = ["state"],
p = 12,
ax = ax2)
plt.show()
You can customize your charts using come common input parameters like colors, height and width.
[24]:
vdf = load_iris()
vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
size = "SepalWidthCm",
by = "Species",
max_nb_points = 1000,
colors = ["red", "green", "blue"],)
[24]:
<AxesSubplot:xlabel='SepalLengthCm', ylabel='PetalWidthCm'>
Note: Other parameters that are specific to each plotting library are also possible. You can read the documentation of the plotting libraries to get more details.
Switching between the libraries is very convenient using the following syntax:
[ ]:
vp.set_option("plotting_lib","highcharts")
You can also draw responsive graphics with Highchart or Plotly integration:
[25]:
vdf.scatter(["SepalLengthCm", "PetalWidthCm"],
by = "Species",
max_nb_points = 1000,
colors = ["red", "green", "blue"],)
[25]:
Graphics are powerful tools and can help us understand and visualize trends in our data.