Loading...

Histogram

General

Let’s begin by importing VerticaPy.

import verticapy as vp

Let’s also import numpy to create a random dataset.

import numpy as np

Let’s generate a dataset using the following data.

N = 100 # Number of records

data = vp.vDataFrame({
  "score1": np.random.normal(5, 1, N),
  "score2": np.random.normal(8, 1.5, N),
  "score3": np.random.normal(10, 2, N),
})

In the context of data visualization, we have the flexibility to harness multiple plotting libraries to craft a wide range of graphical representations. VerticaPy, as a versatile tool, provides support for several graphic libraries, such as Matplotlib, Highcharts, and Plotly. Each of these libraries offers unique features and capabilities, allowing us to choose the most suitable one for our specific data visualization needs.

_images/plotting_libs.png

Note

To select the desired plotting library, we simply need to use the set_option function. VerticaPy offers the flexibility to smoothly transition between different plotting libraries. In instances where a particular graphic is not supported by the chosen library or is not supported within the VerticaPy framework, the tool will automatically generate a warning and then switch to an alternative library where the graphic can be created.

Note

In VerticaPy, histograms are employed for numerical features. The bins are automatically computed using various methods such as Freedman–Diaconis, Sturges, etc. However, it is still possible to manually select one using the ‘h’ parameter. If you are working with categorical data, you may find bar charts more relevant.

Please click on the tabs to view the various graphics generated by the different plotting libraries.

We can switch to using the plotly module.

vp.set_option("plotting_lib", "plotly")

In VerticaPy, you can create a single histogram or multiple histograms within the same graphic.

data["score1"].hist()

We load the VerticaPy chart extension.

%load_ext verticapy.chart

Let us provide a vlaue for the interval ‘h’.

h = 1

Now, We write the SQL query using Jupyter magic cells.

%%chart -k hist
SELECT
    FLOOR(score1 / :h) * :h AS score1,
    COUNT(*) / :N AS density
FROM :data
GROUP BY 1
ORDER BY 1;

Note

N represents the number of records, and h represents the histogram interval. h is computed automatically using Python, while in SQL, it must be manually entered. In SQL, we compute the histogram bins using the FLOOR SQL function.

data.hist(columns = ["score1", "score2", "score3"])