VerticaPy
Descriptive Statistics¶
The easiest way to understand data is to aggregate it. An aggregation is a number or a category which summarizes the data. VerticaPy lets you compute all well-known aggregation in a single line.
The 'agg' method is the best way to compute multiple aggregations on multiple columns at the same time.
import verticapy as vp
help(vp.vDataFrame.agg)
This is a tremendously useful function for understanding your data. Let's use the churn dataset.
vdf = vp.read_csv("data/churn.csv")
vdf.agg(func = ["min", "10%", "median", "90%", "max", "kurtosis", "unique"])
Some methods, like 'describe', are abstractions of the 'agg' method; they simplify the call to computing specific aggregations.
vdf.describe()
vdf.describe(method = "all")
vdf.describe(method = "categorical")
Multi-column aggregations can also be called with many built-in methods. For example, you can compute the 'avg' of all the numerical columns in just one line.
vdf.avg()
Or just the 'median' of a specific column.
vdf["tenure"].median()
The approximate median is automatically computed. Set the parameter 'approx' to False to get the exact median.
vdf["tenure"].median(approx=False)
You can also use the 'groupby' method to compute customized aggregations.
# SQL way
vdf.groupby(["gender",
"Contract"],
["AVG(DECODE(Churn, 'Yes', 1, 0)) AS Churn"])
# Pythonic way
import verticapy.stats as st
vdf.groupby([vdf["gender"],
vdf["Contract"]],
[st.min(vdf["tenure"])._as("min_tenure"),
st.max(vdf["tenure"])._as("max_tenure")])
Computing many aggregations at the same time can be resource intensive. You can use the parameters 'ncols_block' and 'processes' to manage the ressources.
For example, the parameter 'ncols_block' will divide the main query into smaller using a specific number of columns. The parameter 'processes' allows you to manage the number of queries you want to send at the same time. An entire example is available in the vDataFrame.agg documentation.
