Descriptive Statistics#
The easiest way to understand data is to aggregate it. An aggregation is a number or a category which summarizes the data. VerticaPy lets you compute all well-known aggregation in a single line.
The ‘agg’ method is the best way to compute multiple aggregations on multiple columns at the same time.
[1]:
import verticapy as vp
help(vp.vDataFrame.agg)
Help on function aggregate in module verticapy.core.vdataframe._aggregate:
aggregate(self, func: Annotated[Union[str, list[str], ForwardRef('StringSQL'), list['StringSQL']], ''], columns: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, ncols_block: int = 20, processes: int = 1) -> verticapy.core.tablesample.base.TableSample
Aggregates the vDataFrame using the input functions.
Parameters
----------
func: SQLExpression
| List of the different aggregations:
| **aad**: average absolute deviation
| **approx_median**: approximate median
| **approx_q%**: approximate q quantile
(ex: approx_50% for the
approximate median)
| **approx_unique**: approximative cardinality
| **count**: number of non-missing elements
| **cvar**: conditional value at risk
| **dtype**: virtual column type
| **iqr**: interquartile range
| **kurtosis**: kurtosis
| **jb**: Jarque-Bera index
| **mad**: median absolute deviation
| **max**: maximum
| **mean**: average
| **median**: median
| **min**: minimum
| **mode**: most occurent element
| **percent**: percent of non-missing elements
| **q%**: q quantile (ex: 50% for the median)
Use the ``approx_q%`` (approximate quantile)
aggregation to get better performance.
| **prod**: product
| **range**: difference between the max and the min
| **sem**: standard error of the mean
| **skewness**: skewness
| **sum**: sum
| **std**: standard deviation
| **topk**: kth most occurent element (ex: top1 for the mode)
| **topk_percent**: kth most occurent element density
| **unique**: cardinality (count distinct)
| **var**: variance
| Other aggregations will work if supported by your database
version.
columns: SQLColumns, optional
List of the vDataColumn's names. If empty, depending on the
aggregations, all or only numerical vDataColumns are used.
ncols_block: int, optional
Number of columns used per query. Setting this parameter
divides what would otherwise be one large query into many
smaller queries called "blocks", whose size is determine by
the size of ncols_block.
processes: int, optional
Number of child processes to create. Setting this with the
ncols_block parameter lets you parallelize a single query into
many smaller queries, where each child process creates its own
connection to the database and sends one query. This can improve
query performance, but consumes more resources. If processes is
set to 1, the queries are sent iteratively from a single process.
Returns
-------
TableSample
result.
Examples
--------
For this example, we will use the following dataset:
.. code-block:: python
import verticapy as vp
data = vp.vDataFrame({
"x": [1, 2, 4, 9, 10, 15, 20, 22],
"y": [1, 2, 1, 2, 1, 1, 2, 1],
"z": [10, 12, 2, 1, 9, 8, 1, 3],
})
With the ``aggregate`` method, you have the flexibility to select specific
aggregates and the columns you wish to include in the query. This
allows for more precise control over the aggregation process and helps
tailor the results to your specific needs.
.. code-block:: python
data.aggregate(
func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
columns = ["x", "y", "z"],
)
.. ipython:: python
:suppress:
import verticapy as vp
data = vp.vDataFrame({
"x": [1, 2, 4, 9, 10, 15, 20, 22],
"y": [1, 2, 1, 2, 1, 1, 2, 1],
"z": [10, 12, 2, 1, 9, 8, 1, 3],
})
result = data.aggregate(
func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
columns = ["x", "y", "z"],
)
html_file = open("figures/core_vDataFrame_vDFAgg_aggregate_table.html", "w")
html_file.write(result._repr_html_())
html_file.close()
.. raw:: html
:file: SPHINX_DIRECTORY/figures/core_vDataFrame_vDFAgg_aggregate_table.html
.. note:: All the calculations are pushed to the database.
.. hint::
When the vDataFrame includes a large number of columns and many aggregates
need to be computed, it can be resource-intensive for the database. To
address this, you can use the ``ncols_block`` parameter to control the number
of blocks of aggregates to use and the ``processes`` parameter to manage
the number of processes. These blocks consist of specific columns, and
their aggregates are calculated first (or in parallel), then the subsequent
ones, and the results are combined at the end.
.. seealso::
| :py:mod:`verticapy.vDataColumn.aggregate` :
Aggregations for a specific column.
| :py:mod:`verticapy.vDataColumn.describe` :
Summarizes the information within the column.
| :py:mod:`verticapy.vDataFrame.describe` :
Summarizes the information for specific columns.
This is a tremendously useful function for understanding your data. Let’s use the churn dataset.
[2]:
vdf = vp.read_csv("data/churn.csv")
vdf.agg(func = ["min", "10%", "median", "90%", "max", "kurtosis", "unique"])
[2]:
min | 10% | median | 90% | max | kurtosis | unique | |
"SeniorCitizen" | 0 | 0 | 0.0 | 1 | 1 | 1.36259589579391 | 2 |
"Partner" | 0 | 0 | 0.0 | 1 | 1 | -1.9959534211947 | 2 |
"Dependents" | 0 | 0 | 0.0 | 1 | 1 | -1.2343780571695 | 2 |
"tenure" | 0 | 2 | 29.0 | 69 | 72 | -1.38737163597169 | 73 |
"PhoneService" | 0 | 1 | 1.0 | 1 | 1 | 5.43890755508706 | 2 |
"PaperlessBilling" | 0 | 0 | 1.0 | 1 | 1 | -1.85960618560884 | 2 |
"MonthlyCharges" | 18.25 | 20.05 | 70.35 | 102.6 | 118.75 | -1.25725969454951 | 1585 |
"TotalCharges" | 18.8 | 84.6 | 1397.475 | 5976.64 | 8684.8 | -0.231798760869362 | 6530 |
"Churn" | 0 | 0 | 0.0 | 1 | 1 | -0.870211342331981 | 2 |
Some methods, like ‘describe’, are abstractions of the ‘agg’ method; they simplify the call to computing specific aggregations.
[3]:
vdf.describe()
[3]:
count | mean | std | min | approx_25% | approx_50% | approx_75% | max | |
"SeniorCitizen" | 7043 | 0.162146812437882 | 0.368611605610013 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
"Partner" | 7043 | 0.483032798523356 | 0.499747510719986 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
"Dependents" | 7043 | 0.299588243646173 | 0.458110167510015 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
"tenure" | 7043 | 32.3711486582423 | 24.5594810230944 | 0.0 | 9.0 | 29.0 | 55.0 | 72.0 |
"PhoneService" | 7043 | 0.903166264375975 | 0.295752231783635 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
"PaperlessBilling" | 7043 | 0.592219224762175 | 0.491456924049407 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
"MonthlyCharges" | 7043 | 64.761692460599 | 30.0900470976785 | 18.25 | 35.5 | 70.3214285714286 | 89.85 | 118.75 |
"TotalCharges" | 7032 | 2283.30044084188 | 2266.77136188315 | 18.8 | 402.683333333333 | 1397.475 | 3798.2375 | 8684.8 |
"Churn" | 7043 | 0.265369870793694 | 0.441561305121947 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
[4]:
vdf.describe(method = "all")
[4]:
123 "SeniorCitizen"Int 100% | 010 "Partner"Boolean 100% | 010 "Dependents"Boolean 100% | 123 "tenure"Int 100% | 010 "PhoneService"Boolean 100% | 010 "PaperlessBilling"Boolean 100% | 123 "MonthlyCharges"Numeric(8,3) 100% | 123 "TotalCharges"Numeric(9,3) 99% | 010 "Churn"Boolean 100% | Abc "customerID"Varchar(20) 100% | Abc "gender"Varchar(20) 100% | Abc "MultipleLines"Varchar(100) 100% | Abc "InternetService"Varchar(22) 100% | Abc "OnlineSecurity"Varchar(38) 100% | Abc "OnlineBackup"Varchar(38) 100% | Abc "DeviceProtection"Varchar(38) 100% | Abc "TechSupport"Varchar(38) 100% | Abc "StreamingTV"Varchar(38) 100% | Abc "StreamingMovies"Varchar(38) 100% | Abc "Contract"Varchar(28) 100% | Abc "PaymentMethod"Varchar(50) 100% | |
dtype | int | boolean | boolean | int | boolean | boolean | numeric(8,3) | numeric(9,3) | boolean | varchar(20) | varchar(20) | varchar(100) | varchar(22) | varchar(38) | varchar(38) | varchar(38) | varchar(38) | varchar(38) | varchar(38) | varchar(28) | varchar(50) |
percent | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 99.844 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
count | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7032 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 | 7043 |
top | 0 | 1 | 20.05 | [null] | 0002-ORFBO | Male | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Electronic check | |||||
top_percent | 83.785 | 51.697 | 70.041 | 8.704 | 90.317 | 59.222 | 0.866 | 0.156 | 73.463 | 0.014 | 50.476 | 48.133 | 43.959 | 49.666 | 43.845 | 43.944 | 49.311 | 39.898 | 39.543 | 55.019 | 33.579 |
avg | 0.162146812437882 | 0.483032798523356 | 0.299588243646173 | 32.3711486582423 | 0.903166264375975 | 0.592219224762175 | 64.761692460599 | 2283.30044084188 | 0.265369870793694 | 10.0 | 4.99048700837711 | 3.7775095839841 | 6.30001419849496 | 5.97004117563538 | 6.02825500496947 | 6.02726111032231 | 5.97359079937527 | 6.06772682095698 | 6.07127644469686 | 11.3011500780917 | 18.5702115575749 |
stddev | 0.368611605610013 | 0.499747510719986 | 0.458110167510015 | 24.5594810230944 | 0.295752231783635 | 0.491456924049407 | 30.0900470976785 | 2266.77136188315 | 0.441561305121947 | 0.0 | 1.0000257472943 | 4.03038652639303 | 4.17881808536094 | 6.86654831448753 | 6.83681433733283 | 6.83732723501384 | 6.86475308909593 | 6.81629659124962 | 6.81443723618397 | 2.98505842452654 | 5.04042209794139 |
min | 0 | 0 | 0 | 0 | 0 | 0 | 18.25 | 18.8 | 0 | 10 | 4 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 8 | 12 |
approx_25% | 0 | 0 | 0 | 9 | 1 | 0 | 35.5 | 402.683333333333 | 0 | 10 | 4 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 8 | 16 |
approx_50% | 0 | 0 | 0 | 29 | 1 | 1 | 70.3214285714286 | 1397.475 | 0 | 10 | 4 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 14 | 16 |
approx_75% | 0 | 1 | 1 | 55 | 1 | 1 | 89.85 | 3798.2375 | 1 | 10 | 6 | 3 | 11 | 3 | 3 | 3 | 3 | 3 | 3 | 14 | 23 |
max | 1 | 1 | 1 | 72 | 1 | 1 | 118.75 | 8684.8 | 1 | 10 | 6 | 16 | 11 | 19 | 19 | 19 | 19 | 19 | 19 | 14 | 25 |
range | 1 | 1 | 1 | 72 | 1 | 1 | 100.5 | 8666.0 | 1 | 0 | 2 | 14 | 9 | 17 | 17 | 17 | 17 | 17 | 17 | 6 | 13 |
empty | [null] | [null] | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
[5]:
vdf.describe(method = "categorical")
[5]:
dtype | count | top | top_percent | |
"customerID" | varchar(20) | 7043 | 0002-ORFBO | 0.014 |
"gender" | varchar(20) | 7043 | Male | 50.476 |
"SeniorCitizen" | int | 7043 | 0 | 83.785 |
"Partner" | boolean | 7043 | 0 | 51.697 |
"Dependents" | boolean | 7043 | 0 | 70.041 |
"tenure" | int | 7043 | 1 | 8.704 |
"PhoneService" | boolean | 7043 | 1 | 90.317 |
"MultipleLines" | varchar(100) | 7043 | No | 48.133 |
"InternetService" | varchar(22) | 7043 | Fiber optic | 43.959 |
"OnlineSecurity" | varchar(38) | 7043 | No | 49.666 |
"OnlineBackup" | varchar(38) | 7043 | No | 43.845 |
"DeviceProtection" | varchar(38) | 7043 | No | 43.944 |
"TechSupport" | varchar(38) | 7043 | No | 49.311 |
"StreamingTV" | varchar(38) | 7043 | No | 39.898 |
"StreamingMovies" | varchar(38) | 7043 | No | 39.543 |
"Contract" | varchar(28) | 7043 | Month-to-month | 55.019 |
"PaperlessBilling" | boolean | 7043 | 1 | 59.222 |
"PaymentMethod" | varchar(50) | 7043 | Electronic check | 33.579 |
"MonthlyCharges" | numeric(8,3) | 7043 | 20.05 | 0.866 |
"TotalCharges" | numeric(9,3) | 7032 | [null] | 0.156 |
"Churn" | boolean | 7043 | 0 | 73.463 |
Multi-column aggregations can also be called with many built-in methods. For example, you can compute the ‘avg’ of all the numerical columns in just one line.
[6]:
vdf.avg()
[6]:
avg | |
"SeniorCitizen" | 0.162146812437882 |
"Partner" | 0.483032798523356 |
"Dependents" | 0.299588243646173 |
"tenure" | 32.3711486582423 |
"PhoneService" | 0.903166264375975 |
"PaperlessBilling" | 0.592219224762175 |
"MonthlyCharges" | 64.761692460599 |
"TotalCharges" | 2283.30044084188 |
"Churn" | 0.265369870793694 |
Or just the ‘median’ of a specific column.
[7]:
vdf["tenure"].median()
[7]:
29.0
The approximate median is automatically computed. Set the parameter ‘approx’ to False to get the exact median.
[8]:
vdf["tenure"].median(approx=False)
[8]:
29.0
You can also use the ‘groupby’ method to compute customized aggregations.
[9]:
# SQL way
vdf.groupby(["gender",
"Contract"],
["AVG(DECODE(Churn, 'Yes', 1, 0)) AS Churn"])
[9]:
Abc genderVarchar(20) | Abc ContractVarchar(28) | 123 ChurnFloat(22) | |
1 | Female | Two year | 0.0260355029585799 |
2 | Female | Month-to-month | 0.437402597402597 |
3 | Female | One year | 0.104456824512535 |
4 | Male | Two year | 0.0305882352941176 |
5 | Male | Month-to-month | 0.416923076923077 |
6 | Male | One year | 0.120529801324503 |
[11]:
# Pythonic way
import verticapy.stats as st
vdf.groupby(["gender", "Contract"],
[st.min(vdf["tenure"])._as("min_tenure"),
st.max(vdf["tenure"])._as("max_tenure")])
[11]:
Abc genderVarchar(20) | Abc ContractVarchar(28) | 123 min_tenureInteger | 123 max_tenureInteger | |
1 | Male | One year | 0 | 72 |
2 | Female | One year | 1 | 72 |
3 | Female | Two year | 0 | 72 |
4 | Female | Month-to-month | 1 | 71 |
5 | Male | Two year | 0 | 72 |
6 | Male | Month-to-month | 1 | 72 |
Computing many aggregations at the same time can be resource intensive. You can use the parameters ‘ncols_block’ and ‘processes’ to manage the ressources.
For example, the parameter ‘ncols_block’ will divide the main query into smaller using a specific number of columns. The parameter ‘processes’ allows you to manage the number of queries you want to send at the same time. An entire example is available in the vDataFrame.agg documentation.