Descriptive Statistics#

The easiest way to understand data is to aggregate it. An aggregation is a number or a category which summarizes the data. VerticaPy lets you compute all well-known aggregation in a single line.

The ‘agg’ method is the best way to compute multiple aggregations on multiple columns at the same time.

[1]:

import verticapy as vp
help(vp.vDataFrame.agg)

Help on function aggregate in module verticapy.core.vdataframe._aggregate:

aggregate(self, func: Annotated[Union[str, list[str], ForwardRef('StringSQL'), list['StringSQL']], ''], columns: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, ncols_block: int = 20, processes: int = 1) -> verticapy.core.tablesample.base.TableSample
    Aggregates the vDataFrame using the input functions.

    Parameters
    ----------
    func: SQLExpression
        | List of the different aggregations:

        |    **aad**: average absolute deviation
        |    **approx_median**: approximate median
        |    **approx_q%**: approximate q quantile
                            (ex: approx_50% for the
                            approximate median)
        |    **approx_unique**: approximative cardinality
        |    **count**: number of non-missing elements
        |    **cvar**: conditional value at risk
        |    **dtype**: virtual column type
        |    **iqr**: interquartile range
        |    **kurtosis**: kurtosis
        |    **jb**: Jarque-Bera index
        |    **mad**: median absolute deviation
        |    **max**: maximum
        |    **mean**: average
        |    **median**: median
        |    **min**: minimum
        |    **mode**: most occurent element
        |    **percent**: percent of non-missing elements
        |    **q%**: q quantile (ex: 50% for the median)
                     Use the ``approx_q%`` (approximate quantile)
                     aggregation to get better performance.
        |    **prod**: product
        |    **range**: difference between the max and the min
        |    **sem**: standard error of the mean
        |    **skewness**: skewness
        |    **sum**: sum
        |    **std**: standard deviation
        |    **topk**: kth most occurent element (ex: top1 for the mode)
        |    **topk_percent**: kth most occurent element density
        |    **unique**: cardinality (count distinct)
        |    **var**: variance
        | Other aggregations will work if supported by your database
          version.
    columns: SQLColumns, optional
        List of  the vDataColumn's names. If empty,  depending on the
        aggregations, all or only numerical vDataColumns are used.
    ncols_block: int, optional
        Number  of columns  used per query.  Setting  this  parameter
        divides  what  would otherwise be  one large query into  many
        smaller  queries  called "blocks", whose size is determine by
        the size of ncols_block.
    processes: int, optional
        Number  of child processes  to  create. Setting  this  with  the
        ncols_block  parameter lets you parallelize a  single query into
        many smaller  queries, where each child process creates  its own
        connection to the database and sends one query. This can improve
        query performance, but consumes  more resources. If processes is
        set to 1, the queries are sent iteratively from a single process.

    Returns
    -------
    TableSample
        result.

    Examples
    --------
    For this example, we will use the following dataset:

    .. code-block:: python

        import verticapy as vp

        data = vp.vDataFrame({
            "x": [1, 2, 4, 9, 10, 15, 20, 22],
            "y": [1, 2, 1, 2, 1, 1, 2, 1],
            "z": [10, 12, 2, 1, 9, 8, 1, 3],
        })

    With the ``aggregate`` method, you have the flexibility to select specific
    aggregates and the columns you wish to include in the query. This
    allows for more precise control over the aggregation process and helps
    tailor the results to your specific needs.

    .. code-block:: python

        data.aggregate(
            func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
            columns = ["x", "y", "z"],
        )

    .. ipython:: python
        :suppress:

        import verticapy as vp
        data = vp.vDataFrame({
            "x": [1, 2, 4, 9, 10, 15, 20, 22],
            "y": [1, 2, 1, 2, 1, 1, 2, 1],
            "z": [10, 12, 2, 1, 9, 8, 1, 3],
        })
        result = data.aggregate(
            func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
            columns = ["x", "y", "z"],
        )
        html_file = open("figures/core_vDataFrame_vDFAgg_aggregate_table.html", "w")
        html_file.write(result._repr_html_())
        html_file.close()

    .. raw:: html
        :file: SPHINX_DIRECTORY/figures/core_vDataFrame_vDFAgg_aggregate_table.html

    .. note:: All the calculations are pushed to the database.

    .. hint::

        When the vDataFrame includes a large number of columns and many aggregates
        need to be computed, it can be resource-intensive for the database. To
        address this, you can use the ``ncols_block`` parameter to control the number
        of blocks of aggregates to use and the ``processes`` parameter to manage
        the number of processes. These blocks consist of specific columns, and
        their aggregates are calculated first (or in parallel), then the subsequent
        ones, and the results are combined at the end.

    .. seealso::
        | :py:mod:`verticapy.vDataColumn.aggregate` :
          Aggregations for a specific column.
        | :py:mod:`verticapy.vDataColumn.describe` :
          Summarizes the information within the column.
        | :py:mod:`verticapy.vDataFrame.describe` :
          Summarizes the information for specific columns.

This is a tremendously useful function for understanding your data. Let’s use the churn dataset.

[2]:

vdf = vp.read_csv("data/churn.csv")
vdf.agg(func = ["min", "10%", "median", "90%", "max", "kurtosis", "unique"])

[2]:

	min	10%	median	90%	max	kurtosis	unique
"SeniorCitizen"	0	0	0.0	1	1	1.36259589579391	2
"Partner"	0	0	0.0	1	1	-1.9959534211947	2
"Dependents"	0	0	0.0	1	1	-1.2343780571695	2
"tenure"	0	2	29.0	69	72	-1.38737163597169	73
"PhoneService"	0	1	1.0	1	1	5.43890755508706	2
"PaperlessBilling"	0	0	1.0	1	1	-1.85960618560884	2
"MonthlyCharges"	18.25	20.05	70.35	102.6	118.75	-1.25725969454951	1585
"TotalCharges"	18.8	84.6	1397.475	5976.64	8684.8	-0.231798760869362	6530
"Churn"	0	0	0.0	1	1	-0.870211342331981	2

Rows: 1-9 | Columns: 8

Some methods, like ‘describe’, are abstractions of the ‘agg’ method; they simplify the call to computing specific aggregations.

[3]:

vdf.describe()

[3]:

	count	mean	std	min	approx_25%	approx_50%	approx_75%	max
"SeniorCitizen"	7043	0.162146812437882	0.368611605610013	0.0	0.0	0.0	0.0	1.0
"Partner"	7043	0.483032798523356	0.499747510719986	0.0	0.0	0.0	1.0	1.0
"Dependents"	7043	0.299588243646173	0.458110167510015	0.0	0.0	0.0	1.0	1.0
"tenure"	7043	32.3711486582423	24.5594810230944	0.0	9.0	29.0	55.0	72.0
"PhoneService"	7043	0.903166264375975	0.295752231783635	0.0	1.0	1.0	1.0	1.0
"PaperlessBilling"	7043	0.592219224762175	0.491456924049407	0.0	0.0	1.0	1.0	1.0
"MonthlyCharges"	7043	64.761692460599	30.0900470976785	18.25	35.5	70.3214285714286	89.85	118.75
"TotalCharges"	7032	2283.30044084188	2266.77136188315	18.8	402.683333333333	1397.475	3798.2375	8684.8
"Churn"	7043	0.265369870793694	0.441561305121947	0.0	0.0	0.0	1.0	1.0

Rows: 1-9 | Columns: 9

[4]:

vdf.describe(method = "all")

[4]:

	123 "SeniorCitizen" Int 100%	010 "Partner" Boolean 100%	010 "Dependents" Boolean 100%	123 "tenure" Int 100%	010 "PhoneService" Boolean 100%	010 "PaperlessBilling" Boolean 100%	123 "MonthlyCharges" Numeric(8,3) 100%	123 "TotalCharges" Numeric(9,3) 99%	010 "Churn" Boolean 100%	Abc "customerID" Varchar(20) 100%	Abc "gender" Varchar(20) 100%	Abc "MultipleLines" Varchar(100) 100%	Abc "InternetService" Varchar(22) 100%	Abc "OnlineSecurity" Varchar(38) 100%	Abc "OnlineBackup" Varchar(38) 100%	Abc "DeviceProtection" Varchar(38) 100%	Abc "TechSupport" Varchar(38) 100%	Abc "StreamingTV" Varchar(38) 100%	Abc "StreamingMovies" Varchar(38) 100%	Abc "Contract" Varchar(28) 100%	Abc "PaymentMethod" Varchar(50) 100%
dtype	int	boolean	boolean	int	boolean	boolean	numeric(8,3)	numeric(9,3)	boolean	varchar(20)	varchar(20)	varchar(100)	varchar(22)	varchar(38)	varchar(38)	varchar(38)	varchar(38)	varchar(38)	varchar(38)	varchar(28)	varchar(50)
percent	100.0	100.0	100.0	100.0	100.0	100.0	100.0	99.844	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
count	7043	7043	7043	7043	7043	7043	7043	7032	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043
top	0	❌	❌	1	✅	✅	20.05	[null]	❌	0002-ORFBO	Male	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Electronic check
top_percent	83.785	51.697	70.041	8.704	90.317	59.222	0.866	0.156	73.463	0.014	50.476	48.133	43.959	49.666	43.845	43.944	49.311	39.898	39.543	55.019	33.579
avg	0.162146812437882	0.483032798523356	0.299588243646173	32.3711486582423	0.903166264375975	0.592219224762175	64.761692460599	2283.30044084188	0.265369870793694	10.0	4.99048700837711	3.7775095839841	6.30001419849496	5.97004117563538	6.02825500496947	6.02726111032231	5.97359079937527	6.06772682095698	6.07127644469686	11.3011500780917	18.5702115575749
stddev	0.368611605610013	0.499747510719986	0.458110167510015	24.5594810230944	0.295752231783635	0.491456924049407	30.0900470976785	2266.77136188315	0.441561305121947	0.0	1.0000257472943	4.03038652639303	4.17881808536094	6.86654831448753	6.83681433733283	6.83732723501384	6.86475308909593	6.81629659124962	6.81443723618397	2.98505842452654	5.04042209794139
min	0	0	0	0	0	0	18.25	18.8	0	10	4	2	2	2	2	2	2	2	2	8	12
approx_25%	0	0	0	9	1	0	35.5	402.683333333333	0	10	4	2	3	2	2	2	2	2	2	8	16
approx_50%	0	0	0	29	1	1	70.3214285714286	1397.475	0	10	4	3	3	3	3	3	3	3	3	14	16
approx_75%	0	1	1	55	1	1	89.85	3798.2375	1	10	6	3	11	3	3	3	3	3	3	14	23
max	1	1	1	72	1	1	118.75	8684.8	1	10	6	16	11	19	19	19	19	19	19	14	25
range	1	1	1	72	1	1	100.5	8666.0	1	0	2	14	9	17	17	17	17	17	17	6	13
empty	[null]	[null]	[null]	[null]	[null]	[null]	[null]	[null]	[null]	0	0	0	0	0	0	0	0	0	0	0	0

Rows: 1-14 | Columns: 22

[5]:

vdf.describe(method = "categorical")

[5]:

	dtype	count	top	top_percent
"customerID"	varchar(20)	7043	0002-ORFBO	0.014
"gender"	varchar(20)	7043	Male	50.476
"SeniorCitizen"	int	7043	0	83.785
"Partner"	boolean	7043	0	51.697
"Dependents"	boolean	7043	0	70.041
"tenure"	int	7043	1	8.704
"PhoneService"	boolean	7043	1	90.317
"MultipleLines"	varchar(100)	7043	No	48.133
"InternetService"	varchar(22)	7043	Fiber optic	43.959
"OnlineSecurity"	varchar(38)	7043	No	49.666
"OnlineBackup"	varchar(38)	7043	No	43.845
"DeviceProtection"	varchar(38)	7043	No	43.944
"TechSupport"	varchar(38)	7043	No	49.311
"StreamingTV"	varchar(38)	7043	No	39.898
"StreamingMovies"	varchar(38)	7043	No	39.543
"Contract"	varchar(28)	7043	Month-to-month	55.019
"PaperlessBilling"	boolean	7043	1	59.222
"PaymentMethod"	varchar(50)	7043	Electronic check	33.579
"MonthlyCharges"	numeric(8,3)	7043	20.05	0.866
"TotalCharges"	numeric(9,3)	7032	[null]	0.156
"Churn"	boolean	7043	0	73.463

Rows: 1-21 | Columns: 5

Multi-column aggregations can also be called with many built-in methods. For example, you can compute the ‘avg’ of all the numerical columns in just one line.

[6]:

vdf.avg()

[6]:

	avg
"SeniorCitizen"	0.162146812437882
"Partner"	0.483032798523356
"Dependents"	0.299588243646173
"tenure"	32.3711486582423
"PhoneService"	0.903166264375975
"PaperlessBilling"	0.592219224762175
"MonthlyCharges"	64.761692460599
"TotalCharges"	2283.30044084188
"Churn"	0.265369870793694

Rows: 1-9 | Columns: 2

Or just the ‘median’ of a specific column.

[7]:

vdf["tenure"].median()

[7]:

29.0

The approximate median is automatically computed. Set the parameter ‘approx’ to False to get the exact median.

[8]:

vdf["tenure"].median(approx=False)

[8]:

29.0

You can also use the ‘groupby’ method to compute customized aggregations.

[9]:

# SQL way
vdf.groupby(["gender",
             "Contract"],
            ["AVG(DECODE(Churn, 'Yes', 1, 0)) AS Churn"])

[9]:

	Abc gender Varchar(20)	Abc Contract Varchar(28)	123 Churn Float(22)
1	Female	Two year	0.0260355029585799
2	Female	Month-to-month	0.437402597402597
3	Female	One year	0.104456824512535
4	Male	Two year	0.0305882352941176
5	Male	Month-to-month	0.416923076923077
6	Male	One year	0.120529801324503

Rows: 1-6 | Columns: 3

[11]:

# Pythonic way
import verticapy.stats as st
vdf.groupby(["gender", "Contract"],
            [st.min(vdf["tenure"])._as("min_tenure"),
             st.max(vdf["tenure"])._as("max_tenure")])

[11]:

	Abc gender Varchar(20)	Abc Contract Varchar(28)	123 min_tenure Integer	123 max_tenure Integer
1	Male	One year	0	72
2	Female	One year	1	72
3	Female	Two year	0	72
4	Female	Month-to-month	1	71
5	Male	Two year	0	72
6	Male	Month-to-month	1	72

Rows: 1-6 | Columns: 4

Computing many aggregations at the same time can be resource intensive. You can use the parameters ‘ncols_block’ and ‘processes’ to manage the ressources.

For example, the parameter ‘ncols_block’ will divide the main query into smaller using a specific number of columns. The parameter ‘processes’ allows you to manage the number of queries you want to send at the same time. An entire example is available in the vDataFrame.agg documentation.