Descriptive Statistics#

The easiest way to understand data is to aggregate it. An aggregation is a number or a category which summarizes the data. VerticaPy lets you compute all well-known aggregation in a single line.

The ‘agg’ method is the best way to compute multiple aggregations on multiple columns at the same time.

[1]:
import verticapy as vp
help(vp.vDataFrame.agg)
Help on function aggregate in module verticapy.core.vdataframe._aggregate:

aggregate(self, func: Annotated[Union[str, list[str], ForwardRef('StringSQL'), list['StringSQL']], ''], columns: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, ncols_block: int = 20, processes: int = 1) -> verticapy.core.tablesample.base.TableSample
    Aggregates the vDataFrame using the input functions.

    Parameters
    ----------
    func: SQLExpression
        | List of the different aggregations:

        |    **aad**: average absolute deviation
        |    **approx_median**: approximate median
        |    **approx_q%**: approximate q quantile
                            (ex: approx_50% for the
                            approximate median)
        |    **approx_unique**: approximative cardinality
        |    **count**: number of non-missing elements
        |    **cvar**: conditional value at risk
        |    **dtype**: virtual column type
        |    **iqr**: interquartile range
        |    **kurtosis**: kurtosis
        |    **jb**: Jarque-Bera index
        |    **mad**: median absolute deviation
        |    **max**: maximum
        |    **mean**: average
        |    **median**: median
        |    **min**: minimum
        |    **mode**: most occurent element
        |    **percent**: percent of non-missing elements
        |    **q%**: q quantile (ex: 50% for the median)
                     Use the ``approx_q%`` (approximate quantile)
                     aggregation to get better performance.
        |    **prod**: product
        |    **range**: difference between the max and the min
        |    **sem**: standard error of the mean
        |    **skewness**: skewness
        |    **sum**: sum
        |    **std**: standard deviation
        |    **topk**: kth most occurent element (ex: top1 for the mode)
        |    **topk_percent**: kth most occurent element density
        |    **unique**: cardinality (count distinct)
        |    **var**: variance
        | Other aggregations will work if supported by your database
          version.
    columns: SQLColumns, optional
        List of  the vDataColumn's names. If empty,  depending on the
        aggregations, all or only numerical vDataColumns are used.
    ncols_block: int, optional
        Number  of columns  used per query.  Setting  this  parameter
        divides  what  would otherwise be  one large query into  many
        smaller  queries  called "blocks", whose size is determine by
        the size of ncols_block.
    processes: int, optional
        Number  of child processes  to  create. Setting  this  with  the
        ncols_block  parameter lets you parallelize a  single query into
        many smaller  queries, where each child process creates  its own
        connection to the database and sends one query. This can improve
        query performance, but consumes  more resources. If processes is
        set to 1, the queries are sent iteratively from a single process.

    Returns
    -------
    TableSample
        result.

    Examples
    --------
    For this example, we will use the following dataset:

    .. code-block:: python

        import verticapy as vp

        data = vp.vDataFrame({
            "x": [1, 2, 4, 9, 10, 15, 20, 22],
            "y": [1, 2, 1, 2, 1, 1, 2, 1],
            "z": [10, 12, 2, 1, 9, 8, 1, 3],
        })

    With the ``aggregate`` method, you have the flexibility to select specific
    aggregates and the columns you wish to include in the query. This
    allows for more precise control over the aggregation process and helps
    tailor the results to your specific needs.

    .. code-block:: python

        data.aggregate(
            func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
            columns = ["x", "y", "z"],
        )

    .. ipython:: python
        :suppress:

        import verticapy as vp
        data = vp.vDataFrame({
            "x": [1, 2, 4, 9, 10, 15, 20, 22],
            "y": [1, 2, 1, 2, 1, 1, 2, 1],
            "z": [10, 12, 2, 1, 9, 8, 1, 3],
        })
        result = data.aggregate(
            func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
            columns = ["x", "y", "z"],
        )
        html_file = open("figures/core_vDataFrame_vDFAgg_aggregate_table.html", "w")
        html_file.write(result._repr_html_())
        html_file.close()

    .. raw:: html
        :file: SPHINX_DIRECTORY/figures/core_vDataFrame_vDFAgg_aggregate_table.html

    .. note:: All the calculations are pushed to the database.

    .. hint::

        When the vDataFrame includes a large number of columns and many aggregates
        need to be computed, it can be resource-intensive for the database. To
        address this, you can use the ``ncols_block`` parameter to control the number
        of blocks of aggregates to use and the ``processes`` parameter to manage
        the number of processes. These blocks consist of specific columns, and
        their aggregates are calculated first (or in parallel), then the subsequent
        ones, and the results are combined at the end.

    .. seealso::
        | :py:mod:`verticapy.vDataColumn.aggregate` :
          Aggregations for a specific column.
        | :py:mod:`verticapy.vDataColumn.describe` :
          Summarizes the information within the column.
        | :py:mod:`verticapy.vDataFrame.describe` :
          Summarizes the information for specific columns.

This is a tremendously useful function for understanding your data. Let’s use the churn dataset.

[2]:
vdf = vp.read_csv("data/churn.csv")
vdf.agg(func = ["min", "10%", "median", "90%", "max", "kurtosis", "unique"])
[2]:
min
10%
median
90%
max
kurtosis
unique
"SeniorCitizen"000.0111.362595895793912
"Partner"000.011-1.99595342119472
"Dependents"000.011-1.23437805716952
"tenure"0229.06972-1.3873716359716973
"PhoneService"011.0115.438907555087062
"PaperlessBilling"001.011-1.859606185608842
"MonthlyCharges"18.2520.0570.35102.6118.75-1.257259694549511585
"TotalCharges"18.884.61397.4755976.648684.8-0.2317987608693626530
"Churn"000.011-0.8702113423319812
Rows: 1-9 | Columns: 8

Some methods, like ‘describe’, are abstractions of the ‘agg’ method; they simplify the call to computing specific aggregations.

[3]:
vdf.describe()
[3]:
count
mean
std
min
approx_25%
approx_50%
approx_75%
max
"SeniorCitizen"70430.1621468124378820.3686116056100130.00.00.00.01.0
"Partner"70430.4830327985233560.4997475107199860.00.00.01.01.0
"Dependents"70430.2995882436461730.4581101675100150.00.00.01.01.0
"tenure"704332.371148658242324.55948102309440.09.029.055.072.0
"PhoneService"70430.9031662643759750.2957522317836350.01.01.01.01.0
"PaperlessBilling"70430.5922192247621750.4914569240494070.00.01.01.01.0
"MonthlyCharges"704364.76169246059930.090047097678518.2535.570.321428571428689.85118.75
"TotalCharges"70322283.300440841882266.7713618831518.8402.6833333333331397.4753798.23758684.8
"Churn"70430.2653698707936940.4415613051219470.00.00.01.01.0
Rows: 1-9 | Columns: 9
[4]:
vdf.describe(method = "all")
[4]:
123
"SeniorCitizen"
Int
100%
010
"Partner"
Boolean
100%
010
"Dependents"
Boolean
100%
123
"tenure"
Int
100%
010
"PhoneService"
Boolean
100%
010
"PaperlessBilling"
Boolean
100%
123
"MonthlyCharges"
Numeric(8,3)
100%
123
"TotalCharges"
Numeric(9,3)
99%
010
"Churn"
Boolean
100%
Abc
"customerID"
Varchar(20)
100%
Abc
"gender"
Varchar(20)
100%
Abc
"MultipleLines"
Varchar(100)
100%
Abc
"InternetService"
Varchar(22)
100%
Abc
"OnlineSecurity"
Varchar(38)
100%
Abc
"OnlineBackup"
Varchar(38)
100%
Abc
"DeviceProtection"
Varchar(38)
100%
Abc
"TechSupport"
Varchar(38)
100%
Abc
"StreamingTV"
Varchar(38)
100%
Abc
"StreamingMovies"
Varchar(38)
100%
Abc
"Contract"
Varchar(28)
100%
Abc
"PaymentMethod"
Varchar(50)
100%
dtypeintbooleanbooleanintbooleanbooleannumeric(8,3)numeric(9,3)booleanvarchar(20)varchar(20)varchar(100)varchar(22)varchar(38)varchar(38)varchar(38)varchar(38)varchar(38)varchar(38)varchar(28)varchar(50)
percent100.0100.0100.0100.0100.0100.0100.099.844100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0
count704370437043704370437043704370327043704370437043704370437043704370437043704370437043
top0
1
20.05[null]
0002-ORFBOMaleNoFiber opticNoNoNoNoNoNoMonth-to-monthElectronic check
top_percent83.78551.69770.0418.70490.31759.2220.8660.15673.4630.01450.47648.13343.95949.66643.84543.94449.31139.89839.54355.01933.579
avg0.1621468124378820.4830327985233560.29958824364617332.37114865824230.9031662643759750.59221922476217564.7616924605992283.300440841880.26536987079369410.04.990487008377113.77750958398416.300014198494965.970041175635386.028255004969476.027261110322315.973590799375276.067726820956986.0712764446968611.301150078091718.5702115575749
stddev0.3686116056100130.4997475107199860.45811016751001524.55948102309440.2957522317836350.49145692404940730.09004709767852266.771361883150.4415613051219470.01.00002574729434.030386526393034.178818085360946.866548314487536.836814337332836.837327235013846.864753089095936.816296591249626.814437236183972.985058424526545.04042209794139
min00000018.2518.8010422222222812
approx_25%00091035.5402.683333333333010423222222816
approx_50%000291170.32142857142861397.4750104333333331416
approx_75%011551189.853798.237511063113333331423
max1117211118.758684.8110616111919191919191425
range1117211100.58666.0102149171717171717613
empty[null][null][null][null][null][null][null][null][null]000000000000
Rows: 1-14 | Columns: 22
[5]:
vdf.describe(method = "categorical")
[5]:
dtype
count
top
top_percent
"customerID"varchar(20)70430002-ORFBO0.014
"gender"varchar(20)7043Male50.476
"SeniorCitizen"int7043083.785
"Partner"boolean7043051.697
"Dependents"boolean7043070.041
"tenure"int704318.704
"PhoneService"boolean7043190.317
"MultipleLines"varchar(100)7043No48.133
"InternetService"varchar(22)7043Fiber optic43.959
"OnlineSecurity"varchar(38)7043No49.666
"OnlineBackup"varchar(38)7043No43.845
"DeviceProtection"varchar(38)7043No43.944
"TechSupport"varchar(38)7043No49.311
"StreamingTV"varchar(38)7043No39.898
"StreamingMovies"varchar(38)7043No39.543
"Contract"varchar(28)7043Month-to-month55.019
"PaperlessBilling"boolean7043159.222
"PaymentMethod"varchar(50)7043Electronic check33.579
"MonthlyCharges"numeric(8,3)704320.050.866
"TotalCharges"numeric(9,3)7032[null]0.156
"Churn"boolean7043073.463
Rows: 1-21 | Columns: 5

Multi-column aggregations can also be called with many built-in methods. For example, you can compute the ‘avg’ of all the numerical columns in just one line.

[6]:
vdf.avg()
[6]:
avg
"SeniorCitizen"0.162146812437882
"Partner"0.483032798523356
"Dependents"0.299588243646173
"tenure"32.3711486582423
"PhoneService"0.903166264375975
"PaperlessBilling"0.592219224762175
"MonthlyCharges"64.761692460599
"TotalCharges"2283.30044084188
"Churn"0.265369870793694
Rows: 1-9 | Columns: 2

Or just the ‘median’ of a specific column.

[7]:
vdf["tenure"].median()
[7]:
29.0

The approximate median is automatically computed. Set the parameter ‘approx’ to False to get the exact median.

[8]:
vdf["tenure"].median(approx=False)
[8]:
29.0

You can also use the ‘groupby’ method to compute customized aggregations.

[9]:
# SQL way
vdf.groupby(["gender",
             "Contract"],
            ["AVG(DECODE(Churn, 'Yes', 1, 0)) AS Churn"])
[9]:
Abc
gender
Varchar(20)
Abc
Contract
Varchar(28)
123
Churn
Float(22)
1FemaleTwo year0.0260355029585799
2FemaleMonth-to-month0.437402597402597
3FemaleOne year0.104456824512535
4MaleTwo year0.0305882352941176
5MaleMonth-to-month0.416923076923077
6MaleOne year0.120529801324503
Rows: 1-6 | Columns: 3
[11]:
# Pythonic way
import verticapy.stats as st
vdf.groupby(["gender", "Contract"],
            [st.min(vdf["tenure"])._as("min_tenure"),
             st.max(vdf["tenure"])._as("max_tenure")])
[11]:
Abc
gender
Varchar(20)
Abc
Contract
Varchar(28)
123
min_tenure
Integer
123
max_tenure
Integer
1MaleOne year072
2FemaleOne year172
3FemaleTwo year072
4FemaleMonth-to-month171
5MaleTwo year072
6MaleMonth-to-month172
Rows: 1-6 | Columns: 4

Computing many aggregations at the same time can be resource intensive. You can use the parameters ‘ncols_block’ and ‘processes’ to manage the ressources.

For example, the parameter ‘ncols_block’ will divide the main query into smaller using a specific number of columns. The parameter ‘processes’ allows you to manage the number of queries you want to send at the same time. An entire example is available in the vDataFrame.agg documentation.