Loading...

verticapy.vDataColumn.aggregate#

vDataColumn.aggregate(func: list) TableSample#

Aggregates the vDataFrame using the input functions.

Parameters#

func: SQLExpression

List of the different aggregations:

  • aad:

    average absolute deviation.

  • approx_median:

    approximate median.

  • approx_q%:

    approximate q quantile (ex: approx_50% for the approximate median).

  • approx_unique:

    approximative cardinality.

  • count:

    number of non-missing elements.

  • cvar:

    conditional value at risk.

  • dtype:

    virtual column type.

  • iqr:

    interquartile range.

  • kurtosis:

    kurtosis.

  • jb:

    Jarque-Bera index.

  • mad:

    median absolute deviation.

  • max:

    maximum.

  • mean:

    average.

  • median:

    median.

  • min:

    minimum.

  • mode:

    most occurent element.

  • percent:

    percent of non-missing elements.

  • q%:

    q quantile (ex: 50% for the median) Use the approx_q% (approximate quantile) aggregation to get better performance.

  • prod:

    product.

  • range:

    difference between the max and the min.

  • sem:

    standard error of the mean.

  • skewness:

    skewness.

  • sum:

    sum.

  • std:

    standard deviation.

  • topk:

    kth most occurent element (ex: top1 for the mode).

  • topk_percent:

    kth most occurent element density.

  • unique:

    cardinality (count distinct).

  • var:

    variance.

Other aggregations will work if supported by your database version.

columns: SQLColumns, optional

List of the vDataColumn’s names. If empty, depending on the aggregations, all or only numerical vDataColumns are used.

ncols_block: int, optional

Number of columns used per query. Setting this parameter divides what would otherwise be one large query into many smaller queries called “blocks”, whose size is determine by the size of ncols_block.

processes: int, optional

Number of child processes to create. Setting this with the ncols_block parameter lets you parallelize a single query into many smaller queries, where each child process creates its own connection to the database and sends one query. This can improve query performance, but consumes more resources. If processes is set to 1, the queries are sent iteratively from a single process.

Returns#

TableSample

result.

Examples#

For this example, we will use the following dataset:

import verticapy as vp

data = vp.vDataFrame(
    {
        "x": [1, 2, 4, 9, 10, 15, 20, 22],
        "y": [1, 2, 1, 2, 1, 1, 2, 1],
        "z": [10, 12, 2, 1, 9, 8, 1, 3],
    }
)

With the aggregate method, you have the flexibility to select specific aggregates you wish to include in the query. This allows for more precise control over the aggregation process and helps tailor the results to your specific needs.

data["x"].aggregate(
    func = ["min", "approx_10%", "approx_50%", "approx_90%", "max"],
)
"x"
min1.0
approx_10%1.7
approx_50%9.5
approx_90%20.6
max22.0

Note

All the calculations are pushed to the database.

See also

vDataFrame.aggregate() : Aggregations for specific columns.
vDataColumn.describe() : Summarizes the information within the column.
vDataFrame.describe() : Summarizes the information for specific columns.