Loading...

verticapy.vDataFrame.groupby#

vDataFrame.groupby(columns: str | list[str], expr: str | list[str] | StringSQL | list[StringSQL] | None = None, rollup: bool | list[bool] = False, having: str | None = None) vDataFrame#

This method facilitates the aggregation of the vDataFrame by grouping its elements based on one or more specified criteria. Grouping is a critical operation in data analysis, as it allows us to segment data into subsets, making it easier to apply various aggregation functions and gain insights specific to each group.

The groupby method can be applied to one or more columns, and it is particularly valuable when we want to calculate aggregate statistics or perform operations within distinct categories or segments of our data. By grouping the elements, we can perform custom analyses, create summary statistics, or uncover patterns that might not be apparent when looking at the entire dataset as a whole. It is a foundational method in data analysis and is used extensively to explore and understand data dynamics in numerous domains.

Parameters#

columns: SQLColumns

List of the vDataColumns used to group the elements or a customized expression. If rollup is set to True, this can be a list of tuples.

expr: SQLExpression, optional

List of the different aggregations in pure SQL. Aliases can be used. For example, SUM(column) or AVG(column) AS  my_new_alias are valid whereas AVG is invalid. Aliases are recommended to keep the track of the features and to prevent ambiguous names. For example, the MODE function does not exist, but can be replicated by using the analytic method and then grouping the result.

rollup: bool / list of bools, optional

If set to True, the rollup operator is used. If set to a list of bools, the rollup operator is used on the matching indexes and the length of rollup must match the length of columns. For example, for columns = ['col1', ('col2', 'col3'), 'col4'] and rollup = [False, True, True], the rollup operator is used on the set (‘col2’, ‘col3’) and on ‘col4’.

having: str, optional

Expression used to filter the result.

Returns#

vDataFrame

object result of the grouping.

Examples#

For this example, we will use the following dataset:

import verticapy as vp

data = vp.vDataFrame(
    {
        "x": ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
        "y": [1, 2, 1, 2, 1, 1, 2, 1],
        "z": [10, 12, 2, 1, 9, 8, 1, 3],
    }
)

You can perform grouping using a direct SQL statement.

data.groupby(
    columns = ["x"],
    expr = ["AVG(y) AS avg_y", "MIN(z) AS min_z"],
)
Abc
x
Varchar(1)
100%
...
123
avg_y
Float(22)
100%
123
min_z
Integer
100%
1B...1.333333333333331
2C...1.51
3A...1.333333333333332

Alternatively, you can achieve grouping using VerticaPy SQL functions, which offer a more Pythonic approach.

import verticapy.sql.functions as vpf

data.groupby(
    columns = ["x"],
    expr = [
        vpf.avg(data["y"])._as("avg_y"),
        vpf.min(data["z"])._as("min_z"),
    ],
)
Abc
x
Varchar(1)
100%
...
123
avg_y
Float(22)
100%
123
min_z
Integer
100%
1B...1.333333333333331
2C...1.51
3A...1.333333333333332

You can also perform rollup aggregations.

data.groupby(
    columns = ["x"],
    expr = [
        vpf.avg(data["y"])._as("avg_y"),
        vpf.min(data["z"])._as("min_z"),
    ],
    rollup = True,
)
Abc
x
Varchar(1)
75%
...
123
avg_y
Float(22)
100%
123
min_z
Integer
100%
1A...1.333333333333332
2[null]...1.3751
3B...1.333333333333331
4C...1.51

Note

All the calculations are pushed to the database.

Hint

For additional aggregation options, please refer to the aggregate method.

See also

vDataColumn.aggregate() : Aggregations for a specific column.
vDataFrame.aggregate() : Aggregates for particular columns.