verticapy.vDataFrame.groupby#

This method facilitates the aggregation of the vDataFrame by grouping its elements based on one or more specified criteria. Grouping is a critical operation in data analysis, as it allows us to segment data into subsets, making it easier to apply various aggregation functions and gain insights specific to each group.

The groupby method can be applied to one or more columns, and it is particularly valuable when we want to calculate aggregate statistics or perform operations within distinct categories or segments of our data. By grouping the elements, we can perform custom analyses, create summary statistics, or uncover patterns that might not be apparent when looking at the entire dataset as a whole. It is a foundational method in data analysis and is used extensively to explore and understand data dynamics in numerous domains.

Parameters#

columns: SQLColumns: List of the vDataColumns used to group the elements or a customized expression. If rollup is set to True, this can be a list of tuples.
expr: SQLExpression, optional: List of the different aggregations in pure SQL. Aliases can be used. For example, SUM(column) or AVG(column) AS my_new_alias are valid whereas AVG is invalid. Aliases are recommended to keep the track of the features and to prevent ambiguous names. For example, the MODE function does not exist, but can be replicated by using the analytic method and then grouping the result.
rollup: bool / list of bools, optional: If set to True, the rollup operator is used. If set to a list of bools, the rollup operator is used on the matching indexes and the length of rollup must match the length of columns. For example, for columns = ['col1', ('col2', 'col3'), 'col4'] and rollup = [False, True, True], the rollup operator is used on the set (‘col2’, ‘col3’) and on ‘col4’.
having: str, optional: Expression used to filter the result.

Returns#

vDataFrame: object result of the grouping.

Examples#

For this example, we will use the following dataset:

import verticapy as vp

data = vp.vDataFrame(
    {
        "x": ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
        "y": [1, 2, 1, 2, 1, 1, 2, 1],
        "z": [10, 12, 2, 1, 9, 8, 1, 3],
    }
)

You can perform grouping using a direct SQL statement.

data.groupby(
    columns = ["x"],
    expr = ["AVG(y) AS avg_y", "MIN(z) AS min_z"],
)

	Abc x Varchar(1) 100%	...	123 avg_y Float(22) 100%	123 min_z Integer 100%
1	B	...	1.33333333333333	1
2	C	...	1.5	1
3	A	...	1.33333333333333	2

Alternatively, you can achieve grouping using VerticaPy SQL functions, which offer a more Pythonic approach.

import verticapy.sql.functions as vpf

data.groupby(
    columns = ["x"],
    expr = [
        vpf.avg(data["y"])._as("avg_y"),
        vpf.min(data["z"])._as("min_z"),
    ],
)

	Abc x Varchar(1) 100%	...	123 avg_y Float(22) 100%	123 min_z Integer 100%
1	B	...	1.33333333333333	1
2	C	...	1.5	1
3	A	...	1.33333333333333	2

You can also perform rollup aggregations.

data.groupby(
    columns = ["x"],
    expr = [
        vpf.avg(data["y"])._as("avg_y"),
        vpf.min(data["z"])._as("min_z"),
    ],
    rollup = True,
)

	Abc x Varchar(1) 75%	...	123 avg_y Float(22) 100%	123 min_z Integer 100%
1	A	...	1.33333333333333	2
2	[null]	...	1.375	1
3	B	...	1.33333333333333	1
4	C	...	1.5	1

Note

All the calculations are pushed to the database.

Hint

For additional aggregation options, please refer to the aggregate method.