verticapy.vDataFrame.groupby#
- vDataFrame.groupby(columns: str | list[str], expr: str | list[str] | StringSQL | list[StringSQL] | None = None, rollup: bool | list[bool] = False, having: str | None = None) vDataFrame #
This method facilitates the aggregation of the vDataFrame by grouping its elements based on one or more specified criteria. Grouping is a critical operation in data analysis, as it allows us to segment data into subsets, making it easier to apply various aggregation functions and gain insights specific to each group.
The
groupby
method can be applied to one or more columns, and it is particularly valuable when we want to calculate aggregate statistics or perform operations within distinct categories or segments of our data. By grouping the elements, we can perform custom analyses, create summary statistics, or uncover patterns that might not be apparent when looking at the entire dataset as a whole. It is a foundational method in data analysis and is used extensively to explore and understand data dynamics in numerous domains.Parameters#
- columns: SQLColumns
List of the vDataColumns used to group the elements or a customized expression. If rollup is set to True, this can be a list of tuples.
- expr: SQLExpression, optional
List of the different aggregations in pure SQL. Aliases can be used. For example,
SUM(column)
orAVG(column) AS my_new_alias
are valid whereasAVG
is invalid. Aliases are recommended to keep the track of the features and to prevent ambiguous names. For example, the MODE function does not exist, but can be replicated by using theanalytic
method and then grouping the result.- rollup: bool / list of bools, optional
If set to True, the rollup operator is used. If set to a list of bools, the rollup operator is used on the matching indexes and the length of
rollup
must match the length ofcolumns
. For example, forcolumns = ['col1', ('col2', 'col3'), 'col4']
androllup = [False, True, True]
, the rollup operator is used on the set (‘col2’, ‘col3’) and on ‘col4’.- having: str, optional
Expression used to filter the result.
Returns#
- vDataFrame
object result of the grouping.
Examples#
For this example, we will use the following dataset:
import verticapy as vp data = vp.vDataFrame( { "x": ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], "y": [1, 2, 1, 2, 1, 1, 2, 1], "z": [10, 12, 2, 1, 9, 8, 1, 3], } )
You can perform grouping using a direct SQL statement.
data.groupby( columns = ["x"], expr = ["AVG(y) AS avg_y", "MIN(z) AS min_z"], )
AbcxVarchar(1)100%... 123avg_yFloat(22)100%123min_zInteger100%1 B ... 1.33333333333333 1 2 C ... 1.5 1 3 A ... 1.33333333333333 2 Alternatively, you can achieve grouping using VerticaPy SQL functions, which offer a more Pythonic approach.
import verticapy.sql.functions as vpf data.groupby( columns = ["x"], expr = [ vpf.avg(data["y"])._as("avg_y"), vpf.min(data["z"])._as("min_z"), ], )
AbcxVarchar(1)100%... 123avg_yFloat(22)100%123min_zInteger100%1 B ... 1.33333333333333 1 2 C ... 1.5 1 3 A ... 1.33333333333333 2 You can also perform rollup aggregations.
data.groupby( columns = ["x"], expr = [ vpf.avg(data["y"])._as("avg_y"), vpf.min(data["z"])._as("min_z"), ], rollup = True, )
AbcxVarchar(1)75%... 123avg_yFloat(22)100%123min_zInteger100%1 A ... 1.33333333333333 2 2 [null] ... 1.375 1 3 B ... 1.33333333333333 1 4 C ... 1.5 1 Note
All the calculations are pushed to the database.
Hint
For additional aggregation options, please refer to the
aggregate
method.See also
vDataColumn.
aggregate()
: Aggregations for a specific column.vDataFrame.
aggregate()
: Aggregates for particular columns.