verticapy.vDataFrame.describe#

vDataFrame.describe(method: Literal['numerical', 'categorical', 'statistics', 'length', 'range', 'all', 'auto'] = 'auto', columns: str | list[str] | None = None, unique: bool = False, ncols_block: int = 20, processes: int = 1) → TableSample#

This function aggregates the vDataFrame using multiple statistical aggregations such as minimum (min), maximum (max), median, cardinality (unique), and other relevant statistics. The specific aggregations applied depend on the data types of the vDataColumns. For example, numeric columns are aggregated with numerical aggregations (min, median, max…), while categorical columns are aggregated using categorical ones (cardinality, mode…). This versatile function provides valuable insights into the dataset’s statistical properties and can be customized to meet specific analytical requirements.

Note

This function can offer faster performance compared to the vDataFrame.aggregate() method, as it leverages specialized and optimized backend functions.

Parameters#

method: str, optional

The describe method.

all:
Aggregates all statistics for all vDataColumns. The exact method depends on the vDataColumn type (numerical dtype: numerical; timestamp dtype: range; categorical dtype: length)

auto:
Sets the method to numerical if at least one vDataColumn of the vDataFrame is numerical, categorical otherwise.

categorical:
Uses only categorical aggregations.

length:
Aggregates the vDataFrame using numerical aggregation on the length of all selected vDataColumns.

numerical:
Uses only numerical descriptive statistics, which are computed faster than the aggregate method.

range:
Aggregates the vDataFrame using multiple statistical aggregations - min, max, range…

statistics:
Aggregates the vDataFrame using multiple statistical aggregations - kurtosis, skewness, min, max…

columns: SQLColumns, optional

List of the vDataColumns names. If empty, the vDataColumns are selected depending on the parameter method.

unique: bool, optional

If set to True, computes the cardinality of each element.

ncols_block: int, optional

Number of columns used per query. Setting this parameter divides what would otherwise be one large query into many smaller queries called “blocks”, whose size is determined by the ncols_block parmeter.

processes: int, optional

Number of child processes to create. Setting this with the ncols_block parameter lets you parallelize a single query into many smaller queries, where each child process creates its own connection to the database and sends one query. This can improve query performance, but consumes more resources. If processes is set to 1, the queries are sent iteratively from a single process.

Returns#

TableSample: result.

Examples#

For this example, we will use the following dataset:

import verticapy as vp

data = vp.vDataFrame(
    {
        "x": [1, 2, 4, 9, 10, 15, 20, 22],
        "y": [1, 2, 1, 2, 1, 1, 2, 1],
        "z": [10, 12, 2, 1, 9, 8, 1, 3],
        "c": ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'],
    }
)

The describe method provides you with a variety of statistical methods.

The numerical parameter allows for the computation of numerical aggregations.

data.describe(
    columns = ["x", "y", "z"],
    method = "numerical",
)

	...	approx_75%	max
"x"	...	16.25	22.0
"y"	...	2.0	2.0
"z"	...	9.25	12.0

The categorical parameter allows for the computation of categorical aggregations.

data.describe(
    columns = ["x", "y", "z", "c"],
    method = "categorical",
)

	...	top	top_percent
"x"	...	10	12.5
"y"	...	1	62.5
"z"	...	1	25.0
"c"	...	A	50.0

The all parameter allows for the computation of both categorical and numerical aggregations.

data.describe(
    columns = ["x", "y", "z", "c"],
    method = "all",
)

	...	123 "z" Integer 100%	Abc "c" Varchar(1) 100%
dtype	...	integer	varchar(1)
percent	...	100.0	100.0
count	...	8	8
top	...	1	A
top_percent	...	25.0	50.0
avg	...	5.75	1.0
stddev	...	4.46414285485707	0.0
min	...	1	1
approx_25%	...	1.75	1
approx_50%	...	5.5	1
approx_75%	...	9.25	1
max	...	12	1
range	...	11	0
empty	...	[null]	0

Note

Many other methods are available, and their cost in terms of computation can vary.

Note

All the calculations are pushed to the database.