verticapy.vDataFrame.describe#
- vDataFrame.describe(method: Literal['numerical', 'categorical', 'statistics', 'length', 'range', 'all', 'auto'] = 'auto', columns: str | list[str] | None = None, unique: bool = False, ncols_block: int = 20, processes: int = 1) TableSample #
This function aggregates the vDataFrame using multiple statistical aggregations such as minimum (min), maximum (max), median, cardinality (unique), and other relevant statistics. The specific aggregations applied depend on the data types of the vDataColumns. For example, numeric columns are aggregated with numerical aggregations (min, median, max…), while categorical columns are aggregated using categorical ones (cardinality, mode…). This versatile function provides valuable insights into the dataset’s statistical properties and can be customized to meet specific analytical requirements.
Note
This function can offer faster performance compared to the
vDataFrame.
aggregate()
method, as it leverages specialized and optimized backend functions.Parameters#
- method: str, optional
The describe method.
- all:
Aggregates all statistics for all vDataColumns. The exact method depends on the vDataColumn type (numerical dtype: numerical; timestamp dtype: range; categorical dtype: length)
- auto:
Sets the method to
numerical
if at least one vDataColumn of the vDataFrame is numerical,categorical
otherwise.
- categorical:
Uses only categorical aggregations.
- length:
Aggregates the vDataFrame using numerical aggregation on the length of all selected vDataColumns.
- numerical:
Uses only numerical descriptive statistics, which are computed faster than the aggregate method.
- range:
Aggregates the vDataFrame using multiple statistical aggregations - min, max, range…
- statistics:
Aggregates the vDataFrame using multiple statistical aggregations - kurtosis, skewness, min, max…
- columns: SQLColumns, optional
List of the vDataColumns names. If empty, the vDataColumns are selected depending on the parameter
method
.- unique: bool, optional
If set to True, computes the cardinality of each element.
- ncols_block: int, optional
Number of columns used per query. Setting this parameter divides what would otherwise be one large query into many smaller queries called “blocks”, whose size is determined by the ncols_block parmeter.
- processes: int, optional
Number of child processes to create. Setting this with the ncols_block parameter lets you parallelize a single query into many smaller queries, where each child process creates its own connection to the database and sends one query. This can improve query performance, but consumes more resources. If processes is set to 1, the queries are sent iteratively from a single process.
Returns#
- TableSample
result.
Examples#
For this example, we will use the following dataset:
import verticapy as vp data = vp.vDataFrame( { "x": [1, 2, 4, 9, 10, 15, 20, 22], "y": [1, 2, 1, 2, 1, 1, 2, 1], "z": [10, 12, 2, 1, 9, 8, 1, 3], "c": ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'], } )
The
describe
method provides you with a variety of statistical methods.The
numerical
parameter allows for the computation of numerical aggregations.data.describe( columns = ["x", "y", "z"], method = "numerical", )
... approx_75% max "x" ... 16.25 22.0 "y" ... 2.0 2.0 "z" ... 9.25 12.0 The
categorical
parameter allows for the computation of categorical aggregations.data.describe( columns = ["x", "y", "z", "c"], method = "categorical", )
... top top_percent "x" ... 10 12.5 "y" ... 1 62.5 "z" ... 1 25.0 "c" ... A 50.0 The
all
parameter allows for the computation of both categorical and numerical aggregations.data.describe( columns = ["x", "y", "z", "c"], method = "all", )
... 123"z"Integer100%Abc"c"Varchar(1)100%dtype ... integer varchar(1) percent ... 100.0 100.0 count ... 8 8 top ... 1 A top_percent ... 25.0 50.0 avg ... 5.75 1.0 stddev ... 4.46414285485707 0.0 min ... 1 1 approx_25% ... 1.75 1 approx_50% ... 5.5 1 approx_75% ... 9.25 1 max ... 12 1 range ... 11 0 empty ... [null] 0 Note
Many other methods are available, and their cost in terms of computation can vary.
Note
All the calculations are pushed to the database.
See also
vDataColumn.
aggregate()
: Aggregations for a specific column.vDataFrame.
aggregate()
: Aggregations for specific columns.vDataColumn.
describe()
: Summarizes the information within the column.