verticapy.vDataFrame.analytic#
- vDataFrame.analytic(func: str, columns: str | list[str] | None = None, by: str | list[str] | None = None, order_by: None | str | list[str] | dict = None, name: str | None = None, offset: int = 1, x_smoothing: float = 0.5, add_count: bool = True) vDataFrame #
Adds a new vDataColumn to the vDataFrame by using an advanced analytical function on one or two specific vDataColumns.
Warning
Some analytical functions can make the vDataFrame structure more resource intensive. It is best to check the structure of the vDataFrame with the
current_relation
method and save it with theto_db
method, uisng the parametersinplace = True
andrelation_type = table
.Parameters#
- func: str
Function to apply.
- aad:
average absolute deviation
- beta:
Beta Coefficient between 2 vDataColumns
- count:
number of non-missing elements
- corr:
Pearson’s correlation between 2 vDataColumns
- cov:
covariance between 2 vDataColumns
- dense_rank:
dense rank
- ema:
exponential moving average
- first_value:
first non null lead
- iqr:
interquartile range
- kurtosis:
kurtosis
- jb:
Jarque-Bera index
- lead:
next element
- lag:
previous element
- last_value:
first non null lag
- mad:
median absolute deviation
- max:
maximum
- mean:
average
- median :
median
- min:
minimum
- mode:
most occurent element
- q%:
q quantile (ex: 50% for the median)
- pct_change:
ratio between the current value and the previous one
- percent_rank :
percent rank
- prod:
product
- range:
difference between the max and the min
- rank:
rank
- row_number:
row number
- sem:
standard error of the mean
- skewness:
skewness
- sum:
sum
- std:
standard deviation
- unique:
cardinality (count distinct)
- var:
variance
Other analytical functions could work if they are part of your DB version.
- columns: SQLColumns, optional
Input vDataColumns. Must be a list of one or two elements.
- by: SQLColumns, optional
vDataColumns used in the partition.
- order_by: dict / list, optional
Either a list of the vDataColumns used to sort (in ascending order) the data, or a dictionary of vDataColumns and their sorting methods. For example, to sort by “column1” ASC and “column2” DESC, write:
{"column1": "asc", "column2": "desc"}
- name: str, optional
Name of the new vDataColumn. If empty, a default name based on the other parameters is generated.
- offset: int, optional
Lead/Lag offset if parameter
func
is the function ‘lead’/’lag’.- x_smoothing: float, optional
The smoothing parameter of the ‘ema’ if the function is ‘ema’. It must be a float in the range [0;1].
- add_count: bool, optional
If the
func
is set tomode
and this parameter is True, a column with the mode number of occurences is added to the vDataFrame.
Returns#
- vDataFrame
self
Examples#
Let’s begin by importing VerticaPy.
import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.Let us create a dummy dataset with negative values:
vdf = vp.vDataFrame( { "val" : [0.0, 10, 20], "cat": ['a', 'a', 'b'], }, )
123valNumeric(21)100%AbccatVarchar(1)100%1 0.0 a 2 10.0 a 3 20.0 b A
max
function can be conveniently applied using theanalytic
function. Below, we can find the maximum value by each category:vdf.analytic(func = "max", columns = "val", by = "cat")
123valNumeric(21)100%... AbccatVarchar(1)100%123max_val_by_catNumeric(21)100%1 0.0 ... a 10.0 2 10.0 ... a 10.0 3 20.0 ... b 20.0 Note
While the same task can be accomplished using pure SQL (see below), adopting a Pythonic approach can offer greater convenience and help avoid potential syntax errors.
vdf["val_max"] = "MAX(val) OVER (PARTITION BY cat)"
Note
Aggregations such as
mode
can be challenging to compute using pure SQL. This function is designed to simplify the process.See also