verticapy.vDataFrame.analytic#

Adds a new vDataColumn to the vDataFrame by using an advanced analytical function on one or two specific vDataColumns.

Warning

Some analytical functions can make the vDataFrame structure more resource intensive. It is best to check the structure of the vDataFrame with the current_relation method and save it with the to_db method, uisng the parameters inplace = True and relation_type = table.

Parameters#

func: str

Function to apply.

aad:
average absolute deviation
beta:
Beta Coefficient between 2 vDataColumns
count:
number of non-missing elements
corr:
Pearson’s correlation between 2 vDataColumns
cov:
covariance between 2 vDataColumns
dense_rank:
dense rank
ema:
exponential moving average
first_value:
first non null lead
iqr:
interquartile range
kurtosis:
kurtosis
jb:
Jarque-Bera index
lead:
next element
lag:
previous element
last_value:
first non null lag
mad:
median absolute deviation
max:
maximum
mean:
average
median :
median
min:
minimum
mode:
most occurent element
q%:
q quantile (ex: 50% for the median)
pct_change:
ratio between the current value and the previous one
percent_rank :
percent rank
prod:
product
range:
difference between the max and the min
rank:
rank
row_number:
row number
sem:
standard error of the mean
skewness:
skewness
sum:
sum
std:
standard deviation
unique:
cardinality (count distinct)
var:
variance

Other analytical functions could work if they are part of your DB version.

columns: SQLColumns, optional

Input vDataColumns. Must be a list of one or two elements.

by: SQLColumns, optional

vDataColumns used in the partition.

order_by: dict / list, optional

Either a list of the vDataColumns used to sort (in ascending order) the data, or a dictionary of vDataColumns and their sorting methods. For example, to sort by “column1” ASC and “column2” DESC, write: {"column1": "asc", "column2": "desc"}

name: str, optional

Name of the new vDataColumn. If empty, a default name based on the other parameters is generated.

offset: int, optional

Lead/Lag offset if parameter func is the function ‘lead’/’lag’.

x_smoothing: float, optional

The smoothing parameter of the ‘ema’ if the function is ‘ema’. It must be a float in the range [0;1].

add_count: bool, optional

If the func is set to mode and this parameter is True, a column with the mode number of occurences is added to the vDataFrame.

Returns#

vDataFrame: self

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create a dummy dataset with negative values:

vdf = vp.vDataFrame(
    {
        "val" : [0.0, 10, 20],
        "cat": ['a', 'a', 'b'],
    },
)

	123 val Numeric(21) 100%	Abc cat Varchar(1) 100%
1	0.0	a
2	10.0	a
3	20.0	b

A max function can be conveniently applied using the analytic function. Below, we can find the maximum value by each category:

vdf.analytic(func = "max", columns = "val", by = "cat")

	123 val Numeric(21) 100%	...	Abc cat Varchar(1) 100%	123 max_val_by_cat Numeric(21) 100%
1	0.0	...	a	10.0
2	10.0	...	a	10.0
3	20.0	...	b	20.0

Note

While the same task can be accomplished using pure SQL (see below), adopting a Pythonic approach can offer greater convenience and help avoid potential syntax errors.

vdf["val_max"] = "MAX(val) OVER (PARTITION BY cat)"

Note

Aggregations such as mode can be challenging to compute using pure SQL. This function is designed to simplify the process.