Loading...

verticapy.vDataFrame.analytic#

vDataFrame.analytic(func: str, columns: str | list[str] | None = None, by: str | list[str] | None = None, order_by: None | str | list[str] | dict = None, name: str | None = None, offset: int = 1, x_smoothing: float = 0.5, add_count: bool = True) vDataFrame#

Adds a new vDataColumn to the vDataFrame by using an advanced analytical function on one or two specific vDataColumns.

Warning

Some analytical functions can make the vDataFrame structure more resource intensive. It is best to check the structure of the vDataFrame with the current_relation method and save it with the to_db method, uisng the parameters inplace = True and relation_type = table.

Parameters#

func: str

Function to apply.

  • aad:

    average absolute deviation

  • beta:

    Beta Coefficient between 2 vDataColumns

  • count:

    number of non-missing elements

  • corr:

    Pearson’s correlation between 2 vDataColumns

  • cov:

    covariance between 2 vDataColumns

  • dense_rank:

    dense rank

  • ema:

    exponential moving average

  • first_value:

    first non null lead

  • iqr:

    interquartile range

  • kurtosis:

    kurtosis

  • jb:

    Jarque-Bera index

  • lead:

    next element

  • lag:

    previous element

  • last_value:

    first non null lag

  • mad:

    median absolute deviation

  • max:

    maximum

  • mean:

    average

  • median :

    median

  • min:

    minimum

  • mode:

    most occurent element

  • q%:

    q quantile (ex: 50% for the median)

  • pct_change:

    ratio between the current value and the previous one

  • percent_rank :

    percent rank

  • prod:

    product

  • range:

    difference between the max and the min

  • rank:

    rank

  • row_number:

    row number

  • sem:

    standard error of the mean

  • skewness:

    skewness

  • sum:

    sum

  • std:

    standard deviation

  • unique:

    cardinality (count distinct)

  • var:

    variance

Other analytical functions could work if they are part of your DB version.

columns: SQLColumns, optional

Input vDataColumns. Must be a list of one or two elements.

by: SQLColumns, optional

vDataColumns used in the partition.

order_by: dict / list, optional

Either a list of the vDataColumns used to sort (in ascending order) the data, or a dictionary of vDataColumns and their sorting methods. For example, to sort by “column1” ASC and “column2” DESC, write: {"column1": "asc", "column2": "desc"}

name: str, optional

Name of the new vDataColumn. If empty, a default name based on the other parameters is generated.

offset: int, optional

Lead/Lag offset if parameter func is the function ‘lead’/’lag’.

x_smoothing: float, optional

The smoothing parameter of the ‘ema’ if the function is ‘ema’. It must be a float in the range [0;1].

add_count: bool, optional

If the func is set to mode and this parameter is True, a column with the mode number of occurences is added to the vDataFrame.

Returns#

vDataFrame

self

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create a dummy dataset with negative values:

vdf = vp.vDataFrame(
    {
        "val" : [0.0, 10, 20],
        "cat": ['a', 'a', 'b'],
    },
)

123
val
Numeric(21)
100%
Abc
cat
Varchar(1)
100%
10.0a
210.0a
320.0b

A max function can be conveniently applied using the analytic function. Below, we can find the maximum value by each category:

vdf.analytic(func = "max", columns = "val", by = "cat")
123
val
Numeric(21)
100%
...
Abc
cat
Varchar(1)
100%
123
max_val_by_cat
Numeric(21)
100%
10.0...a10.0
210.0...a10.0
320.0...b20.0

Note

While the same task can be accomplished using pure SQL (see below), adopting a Pythonic approach can offer greater convenience and help avoid potential syntax errors.

vdf["val_max"] = "MAX(val) OVER (PARTITION BY cat)"

Note

Aggregations such as mode can be challenging to compute using pure SQL. This function is designed to simplify the process.

See also

vDataFrame.apply() : Applies each function of the dictionary to the input vDataColumn.
vDataColumn.apply_fun() : Applies a default function to the vDataColumn.