Loading...

verticapy.vDataFrame.outliers#

vDataFrame.outliers(columns: str | list[str] | None = None, name: str = 'distribution_outliers', threshold: float = 3.0, robust: bool = False) vDataFrame#

Adds a new vDataColumn labeled with 0 or 1, where 1 indicates that the record is a global outlier.

Parameters#

columns: SQLColumns, optional

List of the vDataColumn names. If empty, all numerical vDataColumn are used.

name: str, optional

Name of the new vDataColumn.

threshold: float, optional

Threshold equal to the critical score.

robust: bool

If set to True, uses the Robust Z-Score instead of the Z-Score.

Returns#

vDataFrame

self

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create a vDataFrame that has some outliers:

import numpy as np

data = np.random.normal(
    loc = 0,
    scale = 1,
    size = 10,
)


data = np.append(data, [100])

vdf = vp.vDataFrame({"vals": data})
123
vals
Numeric(22)
100%
1-0.9692907406301432
21.5361282477773475
3-0.845487667550313
40.8760238645325606
5-2.1440816057858076
60.6047497570245369
70.11152711520356214
8-0.9210417560931947
90.8408167674806595
101.1481774819371722
11100.0

Now we can see which values are outliers by using the vDataFrame.outliers() method:

vdf.outliers()
123
vals
Numeric(22)
100%
123
distribution_outliers
Integer
100%
1-0.96929074063014320
21.53612824777734750
3-0.8454876675503130
40.87602386453256060
5-2.14408160578580760
60.60474975702453690
70.111527115203562140
8-0.92104175609319470
90.84081676748065950
101.14817748193717220
11100.01

Note

This function can only identify global outliers in the distribution. For other types of outliers, it is recommended to create machine learning models.

See also

vDataFrame.outliers_plot() : Plots the outliers.