verticapy.vDataFrame.outliers#
- vDataFrame.outliers(columns: str | list[str] | None = None, name: str = 'distribution_outliers', threshold: float = 3.0, robust: bool = False) vDataFrame #
Adds a new
vDataColumn
labeled with 0 or 1, where 1 indicates that the record is a global outlier.Parameters#
- columns: SQLColumns, optional
List of the
vDataColumn
names. If empty, all numericalvDataColumn
are used.- name: str, optional
Name of the new
vDataColumn
.- threshold: float, optional
Threshold equal to the critical score.
- robust: bool
If set to True, uses the Robust Z-Score instead of the Z-Score.
Returns#
- vDataFrame
self
Examples#
Let’s begin by importing VerticaPy.
import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.Let us create a
vDataFrame
that has some outliers:import numpy as np data = np.random.normal( loc = 0, scale = 1, size = 10, ) data = np.append(data, [100]) vdf = vp.vDataFrame({"vals": data})
123valsNumeric(22)100%1 -0.9692907406301432 2 1.5361282477773475 3 -0.845487667550313 4 0.8760238645325606 5 -2.1440816057858076 6 0.6047497570245369 7 0.11152711520356214 8 -0.9210417560931947 9 0.8408167674806595 10 1.1481774819371722 11 100.0 Now we can see which values are outliers by using the
vDataFrame.
outliers()
method:vdf.outliers()
123valsNumeric(22)100%123distribution_outliersInteger100%1 -0.9692907406301432 0 2 1.5361282477773475 0 3 -0.845487667550313 0 4 0.8760238645325606 0 5 -2.1440816057858076 0 6 0.6047497570245369 0 7 0.11152711520356214 0 8 -0.9210417560931947 0 9 0.8408167674806595 0 10 1.1481774819371722 0 11 100.0 1 Note
This function can only identify global outliers in the distribution. For other types of outliers, it is recommended to create machine learning models.
See also
vDataFrame.
outliers_plot()
: Plots the outliers.