Loading...

verticapy.vDataColumn.fill_outliers#

vDataColumn.fill_outliers(method: Literal['winsorize', 'null', 'mean'] = 'winsorize', threshold: int | float | Decimal = 4.0, use_threshold: bool = True, alpha: int | float | Decimal = 0.05) vDataFrame#

Fills the vDataColumns outliers using the input method.

Parameters#

method: str, optional

Method used to fill the vDataColumn outliers.

  • mean:

    Replaces the upper and lower outliers by their respective average.

  • null:

    Replaces the outliers by the NULL value.

  • winsorize:

    If ‘use_threshold’ is set to False, clips the vDataColumn using quantile(alpha) as lower bound and quantile(1-alpha) as upper bound; otherwise uses the lower and upper ZScores.

threshold: PythonNumber, optional

Uses the Gaussian distribution to define the outliers. After normalizing the data (Z-Score), if the absolute value of the record is greater than the threshold, it will be considered as an outlier.

use_threshold: bool, optional

Uses the threshold instead of the ‘alpha’ parameter.

alpha: PythonNumber, optional

Number representing the outliers threshold. Values less than quantile(alpha) or greater than quantile(1-alpha) are filled.

Returns#

vDataFrame

self._parent

Examples#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use a dummy data that has one outlier:

vdf = vp.vDataFrame({"vals": [20, 10, 0, -20, 10, 20, 1200]})
123
vals
Integer
100%
120
210
30
4-20
510
620
71200

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can see that there are some extreme values in the data. We may need to remove those values. For this we can use the fill_outliers function.

vdf["vals"].fill_outliers(method = "null", threshold = 1)
123
vals
Integer
85%
120
210
30
4-20
510
620
7[null]

Note

We can use either the alpha parameter or the z-score threshold parameter. By default it uses the threshold.

See also

vDataFrame.fillna() : Fill the missing values using the input method.
vDataColumn.fill_outliers() : Fill the outliers using the input method.