verticapy.vDataColumn.drop_outliers#
- vDataColumn.drop_outliers(threshold: int | float | Decimal = 4.0, use_threshold: bool = True, alpha: int | float | Decimal = 0.05) vDataFrame #
Drops outliers in the vDataColumn.
Parameters#
- threshold: PythonNumber, optional
Uses the Gaussian distribution to identify outliers. After normalizing the data (Z-Score), if the absolute value of the record is greater than the threshold, it is considered as an outlier.
- use_threshold: bool, optional
Uses the threshold instead of the ‘alpha’ parameter.
- alpha: PythonNumber, optional
Number representing the outliers threshold. Values less than quantile(alpha) or greater than quantile(1-alpha) are be dropped.
Returns#
- vDataFrame
self._parent
Examples#
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly knowvDC_dropn function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.For this example, we will use a dummy data that has one outlier:
vdf = vp.vDataFrame({"vals": [20, 10, 0, -20, 10, 20, 1200]})
123valsInteger100%1 20 2 10 3 0 4 -20 5 10 6 20 7 1200 Note
VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.
Using
drop_outliers
we can take out all the outliers in that column:vdf["vals"].drop_outliers(threshold = 1.0) Out[3]: None vals 1 20 2 10 3 0 4 -20 5 10 6 20 Rows: 6 | Column: vals | Type: integer
123valsInteger100%1 20 2 10 3 0 4 -20 5 10 6 20 Note
By providing a custom threshold value, can have more control on the treatment of outliers.
See also
vDataColumn.
drop()
: Drops the input vDataColumn.vDataFrame.
drop_duplicates()
: Drops the vDataFrame duplicates.