
VerticaPy
Outliers¶
Outliers are data points that differ significantly from the rest of the data. While some outliers can reveal some important information (machine failure, systems fraud...), they can also be simple errors.
Some machine learning algorithms are sensitive to outliers. In fact, they can destroy the final predictions because of how much bias they add to the data, and handling outliers in our data is one of the most important parts of the data preparation.
Outliers consist of three main types:
- Global Outliers : Values far outside the entirety of their source dataset
- Contextual Outliers : Values deviate significantly from the rest of the data points in the same context
- Collective Outliers : Values that aren't global or contextual outliers, but as a collection deviate significantly from the entire dataset
Global outliers are often the most critical type and can add a significant amount of bias into the data. Fortunately, we can easily identify these outliers by computing the Z-Score.
Let's look at some examples using the 'Heart Disease' dataset. This dataset contains information on patients who are likely to have heart-related complications.
import verticapy as vp
vdf = vp.read_csv("data/heart.csv")
display(vdf)
Let's focus on a patient's maximum heart rate (thalach) and the cholesterol (chol) to identify some outliers.
%matplotlib inline
vdf.scatter(["thalach", "chol"])
We can see some outliers of the distribution: people with high cholesterol and others with a very low heart rate. Let's compute the global outliers using the 'outliers' method.
vdf.outliers(["thalach", "chol"], "global_outliers")
vdf.scatter(["thalach", "chol"], catcol = "global_outliers")
It is also possible to draw an outlier plot using the 'outliers_plot' method.
vdf.outliers_plot(["thalach", "chol"],)
We've detected some global outliers in the distribution and we can impute these with the 'fill_outliers' method.
Generally, you can identify global outliers with the Z-Score. We'll consider a Z-Score greater than 3 indicates that the datapoint is an outlier. Some less precise techniques consider the data points belonging in the first and last alpha-quantile as outliers. You're free to choose either of these strategies when filling outliers.
vdf["thalach"].fill_outliers(use_threshold = True,
threshold = 3.0,
method = "winsorize")
vdf["chol"].fill_outliers(use_threshold = True,
threshold = 3.0,
method = "winsorize")
vdf.scatter(["thalach", "chol"], catcol = "global_outliers")
Other techniques like DBSCAN or local outlier factor (LOF) can be to used to check other data points for outliers.
from verticapy.learn.cluster import DBSCAN
model = DBSCAN("dbscan_heart", eps = 20, min_samples = 10)
model.fit("heart", ["thalach", "chol"])
model.plot()
vdf_tmp = model.predict()
vdf_tmp["outliers_dbscan"] = "(dbscan_cluster = -1)::int"
vdf_tmp.scatter(["thalach", "chol"], catcol = "outliers_dbscan")
While DBSCAN identifies outliers when computing the clusters, LOF computes an outlier score. Generally, a LOF Score greater than 1.5 indicates an outlier.
from verticapy.learn.neighbors import LocalOutlierFactor
model = LocalOutlierFactor("lof_heart")
model.fit("heart", ["thalach", "chol",])
model.plot()
lof_heart = model.predict()
lof_heart["outliers"] = "(CASE WHEN lof_score > 1.5 THEN 1 ELSE 0 END)"
lof_heart.scatter(["thalach", "chol"], catcol = "outliers")
We have many other techniques like the k-means clustering for finding outliers, but the most important method is using the Z-Score. After identifying outliers, we just have to decide how to impute the missing values. We'll focus on missing values in the next lesson.