VerticaPy

Python API for Vertica Data Science at Scale

Outliers

Outliers are data points that differ significantly from the rest of the data. While some outliers can reveal some important information (machine failure, systems fraud...), they can also be simple errors.

Some machine learning algorithms are sensitive to outliers. In fact, they can destroy the final predictions because of how much bias they add to the data, and handling outliers in our data is one of the most important parts of the data preparation.

Outliers consist of three main types:

  • Global Outliers : Values far outside the entirety of their source dataset
  • Contextual Outliers : Values deviate significantly from the rest of the data points in the same context
  • Collective Outliers : Values that aren't global or contextual outliers, but as a collection deviate significantly from the entire dataset

Global outliers are often the most critical type and can add a significant amount of bias into the data. Fortunately, we can easily identify these outliers by computing the Z-Score.

Let's look at some examples using the 'Heart Disease' dataset. This dataset contains information on patients who are likely to have heart-related complications.

In [5]:
import verticapy as vp
vdf = vp.read_csv("data/heart.csv")
display(vdf)
123
age
Int
123
sex
Int
123
cp
Int
123
trestbps
Int
123
chol
Int
123
fbs
Int
123
restecg
Int
123
thalach
Int
123
exang
Int
123
oldpeak
Numeric(5,2)
123
slope
Int
123
ca
Int
123
thal
Int
123
target
Int
129111302040020200.02021
229111302040020200.02021
329111302040020200.02021
429111302040020200.02021
534011182100119200.72021
634011182100119200.72021
734011182100119200.72021
834131181820017400.02021
934131181820017400.02021
1034131181820017400.02021
1135001381830118201.42021
1235001381830118201.42021
1335001381830118201.42021
1435001381830118201.42021
1535101201980113011.61030
1635101201980113011.61030
1735101201980113011.61030
1835101201980113011.61030
1935101262820015610.02030
2035101262820015610.02030
2135101262820015610.02030
2235111221920117400.02021
2335111221920117400.02021
2435111221920117400.02021
2535111221920117400.02021
2637021202150117000.02021
2737021202150117000.02021
2837021202150117000.02021
2937121302500118703.50021
3037121302500118703.50021
3137121302500118703.50021
3238121381750117300.02421
3338121381750117300.02421
3438121381750117300.02421
3538121381750117300.02421
3638121381750117300.02421
3738121381750117300.02421
3838121381750117300.02421
3938121381750117300.02421
4038131202310118213.81030
4138131202310118213.81030
4238131202310118213.81030
4338131202310118213.81030
443902941990117900.02021
453902941990117900.02021
463902941990117900.02021
4739021382200115200.01021
4839021382200115200.01021
4939021382200115200.01021
5039021382200115200.01021
5139101182190114001.21030
5239101182190114001.21030
5339101182190114001.21030
5439101182190114001.21030
5539121403210018200.02021
5639121403210018200.02021
5739121403210018200.02021
5840101101670011412.01030
5940101101670011412.01030
6040101101670011412.01030
6140101101670011412.01030
6240101522230118100.02030
6340101522230118100.02030
6440101522230118100.02030
6540101522230118100.02030
6640131401990117811.42031
6740131401990117811.42031
6840131401990117811.42031
6941011051980116800.02121
7041011051980116800.02121
7141011051980116800.02121
7241011263060116300.02021
7341011263060116300.02021
7441011263060116300.02021
7541011302040017201.42021
7641011302040017201.42021
7741011302040017201.42021
7841021122680017210.02021
7941021122680017210.02021
8041021122680017210.02021
8141101101720015800.02030
8241101101720015800.02030
8341101101720015800.02030
8441111102350115300.02021
8541111102350115300.02021
8641111102350115300.02021
8741111201570118200.02021
8841111201570118200.02021
8941111201570118200.02021
9041111201570118200.02021
9141111352030113200.01011
9241111352030113200.01011
9341111352030113200.01011
9441111352030113200.01011
9541121122500117900.02021
9641121122500117900.02021
9741121122500117900.02021
9841121302140016802.01021
9941121302140016802.01021
10041121302140016802.01021
Rows: 1-100 | Columns: 14

Let's focus on a patient's maximum heart rate (thalach) and the cholesterol (chol) to identify some outliers.

In [6]:
%matplotlib inline
vdf.scatter(["thalach", "chol"])
Out[6]:
<AxesSubplot:xlabel='"thalach"', ylabel='"chol"'>

We can see some outliers of the distribution: people with high cholesterol and others with a very low heart rate. Let's compute the global outliers using the 'outliers' method.

In [7]:
vdf.outliers(["thalach", "chol"], "global_outliers")
vdf.scatter(["thalach", "chol"], catcol = "global_outliers")
Out[7]:
<AxesSubplot:xlabel='"thalach"', ylabel='"chol"'>

It is also possible to draw an outlier plot using the 'outliers_plot' method.

In [8]:
vdf.outliers_plot(["thalach", "chol"],)
Out[8]:
<AxesSubplot:xlabel='"thalach"', ylabel='"chol"'>

We've detected some global outliers in the distribution and we can impute these with the 'fill_outliers' method.

Generally, you can identify global outliers with the Z-Score. We'll consider a Z-Score greater than 3 indicates that the datapoint is an outlier. Some less precise techniques consider the data points belonging in the first and last alpha-quantile as outliers. You're free to choose either of these strategies when filling outliers.

In [9]:
vdf["thalach"].fill_outliers(use_threshold = True,
                             threshold = 3.0,
                             method = "winsorize")
vdf["chol"].fill_outliers(use_threshold = True,
                          threshold = 3.0,
                          method = "winsorize")
vdf.scatter(["thalach", "chol"], catcol = "global_outliers")
Out[9]:
<AxesSubplot:xlabel='"thalach"', ylabel='"chol"'>

Other techniques like DBSCAN or local outlier factor (LOF) can be to used to check other data points for outliers.

In [10]:
from verticapy.learn.cluster import DBSCAN

model = DBSCAN("dbscan_heart", eps = 20, min_samples = 10)
model.fit("heart", ["thalach", "chol"])
model.plot()
vdf_tmp = model.predict()
vdf_tmp["outliers_dbscan"] = "(dbscan_cluster = -1)::int"
vdf_tmp.scatter(["thalach", "chol"], catcol = "outliers_dbscan")
Out[10]:
<AxesSubplot:xlabel='"thalach"', ylabel='"chol"'>

While DBSCAN identifies outliers when computing the clusters, LOF computes an outlier score. Generally, a LOF Score greater than 1.5 indicates an outlier.

In [11]:
from verticapy.learn.neighbors import LocalOutlierFactor

model = LocalOutlierFactor("lof_heart")
model.fit("heart", ["thalach", "chol",])
model.plot()
lof_heart = model.predict()
lof_heart["outliers"] = "(CASE WHEN lof_score > 1.5 THEN 1 ELSE 0 END)"
lof_heart.scatter(["thalach", "chol"], catcol = "outliers")
Out[11]:
<AxesSubplot:xlabel='"thalach"', ylabel='"chol"'>

We have many other techniques like the k-means clustering for finding outliers, but the most important method is using the Z-Score. After identifying outliers, we just have to decide how to impute the missing values. We'll focus on missing values in the next lesson.