Outliers#

Outliers are data points that differ significantly from the rest of the data. While some outliers can reveal some important information (machine failure, systems fraud…), they can also be simple errors.

Some machine learning algorithms are sensitive to outliers. In fact, they can destroy the final predictions because of how much bias they add to the data, and handling outliers in our data is one of the most important parts of the data preparation.

Outliers consist of three main types:

  • Global Outliers : Values far outside the entirety of their source dataset

  • Contextual Outliers : Values deviate significantly from the rest of the data points in the same context

  • Collective Outliers : Values that aren’t global or contextual outliers, but as a collection deviate significantly from the entire dataset

Global outliers are often the most critical type and can add a significant amount of bias into the data. Fortunately, we can easily identify these outliers by computing the Z-Score.

Let’s look at some examples using the ‘Heart Disease’ dataset. This dataset contains information on patients who are likely to have heart-related complications.

[1]:
import verticapy as vp

vp.set_option("plotting_lib","highcharts")
vp.drop("public.heart") # To make sure there is no other table with that name
vdf = vp.read_csv("data/heart.csv", schema = "public", table_name = "heart")
display(vdf)
The table "public"."heart" has been successfully created.
123
age
Integer
123
sex
Integer
123
cp
Integer
123
trestbps
Integer
123
chol
Integer
123
fbs
Integer
123
restecg
Integer
123
thalach
Integer
123
exang
Integer
123
oldpeak
Numeric(7)
123
slope
Integer
123
ca
Integer
123
thal
Integer
123
target
Integer
129111302040020200.02021
229111302040020200.02021
329111302040020200.02021
429111302040020200.02021
534011182100119200.72021
634011182100119200.72021
734011182100119200.72021
834131181820017400.02021
934131181820017400.02021
1034131181820017400.02021
1135001381830118201.42021
1235001381830118201.42021
1335001381830118201.42021
1435001381830118201.42021
1535101201980113011.61030
1635101201980113011.61030
1735101201980113011.61030
1835101201980113011.61030
1935101262820015610.02030
2035101262820015610.02030
2135101262820015610.02030
2235111221920117400.02021
2335111221920117400.02021
2435111221920117400.02021
2535111221920117400.02021
2637021202150117000.02021
2737021202150117000.02021
2837021202150117000.02021
2937121302500118703.50021
3037121302500118703.50021
3137121302500118703.50021
3238121381750117300.02421
3338121381750117300.02421
3438121381750117300.02421
3538121381750117300.02421
3638121381750117300.02421
3738121381750117300.02421
3838121381750117300.02421
3938121381750117300.02421
4038131202310118213.81030
4138131202310118213.81030
4238131202310118213.81030
4338131202310118213.81030
443902941990117900.02021
453902941990117900.02021
463902941990117900.02021
4739021382200115200.01021
4839021382200115200.01021
4939021382200115200.01021
5039021382200115200.01021
5139101182190114001.21030
5239101182190114001.21030
5339101182190114001.21030
5439101182190114001.21030
5539121403210018200.02021
5639121403210018200.02021
5739121403210018200.02021
5840101101670011412.01030
5940101101670011412.01030
6040101101670011412.01030
6140101101670011412.01030
6240101522230118100.02030
6340101522230118100.02030
6440101522230118100.02030
6540101522230118100.02030
6640131401990117811.42031
6740131401990117811.42031
6840131401990117811.42031
6941011051980116800.02121
7041011051980116800.02121
7141011051980116800.02121
7241011263060116300.02021
7341011263060116300.02021
7441011263060116300.02021
7541011302040017201.42021
7641011302040017201.42021
7741011302040017201.42021
7841021122680017210.02021
7941021122680017210.02021
8041021122680017210.02021
8141101101720015800.02030
8241101101720015800.02030
8341101101720015800.02030
8441111102350115300.02021
8541111102350115300.02021
8641111102350115300.02021
8741111201570118200.02021
8841111201570118200.02021
8941111201570118200.02021
9041111201570118200.02021
9141111352030113200.01011
9241111352030113200.01011
9341111352030113200.01011
9441111352030113200.01011
9541121122500117900.02021
9641121122500117900.02021
9741121122500117900.02021
9841121302140016802.01021
9941121302140016802.01021
10041121302140016802.01021
Rows: 1-100 | Columns: 14

Let’s focus on a patient’s maximum heart rate (thalach) and the cholesterol (chol) to identify some outliers.

[2]:
%matplotlib inline
vdf.scatter(["thalach", "chol"])
[2]:

We can see some outliers of the distribution: people with high cholesterol and others with a very low heart rate. Let’s compute the global outliers using the ‘outliers’ method.

[3]:
vdf.outliers(["thalach", "chol"], "global_outliers")
vdf.scatter(["thalach", "chol"], by = "global_outliers")
[3]:

It is also possible to draw an outlier plot using the ‘outliers_plot’ method.

[4]:
vdf.outliers_plot(["thalach", "chol"],)
[4]:

We’ve detected some global outliers in the distribution and we can impute these with the ‘fill_outliers’ method.

Generally, you can identify global outliers with the Z-Score. We’ll consider a Z-Score greater than 3 indicates that the datapoint is an outlier. Some less precise techniques consider the data points belonging in the first and last alpha-quantile as outliers. You’re free to choose either of these strategies when filling outliers.

[5]:
vdf["thalach"].fill_outliers(use_threshold = True,
                             threshold = 3.0,
                             method = "winsorize")
vdf["chol"].fill_outliers(use_threshold = True,
                          threshold = 3.0,
                          method = "winsorize")
vdf.scatter(["thalach", "chol"], by = "global_outliers")
[5]:

Other techniques like DBSCAN or local outlier factor (LOF) can be to used to check other data points for outliers.

[6]:
vdf
[6]:
123
age
Integer
123
sex
Integer
123
cp
Integer
123
trestbps
Integer
123
chol
Numeric(34)
123
fbs
Integer
123
restecg
Integer
123
thalach
Numeric(33)
123
exang
Integer
123
oldpeak
Numeric(7)
123
slope
Integer
123
ca
Integer
123
thal
Integer
123
target
Integer
123
global_outliers
Integer
12911130204.000202.000.020210
22911130204.000202.000.020210
32911130204.000202.000.020210
42911130204.000202.000.020210
53401118210.001192.000.720210
63401118210.001192.000.720210
73401118210.001192.000.720210
83413118182.000174.000.020210
93413118182.000174.000.020210
103413118182.000174.000.020210
113500138183.001182.001.420210
123500138183.001182.001.420210
133500138183.001182.001.420210
143500138183.001182.001.420210
153510120198.001130.011.610300
163510120198.001130.011.610300
173510120198.001130.011.610300
183510120198.001130.011.610300
193510126282.000156.010.020300
203510126282.000156.010.020300
213510126282.000156.010.020300
223511122192.001174.000.020210
233511122192.001174.000.020210
243511122192.001174.000.020210
253511122192.001174.000.020210
263702120215.001170.000.020210
273702120215.001170.000.020210
283702120215.001170.000.020210
293712130250.001187.003.500210
303712130250.001187.003.500210
313712130250.001187.003.500210
323812138175.001173.000.024210
333812138175.001173.000.024210
343812138175.001173.000.024210
353812138175.001173.000.024210
363812138175.001173.000.024210
373812138175.001173.000.024210
383812138175.001173.000.024210
393812138175.001173.000.024210
403813120231.001182.013.810300
413813120231.001182.013.810300
423813120231.001182.013.810300
433813120231.001182.013.810300
44390294199.001179.000.020210
45390294199.001179.000.020210
46390294199.001179.000.020210
473902138220.001152.000.010210
483902138220.001152.000.010210
493902138220.001152.000.010210
503902138220.001152.000.010210
513910118219.001140.001.210300
523910118219.001140.001.210300
533910118219.001140.001.210300
543910118219.001140.001.210300
553912140321.000182.000.020210
563912140321.000182.000.020210
573912140321.000182.000.020210
584010110167.000114.012.010300
594010110167.000114.012.010300
604010110167.000114.012.010300
614010110167.000114.012.010300
624010152223.001181.000.020300
634010152223.001181.000.020300
644010152223.001181.000.020300
654010152223.001181.000.020300
664013140199.001178.011.420310
674013140199.001178.011.420310
684013140199.001178.011.420310
694101105198.001168.000.021210
704101105198.001168.000.021210
714101105198.001168.000.021210
724101126306.001163.000.020210
734101126306.001163.000.020210
744101126306.001163.000.020210
754101130204.000172.001.420210
764101130204.000172.001.420210
774101130204.000172.001.420210
784102112268.000172.010.020210
794102112268.000172.010.020210
804102112268.000172.010.020210
814110110172.000158.000.020300
824110110172.000158.000.020300
834110110172.000158.000.020300
844111110235.001153.000.020210
854111110235.001153.000.020210
864111110235.001153.000.020210
874111120157.001182.000.020210
884111120157.001182.000.020210
894111120157.001182.000.020210
904111120157.001182.000.020210
914111135203.001132.000.010110
924111135203.001132.000.010110
934111135203.001132.000.010110
944111135203.001132.000.010110
954112112250.001179.000.020210
964112112250.001179.000.020210
974112112250.001179.000.020210
984112130214.000168.002.010210
994112130214.000168.002.010210
1004112130214.000168.002.010210
Rows: 1-100 | Columns: 15
[7]:
from verticapy.learn.cluster import DBSCAN

vp.drop("dbscan_heart")
model = DBSCAN("dbscan_heart", eps = 20, min_samples = 10)
model.fit("public.heart", ["thalach", "chol"])
model.plot()
/opt/venv/lib/python3.10/site-packages/vertica_python/vertica/connection.py:659: UserWarning: [INFO] Cannot commit; no transaction in progress
  warnings.warn(notice)
[7]:
[8]:
vdf_tmp = model.predict()
vdf_tmp["outliers_dbscan"] = "(dbscan_cluster = -1)::int"
vdf_tmp.scatter(["thalach", "chol"], by = "outliers_dbscan")
[8]:

While DBSCAN identifies outliers when computing the clusters, LOF computes an outlier score. Generally, a LOF Score greater than 1.5 indicates an outlier.

[9]:
from verticapy.learn.neighbors import LocalOutlierFactor

vp.drop("lof_heart")
model = LocalOutlierFactor("lof_heart")
model.fit("heart", ["thalach", "chol",])
model.plot()
[9]:
[10]:
lof_heart = model.predict()
lof_heart["outliers"] = "(CASE WHEN lof_score > 1.5 THEN 1 ELSE 0 END)"
lof_heart.scatter(["thalach", "chol"], by = "outliers")
[10]:

We have many other techniques like the k-means clustering for finding outliers, but the most important method is using the Z-Score. After identifying outliers, we just have to decide how to impute the missing values. We’ll focus on missing values in the next lesson.