Outliers#
Outliers are data points that differ significantly from the rest of the data. While some outliers can reveal some important information (machine failure, systems fraud…), they can also be simple errors.
Some machine learning algorithms are sensitive to outliers. In fact, they can destroy the final predictions because of how much bias they add to the data, and handling outliers in our data is one of the most important parts of the data preparation.
Outliers consist of three main types:
Global Outliers : Values far outside the entirety of their source dataset
Contextual Outliers : Values deviate significantly from the rest of the data points in the same context
Collective Outliers : Values that aren’t global or contextual outliers, but as a collection deviate significantly from the entire dataset
Global outliers are often the most critical type and can add a significant amount of bias into the data. Fortunately, we can easily identify these outliers by computing the Z-Score.
Let’s look at some examples using the ‘Heart Disease’ dataset. This dataset contains information on patients who are likely to have heart-related complications.
[1]:
import verticapy as vp
vp.set_option("plotting_lib","highcharts")
vp.drop("public.heart") # To make sure there is no other table with that name
vdf = vp.read_csv("data/heart.csv", schema = "public", table_name = "heart")
display(vdf)
The table "public"."heart" has been successfully created.
123 ageInteger | 123 sexInteger | 123 cpInteger | 123 trestbpsInteger | 123 cholInteger | 123 fbsInteger | 123 restecgInteger | 123 thalachInteger | 123 exangInteger | 123 oldpeakNumeric(7) | 123 slopeInteger | 123 caInteger | 123 thalInteger | 123 targetInteger | |
1 | 29 | 1 | 1 | 130 | 204 | 0 | 0 | 202 | 0 | 0.0 | 2 | 0 | 2 | 1 |
2 | 29 | 1 | 1 | 130 | 204 | 0 | 0 | 202 | 0 | 0.0 | 2 | 0 | 2 | 1 |
3 | 29 | 1 | 1 | 130 | 204 | 0 | 0 | 202 | 0 | 0.0 | 2 | 0 | 2 | 1 |
4 | 29 | 1 | 1 | 130 | 204 | 0 | 0 | 202 | 0 | 0.0 | 2 | 0 | 2 | 1 |
5 | 34 | 0 | 1 | 118 | 210 | 0 | 1 | 192 | 0 | 0.7 | 2 | 0 | 2 | 1 |
6 | 34 | 0 | 1 | 118 | 210 | 0 | 1 | 192 | 0 | 0.7 | 2 | 0 | 2 | 1 |
7 | 34 | 0 | 1 | 118 | 210 | 0 | 1 | 192 | 0 | 0.7 | 2 | 0 | 2 | 1 |
8 | 34 | 1 | 3 | 118 | 182 | 0 | 0 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
9 | 34 | 1 | 3 | 118 | 182 | 0 | 0 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
10 | 34 | 1 | 3 | 118 | 182 | 0 | 0 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
11 | 35 | 0 | 0 | 138 | 183 | 0 | 1 | 182 | 0 | 1.4 | 2 | 0 | 2 | 1 |
12 | 35 | 0 | 0 | 138 | 183 | 0 | 1 | 182 | 0 | 1.4 | 2 | 0 | 2 | 1 |
13 | 35 | 0 | 0 | 138 | 183 | 0 | 1 | 182 | 0 | 1.4 | 2 | 0 | 2 | 1 |
14 | 35 | 0 | 0 | 138 | 183 | 0 | 1 | 182 | 0 | 1.4 | 2 | 0 | 2 | 1 |
15 | 35 | 1 | 0 | 120 | 198 | 0 | 1 | 130 | 1 | 1.6 | 1 | 0 | 3 | 0 |
16 | 35 | 1 | 0 | 120 | 198 | 0 | 1 | 130 | 1 | 1.6 | 1 | 0 | 3 | 0 |
17 | 35 | 1 | 0 | 120 | 198 | 0 | 1 | 130 | 1 | 1.6 | 1 | 0 | 3 | 0 |
18 | 35 | 1 | 0 | 120 | 198 | 0 | 1 | 130 | 1 | 1.6 | 1 | 0 | 3 | 0 |
19 | 35 | 1 | 0 | 126 | 282 | 0 | 0 | 156 | 1 | 0.0 | 2 | 0 | 3 | 0 |
20 | 35 | 1 | 0 | 126 | 282 | 0 | 0 | 156 | 1 | 0.0 | 2 | 0 | 3 | 0 |
21 | 35 | 1 | 0 | 126 | 282 | 0 | 0 | 156 | 1 | 0.0 | 2 | 0 | 3 | 0 |
22 | 35 | 1 | 1 | 122 | 192 | 0 | 1 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
23 | 35 | 1 | 1 | 122 | 192 | 0 | 1 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
24 | 35 | 1 | 1 | 122 | 192 | 0 | 1 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
25 | 35 | 1 | 1 | 122 | 192 | 0 | 1 | 174 | 0 | 0.0 | 2 | 0 | 2 | 1 |
26 | 37 | 0 | 2 | 120 | 215 | 0 | 1 | 170 | 0 | 0.0 | 2 | 0 | 2 | 1 |
27 | 37 | 0 | 2 | 120 | 215 | 0 | 1 | 170 | 0 | 0.0 | 2 | 0 | 2 | 1 |
28 | 37 | 0 | 2 | 120 | 215 | 0 | 1 | 170 | 0 | 0.0 | 2 | 0 | 2 | 1 |
29 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
30 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
31 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
32 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
33 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
34 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
35 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
36 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
37 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
38 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
39 | 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0.0 | 2 | 4 | 2 | 1 |
40 | 38 | 1 | 3 | 120 | 231 | 0 | 1 | 182 | 1 | 3.8 | 1 | 0 | 3 | 0 |
41 | 38 | 1 | 3 | 120 | 231 | 0 | 1 | 182 | 1 | 3.8 | 1 | 0 | 3 | 0 |
42 | 38 | 1 | 3 | 120 | 231 | 0 | 1 | 182 | 1 | 3.8 | 1 | 0 | 3 | 0 |
43 | 38 | 1 | 3 | 120 | 231 | 0 | 1 | 182 | 1 | 3.8 | 1 | 0 | 3 | 0 |
44 | 39 | 0 | 2 | 94 | 199 | 0 | 1 | 179 | 0 | 0.0 | 2 | 0 | 2 | 1 |
45 | 39 | 0 | 2 | 94 | 199 | 0 | 1 | 179 | 0 | 0.0 | 2 | 0 | 2 | 1 |
46 | 39 | 0 | 2 | 94 | 199 | 0 | 1 | 179 | 0 | 0.0 | 2 | 0 | 2 | 1 |
47 | 39 | 0 | 2 | 138 | 220 | 0 | 1 | 152 | 0 | 0.0 | 1 | 0 | 2 | 1 |
48 | 39 | 0 | 2 | 138 | 220 | 0 | 1 | 152 | 0 | 0.0 | 1 | 0 | 2 | 1 |
49 | 39 | 0 | 2 | 138 | 220 | 0 | 1 | 152 | 0 | 0.0 | 1 | 0 | 2 | 1 |
50 | 39 | 0 | 2 | 138 | 220 | 0 | 1 | 152 | 0 | 0.0 | 1 | 0 | 2 | 1 |
51 | 39 | 1 | 0 | 118 | 219 | 0 | 1 | 140 | 0 | 1.2 | 1 | 0 | 3 | 0 |
52 | 39 | 1 | 0 | 118 | 219 | 0 | 1 | 140 | 0 | 1.2 | 1 | 0 | 3 | 0 |
53 | 39 | 1 | 0 | 118 | 219 | 0 | 1 | 140 | 0 | 1.2 | 1 | 0 | 3 | 0 |
54 | 39 | 1 | 0 | 118 | 219 | 0 | 1 | 140 | 0 | 1.2 | 1 | 0 | 3 | 0 |
55 | 39 | 1 | 2 | 140 | 321 | 0 | 0 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
56 | 39 | 1 | 2 | 140 | 321 | 0 | 0 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
57 | 39 | 1 | 2 | 140 | 321 | 0 | 0 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
58 | 40 | 1 | 0 | 110 | 167 | 0 | 0 | 114 | 1 | 2.0 | 1 | 0 | 3 | 0 |
59 | 40 | 1 | 0 | 110 | 167 | 0 | 0 | 114 | 1 | 2.0 | 1 | 0 | 3 | 0 |
60 | 40 | 1 | 0 | 110 | 167 | 0 | 0 | 114 | 1 | 2.0 | 1 | 0 | 3 | 0 |
61 | 40 | 1 | 0 | 110 | 167 | 0 | 0 | 114 | 1 | 2.0 | 1 | 0 | 3 | 0 |
62 | 40 | 1 | 0 | 152 | 223 | 0 | 1 | 181 | 0 | 0.0 | 2 | 0 | 3 | 0 |
63 | 40 | 1 | 0 | 152 | 223 | 0 | 1 | 181 | 0 | 0.0 | 2 | 0 | 3 | 0 |
64 | 40 | 1 | 0 | 152 | 223 | 0 | 1 | 181 | 0 | 0.0 | 2 | 0 | 3 | 0 |
65 | 40 | 1 | 0 | 152 | 223 | 0 | 1 | 181 | 0 | 0.0 | 2 | 0 | 3 | 0 |
66 | 40 | 1 | 3 | 140 | 199 | 0 | 1 | 178 | 1 | 1.4 | 2 | 0 | 3 | 1 |
67 | 40 | 1 | 3 | 140 | 199 | 0 | 1 | 178 | 1 | 1.4 | 2 | 0 | 3 | 1 |
68 | 40 | 1 | 3 | 140 | 199 | 0 | 1 | 178 | 1 | 1.4 | 2 | 0 | 3 | 1 |
69 | 41 | 0 | 1 | 105 | 198 | 0 | 1 | 168 | 0 | 0.0 | 2 | 1 | 2 | 1 |
70 | 41 | 0 | 1 | 105 | 198 | 0 | 1 | 168 | 0 | 0.0 | 2 | 1 | 2 | 1 |
71 | 41 | 0 | 1 | 105 | 198 | 0 | 1 | 168 | 0 | 0.0 | 2 | 1 | 2 | 1 |
72 | 41 | 0 | 1 | 126 | 306 | 0 | 1 | 163 | 0 | 0.0 | 2 | 0 | 2 | 1 |
73 | 41 | 0 | 1 | 126 | 306 | 0 | 1 | 163 | 0 | 0.0 | 2 | 0 | 2 | 1 |
74 | 41 | 0 | 1 | 126 | 306 | 0 | 1 | 163 | 0 | 0.0 | 2 | 0 | 2 | 1 |
75 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
76 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
77 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
78 | 41 | 0 | 2 | 112 | 268 | 0 | 0 | 172 | 1 | 0.0 | 2 | 0 | 2 | 1 |
79 | 41 | 0 | 2 | 112 | 268 | 0 | 0 | 172 | 1 | 0.0 | 2 | 0 | 2 | 1 |
80 | 41 | 0 | 2 | 112 | 268 | 0 | 0 | 172 | 1 | 0.0 | 2 | 0 | 2 | 1 |
81 | 41 | 1 | 0 | 110 | 172 | 0 | 0 | 158 | 0 | 0.0 | 2 | 0 | 3 | 0 |
82 | 41 | 1 | 0 | 110 | 172 | 0 | 0 | 158 | 0 | 0.0 | 2 | 0 | 3 | 0 |
83 | 41 | 1 | 0 | 110 | 172 | 0 | 0 | 158 | 0 | 0.0 | 2 | 0 | 3 | 0 |
84 | 41 | 1 | 1 | 110 | 235 | 0 | 1 | 153 | 0 | 0.0 | 2 | 0 | 2 | 1 |
85 | 41 | 1 | 1 | 110 | 235 | 0 | 1 | 153 | 0 | 0.0 | 2 | 0 | 2 | 1 |
86 | 41 | 1 | 1 | 110 | 235 | 0 | 1 | 153 | 0 | 0.0 | 2 | 0 | 2 | 1 |
87 | 41 | 1 | 1 | 120 | 157 | 0 | 1 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
88 | 41 | 1 | 1 | 120 | 157 | 0 | 1 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
89 | 41 | 1 | 1 | 120 | 157 | 0 | 1 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
90 | 41 | 1 | 1 | 120 | 157 | 0 | 1 | 182 | 0 | 0.0 | 2 | 0 | 2 | 1 |
91 | 41 | 1 | 1 | 135 | 203 | 0 | 1 | 132 | 0 | 0.0 | 1 | 0 | 1 | 1 |
92 | 41 | 1 | 1 | 135 | 203 | 0 | 1 | 132 | 0 | 0.0 | 1 | 0 | 1 | 1 |
93 | 41 | 1 | 1 | 135 | 203 | 0 | 1 | 132 | 0 | 0.0 | 1 | 0 | 1 | 1 |
94 | 41 | 1 | 1 | 135 | 203 | 0 | 1 | 132 | 0 | 0.0 | 1 | 0 | 1 | 1 |
95 | 41 | 1 | 2 | 112 | 250 | 0 | 1 | 179 | 0 | 0.0 | 2 | 0 | 2 | 1 |
96 | 41 | 1 | 2 | 112 | 250 | 0 | 1 | 179 | 0 | 0.0 | 2 | 0 | 2 | 1 |
97 | 41 | 1 | 2 | 112 | 250 | 0 | 1 | 179 | 0 | 0.0 | 2 | 0 | 2 | 1 |
98 | 41 | 1 | 2 | 130 | 214 | 0 | 0 | 168 | 0 | 2.0 | 1 | 0 | 2 | 1 |
99 | 41 | 1 | 2 | 130 | 214 | 0 | 0 | 168 | 0 | 2.0 | 1 | 0 | 2 | 1 |
100 | 41 | 1 | 2 | 130 | 214 | 0 | 0 | 168 | 0 | 2.0 | 1 | 0 | 2 | 1 |
Let’s focus on a patient’s maximum heart rate (thalach) and the cholesterol (chol) to identify some outliers.
[2]:
%matplotlib inline
vdf.scatter(["thalach", "chol"])
[2]:
We can see some outliers of the distribution: people with high cholesterol and others with a very low heart rate. Let’s compute the global outliers using the ‘outliers’ method.
[3]:
vdf.outliers(["thalach", "chol"], "global_outliers")
vdf.scatter(["thalach", "chol"], by = "global_outliers")
[3]:
It is also possible to draw an outlier plot using the ‘outliers_plot’ method.
[4]:
vdf.outliers_plot(["thalach", "chol"],)
[4]:
We’ve detected some global outliers in the distribution and we can impute these with the ‘fill_outliers’ method.
Generally, you can identify global outliers with the Z-Score. We’ll consider a Z-Score greater than 3 indicates that the datapoint is an outlier. Some less precise techniques consider the data points belonging in the first and last alpha-quantile as outliers. You’re free to choose either of these strategies when filling outliers.
[5]:
vdf["thalach"].fill_outliers(use_threshold = True,
threshold = 3.0,
method = "winsorize")
vdf["chol"].fill_outliers(use_threshold = True,
threshold = 3.0,
method = "winsorize")
vdf.scatter(["thalach", "chol"], by = "global_outliers")
[5]:
Other techniques like DBSCAN or local outlier factor (LOF) can be to used to check other data points for outliers.
[6]:
vdf
[6]:
123 ageInteger | 123 sexInteger | 123 cpInteger | 123 trestbpsInteger | 123 cholNumeric(34) | 123 fbsInteger | 123 restecgInteger | 123 thalachNumeric(33) | 123 exangInteger | 123 oldpeakNumeric(7) | 123 slopeInteger | 123 caInteger | 123 thalInteger | 123 targetInteger | 123 global_outliersInteger | |
1 | 29 | 1 | 1 | 130 | 204.0 | 0 | 0 | 202.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
2 | 29 | 1 | 1 | 130 | 204.0 | 0 | 0 | 202.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
3 | 29 | 1 | 1 | 130 | 204.0 | 0 | 0 | 202.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
4 | 29 | 1 | 1 | 130 | 204.0 | 0 | 0 | 202.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
5 | 34 | 0 | 1 | 118 | 210.0 | 0 | 1 | 192.0 | 0 | 0.7 | 2 | 0 | 2 | 1 | 0 |
6 | 34 | 0 | 1 | 118 | 210.0 | 0 | 1 | 192.0 | 0 | 0.7 | 2 | 0 | 2 | 1 | 0 |
7 | 34 | 0 | 1 | 118 | 210.0 | 0 | 1 | 192.0 | 0 | 0.7 | 2 | 0 | 2 | 1 | 0 |
8 | 34 | 1 | 3 | 118 | 182.0 | 0 | 0 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
9 | 34 | 1 | 3 | 118 | 182.0 | 0 | 0 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
10 | 34 | 1 | 3 | 118 | 182.0 | 0 | 0 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
11 | 35 | 0 | 0 | 138 | 183.0 | 0 | 1 | 182.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
12 | 35 | 0 | 0 | 138 | 183.0 | 0 | 1 | 182.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
13 | 35 | 0 | 0 | 138 | 183.0 | 0 | 1 | 182.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
14 | 35 | 0 | 0 | 138 | 183.0 | 0 | 1 | 182.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
15 | 35 | 1 | 0 | 120 | 198.0 | 0 | 1 | 130.0 | 1 | 1.6 | 1 | 0 | 3 | 0 | 0 |
16 | 35 | 1 | 0 | 120 | 198.0 | 0 | 1 | 130.0 | 1 | 1.6 | 1 | 0 | 3 | 0 | 0 |
17 | 35 | 1 | 0 | 120 | 198.0 | 0 | 1 | 130.0 | 1 | 1.6 | 1 | 0 | 3 | 0 | 0 |
18 | 35 | 1 | 0 | 120 | 198.0 | 0 | 1 | 130.0 | 1 | 1.6 | 1 | 0 | 3 | 0 | 0 |
19 | 35 | 1 | 0 | 126 | 282.0 | 0 | 0 | 156.0 | 1 | 0.0 | 2 | 0 | 3 | 0 | 0 |
20 | 35 | 1 | 0 | 126 | 282.0 | 0 | 0 | 156.0 | 1 | 0.0 | 2 | 0 | 3 | 0 | 0 |
21 | 35 | 1 | 0 | 126 | 282.0 | 0 | 0 | 156.0 | 1 | 0.0 | 2 | 0 | 3 | 0 | 0 |
22 | 35 | 1 | 1 | 122 | 192.0 | 0 | 1 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
23 | 35 | 1 | 1 | 122 | 192.0 | 0 | 1 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
24 | 35 | 1 | 1 | 122 | 192.0 | 0 | 1 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
25 | 35 | 1 | 1 | 122 | 192.0 | 0 | 1 | 174.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
26 | 37 | 0 | 2 | 120 | 215.0 | 0 | 1 | 170.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
27 | 37 | 0 | 2 | 120 | 215.0 | 0 | 1 | 170.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
28 | 37 | 0 | 2 | 120 | 215.0 | 0 | 1 | 170.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
29 | 37 | 1 | 2 | 130 | 250.0 | 0 | 1 | 187.0 | 0 | 3.5 | 0 | 0 | 2 | 1 | 0 |
30 | 37 | 1 | 2 | 130 | 250.0 | 0 | 1 | 187.0 | 0 | 3.5 | 0 | 0 | 2 | 1 | 0 |
31 | 37 | 1 | 2 | 130 | 250.0 | 0 | 1 | 187.0 | 0 | 3.5 | 0 | 0 | 2 | 1 | 0 |
32 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
33 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
34 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
35 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
36 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
37 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
38 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
39 | 38 | 1 | 2 | 138 | 175.0 | 0 | 1 | 173.0 | 0 | 0.0 | 2 | 4 | 2 | 1 | 0 |
40 | 38 | 1 | 3 | 120 | 231.0 | 0 | 1 | 182.0 | 1 | 3.8 | 1 | 0 | 3 | 0 | 0 |
41 | 38 | 1 | 3 | 120 | 231.0 | 0 | 1 | 182.0 | 1 | 3.8 | 1 | 0 | 3 | 0 | 0 |
42 | 38 | 1 | 3 | 120 | 231.0 | 0 | 1 | 182.0 | 1 | 3.8 | 1 | 0 | 3 | 0 | 0 |
43 | 38 | 1 | 3 | 120 | 231.0 | 0 | 1 | 182.0 | 1 | 3.8 | 1 | 0 | 3 | 0 | 0 |
44 | 39 | 0 | 2 | 94 | 199.0 | 0 | 1 | 179.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
45 | 39 | 0 | 2 | 94 | 199.0 | 0 | 1 | 179.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
46 | 39 | 0 | 2 | 94 | 199.0 | 0 | 1 | 179.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
47 | 39 | 0 | 2 | 138 | 220.0 | 0 | 1 | 152.0 | 0 | 0.0 | 1 | 0 | 2 | 1 | 0 |
48 | 39 | 0 | 2 | 138 | 220.0 | 0 | 1 | 152.0 | 0 | 0.0 | 1 | 0 | 2 | 1 | 0 |
49 | 39 | 0 | 2 | 138 | 220.0 | 0 | 1 | 152.0 | 0 | 0.0 | 1 | 0 | 2 | 1 | 0 |
50 | 39 | 0 | 2 | 138 | 220.0 | 0 | 1 | 152.0 | 0 | 0.0 | 1 | 0 | 2 | 1 | 0 |
51 | 39 | 1 | 0 | 118 | 219.0 | 0 | 1 | 140.0 | 0 | 1.2 | 1 | 0 | 3 | 0 | 0 |
52 | 39 | 1 | 0 | 118 | 219.0 | 0 | 1 | 140.0 | 0 | 1.2 | 1 | 0 | 3 | 0 | 0 |
53 | 39 | 1 | 0 | 118 | 219.0 | 0 | 1 | 140.0 | 0 | 1.2 | 1 | 0 | 3 | 0 | 0 |
54 | 39 | 1 | 0 | 118 | 219.0 | 0 | 1 | 140.0 | 0 | 1.2 | 1 | 0 | 3 | 0 | 0 |
55 | 39 | 1 | 2 | 140 | 321.0 | 0 | 0 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
56 | 39 | 1 | 2 | 140 | 321.0 | 0 | 0 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
57 | 39 | 1 | 2 | 140 | 321.0 | 0 | 0 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
58 | 40 | 1 | 0 | 110 | 167.0 | 0 | 0 | 114.0 | 1 | 2.0 | 1 | 0 | 3 | 0 | 0 |
59 | 40 | 1 | 0 | 110 | 167.0 | 0 | 0 | 114.0 | 1 | 2.0 | 1 | 0 | 3 | 0 | 0 |
60 | 40 | 1 | 0 | 110 | 167.0 | 0 | 0 | 114.0 | 1 | 2.0 | 1 | 0 | 3 | 0 | 0 |
61 | 40 | 1 | 0 | 110 | 167.0 | 0 | 0 | 114.0 | 1 | 2.0 | 1 | 0 | 3 | 0 | 0 |
62 | 40 | 1 | 0 | 152 | 223.0 | 0 | 1 | 181.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
63 | 40 | 1 | 0 | 152 | 223.0 | 0 | 1 | 181.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
64 | 40 | 1 | 0 | 152 | 223.0 | 0 | 1 | 181.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
65 | 40 | 1 | 0 | 152 | 223.0 | 0 | 1 | 181.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
66 | 40 | 1 | 3 | 140 | 199.0 | 0 | 1 | 178.0 | 1 | 1.4 | 2 | 0 | 3 | 1 | 0 |
67 | 40 | 1 | 3 | 140 | 199.0 | 0 | 1 | 178.0 | 1 | 1.4 | 2 | 0 | 3 | 1 | 0 |
68 | 40 | 1 | 3 | 140 | 199.0 | 0 | 1 | 178.0 | 1 | 1.4 | 2 | 0 | 3 | 1 | 0 |
69 | 41 | 0 | 1 | 105 | 198.0 | 0 | 1 | 168.0 | 0 | 0.0 | 2 | 1 | 2 | 1 | 0 |
70 | 41 | 0 | 1 | 105 | 198.0 | 0 | 1 | 168.0 | 0 | 0.0 | 2 | 1 | 2 | 1 | 0 |
71 | 41 | 0 | 1 | 105 | 198.0 | 0 | 1 | 168.0 | 0 | 0.0 | 2 | 1 | 2 | 1 | 0 |
72 | 41 | 0 | 1 | 126 | 306.0 | 0 | 1 | 163.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
73 | 41 | 0 | 1 | 126 | 306.0 | 0 | 1 | 163.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
74 | 41 | 0 | 1 | 126 | 306.0 | 0 | 1 | 163.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
75 | 41 | 0 | 1 | 130 | 204.0 | 0 | 0 | 172.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
76 | 41 | 0 | 1 | 130 | 204.0 | 0 | 0 | 172.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
77 | 41 | 0 | 1 | 130 | 204.0 | 0 | 0 | 172.0 | 0 | 1.4 | 2 | 0 | 2 | 1 | 0 |
78 | 41 | 0 | 2 | 112 | 268.0 | 0 | 0 | 172.0 | 1 | 0.0 | 2 | 0 | 2 | 1 | 0 |
79 | 41 | 0 | 2 | 112 | 268.0 | 0 | 0 | 172.0 | 1 | 0.0 | 2 | 0 | 2 | 1 | 0 |
80 | 41 | 0 | 2 | 112 | 268.0 | 0 | 0 | 172.0 | 1 | 0.0 | 2 | 0 | 2 | 1 | 0 |
81 | 41 | 1 | 0 | 110 | 172.0 | 0 | 0 | 158.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
82 | 41 | 1 | 0 | 110 | 172.0 | 0 | 0 | 158.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
83 | 41 | 1 | 0 | 110 | 172.0 | 0 | 0 | 158.0 | 0 | 0.0 | 2 | 0 | 3 | 0 | 0 |
84 | 41 | 1 | 1 | 110 | 235.0 | 0 | 1 | 153.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
85 | 41 | 1 | 1 | 110 | 235.0 | 0 | 1 | 153.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
86 | 41 | 1 | 1 | 110 | 235.0 | 0 | 1 | 153.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
87 | 41 | 1 | 1 | 120 | 157.0 | 0 | 1 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
88 | 41 | 1 | 1 | 120 | 157.0 | 0 | 1 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
89 | 41 | 1 | 1 | 120 | 157.0 | 0 | 1 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
90 | 41 | 1 | 1 | 120 | 157.0 | 0 | 1 | 182.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
91 | 41 | 1 | 1 | 135 | 203.0 | 0 | 1 | 132.0 | 0 | 0.0 | 1 | 0 | 1 | 1 | 0 |
92 | 41 | 1 | 1 | 135 | 203.0 | 0 | 1 | 132.0 | 0 | 0.0 | 1 | 0 | 1 | 1 | 0 |
93 | 41 | 1 | 1 | 135 | 203.0 | 0 | 1 | 132.0 | 0 | 0.0 | 1 | 0 | 1 | 1 | 0 |
94 | 41 | 1 | 1 | 135 | 203.0 | 0 | 1 | 132.0 | 0 | 0.0 | 1 | 0 | 1 | 1 | 0 |
95 | 41 | 1 | 2 | 112 | 250.0 | 0 | 1 | 179.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
96 | 41 | 1 | 2 | 112 | 250.0 | 0 | 1 | 179.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
97 | 41 | 1 | 2 | 112 | 250.0 | 0 | 1 | 179.0 | 0 | 0.0 | 2 | 0 | 2 | 1 | 0 |
98 | 41 | 1 | 2 | 130 | 214.0 | 0 | 0 | 168.0 | 0 | 2.0 | 1 | 0 | 2 | 1 | 0 |
99 | 41 | 1 | 2 | 130 | 214.0 | 0 | 0 | 168.0 | 0 | 2.0 | 1 | 0 | 2 | 1 | 0 |
100 | 41 | 1 | 2 | 130 | 214.0 | 0 | 0 | 168.0 | 0 | 2.0 | 1 | 0 | 2 | 1 | 0 |
[7]:
from verticapy.learn.cluster import DBSCAN
vp.drop("dbscan_heart")
model = DBSCAN("dbscan_heart", eps = 20, min_samples = 10)
model.fit("public.heart", ["thalach", "chol"])
model.plot()
/opt/venv/lib/python3.10/site-packages/vertica_python/vertica/connection.py:659: UserWarning: [INFO] Cannot commit; no transaction in progress
warnings.warn(notice)
[7]:
[8]:
vdf_tmp = model.predict()
vdf_tmp["outliers_dbscan"] = "(dbscan_cluster = -1)::int"
vdf_tmp.scatter(["thalach", "chol"], by = "outliers_dbscan")
[8]:
While DBSCAN identifies outliers when computing the clusters, LOF computes an outlier score. Generally, a LOF Score greater than 1.5 indicates an outlier.
[9]:
from verticapy.learn.neighbors import LocalOutlierFactor
vp.drop("lof_heart")
model = LocalOutlierFactor("lof_heart")
model.fit("heart", ["thalach", "chol",])
model.plot()
[9]:
[10]:
lof_heart = model.predict()
lof_heart["outliers"] = "(CASE WHEN lof_score > 1.5 THEN 1 ELSE 0 END)"
lof_heart.scatter(["thalach", "chol"], by = "outliers")
[10]:
We have many other techniques like the k-means clustering for finding outliers, but the most important method is using the Z-Score. After identifying outliers, we just have to decide how to impute the missing values. We’ll focus on missing values in the next lesson.