Duplicates#

When merging different data sources, we’re likely to end up with duplicates that can add a lot of bias to and skew our data. Just imagine running a Telco marketing campaign and not removing your duplicates: you’ll end up targeting the same person multiple times!

Let’s use the Iris dataset to understand the tools VerticaPy gives you for handling duplicate values.

[28]:
from verticapy.datasets import load_iris
vdf = load_iris()
vdf = vdf.append(load_iris().sample(3)) # adding some duplicates
display(vdf)
123
Id
Integer
123
PetalLengthCm
Numeric(6,3)
123
PetalWidthCm
Numeric(6,3)
123
SepalLengthCm
Numeric(6,3)
123
SepalWidthCm
Numeric(6,3)
Abc
Species
Varchar(30)
111.40.25.13.5Iris-setosa
221.40.24.93.0Iris-setosa
331.30.24.73.2Iris-setosa
441.50.24.63.1Iris-setosa
551.40.25.03.6Iris-setosa
661.70.45.43.9Iris-setosa
771.40.34.63.4Iris-setosa
881.50.25.03.4Iris-setosa
991.40.24.42.9Iris-setosa
10101.50.14.93.1Iris-setosa
11111.50.25.43.7Iris-setosa
12121.60.24.83.4Iris-setosa
13131.40.14.83.0Iris-setosa
14141.10.14.33.0Iris-setosa
15151.20.25.84.0Iris-setosa
16161.50.45.74.4Iris-setosa
17171.30.45.43.9Iris-setosa
18181.40.35.13.5Iris-setosa
19191.70.35.73.8Iris-setosa
20201.50.35.13.8Iris-setosa
21211.70.25.43.4Iris-setosa
22221.50.45.13.7Iris-setosa
23231.00.24.63.6Iris-setosa
24241.70.55.13.3Iris-setosa
25251.90.24.83.4Iris-setosa
26261.60.25.03.0Iris-setosa
27271.60.45.03.4Iris-setosa
28281.50.25.23.5Iris-setosa
29291.40.25.23.4Iris-setosa
30301.60.24.73.2Iris-setosa
31311.60.24.83.1Iris-setosa
32321.50.45.43.4Iris-setosa
33331.50.15.24.1Iris-setosa
34341.40.25.54.2Iris-setosa
35351.50.14.93.1Iris-setosa
36361.20.25.03.2Iris-setosa
37371.30.25.53.5Iris-setosa
38381.50.14.93.1Iris-setosa
39391.30.24.43.0Iris-setosa
40401.50.25.13.4Iris-setosa
41411.30.35.03.5Iris-setosa
42421.30.34.52.3Iris-setosa
43431.30.24.43.2Iris-setosa
44441.60.65.03.5Iris-setosa
45451.90.45.13.8Iris-setosa
46461.40.34.83.0Iris-setosa
47471.60.25.13.8Iris-setosa
48481.40.24.63.2Iris-setosa
49491.50.25.33.7Iris-setosa
50501.40.25.03.3Iris-setosa
51514.71.47.03.2Iris-versicolor
52524.51.56.43.2Iris-versicolor
53534.91.56.93.1Iris-versicolor
54544.01.35.52.3Iris-versicolor
55554.61.56.52.8Iris-versicolor
56564.51.35.72.8Iris-versicolor
57574.71.66.33.3Iris-versicolor
58583.31.04.92.4Iris-versicolor
59594.61.36.62.9Iris-versicolor
60603.91.45.22.7Iris-versicolor
61613.51.05.02.0Iris-versicolor
62624.21.55.93.0Iris-versicolor
63634.01.06.02.2Iris-versicolor
64644.71.46.12.9Iris-versicolor
65653.61.35.62.9Iris-versicolor
66664.41.46.73.1Iris-versicolor
67674.51.55.63.0Iris-versicolor
68684.11.05.82.7Iris-versicolor
69694.51.56.22.2Iris-versicolor
70703.91.15.62.5Iris-versicolor
71714.81.85.93.2Iris-versicolor
72724.01.36.12.8Iris-versicolor
73734.91.56.32.5Iris-versicolor
74744.71.26.12.8Iris-versicolor
75754.31.36.42.9Iris-versicolor
76764.41.46.63.0Iris-versicolor
77774.81.46.82.8Iris-versicolor
78785.01.76.73.0Iris-versicolor
79794.51.56.02.9Iris-versicolor
80803.51.05.72.6Iris-versicolor
81813.81.15.52.4Iris-versicolor
82823.71.05.52.4Iris-versicolor
83833.91.25.82.7Iris-versicolor
84845.11.66.02.7Iris-versicolor
85854.51.55.43.0Iris-versicolor
86864.51.66.03.4Iris-versicolor
87874.71.56.73.1Iris-versicolor
88884.41.36.32.3Iris-versicolor
89894.11.35.63.0Iris-versicolor
90904.01.35.52.5Iris-versicolor
91914.41.25.52.6Iris-versicolor
92924.61.46.13.0Iris-versicolor
93934.01.25.82.6Iris-versicolor
94943.31.05.02.3Iris-versicolor
95954.21.35.62.7Iris-versicolor
96964.21.25.73.0Iris-versicolor
97974.21.35.72.9Iris-versicolor
98984.31.36.22.9Iris-versicolor
99993.01.15.12.5Iris-versicolor
1001004.11.35.72.8Iris-versicolor
Rows: 1-100 | Columns: 6

To find all the duplicates, you can use the ‘duplicated’ method.

[29]:
vdf.duplicated()
[29]:
123
Id
Integer
123
PetalLengthCm
Numeric(6,3)
123
PetalWidthCm
Numeric(6,3)
123
SepalLengthCm
Numeric(6,3)
123
SepalWidthCm
Numeric(6,3)
Abc
Species
Varchar(30)
123
occurrence
Integer
1331.50.15.24.1Iris-setosa2
2703.91.15.62.5Iris-versicolor2
31495.42.36.23.4Iris-virginica2
Rows: 3 | Columns: 7

As you might expect, some flowers might share the exact same characteristics. But we have to be careful; this doesn’t mean that they are real duplicates. In this case, we don’t have to drop them.

That said, if we did want to drop these duplicates, we can do so with the ‘drop_duplicates’ method.

[30]:
vdf.drop_duplicates()
3 elements were filtered.
[30]:
123
Id
Integer
123
PetalLengthCm
Numeric(6,3)
123
PetalWidthCm
Numeric(6,3)
123
SepalLengthCm
Numeric(6,3)
123
SepalWidthCm
Numeric(6,3)
Abc
Species
Varchar(30)
111.40.25.13.5Iris-setosa
221.40.24.93.0Iris-setosa
331.30.24.73.2Iris-setosa
441.50.24.63.1Iris-setosa
551.40.25.03.6Iris-setosa
661.70.45.43.9Iris-setosa
771.40.34.63.4Iris-setosa
881.50.25.03.4Iris-setosa
991.40.24.42.9Iris-setosa
10101.50.14.93.1Iris-setosa
11111.50.25.43.7Iris-setosa
12121.60.24.83.4Iris-setosa
13131.40.14.83.0Iris-setosa
14141.10.14.33.0Iris-setosa
15151.20.25.84.0Iris-setosa
16161.50.45.74.4Iris-setosa
17171.30.45.43.9Iris-setosa
18181.40.35.13.5Iris-setosa
19191.70.35.73.8Iris-setosa
20201.50.35.13.8Iris-setosa
21211.70.25.43.4Iris-setosa
22221.50.45.13.7Iris-setosa
23231.00.24.63.6Iris-setosa
24241.70.55.13.3Iris-setosa
25251.90.24.83.4Iris-setosa
26261.60.25.03.0Iris-setosa
27271.60.45.03.4Iris-setosa
28281.50.25.23.5Iris-setosa
29291.40.25.23.4Iris-setosa
30301.60.24.73.2Iris-setosa
31311.60.24.83.1Iris-setosa
32321.50.45.43.4Iris-setosa
33331.50.15.24.1Iris-setosa
34341.40.25.54.2Iris-setosa
35351.50.14.93.1Iris-setosa
36361.20.25.03.2Iris-setosa
37371.30.25.53.5Iris-setosa
38381.50.14.93.1Iris-setosa
39391.30.24.43.0Iris-setosa
40401.50.25.13.4Iris-setosa
41411.30.35.03.5Iris-setosa
42421.30.34.52.3Iris-setosa
43431.30.24.43.2Iris-setosa
44441.60.65.03.5Iris-setosa
45451.90.45.13.8Iris-setosa
46461.40.34.83.0Iris-setosa
47471.60.25.13.8Iris-setosa
48481.40.24.63.2Iris-setosa
49491.50.25.33.7Iris-setosa
50501.40.25.03.3Iris-setosa
51514.71.47.03.2Iris-versicolor
52524.51.56.43.2Iris-versicolor
53534.91.56.93.1Iris-versicolor
54544.01.35.52.3Iris-versicolor
55554.61.56.52.8Iris-versicolor
56564.51.35.72.8Iris-versicolor
57574.71.66.33.3Iris-versicolor
58583.31.04.92.4Iris-versicolor
59594.61.36.62.9Iris-versicolor
60603.91.45.22.7Iris-versicolor
61613.51.05.02.0Iris-versicolor
62624.21.55.93.0Iris-versicolor
63634.01.06.02.2Iris-versicolor
64644.71.46.12.9Iris-versicolor
65653.61.35.62.9Iris-versicolor
66664.41.46.73.1Iris-versicolor
67674.51.55.63.0Iris-versicolor
68684.11.05.82.7Iris-versicolor
69694.51.56.22.2Iris-versicolor
70703.91.15.62.5Iris-versicolor
71714.81.85.93.2Iris-versicolor
72724.01.36.12.8Iris-versicolor
73734.91.56.32.5Iris-versicolor
74744.71.26.12.8Iris-versicolor
75754.31.36.42.9Iris-versicolor
76764.41.46.63.0Iris-versicolor
77774.81.46.82.8Iris-versicolor
78785.01.76.73.0Iris-versicolor
79794.51.56.02.9Iris-versicolor
80803.51.05.72.6Iris-versicolor
81813.81.15.52.4Iris-versicolor
82823.71.05.52.4Iris-versicolor
83833.91.25.82.7Iris-versicolor
84845.11.66.02.7Iris-versicolor
85854.51.55.43.0Iris-versicolor
86864.51.66.03.4Iris-versicolor
87874.71.56.73.1Iris-versicolor
88884.41.36.32.3Iris-versicolor
89894.11.35.63.0Iris-versicolor
90904.01.35.52.5Iris-versicolor
91914.41.25.52.6Iris-versicolor
92924.61.46.13.0Iris-versicolor
93934.01.25.82.6Iris-versicolor
94943.31.05.02.3Iris-versicolor
95954.21.35.62.7Iris-versicolor
96964.21.25.73.0Iris-versicolor
97974.21.35.72.9Iris-versicolor
98984.31.36.22.9Iris-versicolor
99993.01.15.12.5Iris-versicolor
1001004.11.35.72.8Iris-versicolor
Rows: 1-100 of 150 | Columns: 6

Using this method will add an advanced analytical function to the SQL code generation which is quite expensive. You should only use this method after aggregating the data to avoid stacking heavy computations on top of each other.