
VerticaPy
Duplicates¶
When merging different data sources, we're likely to end up with duplicates that can add a lot of bias to and skew our data. Just imagine running a Telco marketing campaign and not removing your duplicates: you'll end up targeting the same person multiple times!
Let's use the Iris dataset to understand the tools VerticaPy gives you for handling duplicate values.
from verticapy.datasets import load_iris
vdf = load_iris()
vdf = vdf.append(load_iris().sample(3)) # adding some duplicates
display(vdf)
To find all the duplicates, you can use the 'duplicated' method.
vdf.duplicated()
As you might expect, some flowers might share the exact same characteristics. But we have to be careful; this doesn't mean that they are real duplicates. In this case, we don't have to drop them.
That said, if we did want to drop these duplicates, we can do so with the 'drop_duplicates' method.
vdf.drop_duplicates()
Using this method will add an advanced analytical function to the SQL code generation which is quite expensive. You should only use this method after aggregating the data to avoid stacking heavy computations on top of each other.