verticapy.vDataFrame.duplicated#
- vDataFrame.duplicated(columns: str | list[str] | None = None, count: bool = False, limit: int = 30) TableSample #
This function returns a list or set of values that occur more than once within the dataset. It identifies and provides you with insight into which specific values or entries are duplicated in the dataset, helping to detect and manage data redundancy and potential issues related to duplicate information.
Warning
This function employs the
ROW_NUMBER
SQL function with multiple partition criteria. It’s essential to note that as the number of partition columns increases, the computational cost can rise significantly. TheROW_NUMBER
function assigns a unique rank to each row within its partition, which means that the more columns are involved in partitioning, the more complex and resource-intensive the operation becomes. Therefore, when using a large number of columns for partitioning, it’s important to be mindful of potential performance implications, as it may become computationally expensive.Parameters#
- columns: SQLColumns, optional
List of the vDataColumns names. If empty, all vDataColumns are selected.
- count: bool, optional
If set to True, the method also returns the count of each duplicate.
- limit: int, optional
Sets a limit on the number of elements to be displayed.
Returns#
- TableSample
result.
Examples#
For this example, we will use the following dataset:
import verticapy as vp data = vp.vDataFrame( { "x": [1, 2, 4, 15, 1, 15, 20, 1], "y": [1, 2, 1, 1, 1, 1, 2, 1], "z": [10, 12, 9, 10, 9, 8, 1, 10], } )
Now, let’s find duplicated rows.
data.duplicated( columns = ["x", "y", "z"], )
123xInteger... 123zInteger123occurrenceInteger1 1 ... 10 2 Note
All the calculations are pushed to the database.
Hint
For more precise control, please refer to the
aggregate
method.