verticapy.vDataFrame.duplicated#

vDataFrame.duplicated(columns: str | list[str] | None = None, count: bool = False, limit: int = 30) → TableSample#

This function returns a list or set of values that occur more than once within the dataset. It identifies and provides you with insight into which specific values or entries are duplicated in the dataset, helping to detect and manage data redundancy and potential issues related to duplicate information.

Warning

This function employs the ROW_NUMBER SQL function with multiple partition criteria. It’s essential to note that as the number of partition columns increases, the computational cost can rise significantly. The ROW_NUMBER function assigns a unique rank to each row within its partition, which means that the more columns are involved in partitioning, the more complex and resource-intensive the operation becomes. Therefore, when using a large number of columns for partitioning, it’s important to be mindful of potential performance implications, as it may become computationally expensive.

Parameters#

columns: SQLColumns, optional: List of the vDataColumns names. If empty, all vDataColumns are selected.
count: bool, optional: If set to True, the method also returns the count of each duplicate.
limit: int, optional: Sets a limit on the number of elements to be displayed.

Returns#

TableSample: result.

Examples#

For this example, we will use the following dataset:

import verticapy as vp

data = vp.vDataFrame(
    {
        "x": [1, 2, 4, 15, 1, 15, 20, 1],
        "y": [1, 2, 1, 1, 1, 1, 2, 1],
        "z": [10, 12, 9, 10, 9, 8, 1, 10],
    }
)

Now, let’s find duplicated rows.

data.duplicated(
    columns = ["x", "y", "z"],
)

	123 x Integer	...	123 z Integer	123 occurrence Integer
1	1	...	10	2

Note

All the calculations are pushed to the database.

Hint

For more precise control, please refer to the aggregate method.