Loading...

verticapy.vDataFrame.duplicated#

vDataFrame.duplicated(columns: str | list[str] | None = None, count: bool = False, limit: int = 30) TableSample#

This function returns a list or set of values that occur more than once within the dataset. It identifies and provides you with insight into which specific values or entries are duplicated in the dataset, helping to detect and manage data redundancy and potential issues related to duplicate information.

Warning

This function employs the ROW_NUMBER SQL function with multiple partition criteria. It’s essential to note that as the number of partition columns increases, the computational cost can rise significantly. The ROW_NUMBER function assigns a unique rank to each row within its partition, which means that the more columns are involved in partitioning, the more complex and resource-intensive the operation becomes. Therefore, when using a large number of columns for partitioning, it’s important to be mindful of potential performance implications, as it may become computationally expensive.

Parameters#

columns: SQLColumns, optional

List of the vDataColumns names. If empty, all vDataColumns are selected.

count: bool, optional

If set to True, the method also returns the count of each duplicate.

limit: int, optional

Sets a limit on the number of elements to be displayed.

Returns#

TableSample

result.

Examples#

For this example, we will use the following dataset:

import verticapy as vp

data = vp.vDataFrame(
    {
        "x": [1, 2, 4, 15, 1, 15, 20, 1],
        "y": [1, 2, 1, 1, 1, 1, 2, 1],
        "z": [10, 12, 9, 10, 9, 8, 1, 10],
    }
)

Now, let’s find duplicated rows.

data.duplicated(
    columns = ["x", "y", "z"],
)
123
x
Integer
...
123
z
Integer
123
occurrence
Integer
11...102

Note

All the calculations are pushed to the database.

Hint

For more precise control, please refer to the aggregate method.

See also

vDataColumn.nunique() : Cardinality for a specific column.
vDataFrame.nunique() : Cardinality for particular columns.