Loading...

verticapy.vDataFrame.add_duplicates#

vDataFrame.add_duplicates(weight: int | str, use_gcd: bool = True) vDataFrame#

Duplicates the vDataFrame using the input weight.

Parameters#

weight: str | integer

vDataColumn or integer representing the weight.

use_gcd: bool

If set to True, uses the GCD (Greatest Common Divisor) to reduce all common weights to avoid unnecessary duplicates.

Returns#

vDataFrame

the output vDataFrame.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create a vDataFrame with multiple columns:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

Abc
cats
Varchar(1)
100%
123
reps
Integer
100%
1A2
2B4
3C8

We can add duplicates by the weight column:

vdf.add_duplicates("reps")
Abc
cats
Varchar(1)
100%
1A
2B
3C
4B
5C
6C
7C

Note

VerticaPy will find the greatest common divisor (gcd) of the weight column to normalize the weights by it, ensuring a meaningful minimum number of occurrences. It will then duplicate the different values. This function can be highly valuable in machine learning for preprocessing and increasing the weight of specific rows.

See also

vDataFrame.sample() : Sampling the Dataset.