verticapy.vDataFrame.add_duplicates#
- vDataFrame.add_duplicates(weight: int | str, use_gcd: bool = True) vDataFrame #
Duplicates the
vDataFrame
using the input weight.Parameters#
- weight: str | integer
vDataColumn
orinteger
representing the weight.- use_gcd: bool
If set to True, uses the GCD (Greatest Common Divisor) to reduce all common weights to avoid unnecessary duplicates.
Returns#
- vDataFrame
the output
vDataFrame
.
Examples#
Let’s begin by importing VerticaPy.
import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.Let us create a
vDataFrame
with multiple columns:vdf = vp.vDataFrame( { "cats": ["A", "B", "C"], "reps": [2, 4, 8], }, )
AbccatsVarchar(1)100%123repsInteger100%1 A 2 2 B 4 3 C 8 We can add duplicates by the weight column:
vdf.add_duplicates("reps")
AbccatsVarchar(1)100%1 A 2 B 3 C 4 B 5 C 6 C 7 C Note
VerticaPy will find the greatest common divisor (gcd) of the weight column to normalize the weights by it, ensuring a meaningful minimum number of occurrences. It will then duplicate the different values. This function can be highly valuable in machine learning for preprocessing and increasing the weight of specific rows.
See also
vDataFrame.
sample()
: Sampling the Dataset.