vDataFrame.balance

In [ ]:
vDataFrame.balance(column: str, 
                   method: str = "hybrid", 
                   x: float = 0.5, 
                   order_by: list = [],)

Balances the dataset using the input method.

⚠ Warning: If the data is not sorted, the generated SQL code may differ between attempts.

Parameters

Name Type Optional Description
column
str
Column used to compute the different categories.
method
str
The method with which to sample the data.
  • hybrid : hybrid sampling.
  • over : oversampling.
  • under : undersampling.
x
float
The desired ratio between the majority class and minority classes. Only used when method is 'over' or 'under'.
order_by
list
vColumns used to sort the data.

Returns

vDataFrame : sample vDataFrame

Example

In [5]:
from verticapy.datasets import load_titanic
titanic = load_titanic()

# minority class is Q
titanic["embarked"].topk()
Out[5]:
count
percent
S87370.746
C25320.502
Q1068.59
Rows: 1-3 | Columns: 3
In [7]:
# hybrid

balance = titanic.balance(column = "embarked")
balance["embarked"].topk()
Out[7]:
count
percent
C11134.472
Q10632.919
S10532.609
Rows: 1-3 | Columns: 3
In [8]:
# over

balance = titanic.balance(column = "embarked", method = "over", x = 0.5)
balance["embarked"].topk()
Out[8]:
count
percent
Q10647.964
S6127.602
C5424.434
Rows: 1-3 | Columns: 3
In [9]:
# under

balance = titanic.balance(column = "embarked", method = "under", x = 0.5)
balance["embarked"].topk()
Out[9]:
count
percent
S22041.431
C20538.606
Q10619.962
Rows: 1-3 | Columns: 3