verticapy.vDataFrame.balance#

vDataFrame.balance(column: str, method: Literal['over', 'under'] = 'under', x: float = 0.5, order_by: str | list[str] | None = None) → vDataFrame#

Balances the dataset using the input method.

Warning

If the data is not sorted, the generated SQL code may differ between attempts.

Parameters#

column: str

Column used to compute the different categories.

method: str, optional

The method with which to sample the data.

over:
Oversampling.

under:
Undersampling.

x: float, optional

The desired ratio between the majority class and minority classes.

order_by: SQLColumns, optional

vDataColumns used to sort the data.

Returns#

vDataFrame: balanced vDataFrame

Examples#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will create a toy imbalanced dataset:

vdf = vp.vDataFrame(
    {
        "category" : [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
        "val": [12, 12, 14, 15, 10, 9, 10, 12, 12, 14, 16],
    }
)

	123 category Integer 100%	123 val Integer 100%
1	0	12
2	0	12
3	0	14
4	0	15
5	0	10
6	0	9
7	0	10
8	0	12
9	0	12
10	1	14
11	1	16

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

In the above data, we can see that there are many more 0s than 1s in the category column. We can conveniently plot the historgram to visualize the skewness:

vdf["category"].hist()

Now we can use the balance function to fix this:

balanced_vdf = vdf.balance(column="category", x= 0.5)

	123 category Integer 100%	123 val Integer 100%
1	1	14
2	1	16
3	0	12
4	0	14
5	0	10
6	0	12

Note

By giving x value of 0.5, we have ensured that the ratio between the two classes is not more skewed than this.

Let’s visualize the distribution after the balancing.

balanced_vdf["category"].hist()