BALANCE

Returns a view with an equal distribution of the input data based on the dependent variable.

Important: Before using a machine learning function, be aware that all the ongoing transactions might be committed.

Behavior Type

Immutable

Syntax

BALANCE ( 'output_view', 'input_relation', 'response_column', 'balance_method' 
            [USING PARAMETERS [sampling_ratio=value] ])

Arguments

output_view

The name of the View where Vertica saves the balanced data from the chosen input_relation.

 input_relation

The table or view that contains the data the function uses to create a more balanced data set.

 response_column

The dependent variable: a VARCHAR or INT column in the input_relation.

 balance_method
            

The imbalanced processing method to use. Selects data from the minority and majority classes.

Valid values

  • hybrid_sampling: Performs oversampling and undersampling on different classes so each class is equally represented.
  • over_sampling: Oversamples on all classes, with the exception of the most majority class, towards the most majority class's cardinality.
  • under_sampling: Undersamples on all classes, with the exception of the most minority class, towards the most minority class's cardinality. An alias of weighted_sampling.
  • weighted_sampling: An alias of under_sampling.

Parameters

sampling_ratio=value

The desired ratio between the majority class and the minority class. This value has no effect when used with balance method hybrid_sampling.

Default: 1.0

Privileges

To use BALANCE, you must either be a superuser or have CREATE privileges for the schema of the output view and SELECT privileges for the input table or view. There are no privileges needed on the function itself.

See GRANT (Schema) and GRANT (Table).

Examples

The following example shows how you can use the BALANCE function:

=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE

=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
 bug_type | COUNT
----------+-------
        2 |     1
        1 |     3
        3 |     7
(3 rows)

=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
         BALANCE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
        2 |     1
        1 |     2
        3 |     1
(3 rows)

See Also