Balancing Imbalanced Data
Imbalanced data occurs when an uneven distribution of classes occurs in the data. You see imbalanced data a lot in financial transaction data where the majority of the transactions are not fraudulent and a small number of the transactions are fraudulent. Building a predictive model on the imbalanced data set would cause a model that appears to yield high accuracy but does not generalize well to the new data in the minority class. To prevent creating models with false levels of accuracy, you should rebalance your imbalanced data before creating a predictive model.
Before you begin the example, make sure that you have loaded the Machine Learning sample data.
The following example shows you how to use the BALANCE function to create a more balanced data set.
- View the distribution of the classes.
- Use the BALANCE function to create a more balanced data set.
- View the new distribution of the classifiers.
=> SELECT fraud, COUNT(fraud) FROM transaction_data GROUP BY fraud; fraud | COUNT -------+------- TRUE | 19 FALSE | 981 (2 rows)
=> SELECT BALANCE('balance_fin_data', 'transaction_data', 'fraud', 'under_sampling' USING PARAMETERS sampling_ratio = 0.2); BALANCE -------------------------- Finished in 1 iteration (1 row)
=> SELECT fraud, COUNT(fraud) FROM balance_fin_data GROUP BY fraud; fraud | COUNT -------+------- t | 19 f | 236 (2 rows)