Imputing Missing Values
You can use the IMPUTE function to impute missing values in your data set. The function helps you impute missing values with plausible data values. This impute example uses the small_input_impute
table. Using the function, you can specify either the mean or mode method.
These examples show how you can use the IMPUTE function on the small_input_impute
table.
Create the Table
First, create the small_input_impute
table:
=> CREATE TABLE small_input_impute(pid int, pclass int, gender int, x1 float, x2 float, x3 float, x4 INT, x5 char, x6 varchar);Then insert the table values:
=> INSERT INTO small_input_impute VALUES( 1, 0, 0, -9.445818, -9.740541, -9.786974, 3, 't', 'A'); => INSERT INTO small_input_impute VALUES( 2, 0, 0, -9.618292, -9.308881, -9.562255, 4, 't', 'A'); => INSERT INTO small_input_impute VALUES( 3, 0, 0, -9.060605, -9.390844, -9.559848, 6, 't', 'B'); => INSERT INTO small_input_impute VALUES( 4, 0, 0, -2.264599, -2.615146, -2.107290, 15, 't', 'B'); => INSERT INTO small_input_impute VALUES( 5, 0, 1, -2.590837, -2.892819, -2.702960, 2, 't', 'C'); => INSERT INTO small_input_impute VALUES( 6, 0, 1, -2.264599, -2.615146, -2.107290, 11, 't', 'C'); => INSERT INTO small_input_impute VALUES( 7, 1, 1, 3.829239, 3.087650, 'INFINITY', NULL, 'f', 'C'); => INSERT INTO small_input_impute VALUES( 8, 1, 1, 3.273592, NULL, 3.477332, 18, 'f', 'B'); => INSERT INTO small_input_impute VALUES( 9, 1, 1, NULL, 3.841606, 3.754375, 20, 'f', 'B'); => INSERT INTO small_input_impute VALUES( 10,1, 1, NULL, 3.841606, 3.754375, 20, 't', 'A'); => INSERT INTO small_input_impute VALUES( 11, 0, 0, -9.445818, -9.740541, -9.786974, 3, 't','B'); => INSERT INTO small_input_impute VALUES( 12, 0, 0, -9.618292, -9.308881, -9.562255, 4, 't', 'C'); => INSERT INTO small_input_impute VALUES( 13, 0, 0, -9.060605, -9.390844, -9.559848, 6, 't', 'C'); => INSERT INTO small_input_impute VALUES( 14, 0, 0, -2.264599, -2.615146, -2.107290, 15, 'f','A'); => INSERT INTO small_input_impute VALUES( 15, 0, 1, -2.590837, -2.892819, -2.702960, 2, 'f','A'); => INSERT INTO small_input_impute VALUES( 16, 0, 1, -2.264599, -2.615146, -2.107290, 11, 'f', 'A'); => INSERT INTO small_input_impute VALUES( 17, 1, 1, 3.829239, 3.087650, 'INFINITY', NULL, 'f', 'B'); => INSERT INTO small_input_impute VALUES( 18, 1, 1, 3.273592, NULL, 3.477332, 18, 't', 'B'); => INSERT INTO small_input_impute VALUES( 19, 1, 1, NULL, 3.841606, 3.754375, 20, 't', NULL); => INSERT INTO small_input_impute VALUES( 20,1, 1, NULL, 3.841606, 3.754375, 20, NULL, 'C');
You can see the table with the missing values:
=> SELECT * FROM small_input_impute; pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6 ----+--------+--------+-----------+-----------+-----------+----+----+---- 1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A 2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t |
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | | C
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C 7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
(20 rows)
Specify the Mean Method
Execute the IMPUTE function, specifying the mean method, without using the partition_columns parameter:
=> SELECT IMPUTE('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean' USING PARAMETERS exclude_columns='pid'); impute -------------------------- Finished in 1 iteration (1 row)
View output_view
to see the imputed values:
=> SELECT * FROM output_view; x1 | x2 | x3 | x4 -------------+-----------+-------------------+---- -9.445818 | -9.740541 | -9.786974 | 3 -9.618292 | -9.308881 | -9.562255 | 4 -9.060605 | -9.390844 | -9.559848 | 6 -2.264599 | -2.615146 | -2.10729 | 15 -2.264599 | -2.615146 | -2.10729 | 11 3.273592 | -2.865835 | 3.477332 | 18 -3.517739875 | 3.841606 | 3.754375 | 20 3.273592 | -2.865835 | 3.477332 | 18 -3.517739875 | 3.841606 | 3.754375 | 20 -3.517739875 | 3.841606 | 3.754375 | 20 -3.517739875 | 3.841606 | 3.754375 | 20 -9.445818 | -9.740541 | -9.786974 | 3 -9.618292 | -9.308881 | -9.562255 | 4 3.829239 | 3.08765 | -2.76005944444444 | 11 -2.590837 | -2.892819 | -2.70296 | 2 3.829239 | 3.08765 | -2.76005944444444 | 11 -9.060605 | -9.390844 | -9.559848 | 6 -2.264599 | -2.615146 | -2.10729 | 15 -2.590837 | -2.892819 | -2.70296 | 2 -2.264599 | -2.615146 | -2.10729 | 11 (20 rows)
You can also execute the IMPUTE function, specifying the mean method and using the partition_columns parameter. This parameter uses the GROUP BY clause:
=> SELECT impute('output_view_group','small_input_impute', 'pid, x1,x2,x3,x4','mean' USING PARAMETERS exclude_columns='pid', partition_columns='pclass,gender'); impute -------------------------- Finished in 1 iteration (1 row)
View output_view_group
to see the imputed values:
=> SELECT * FROM output_view_group; pclass | gender | x1 | x2 | x3 | x4 -------+--------+-----------+------------------+------------------+------------------ 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 1 | 1 | 3.273592 | 3.59028733333333 | 3.477332 | 18 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 1 | 1 | 3.273592 | 3.59028733333333 | 3.477332 | 18 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 1 | 1 | 3.829239 | 3.08765 | 3.66202733333333 | 19 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 1 | 1 | 3.829239 | 3.08765 | 3.66202733333333 | 19 (20 rows)
Specify the Mode Method
Execute the IMPUTE function, specifying the mode method, without using the partition_columns parameter:
=> SELECT impute('output_view_mode','small_input_impute', 'pid, x5,x6','mode' USING PARAMETERS exclude_columns='pid'); impute -------------------------- Finished in 1 iteration (1 row)
View output_view_mode
to see the imputed values:
=> SELECT * FROM output_view_mode; x5 | x6 ---+---- t | A t | A t | B t | B t | C f | B t | A t | B t | B t | C f | B t | B t | C f | B t | C f | C t | C f | A f | A f | A (20 rows)
You can also execute the IMPUTE function, specifying the mode method and using the partition_columns parameter. This parameter uses the GROUP BY clause:
=> SELECT impute('output_view_mode_group','small_input_impute', 'pid, x5,x6','mode' USING PARAMETERS exclude_columns='pid',partition_columns='pclass,gender'); impute -------------------------- Finished in 1 iteration (1 row)
View output_view_mode_group
to see the imputed values:
=> SELECT * FROM output_view_mode_group; pclass | gender | x5 | x6 -------+--------+----+---- 0 | 0 | t | B 0 | 0 | t | C 0 | 0 | t | A 0 | 0 | t | A 0 | 0 | t | B 0 | 0 | t | B 0 | 0 | t | C 0 | 0 | f | A 0 | 1 | t | C 0 | 1 | t | C 0 | 1 | f | A 0 | 1 | f | A 1 | 1 | f | C 1 | 1 | f | B 1 | 1 | t | A 1 | 1 | t | B 1 | 1 | t | B 1 | 1 | f | C 1 | 1 | f | B 1 | 1 | f | B (20 rows)
See Also