Encoding Categorical Columns

Many machine learning algorithms cannot work with categorical data. To accommodate such algorithms, categorical data must be converted to numerical data before training. Directly mapping the categorical values into indices is not enough. For example, if your categorical feature has three distinct values "red", "green" and "blue", replacing them with 1, 2 and 3 may have a negative impact on the training process because algorithms usually rely on some kind of numerical distances between values to discriminate between them. In this case, the Euclidean distance from 1 to 3 is twice the distance from 1 to 2, which means the training process will think that "red" is much more different than "blue", while it is more similar to "green". Alternatively, one hot encoding maps each categorical value to a binary vector to avoid this problem. For example, "red" can be mapped to [1,0,0], "green" to [0,1,0] and "blue" to [0,0,1]. Now, the pair-wise distances between the three categories are all the same. One hot encoding allows you to convert categorical variables to binary values so that you can use different machine learning algorithms to evaluate your data.

The following example shows how you can apply one hot encoding to the Titanic data set. If you would like to read more about this data set, see the Kaggle site.

Suppose you want to use a logistic regression classifier to predict which passengers survived the sinking of the Titanic. You cannot use categorical features for logistic regression without one hot encoding. This data set has two categorical features that you can use. The "sex" feature can be either male or female. The "embarkation_point" feature can be one of the following:

  • S for Southampton
  • Q for Queenstown
  • C for Cherbourg

Before you begin the example, load the Machine Learning sample data.

1. Run the ONE_HOT_ENCODER_FIT function on the training data:

=> SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'titanic_training', 'sex, embarkation_point');
ONE_HOT_ENCODER_FIT
---------------------
Success

(1 row)

2. View a summary of the titanic_encoder model:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='titanic_encoder');
GET_MODEL_SUMMARY
------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'  
USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name    |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point|      C       |         0
embarkation_point|      Q       |         1
embarkation_point|      S       |         2
embarkation_point|              |         3
sex              |    female    |         0
sex              |     male     |         1
(1 row)

3. Run the GET_MODEL_ATTRIBUTE function. This function returns the categorical levels in their native data types, so they can be compared easily with the original table:

=> SELECT * FROM (SELECT GET_MODEL_ATTRIBUTE(USING PARAMETERS model_name='titanic_encoder', 
attr_name='varchar_categories')) AS attrs INNER JOIN (SELECT passenger_id, name, sex, age, 
embarkation_point FROM titanic_training) AS original_data ON attrs.category_level 
ILIKE original_data.embarkation_point ORDER BY original_data.passenger_id LIMIT 10;
category_name     | category_level | category_level_index | passenger_id |name             
|  sex   | age | embarkation_point
------------------+----------------+----------------------+--------------+-----------------------------
-----------------------+--------+-----+-------------------
embarkation_point | S              |                    2 |            1 | Braund, Mr. Owen Harris   
| male   |  22 | S
embarkation_point | C              |                    0 |            2 | Cumings, Mrs. John Bradley 
(Florence Briggs Thayer | female |  38 | C
embarkation_point | S              |                    2 |            3 | Heikkinen, Miss. Laina    
| female |  26 | S
embarkation_point | S              |                    2 |            4 | Futrelle, Mrs. Jacques Heath 
(Lily May Peel)       | female |  35 | S
embarkation_point | S              |                    2 |            5 | Allen, Mr. William Henry    
| male   |  35 | S
embarkation_point | Q              |                    1 |            6 | Moran, Mr. James           
| male   |     | Q
embarkation_point | S              |                    2 |            7 | McCarthy, Mr. Timothy J    
| male   |  54 | S
embarkation_point | S              |                    2 |            8 | Palsson, Master. Gosta Leonard
| male   |   2 | S
embarkation_point | S              |                    2 |            9 | Johnson, Mrs. Oscar W 
(Elisabeth Vilhelmina Berg)  | female |  27 | S
embarkation_point | C              |                    0 |           10 | Nasser, Mrs. Nicholas 
(Adele Achem)                | female |  14 | C
(10 rows)

4. Run the APPLY_ONE_HOT_ENCODER function on both the training and testing data:

=> CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age, sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') FROM titanic_training) AS sq; CREATE VIEW => CREATE VIEW titanic_testing_encoded AS SELECT passenger_id, name, pclass, sex_1, age, sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') FROM titanic_testing) AS sq; CREATE VIEW

5. Then, train a logistic regression classifier on the training data, and execute the model on the testing data:

=> SELECT LOGISTIC_REG('titanic_log_reg', 'titanic_training_encoded', 'survived', '*' 
USING PARAMETERS exclude_columns='passenger_id, survived');
LOGISTIC_REG
---------------------------
Finished in 5 iterations
(1 row)

=> SELECT passenger_id, name, PREDICT_LOGISTIC_REG(pclass, sex_1, age, sibling_and_spouse_count, 
parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 USING PARAMETERS 
model_name='titanic_log_reg') FROM titanic_testing_encoded ORDER BY passenger_id LIMIT 10;
passenger_id |                     name                     | PREDICT_LOGISTIC_REG
-------------+----------------------------------------------+----------------------
893          | Wilkes, Mrs. James (Ellen Needs)             |                    0
894          | Myles, Mr. Thomas Francis                    |                    0
895          | Wirz, Mr. Albert                             |                    0
896          | Hirvonen, Mrs. Alexander (Helga E Lindqvist) |                    1
897          | Svensson, Mr. Johan Cervin                   |                    0
898          | Connolly, Miss. Kate                         |                    1
899          | Caldwell, Mr. Albert Francis                 |                    0
900          | Abrahim, Mrs. Joseph (Sophie Halaut Easu)    |                    1
901          | Davies, Mr. John Samuel                      |                    0
902          | Ilieff, Mr. Ylio                             |
(10 rows)