Logistic Regression Using Vertica's In-Built Machine Learning Functions
This document provides step by step instructions for using Logistic Regression. As a sample, we performed logistic regression on the Titanic dataset using Vertica’s in-built Machine Learning Functions.
In Machine Learning, Logistic Regression is used to predict probability of the occurrence of an event. It is one of the simplest approaches for supervised learning. For example, it can be used to predict whether an email is a spam (1) or not (0). Given an input X, we predict p(X) using the logistic function:
β0 and β1 are regression coefficients which are unknown and need to be estimated using the training dataset. We estimate values for β0 and β1 such that probability of p(X) is as close as possible to the corresponding original value. We estimate those coefficients using Maximum Likelihood Estimation.
Let us apply a statistical technique, Logistic Regression on Titanic Dataset using Vertica’s in-built ML functions and predict the survival of a passenger.
The RMS Titanic was a British passenger liner that struck an iceberg on her maiden voyage and sank, taking lives of more than 1500 passengers. Our goal is to predict the survival probability for a given passenger. There are many factors which played into which passengers made it into lifeboats and which passengers did not, most notably gender and class assignment. We will use two data sets (one to train and one to test) to create a model that can predict whether or not a passenger survived.
To download the dataset, clone the Vertica Machine Learning Github repository. Using a terminal window, run the following command:
We will be using titanic_training.csv
and titanic_testing.csv
files.
The first step is to load data into the database. Let us create 2 tables. One for training and the other for testing.
dbadmin=> CREATE TABLE titanic_training(passenger_id int, survived int, pclass int, name varchar(50), sex varchar(10), age int, sibling_and_spouse_count int, parent_and_child_count int, ticket varchar(15), fare float, cabin varchar(10), embarkation_point varchar(15)); CREATE TABLE dbadmin=> dbadmin=> CREATE TABLE titanic_testing(passenger_id int, pclass int, name varchar(50), sex varchar(10), age int, sibling_and_spouse_count int, parent_and_child_count int, ticket varchar(15), fare float, cabin varchar(10), embarkation_point varchar(15)); CREATE TABLE dbadmin=>
Next, let us load data into the corresponding tables.
dbadmin=> COPY titanic_training FROM LOCAL 'titanic_training.csv' DELIMITER ',' ENCLOSED BY '"'; Rows Loaded ------------- 815 (1 row) dbadmin=> COPY titanic_testing FROM LOCAL 'titanic_testing.csv' DELIMITER ',' ENCLOSED BY '"'; Rows Loaded ------------- 378 (1 row) dbadmin=>
In order to evaluate the accuracy of our model, we are going to download the Survived values for the test dataset from Kaggle, file named, gender_submission.csv and create a table named titanic_test_survival.
dbadmin=> CREATE TABLE titanic_test_survival(passenger_id int, survived int); CREATE TABLE dbadmin=> COPY titanic_test_survival FROM LOCAL 'gender_submission.csv' DELIMITER ',' ENCLOSED BY '"'; Rows Loaded ------------- 418 (1 row) dbadmin=>
Next we are going to drop the data from titanic_test_survival for which we don’t have data in titanic_testing.
dbadmin=> DELETE FROM titanic_test_survival where passenger_id in (select passenger_id from titanic_test_survival where passenger_id not in (select passenger_id from titanic_testing)); OUTPUT -------- 40 (1 row) dbadmin=> commit; COMMIT
Data Exploration
We have 815 observations for training and 378 obersvations for testing. Based on the table definition, we observe that the dataset has 11 predictor variables and one target variable named survived. Among 11 features, Survived, Sex, Pclass, embarkation_point, Cabin, Name, Ticket, sibling_and_spouse_count, and parent_and_child_count are categorical variables whereas Fare, Age, and passenger_id are numerical variables.
In statistical terms, categorical variables are the ones which take limited number of values, such as sex, whose values can be Male or Female. Since we have more than 1 predictor variable, we are performing Multiple Logistic Regression. Here is the detailed description of the features of the titanic dataset below:
-
Passenger_id is a unique identifying number assigned to each passenger.
-
Survived is a flag that indicates if a passenger survived or died (0 = No, 1 = Yes).
-
Pclass is the passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
-
Name is the name of the passenger.
-
Sex indicates the gender of the passenger (i.e., Male or female).
-
Age indicates the age of the passenger.
-
sibling_and_spouse_count is the number of siblings/spouses aboard.
-
parent_and_child_count is the number of parents/children aboard.
-
Ticket indicates the ticket number issued to the passenger.
-
Fare indicates the amount of money spent on their ticket.
-
Cabin indicates the cabin category occupied by the passenger.
-
Embarkation_point indicates the port where the passenger embarked from (C = Cherbourg, Q = Queenstown, S = Southampton).
Let us run the function SUMMARIZE_NUMCOL and SUMMARIZE_CATCOL to get the distribution of the features.
dbadmin=> SELECT SUMMARIZE_NUMCOL(passenger_id, survived, pclass,age,sibling_and_spouse_count, parent_and_child_count, fare ) OVER() FROM titanic_training;
COLUMN | COUNT | MEAN | STDDEV | MIN | 25 | 50% | 75% | MAX |
---|---|---|---|---|---|---|---|---|
age | 656 | 30.172256097561 | 14.3014190862079 | 1 | 21 | 29 | 39 | 80 |
fare | 815 | 33.1048303067485 | 50.7783288265627 | 0 | 7.925 | 14.4542 | 31.3875 | 512.3292 |
parent_and_child_count | 815 | 0.376687116564417 | 0.810978914656903 | 0 | 0 | 0 | 0 | 6 |
passenger_id | 815 | 447.566871165644 | 257.283317160943 | 1 | 229.5 | 449 | 667.5 | 891 |
pclass | 815 | 2.29325153374233 | 0.843323892994739 | 1 | 1 | 3 | 3 | 3 |
sibling_and_spouse_count | 815 | 0.532515337423313 | 1.10665195736505 | 0 | 0 | 0 | 1 | 8 |
survived | 815 | 0.36319018404908 | 0.481214303403727 | 0 | 0 | 0 | 1 |
1 |
dbadmin=> SELECT SUMMARIZE_CATCOL( sex USING PARAMETERS WITH_TOTALCOUNT = true) OVER() from titanic_training;
CATEGORY | COUNT | PERCENT |
---|---|---|
815 | 100 | |
male | 538 | 66.0122699386503 |
female | 277 | 33.9877300613497 |
dbadmin=> SELECT SUMMARIZE_CATCOL( embarkation_point USING PARAMETERS WITH_TOTALCOUNT = true) OVER() from titanic_training;
CATEGORY | COUNT | PERCENT |
---|---|---|
815 | 100 | |
S | 603 | 73.9877300613497 |
C | 150 | 18.4049079754601 |
Q | 60 | 7.36196319018405 |
2 | 0.245398773006135 |
Data Preparation
Data preparation is a very important step in Machine Learning. We need to make sure that our dataset does not contain any missing values, duplicate rows, outliers, or imbalanced data. If this step is not performed properly, it is possible that the model will not predict the outcome accurately.
Missing Values
We see that there are missing values for the columns age, cabin, and embarkation_point.
Feature Name | Count of Missing Values |
---|---|
age | 159(training data), 78(testing data) |
cabin | 622(training data) |
embarkation_point | 2(training data) |
Vertica provides an in-built ML function called IMPUTE which will replace missing data based on the method we have specified. Let us impute the missing values for the features identified in Table 1.
Cabin: Out of 815 observations, we have 622 missing observations for training dataset. Even if we fill it with most frequently used cabin number it does not make sense since more than 1 person cannot have the same cabin number. So, it is better not to use this feature. Let us go ahead and drop this column in both titanic_testing and titanic_training tables.
dbadmin=> Alter table titanic_training DROP COLUMN cabin cascade; ALTER TABLE
dbadmin=> Alter table titanic_testing DROP COLUMN cabin cascade; ALTER TABLE
Note If you get any errors while dropping columns, check the projection design and make sure that the column is not a part of segmentation or order by clause.
Age: We have 159 observations with missing values for training data. Let us use the impute function to replace the missing data with mean.
dbadmin=> SELECT IMPUTE('output_view_age','titanic_training','age','mean'); IMPUTE -------------------------- Finished in 1 iteration (1 row)
Now let us run impute function for test data as well for the age feature.
dbadmin=> SELECT IMPUTE('output_view_age_test','titanic_testing','age','mean'); IMPUTE -------------------------- Finished in 1 iteration (1 row)
embarkation_point: It is a categorical variable and we have 2 missing values for training data. Let us use impute function to replace the missing data with most frequently used value. We will use the view created in the above step as an input.
dbadmin=> SELECT IMPUTE('output_view_embarkation','output_view_age','embarkation_point','mode'); IMPUTE -------------------------- Finished in 1 iteration (1 row)
Encoding Categorical Columns
We have 2 categorical columns that need to be encoded inthe titanic dataset named sex and embarkation_point. Most ML models require all the input and output variables to be numeric. There are various methods to encode categorical columns and one hot encoding is the most popular method where data is converted to binary format.
Vertica provides an in-built function called ONE_HOT_ENCODER_FIT. Let us apply one hot encoding on both training datasets. This function will convert the categorical feature into binary features and store it in a model.
dbadmin=> SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'output_view_embarkation', 'sex, embarkation_point'); ONE_HOT_ENCODER_FIT --------------------- Success (1 row)
Let us view the encoded features present in titanic_encoder model using GET_MODEL_SUMMARY function.
dbadmin=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='titanic_encoder'); GET_MODEL_SUMMARY ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- =========== call_string =========== SELECT one_hot_encoder_fit('public.titanic_encoder','output_view_embarkation','sex, embarkation_point' USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}'); ================== varchar_categories ================== category_name |category_level|category_level_index -----------------+--------------+-------------------- embarkation_point| C | 0 embarkation_point| Q | 1 embarkation_point| S | 2 sex | female | 0 sex | male | 1 (1 row)
We see that both embarkation_point and sex have been encoded into binary format.
Now let us use titanic_encoder created from the above step and apply it on training and testing data to create views with encoded columns for training and testing data. Vertica provides a function named APPLY_ONE_HOT_ENCODER to help us in achieve this.
When we run the function APPLY_ONE_HOT_ENCODER on the training dataset, we get the original value plus the corresponding encoded value. In order to build a model, we only need the encoded values for categorical columns. Along with that, we will be excluding name and ticket since they are varchar columns. Logistic Regression requires the input training data to be in numeric format. Now let us select the required columns and create a view for training.
CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age, sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') FROM output_view_embarkation) AS sq; CREATE VIEW dbadmin=>
Next, let us create a view for testing data.
CREATE VIEW titanic_testing_encoded AS SELECT passenger_id, name, pclass, sex_1, age, sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') FROM output_view_age_test) AS sq; CREATE VIEW dbadmin=>
Building Logistic Regression Model
Now that we are done with preprocessing of data, let us run apply logistic regression on training data and build a model. Let us create a model named, titanic_model by executing logistic regression on training data. We are passing passenger_id and survived for exclude_columns parameters. Survived is being excluded because it is being passed as a response column. Passenger_id is being excluded because it will not help you in predicting if the passenger survived or not.
SELECT LOGISTIC_REG('titanic_model', 'titanic_training_encoded', 'survived', '*' USING PARAMETERS exclude_columns='passenger_id, survived'); LOGISTIC_REG --------------------------- Finished in 5 iterations (1 row)
Let us view the summary of the attributes titanic_model using the function
GET_MODEL_ATTRIBUTE function. SELECT GET_MODEL_ATTRIBUTE (USING PARAMETERS model_name='titanic_model', attr_name='details');
predictor | coefficient | std_err | z_value | p_value |
---|---|---|---|---|
Intercept | 4.79608507226254 | 0.588356390682034 | 8.15166648687681 | 3.58943081416261e-16 |
pclass | -1.04367548461173 | 0.150046006586495 |
-6.95570317634611 |
3.50807788992785e-12 |
sex_1 | -2.83317816597054 | 0.213897983172462 |
-13.245464608641 |
4.79306444216242e-40 |
age | -0.0308952083686641 | 0.00828154776064673 | -3.73060800487992 | 0.000191018233152909 |
sibling_and_spouse_count | -0.283460326254313 | 0.114119041328987 | -2.48390034610562 | 0.0129952168866586 |
parent_and_child_count | -0.142765927531063 | 0.123121694222583 | -1.15955135634315 | 0.246231515660041 |
fare | 0.00268052020303712 | 0.00245537645477208 | 1.09169418718969 | 0.274967541017099 |
embarkation_point_1 | -0.160265939268667 | 0.438429351725351 | -0.365545642959287 | 0.714704142353224 |
embarkation_point_2 | -0.338240678212526 | 0.252730668321822 | -1.33834441406935 | 0.180784189896433 |
Predicting Results
Let us apply the logistic regression model, 'titanic_model' on the test data and predict whether the passenger survived or not. We store the predicted results in a table named titanic_predicted_results along with the original survival results.
Create table titanic_predicted_results as (SELECT t.passenger_id, t.name, PREDICT_LOGISTIC_REG(t.pclass, t.sex_1, t.age, t.sibling_and_spouse_count, t.parent_and_child_count, t.fare, t.embarkation_point_1, t.embarkation_point_2 USING PARAMETERS model_name='titanic_model') as predicted ,s.survived FROM titanic_testing_encoded t, titanic_test_survival s where t.passenger_id = s.passenger_id ORDER BY t.passenger_id) ; CREATE TABLE dbadmin=> select * from titanic_predicted_results limit 7; passenger_id | name | predicted | survived --------------+-----------------------------------------------+-----------+---------- 897 | Svensson, Mr. Johan Cervin | 0 | 0 898 | Connolly, Miss. Kate | 1 | 1 901 | Davies, Mr. John Samuel | 0 | 0 902 | Ilieff, Mr. Ylio | 0 | 0 907 | del Carlo, Mrs. Sebastiano (Argenia Genovesi) | 1 | 1 908 | Keane, Mr. Daniel | 0 | 0 917 | Robins, Mr. Alexander A | 0 | 0 (7 rows) dbadmin=>
Model Evaluation
Now let us test if our model is predicting accurate results by creating a confusion matrix. It is a matrix which compares the actual values with the target values predicted by our model.
SELECT CONFUSION_MATRIX(original::int, predicted::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT survived AS original, predicted FROM titanic_predicted_results order by passenger_id ) AS prediction_output; actual_class | predicted_0 | predicted_1 | comment --------------+-------------+-------------+----------------------------------------------- 0 | 232 | 10 | 1 | 9 | 127 | Of 378 rows, 378 were used and 0 were ignored (2 rows)
Based on the confusion matrix, we see that our model predicted results for all 378 observations from the test dataset. It has predicted correctly for 359 observations, that is, 232 observations for class 0 and 127 observations for class 1.
Let us calculate the LIFT_TABLE, ROC, and ERROR_RATE of the model
dbadmin=> SELECT ROC(original,predicted USING PARAMETERS num_bins=5, AUC = True) over() from (SELECT survived AS original, predicted FROM titanic_predicted_results order by passenger_id ) AS prediction_output; decision_boundary | false_positive_rate | true_positive_rate | AUC | comment -------------------+---------------------+--------------------+-------------------+----------------------------------------------- 0 | 1 | 1 | | 0.2 | 0.0413223140495868 | 0.933823529411765 | | 0.4 | 0.0413223140495868 | 0.933823529411765 | | 0.6 | 0.0413223140495868 | 0.933823529411765 | | 0.8 | 0.0413223140495868 | 0.933823529411765 | | 1 | 0 | 0 | 0.946250607681089 | Of 378 rows, 378 were used and 0 were ignored (6 rows) dbadmin=> SELECT ERROR_RATE(original::int, predicted::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT survived AS original, predicted FROM titanic_predicted_results order by passenger_id ) AS prediction_output; class | error_rate | comment -------+--------------------+----------------------------------------------- 0 | 0.0413223140495868 | 1 | 0.0661764705882353 | | 0.0502645502645503 | Of 378 rows, 378 were used and 0 were ignored (3 rows) dbadmin=> SELECT LIFT_TABLE(original,predicted USING PARAMETERS num_bins=5 ) over() from (SELECT survived AS original, predicted FROM titanic_predicted_results order by passenger_id ) AS prediction_output; decision_boundary | positive_prediction_ratio | lift | comment -------------------+---------------------------+------------------+----------------------------------------------- 1 | 0 | NaN | 0.8 | 0.933823529411765 | 2.57653499355947 | 0.6 | 0.933823529411765 | 2.57653499355947 | 0.4 | 0.933823529411765 | 2.57653499355947 | 0.2 | 0.933823529411765 | 2.57653499355947 | 0 | 1 | 1 | Of 378 rows, 378 were used and 0 were ignored (6 rows)
We see that ERROR_RATE is low. Lower the value of standard error, more accurate our predictions are. ROC Curve shows the performance of the classification model and it plots the true positive rates and false positive rates. AUC (Area under the Curve) is a very important metric. Based on the results, we see that AUC is 94% which means that our model is predicting passenger survival results with 94% accuracy. Higher the AUC, better the classifier is. Lift calculates the ratio between the results with and without a model.
Based on the results we see that our positive prediction ratio is high, which is a good sign. Based on the parameters of the model from Table 5 as well, we are not seeing any large p-values for any of the features. Overall, our model did a pretty good job in predicting the survival of the passenger!