Detect Outliers Using Isolation Forest

Posted August 2, 2022 by George Larionov

A key with Anomaly Description written on it

The Isolation Forest (IForest) algorithm is a powerful and scalable algorithm for identifying outliers in your data. It uses a unique approach which focuses directly on characterizing outliers instead of normal data points. Specifically, IForest focuses on the two quantitative properties of outliers: they are few and different. A scalable implementation of the Isolation Forest algorithm is available in Vertica 12.0.0 onward. In this blog post, we go through the steps to get you started with the IForest algorithm with a simple real-world example. We use Vertica’s vsql client program in this example, but similar steps can be done using VerticaPy as well.

IForest handles both numerical and categorical data out of the box, like other tree-based algorithms supported in Vertica such as RF_ClassifierRF_RegressorXGB_Classifier, and XGB_Regressor. Additionally, as an unsupervised algorithm, IForest does not require data to be labeled as outlier/non-outlier, although you may want to have a small subset of your data labeled in order to tune and test the model.

Dataset

This blog post example uses the KDD-Cup-1999 dataset, a web traffic dataset containing a variety of intrusions simulated in a military network environment. Each data point has 41 numerical and categorical features characterizing a network connection such as its duration, protocol_type, service, src_bytes, dst_bytes, and so on. Each data point has been also labeled with either the type of the associated cyber attack, such as snmpgetattack, buffer_overflow, guess_passwd, etc., or normal if the connection is not related to an attack.

We downloaded the dataset from the UC Irvine ML repository, but it is also available in KDD archives which provides more information about the dataset including the name of the features, types of attacks, and so on. To access the data, download the file ‘kddcup.data.gz‘ (~18MB compressed).

Experiment

For this experiment, we train an IForest model on part of the dataset to see if it detects the abnormal data points as anomalies. Indeed, the experiment shows how IForest might be used as an unsupervised algorithm for detecting security threats and fraud. Please note that because there is randomness involved in training an IForest model, you might not get the exact results as displayed here when you try the steps yourself. Let’s go through the experiment step-by-step.

Step 1: Create table and load data

First, we create a table with the appropriate column names and types.

CREATE TABLE web_traffic_train(duration int, protocol_type VARCHAR, service VARCHAR, flag VARCHAR,
                               src_bytes INT, dst_bytes INT, land BOOL, wrong_fragment INT, urgent INT,
                               hot INT, num_failed_logins INT, logged_in BOOL, num_compromised INT, root_shell INT,
                               su_attempted INT, num_root INT, num_file_creations INT, num_shells INT, num_access_files INT,
                               num_outbound_cmds INT, is_host_login BOOL, is_guest_login BOOL, count INT, srv_count INT,
                               serror_rate FLOAT, srv_serror_rate FLOAT, rerror_rate FLOAT, srv_rerror_rate FLOAT,
                               same_srv_rate FLOAT, diff_srv_rate FLOAT, srv_diff_host_rate FLOAT, dst_host_count INT,
                               dst_host_srv_count INT, dst_host_same_srv_rate FLOAT, dst_host_diff_srv_rate FLOAT,
                               dst_host_same_src_port_rate FLOAT, dst_host_srv_diff_host_rate FLOAT,
                               dst_host_serror_rate FLOAT, dst_host_srv_serror_rate FLOAT, dst_host_rerror_rate FLOAT,
                               dst_host_srv_rerror_rate FLOAT, label VARCHAR);


Next, we load the data.

COPY web_traffic_train FROM LOCAL 'path_to_data/kddcup.data.gz' GZIP DELIMITER ',';

For this example we use only a subset of the data because the full dataset has more outliers than normal data points. So, we will limit the scope to only those data points which go to http endpoints. This limits our data points to 623,091 rows with only 4,045 abnormal data points (that is, associated with an attack). Instead of forcing Vertica to copy the data into a new table, let’s create a view.

CREATE VIEW web_traffic_http AS SELECT * FROM web_traffic_train WHERE service='http';

Step 2) Train an IForest model

We call the iforest function to train our model. iforest is a meta-function with three required positional parameters and several optional named parameters. The positional parameters are model-nameinput-relation, and input-columns, respectively. The named parameters can be used to customize the training, such as by specifying ntree and max_depth.

SELECT iforest('iforest_example', 'web_traffic_http', '*' USING PARAMETERS exclude_columns='label');

Now we can apply the model on the same or any new data points to predict its anomaly score. Let’s take a look at the result of the apply_iforest function when it applies the model on the training data. For each of its input rows, the apply_iforest function returns a struct containing a predicted anomaly score and a boolean value is_anomaly, indicating whether it is an anomaly. The value of anomaly_score is between 0 and 1, with a score of 0 indicating that the data point is not an anomaly, and a score of 1 indicating that it is. In practice, the value will always be somewhere between these two extremes. When all scores are close to 0.5, it usually means that there are very few anomalies in the dataset. The correct threshold for interpreting a score as an anomaly is different for every application. The apply_iforest function has an optional parameter named threshold (with default value 0.7) which enables you to set an anomaly threshold suitable to your own use case.

In the following SQL code, we use the vsql command `\set` to create a vsql variable ‘columns’ that stores the string of column names so we don’t have to retype them every time. This variable can be used by typing ‘:columns’, which essentially copy-and-pastes the value of the columns variable at the place where you typed ‘:columns’.

\set columns 'duration, protocol_type, service, flag, src_bytes, dst_bytes, land, wrong_fragment, urgent, hot, num_failed_logins, logged_in, num_compromised, root_shell, su_attempted, num_root, num_file_creations, num_shells, num_access_files, num_outbound_cmds, is_host_login, is_guest_login, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count, dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_rerror_rate, dst_host_srv_rerror_rate'
 
=> SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example') from web_traffic_http LIMIT 5;
                      apply_iforest                      
----------------------------------------------------------
 {"anomaly_score":0.4818153103947974,"is_anomaly":false}
 {"anomaly_score":0.48019606645209719,"is_anomaly":false}
 {"anomaly_score":0.48019606645209719,"is_anomaly":false}
 {"anomaly_score":0.48002628098083308,"is_anomaly":false}
 {"anomaly_score":0.4961066842086145,"is_anomaly":false}
(5 rows)


Step 3) Examine the result and experiment with the threshold parameter

Now that our data points are labeled, we can examine the result of the apply_iforest function. The most interesting information is likely to be found by examining the rows where is_anomaly=true and by looking at whether they match up with the label column we excluded from training.

First, let’s look at a confusion matrix without adjusting the threshold value.

=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example') AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT confusion_matrix(obs, pred.is_anomaly USING PARAMETERS num_classes=2) OVER() FROM prediction_result;
 
 actual_class | predicted_false | predicted_true |                       comment                      
--------------+-----------------+----------------+-----------------------------------------------------
 f            |          619046 |              0 |
 t            |            4045 |              0 | Of 623091 rows, 623091 were used and 0 were ignored
(2 rows)


Here, there are zero predicted_true values, indicating that the model has classified all our data points as regular (non-outlier) data points! We fix this issue by adjusting the threshold parameter in the apply_iforest function. This parameter controls how selective the function is in deciding which points are outliers. Each point whose anomaly_score is above the threshold value is classified as an outlier. Looking at the above results, it appears many of the points have an anomaly score near 0.5, so let’s set that as our threshold value and see what we get.

=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example', threshold=0.5) AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT confusion_matrix(obs, pred.is_anomaly USING PARAMETERS num_classes=2) OVER() FROM prediction_result;
 actual_class | predicted_false | predicted_true |                       comment                      
--------------+-----------------+----------------+-----------------------------------------------------
 f            |          606914 |          12132 |
 t            |            1878 |           2167 | Of 623091 rows, 623091 were used and 0 were ignored
(2 rows)


This result is an improvement, as we have correctly predicted about half of the outliers, but there are still a lot of false positives. We can experiment with different threshold values to get closer to an optimal point; for example, let’s try 0.51 and 0.52.

=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example', threshold=0.51) AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT confusion_matrix(obs, pred.is_anomaly USING PARAMETERS num_classes=2) OVER() FROM prediction_result;
 actual_class | predicted_false | predicted_true |                       comment                      
--------------+-----------------+----------------+-----------------------------------------------------
 f            |          618789 |            257 |
 t            |            2157 |           1888 | Of 623091 rows, 623091 were used and 0 were ignored
(2 rows)
 
=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example', threshold=0.52) AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT confusion_matrix(obs, pred.is_anomaly USING PARAMETERS num_classes=2) OVER() FROM prediction_result;
 actual_class | predicted_false | predicted_true |                       comment                      
--------------+-----------------+----------------+-----------------------------------------------------
 f            |          619006 |             40 |
 t            |            2187 |           1858 | Of 623091 rows, 623091 were used and 0 were ignored
(2 rows)


We can also use the receiver operating characteristic (ROC) function to see the tradeoff between the true positive rate (that is, the fraction of correctly classified outliers out of total outliers) and false positive rate (the fraction of incorrectly classified outliers out of total regular data points) that result from different threshold values. Note that it does not matter how we set the threshold here, since the threshold only affects the value of is_anomalynot the value of the anomaly_score. Each row returned by the ROC function represents a possible threshold value and what kind of true/false positive rate it would yield over the dataset. For the sake of presentation (the following function call returns 1000 rows), here we query only the values of decision_boundary between 0.49 and 0.51.

=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example') AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT * FROM (SELECT roc(obs, pred.anomaly_score using parameters num_bins=1000) OVER()
                  FROM prediction_result) AS subquery WHERE decision_boundary > 0.49 AND decision_boundary < 0.51;
 decision_boundary | false_positive_rate  | true_positive_rate | AUC | comment
-------------------+----------------------+--------------------+-----+---------
             0.491 |   0.0849775299412322 |  0.687515451174289 |     |
             0.492 |    0.081853367924193 |  0.670210135970334 |     |
             0.493 |   0.0717991877824911 |   0.66229913473424 |     |
             0.494 |   0.0595319249296498 |  0.636835599505562 |     |
             0.495 |   0.0534031396697499 |  0.614585908529048 |     |
             0.496 |   0.0448965020370053 |  0.601977750309023 |     |
             0.497 |   0.0404315672825606 |  0.583683559950556 |     |
             0.498 |   0.0328408551222365 |  0.555253399258344 |     |
             0.499 |   0.0275084565605787 |  0.548084054388133 |     |
               0.5 |   0.0195978974098855 |  0.535723114956737 |     |
             0.501 |   0.0148551157749182 |  0.521631644004944 |     |
             0.502 |   0.0117357999243998 |   0.51372064276885 |     |
             0.503 |  0.00702694145507765 |  0.500865265760198 |     |
             0.504 |  0.00543578344743363 |  0.497156983930779 |     |
             0.505 |  0.00265893003104777 |  0.492459826946848 |     |
             0.506 |  0.00212908249144652 |  0.485290482076638 |     |
             0.507 |  0.00110169518904896 |  0.477132262051916 |     |
             0.508 | 0.000974079470669385 |  0.471446229913473 |     |
             0.509 | 0.000491078207435312 |  0.468726823238566 |     |
(19 rows)
 
=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example') AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT * FROM (SELECT roc(obs, pred.anomaly_score using parameters num_bins=1000) OVER()
                  FROM prediction_result) AS subquery WHERE AUC IS NOT NULL;
 decision_boundary | false_positive_rate | true_positive_rate |        AUC        |                       comment                      
-------------------+---------------------+--------------------+-------------------+-----------------------------------------------------
                 1 |                   0 |                  0 | 0.964212676232183 | Of 623091 rows, 623091 were used and 0 were ignored
(1 row)


The ROC function also reports the Area Under the Curve (AUC, sometimes referred to as AUROC or the area under the ROC curve) value as 0.9642, which is quite high. The AUC scores the model based on how well it balances the true positive rate versus the false positive rate. A perfect classifier would achieve an AUC of 1, meaning that it always achieves a true positive rate of 1 and a false positive rate of 0. The dashed diagonal orange line in the graph below, on the other hand represents a model that randomly assigns classes, achieving an AUC of 0.5. Any model with an AUC greater than 0.5 has therefore performed better than random. (The AUC is reported in the last line of the AUC column, since it is incremented at each step of the ROC function, which is why we did not see it first above.) To get a better idea of what the different threshold values mean, we also drew the ROC curve in Python using the matplotlib package.

We see here the trade off between true positives and false positives as the threshold value increases. We want the maximum true positive number and the minimum false positive number, but how many false positives we are willing to tolerate depends on our use case. For example, if this dataset represented the web traffic of users logging in to their bank accounts, we’d likely want to more aggressively classify outliers because allowing a false login attempt is very dangerous, while the cost of locking some genuine users out of their accounts is acceptable. On the other hand, if we are filtering spam email, the stakes are lower for allowing a spam email through while blocking genuine emails is very annoying to users, so we’d likely classify outliers less aggressively.

Looking at the ROC output, we may want to try out a different threshold in our confusion_matrix.

=> WITH prediction_result AS (SELECT apply_iforest(:columns USING PARAMETERS model_name='iforest_example', threshold=0.501) AS pred, label!='normal.' AS obs
                              FROM web_traffic_http)
   SELECT confusion_matrix(obs, pred.is_anomaly USING PARAMETERS num_classes=2) OVER() FROM prediction_result;
 actual_class | predicted_false | predicted_true |                       comment                      
--------------+-----------------+----------------+-----------------------------------------------------
 f            |          609850 |           9196 |
 t            |            1935 |           2110 | Of 623091 rows, 623091 were used and 0 were ignored
(2 rows)


Note that while a threshold of 0.501 does produce a better result in terms of correctly predicted outliers, it also results in more than 7,000 non-outlier rows classified as outliers. Again, choosing the best threshold value ultimately comes down to the specific use case for which the model is being used.

Summary

Here, we’ve walked through how to download and import a dataset into Vertica, train an iForest model, and tune the threshold parameter for prediction. Our example uses a web traffic dataset, but there are countless other applications iForest can be used for. Here are just a few examples:

  • Fraud Detection – emails, account logins, financial data, etc.
  • Fault Detection
  • System Health monitoring
  • Event detection in sensor networks
  • Etc.

All sorts of events we experience represent outliers from the norm. Earthquakes, stock spikes, system failures, tsunamis, extreme weather events, manufacturing defects, and diseases are all examples of outliers in vastly different fields. The Isolation Forest algorithm is yet another powerful tool in your machine learning toolbox with which to build, test, and deploy your solutions and identify such outliers as quickly and efficiently as possible.