Detecting Outliers

Before you perform an in-depth analysis of your data, you should first remove the outliers from the data. Outliers are data points that greatly differ from other similar data points. If you leave outliers in your data, then you risk misclassifying data, introducing bias, or incorrect calculations.

For this example we will use the baseball data set found on the Vertica GitHub page.

  1. Create a table.
    => CREATE TABLE baseball (id identity, first_name varchar(50), last_name varchar(50), dob DATE,
                               team varchar(20), hr int, hits int, avg float, salary float);
    CREATE TABLE
  2. Load the data.
    => COPY baseball FROM 'baseball.csv' DELIMITER ',';
     Rows Loaded
    -------------
            1000
    (1 row)
    
  3. Detect the outliers based on the hr, hits, and salary columns. We will use the id and team columns as our key columns. The DETECT_OUTLIERS function will create a view containing the outliers with the input and key columns. Before you use the DETECT_OUTLIERS function, make sure that you are a superuser or have CREATE privileges for the schema and SELECT privileges for the table.
    => SELECT DETECT_OUTLIERS('baseball_hr_hits_salary_outliers', 'baseball', 'hr, hits, salary', 'robust_zscore'
                             USING PARAMETERS outlier_threshold=3.0, key_columns='id, team');
         DETECT_OUTLIERS
    --------------------------
     Detected 5 outliers
    
    (1 row)
  4. Query the output view containing the outliers.
    => SELECT * FROM baseball_hr_hits_salary_outliers;
     id  |   team    |   hr    |  hits   |        salary
    -----+-----------+---------+---------+----------------------
      73 | Mauv      |    8888 |      34 | 9.99999999341471e+16
      89 | Pink      |  273333 | 4490260 |  4.4444444444828e+17
     147 | Maroon    | 1100037 |     230 |           9000101403
      87 | Green     |      80 |   64253 |          16032567.12
     222 | Goldenrod | 3200000 |     216 |          37008899.76
    (5 rows)
  5. Create a view omitting the outliers from the table.
    => CREATE VIEW clean_baseball AS
       SELECT * FROM baseball WHERE id NOT IN (SELECT id FROM baseball_hr_hits_salary_outliers);
    CREATE VIEW
  6. Perform your analysis.

See Also