Machine Learning

New Correlation Matrix Function

Vertica already offers aggregate function CORR for calculating the Pearson correlation between a pair of columns. New function CORR_MATRIX function performs the same calculation for many pair of columns with a single call, to provide insight about the relationship among the attributes of a data set:

CORR_MATRIX(input‑columns) OVER()

Arguments

The input‑columns argument is a comma-separated list of the columns in the input table. The input columns can be of any numeric type or BOOL, but they are internally converted to FLOAT. The number of input columns must be more than 1 and not more than 1600.

Return

CORR_MATRIX returns the correlation matrix in triplet format. That is, each pair-wise correlation is identified by three returned columns: name of the first variable, name of the second variable, and the correlation value of the pair. The function also returns two extra columns: number_of_ignored_input_rows and number_of_processed_input_rows. The value of the fourth/fifth column indicates the number of rows from the input which are ignored/used to calculate the corresponding correlation value. Any input pair with NULL, Inf, or NaN is ignored.

The correlation matrix is symmetric with a value of 1 on all diagonal elements; therefore, it can return only the value of elements above the diagonals—that is, upper triangle. Nevertheless, the function returns the entire matrix to simplify any later operations. Then, the number of output rows is:

(#input‑columns)^2

The first two output columns are of type VARCHAR(128), the third one is of type FLOAT, and the last two are of type INT.

Example

SELECT CORR_MATRIX("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") OVER() FROM iris; variable_name_1 | variable_name_2 | corr_value | number_of_ignored_input_rows | number_of_processed_input_rows ----------------+-----------------+-------------------+------------------------------+-------------------------------- Sepal.Length | Sepal.Width |-0.117569784133002 | 0 | 150 Sepal.Width | Sepal.Length |-0.117569784133002 | 0 | 150 Sepal.Length | Petal.Length |0.871753775886583 | 0 | 150 Petal.Length | Sepal.Length |0.871753775886583 | 0 | 150 Sepal.Length | Petal.Width |0.817941126271577 | 0 | 150 Petal.Width | Sepal.Length |0.817941126271577 | 0 | 150 Sepal.Width | Petal.Length |-0.42844010433054 | 0 | 150 Petal.Length | Sepal.Width |-0.42844010433054 | 0 | 150 Sepal.Width | Petal.Width |-0.366125932536439 | 0 | 150 Petal.Width | Sepal.Width |-0.366125932536439 | 0 | 150 Petal.Length | Petal.Width |0.962865431402796 | 0 | 150 Petal.Width | Petal.Length |0.962865431402796 | 0 | 150 Sepal.Length | Sepal.Length |1 | 0 | 150 Sepal.Width | Sepal.Width |1 | 0 | 150 Petal.Length | Petal.Length |1 | 0 | 150 Petal.Width | Petal.Width |1 | 0 | 150 (16 rows)