CORR_MATRIX

Takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient between each pair of its input columns. The function is implemented as a Multi-Phase Transform function.

Syntax

CORR_MATRIX ( input_columns ) OVER()

Arguments

input_columns

A comma-separated list of the columns in the input table. The input columns can be of any numeric type or BOOL, but they will be converted internally to FLOAT. The number of input columns must be more than 1 and not more than 1600.

Return

CORR_MATRIX returns the correlation matrix in triplet format. That is, each pair-wise correlation is identified by three returned columns: name of the first variable, name of the second variable, and the correlation value of the pair. The function also returns two extra columns: number_of_ignored_input_rows and number_of_processed_input_rows. The value of the fourth/fifth column indicates the number of rows from the input which are ignored/used to calculate the corresponding correlation value. Any input pair with NULL, Inf, or NaN is ignored.

The correlation matrix is symmetric with a value of 1 on all diagonal elements; therefore, it could return only the value of elements above the diagonals—that is, the upper triangle. Nevertheless, the function returns the entire matrix to simplify any later operations. Then, the number of output rows is:

(#input‑columns)^2

The first two output columns are of type VARCHAR(128), the third one is of type FLOAT, and the last two are of type INT.

Notes

The contents of the OVER clause must be empty.

The function returns no rows when the input table is empty.

When any of X_i and Y_i is NULL, Inf, or NaN, the pair will not be included in the calculation of CORR(X, Y). That is, any input pair with NULL, Inf, or NaN is ignored.

For the pair of (X,X), regardless of the contents of X: CORR(X,X) = 1, number_of_ignored_input_rows = 0, and number_of_processed_input_rows = #input_rows.

When (N*SUMX2 == SUMX*SUMX) or (N*SUMY2 == SUMY*SUMY) then value of CORR(X, Y) will be NULL. In theory it can happen in case of a column with constant values; nevertheless, it may not be always observed because of rounding error.

In the special case where all pair values of (X_i,Y_i) contain NULL, inf, or NaN, and X != Y: CORR(X,Y)=NULL.

Example

This example uses the iris dataset*.

 

SELECT CORR_MATRIX("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") OVER() FROM iris;
variable_name_1 | variable_name_2 | corr_value        | number_of_ignored_input_rows | number_of_processed_input_rows
----------------+-----------------+-------------------+------------------------------+--------------------------------
Sepal.Length    | Sepal.Width     |-0.117569784133002 | 0                            | 150
Sepal.Width     | Sepal.Length    |-0.117569784133002 | 0                            | 150
Sepal.Length    | Petal.Length    |0.871753775886583  | 0                            | 150
Petal.Length    | Sepal.Length    |0.871753775886583  | 0                            | 150
Sepal.Length    | Petal.Width     |0.817941126271577  | 0                            | 150
Petal.Width     | Sepal.Length    |0.817941126271577  | 0                            | 150
Sepal.Width     | Petal.Length    |-0.42844010433054  | 0                            | 150
Petal.Length    | Sepal.Width     |-0.42844010433054  | 0                            | 150
Sepal.Width     | Petal.Width     |-0.366125932536439 | 0                            | 150
Petal.Width     | Sepal.Width     |-0.366125932536439 | 0                            | 150
Petal.Length    | Petal.Width     |0.962865431402796  | 0                            | 150
Petal.Width     | Petal.Length    |0.962865431402796  | 0                            | 150
Sepal.Length    | Sepal.Length    |1                  | 0                            | 150
Sepal.Width     | Sepal.Width     |1                  | 0                            | 150
Petal.Length    | Petal.Length    |1                  | 0                            | 150
Petal.Width     | Petal.Width     |1                  | 0                            | 150
(16 rows)

 

* Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

http://archive.ics.uci.edu/ml/datasets/Iris