Arrays and Sets (Collections)

Tables can include one-dimensional collections (arrays or sets) of primitive types. An array is an ordered collection of elements that allows duplicate values, and a set is an unordered collection of unique values.

Consider an orders table with columns for product keys, customer keys, order prices, and order date, with some containing arrays. A basic query in Vertica results in the following:

=> SELECT * from orders LIMIT 5;
 orderkey | custkey |        prodkey         |         orderprices         | orderdate  
----------+---------+------------------------+-----------------------------+------------
    19626 |      91 | ["P1262","P68","P101"] | ["192.59","49.99","137.49"] | 2021-03-14
    25646 |     716 | ["P997","P31","P101"]  | ["91.39","29.99","147.49"]  | 2021-03-14
    25647 |     716 | ["P12"]                | ["8.99"]                    | 2021-03-14
    19743 |     161 | ["P68","P101"]         | ["49.99","137.49"]          | 2021-03-15
    19888 |     241 | ["P1262","P101"]       | ["197.59","142.49"]         | 2021-03-15
(5 rows)

As shown in this example, array values are returned in JSON format.

Set values are also returned in JSON array format:

=> SELECT custkey,email_addrs FROM customers LIMIT 4;
 custkey |                           email_addrs
---------+------------------------------------------------------------------------
 342176  | ["joe.smith@example.com"]
 342799  | ["bob@example,com","robert.jones@example.com"]
 342845  | ["br92@cs.example.edu"]
 342321  | ["789123@example-isp.com","sjohnson@eng.example.com","sara@johnson.example.name"]

Vertica supports several functions to manipulate arrays and sets.

Consider the same orders table which includes an array of product keys for all items purchased in a single order. You can use the APPLY_COUNT_ELEMENTS function to find out how many items each order contains. The function identifies the number of non-null elements in the prodkey array:

=> SELECT apply_count_elements(prodkey) FROM orders LIMIT 5;
apply_count_elements
--------------------
3
2
2
3
1
(5 rows)

Vertica also supports aggregate functions for collection elements. Now, consider a column in the same table which includes an array of prices for each item purchased in a single order. You can use the APPLY_SUM function to find the total amount spent for each order:

=> SELECT apply_sum(orderprices) from orders LIMIT 5;
apply_sum
-----------
380.07
187.48
340.08
268.87
  8.99
(5 rows))

See Collection Functions for a comprehensive list of functions.

You can include both column names and literal values in queries. The following example returns the product keys for orders where the number of items in each order is greater than three:

=> SELECT prodkey FROM orders WHERE apply_count_elements(prodkey)>2;
      prodkey
------------------------
 ["P1262","P68","P101"]
 ["P997","P31","P101"]
(2 rows)

Consider a more complex query that returns the customer key, name, email, order key, and product key by joining two tables, cust and orders, for orders that satisfy the condition where the total is greater than 150:

=> SELECT custkey, cust_custname, cust_email, orderkey, prodkey, orderprices from orders
 JOIN cust ON custkey = cust_custkey
 WHERE apply_sum(orderprices)>150 ;
custkey|  cust_custname   |        cust_email         |   orderkey   |                  prodkey                  |        orderprices
-------+------------------+---------------------------+--------------+--------------------------------========---+---------------------------
342799 | "Ananya Patel"   | "ananyapatel98@gmail.com" | "113-341987" | ["MG-7190","VA-4028","EH-1247","MS-7018"] | [60.00,67.00,22.00,14.99]
342845 | "Molly Benton"   | "molly_benton@gmail.com"  | "111-952000" | ["ID-2586","IC-9010","MH-2401","JC-1905"] | [22.00,35.00,90.00,12.00]
342989 | "Natasha Abbasi" | "natsabbasi@live.com"     | "111-685238" | ["HP-4024"]                               | [650.00]
342176 | "Jose Martinez"  | "jmartinez@hotmail.com"   | "113-672238" | ["HP-4768","IC-9010"]                     | [899.00,60.00]
342845 | "Molly Benton"   | "molly_benton@gmail.com"  | "113-864153" | ["AE-7064","VA-4028","GW-1808"]           | [72.00,99.00,185.00]
(5 rows)

Ordering and Grouping

You can use Comparison Operators with collections. Null collections are ordered last. Otherwise, collections are compared element by element until there is a mismatch, and then they are ordered based on the non-matching elements. If all elements are equal up to the length of the shorter one, then the shorter one is ordered first.

You can use collections in the ORDER BY and GROUP BY clauses of queries. The following example shows ordering query results by an array column:

=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);

=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.

=> SELECT * FROM employees ORDER BY grant_values;
 id | department |          grants          |  grant_values
----+------------+--------------------------+----------------
 36 | Astronomy  | ["US-7376","DARPA-1567"] | [5000,4000]
 36 | Physics    | ["US-7376","DARPA-1567"] | [10000,25000]
 33 | Physics    | ["US-7376"]              | [30000]
 42 | Physics    | ["US-7376","DARPA-1567"] | [65000,135000]
(4 rows)

The following example queries the same table using GROUP BY:

=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);

=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.

=> SELECT department, grants, SUM(apply_sum(grant_values)) FROM employees GROUP BY grants, department;
 department |          grants          |  SUM
------------+--------------------------+--------
 Physics    | ["US-7376","DARPA-1567"] | 235000
 Astronomy  | ["US-7376","DARPA-1567"] |   9000
 Physics    | ["US-7376"]              |  30000
(3 rows)

See the "Functions and Operators" section on the ARRAY reference page for information on how Vertica orders collections. (The same information is also on the SET reference page.)

Null-Handling

Null semantics for collections are consistent with normal columns in most regards. See NULL Sort Order for more information on null-handling.

The null-safe equality operator (<=>) behaves differently from equality (=) when the collection is null rather than empty. Comparing a collection to NULL strictly is undefined.

=> SELECT ARRAY[1,3] = NULL;
?column?
----------

(1 row)

=> SELECT ARRAY[1,3] <=> NULL;
 ?column?
----------
 f
(1 row)

In the following example, the grants column in the table is null for employee 99.

=> SELECT grants = NULL FROM employees WHERE id=99;
 ?column?
----------

(1 row)

=> SELECT grants <=> NULL FROM employees WHERE id=99;
 ?column?
----------
 t
(1 row)

Empty arrays are not null and behave as expected.

=> SELECT ARRAY[]::ARRAY[INT] = ARRAY[]::ARRAY[INT];
 ?column?
----------
 t
(1 row)

Out-of-bound indexes into collections return NULL.

=> SELECT prodkey[2] from orders LIMIT 4;
prodkey
---------

"EH-1247"
"MH-2401"

(4 rows)

The results of the query return NULL for two out of four rows, the first and the fourth, since the specified index is greater than the size of those arrays.

Casting

When the data type of an expression value is unambiguous, it is implicitly coerced to match the expected data type. However, there can be ambiguity about the data type of an expression. For example, a date can be interpreted as either string or timestamp. Write an explicit cast to avoid the default:

=> SELECT apply_count_elements(ARRAY['2019-01-20','2019-02-12','2019-03-23']::ARRAY[TIMESTAMP]);
apply_count_elements
--------------------
 3
(1 row)

You can cast arrays or sets of one scalar type to arrays or sets of other (compatible) types, following the same rules as for casting scalar values. Casting a collection casts each element of that collection. Casting an array to a set also removes any duplicates.

An array or set with a single null element must be explicitly cast because no type can be inferred.

See Data Type Coercion for more information on casting for data types.

Exploding Array Columns

You can simplify queries on elements stored in 1D arrays with EXPLODE, a function that takes array columns from a table and expands them. For each exploded array, the results include two columns, one for the array element index, and one for the value at that position. If the function explodes a single array, these columns are named position and value by default. If the function explodes two or more arrays, the columns for each array are named pos_column-name and val_column-name.

The function explodes the first N array columns in the parameter list, defaulting to one, and passes all other columns through.

The following examples illustrate using EXPLODE() with the OVER(PARTITION BEST) clause.

Consider an orders table with columns for order keys, customer keys, product keys, order prices, and email addresses, with some containing arrays. A basic query in Vertica results in the following:

=> SELECT orderkey, custkey, prodkey, orderprices, email_addrs FROM orders LIMIT 5;
  orderkey  | custkey |                    prodkey                    |            orderprices            |                                                  email_addrs
------------+---------+-----------------------------------------------+-----------------------------------+----------------------------------------------------------------------------------------------------------------
 113-341987 |  342799 | ["MG-7190 ","VA-4028 ","EH-1247 ","MS-7018 "] | ["60.00","67.00","22.00","14.99"] | ["bob@example,com","robert.jones@example.com"]
 111-952000 |  342845 | ["ID-2586 ","IC-9010 ","MH-2401 ","JC-1905 "] | ["22.00","35.00",null,"12.00"]    | ["br92@cs.example.edu"]
 111-345634 |  342536 | ["RS-0731 ","SJ-2021 "]                       | ["50.00",null]                    | [null]
 113-965086 |  342176 | ["GW-1808 "]                                  | ["108.00"]                        | ["joe.smith@example.com"]
 111-335121 |  342321 | ["TF-3556 "]                                  | ["50.00"]                         | ["789123@example-isp.com","alexjohnson@example.com","monica@eng.example.com","sara@johnson.example.name",null]
(5 rows)

This example expands the orderprices column for a specified customer, in ascending order. The custkey and email_addrs columns are repeated for each array element.

=> SELECT EXPLODE(orderprices, custkey, email_addrs) OVER(PARTITION BEST) AS (position, orderprices, custkey, email_addrs)
   FROM orders WHERE custkey='342845' ORDER BY orderprices;
 position | orderprices | custkey |         email_addrs
----------+-------------+---------+------------------------------
        2 |             |  342845 | ["br92@cs.example.edu",null]
        3 |       12.00 |  342845 | ["br92@cs.example.edu",null]
        0 |       22.00 |  342845 | ["br92@cs.example.edu",null]
        1 |       35.00 |  342845 | ["br92@cs.example.edu",null]
(4 rows)

When you explode an array column that contains null values, the null values are displayed as empty.

You can explode more than one array column by specifying the explode_count parameter.

> SELECT EXPLODE(orderkey, prodkey, orderprices USING PARAMETERS explode_count=2)
OVER(PARTITION BEST)
AS (orderkey,pk_idx,pk_val,ord_idx,ord_val)
FROM orders
WHERE orderkey='113-341987';
  orderkey  | pk_idx |  pk_val  | ord_idx | ord_val
------------+--------+----------+---------+---------
 113-341987 |      0 | MG-7190  |       0 |   60.00
 113-341987 |      0 | MG-7190  |       1 |   67.00
 113-341987 |      0 | MG-7190  |       2 |   22.00
 113-341987 |      0 | MG-7190  |       3 |   14.99
 113-341987 |      1 | VA-4028  |       0 |   60.00
 113-341987 |      1 | VA-4028  |       1 |   67.00
 113-341987 |      1 | VA-4028  |       2 |   22.00
 113-341987 |      1 | VA-4028  |       3 |   14.99
 113-341987 |      2 | EH-1247  |       0 |   60.00
 113-341987 |      2 | EH-1247  |       1 |   67.00
 113-341987 |      2 | EH-1247  |       2 |   22.00
 113-341987 |      2 | EH-1247  |       3 |   14.99
 113-341987 |      3 | MS-7018  |       0 |   60.00
 113-341987 |      3 | MS-7018  |       1 |   67.00
 113-341987 |      3 | MS-7018  |       2 |   22.00
 113-341987 |      3 | MS-7018  |       3 |   14.99
(16 rows)

You might store data of a primitive type in a set, a collection of unordered, unique elements. To explode a set column, you must explicitly cast the set column as an array column, or you receive an error. The following example explodes the email_addrs set column for a specified customer:

=> SELECT EXPLODE(email_addrs::ARRAY[VARCHAR], custkey) OVER(PARTITION BEST) AS (position, email_addrs, custkey)
   FROM orders WHERE custkey='342321';	
 position |        email_addrs        | custkey
----------+---------------------------+---------
        0 | 789123@example-isp.com    |  342321
        1 | alexjohnson@example.com   |  342321
        2 | monica@eng.example.com    |  342321
        3 | sara@johnson.example.name |  342321
        4 |                           |  342321
(5 rows)

See ARRAY and SET for more information on the implementation of these data types in Vertica.

Imploding and Filtering Arrays

The IMPLODE function is the inverse of EXPLODE: it takes a column and produces an array containing the column's values. Combined with GROUP BY, it can be used to reverse an explode operation.

You can use EXPLODE and IMPLODE together to filter array values. For example, in a set of orders where prices are an array value, you might want to query only the orders with prices below a certain threshold. Consider the following table:

=> SELECT * FROM orders;
				
 key |      prices
-----+-------------------
 567 | [27.99,18.99]
 789 | [108.0]
 345 | [14.99,35.99]
 123 | [60.0,67.0,14.99]
(4 rows)

You can use EXPLODE to expand the arrays. For clarity, this example creates a new table to hold the results. More typically, you would use EXPLODE and IMPLODE in subqueries instead of creating intermediate tables.

=> CREATE TABLE exploded AS
SELECT EXPLODE(prices,key) OVER (PARTITION BEST)
AS (position, itemprice, itemkey) FROM orders;

=> SELECT * FROM exploded;
				
 position | itemprice | itemkey
----------+-----------+---------
        0 |       108 |     789
        1 |     35.99 |     345
        0 |     14.99 |     345
        0 |     27.99 |     567
        0 |        60 |     123
        1 |     18.99 |     567
        1 |        67 |     123
        2 |     14.99 |     123
(8 rows)

You can now filter the exploded prices:

=> CREATE TABLE filtered AS
    SELECT position, itemprice, itemkey FROM orders WHERE itemprice < 50.00;

=> SELECT * FROM filtered;
				
 position | itemprice | itemkey
----------+-----------+---------
        0 |     14.99 |     345
        0 |     27.99 |     567
        1 |     18.99 |     567
        1 |     35.99 |     345
        2 |     14.99 |     123
(5 rows)

Finally, you can use IMPLODE to reconstruct the arrays:

=> SELECT itemkey AS key, IMPLODE(itemprice) AS prices
FROM filtered
GROUP BY itemkey ORDER BY itemkey;
				
 key |    prices
-----+---------------
 123 | [14.99]
 345 | [14.99,35.99]
 567 | [27.99,18.99]
(3 rows)