Reading Structs as Expanded Columns

Use this method of reading structs for ORC files. For Parquet files, see Reading Complex Types from Parquet Files. Expanded columns are deprecated for Parquet files.

You can define columns in external tables to use the values from structs. Instead of reading the struct into a single column, you define columns for each field. For example, an address column could use a struct with strings for the street address, city/state, and postal code, such as { "street":"150 Cambridgepark Dr.", "city":"Cambridge MA", "postalcode":"02140"}. To read this struct, you define columns for the street address, city, and postal code.

The following example defines an external table for addresses:

=> CREATE EXTERNAL TABLE addresses (street VARCHAR, city VARCHAR, postalcode VARCHAR)
    AS COPY FROM '...' ORC;

A struct column might be just one of several columns in the data. In the following example, the "customer" table includes a name (VARCHAR), address (struct), and account number (INT).

=> CREATE EXTERNAL TABLE customers (name VARCHAR, street VARCHAR, city VARCHAR,
    postalcode VARCHAR, account INT)
    AS COPY FROM '...' ORC;

Within Vertica the structs are flattened; from the definition alone you cannot tell that some values come from structs.

ORC and Parquet files do not store field names for structs, only values. When defining columns for struct values, you must use the same order that was used in the source schema. Because you must also define all of the columns represented in the data, your external table will have one column for each primitive-type column and one column for each field in each struct, in the order that they appear in the data.

Handling Nulls

Struct fields, like columns of primitive types, can have null values. As with primitive types, a null struct field produces a null value for the corresponding column in the external table. In the following example, the last row in the data has an address struct with a null value for the street.

=> CREATE EXTERNAL TABLE customers (name varchar, street varchar, 
    city varchar, zipcode int, accountID int) 
    as copy from '...' orc;
CREATE TABLE
				
=> SELECT * FROM customers ORDER BY accountID;
        name        |       street        |   city   | zipcode | accountID
--------------------+---------------------+----------+---------+-----------
 Missy Cooper       | 911 San Marcos St   | Austin   |   73344 |        17
 Sheldon Cooper     | 100 Main St Apt 4B  | Pasadena |   91001 |       139
 Leonard Hofstadter | 100 Main St Apt 4A  | Pasadena |   91001 |       142
 Leslie Winkle      | 23 Fifth Ave Apt 8C | Pasadena |   91001 |       198
 Raj Koothrappali   |                     | Pasadena |   91001 |       294
(5 rows)		

Null field values appear when a struct is present but is missing field values. If the entire struct is null in the source data, by default Vertica rejects the row. A null struct is different from a struct where all fields are null.

The data used in the previous example contains a sixth row with a null address struct. To expand null structs instead of rejecting the row, use the flatten_complex_type_nulls parameter to the PARQUET() or ORC() clause:

=> CREATE EXTERNAL TABLE customers (...)
    AS COPY FROM '...' ORC(flatten_complex_type_nulls='True');

With this parameter set, Vertica reports null values for all fields when the struct itself is null. It does not report a null value in the struct column. If you use this parameter, note that you cannot tell from the query results whether the struct column was null or a struct with no values was present in the data.

With this table definition, the query returns all six rows:

=> CREATE EXTERNAL TABLE customers (name varchar, street varchar, 
    city varchar, zipcode int, accountID int) 
    as copy from '...' orc(flatten_complex_type_nulls='True');
CREATE TABLE
				
=> SELECT * FROM customers ORDER BY accountID;
        name        |       street        |   city   | zipcode | accountID
--------------------+---------------------+----------+---------+-----------
 Missy Cooper       | 911 San Marcos St   | Austin   |   73344 |        17
 Sheldon Cooper     | 100 Main St Apt 4B  | Pasadena |   91001 |       139
 Leonard Hofstadter | 100 Main St Apt 4A  | Pasadena |   91001 |       142
 Leslie Winkle      | 23 Fifth Ave Apt 8C | Pasadena |   91001 |       198
 Raj Koothrappali   |                     | Pasadena |   91001 |       294
 Stuart Bloom       |                     |          |         |       482
(6 rows)					

Example

Consider a table defined in Hive as follows:

create external table orderupc_with_structs
(
    struct1 struct<
        cal_dt:varchar(200) ,
        cmc_chn_str_nbr:int ,
        lane_nbr:int >,
    time_key int ,
    struct2 struct<
        tran_nbr:bigint ,
        cmc_chn_nbr:int ,
        str_tndr_typ_nbr:int >,
    struct3 struct<
        ord_amt:float ,
        ss_ord_amt:float ,
        ndc_ord_amt:float >,
    unique_upc_qty int 
)
stored as orc location '/user/data/orderupc_with_structs'
;

The data contains three struct columns with a total of nine fields, and two other (non-struct) columns. A Hive query returns the following results:

$ hive -e "select * from orderupc_with_structs limit 1" 2> /dev/null | head -1

{"cal_dt":"2006-01-08","cmc_chn_str_nbr":85,"lane_nbr":1}       35402   
{"tran_nbr":60080008501335402,"cmc_chn_nbr":2,"str_tndr_typ_nbr":1}     
{"ord_amt":19.74,"ss_ord_amt":19.74,"ndc_ord_amt":0.0} 16

In Vertica you define an external table with 11 expanded columns, one for each field, as follows:

=> CREATE EXTERNAL TABLE orderupc
(
    cal_dt varchar(200) ,
    cmc_chn_str_nbr int ,
    lane_nbr int ,
    time_key int ,
    tran_nbr bigint ,
    cmc_chn_nbr int ,
    str_tndr_typ_nbr int ,
    ord_amt float ,
    ss_ord_amt float ,
    ndc_ord_amt float ,
    unique_upc_qty int
)
AS COPY FROM 'hdfs:///user/data/orderupc_with_structs/*' ORC;

The struct names ("struct1" through "struct3") are not represented in these columns; instead the fields from each struct are individually defined. The following figure illustrates the mappings:

A Vertica query shows the following results:

=> SELECT * FROM orderupc LIMIT 1
   cal_dt   | cmc_chn_str_nbr | lane_nbr | time_key |     tran_nbr      |
 cmc_chn_nbr | str_tndr_typ_nbr |     ord_amt      |    ss_ord_amt    | 
ndc_ord_amt | unique_upc_qty 

------------+-----------------+----------+----------+-------------------+
-------------+------------------+------------------+------------------+
-------------+----------------

2006-03-30 |           34362 |        3 |    47905 |  60893436201547905 |
         214 |                0 | 30.7399997711182 | 30.7399997711182 |
           0 |             10 

(1 row)