Flexible Complex Types

When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays. However, there are times when you might prefer not to. For example, if the data contains a struct with a very large number of fields, and in your queries you will only read a few of them, you might prefer to not have to enumerate them all individually. If the data file's schema is still evolving and the type definitions might change, you might prefer not to fully define, and thus have to update, the complete structure. Further, a deeply-nested set of structs could exceed the nesting limit for the table if fully specified.

As an alternative to strongly-typed complex types, you can use flexible complex types.

About Flexible Types

Flexible types are a way to store complex or unstructured data as a binary blob in one column, in a way that allows access to individual elements of that data. This is the same approach that Vertica uses with flex tables, which support loading unstructured or semi-structured data. In a flex table, all data from a source is loaded into a single column named __raw__. From this column you can materialize other columns, such as a specific field in JSON data. Or you use special lookup functions in queries to read values directly out of the __raw__ column. For more about flex tables, see Understanding Flex Tables.

Vertica uses a similar approach with complex types. You can describe the types fully using the ROW and ARRAY types in your table definition. Or you can instead treat a complex type as a flexible type and not fully describe it. Each complex type that you choose to treat this way becomes its own flex-style column. You are not limited to a single column containing all data as in flex tables; instead, you can treat any complex type column, no matter how deeply it nests other types, as one flex-like column.

Defining Flexible Columns

To use a flexible complex type, declare the column as LONG VARBINARY. You might also need to set other parameters in the parser, as described in the parser documentation.

Consider a Parquet file with a restaurants table and the following columns:

  • name: varchar
  • cuisine type: varchar
  • location (cities): array[varchar]
  • menu: array of structs, each struct having an item name and a price

This data contains two complex columns, location (an array) and menu (an array of structs). The following example defines both columns as flexible columns by using LONG VARBINARY.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY)
    AS COPY FROM '/data/rest*.parquet' 
    PARQUET(allow_long_varbinary_match_complex_type='True');

The allow_long_varbinary_match_complex_type parameter is specific to the Parquet parser. It is required if you define any column as a flexible type. Without this parameter, Vertica tries to match the LONG VARBINARY declaration in the table to a VARBINARY column in the Parquet file, finds a complex type instead, and reports a data-type mismatch.

You need not treat all complex columns as flexible types. The following definition is also valid.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY) 
    AS COPY FROM '/data/rest*.parquet' 
    PARQUET(allow_long_varbinary_match_complex_type='True');

For Parquet and ORC data, you can use the INFER_EXTERNAL_TABLE_DDL function to derive a table definition from a data file. By default, this function uses strong typing for complex types. To get flexible types, specify a value of 'long varbinary' for the optional vertica_type_for_complex_type argument.

Querying Flexible Columns

Flexible columns are stored as LONG VARBINARY, so selecting them directly produces unhelpful results. Instead, use the flex mapping functions to extract values from these columns. The MAPTOSTRING function translates the complex type to JSON, as shown in the following example.

=> SELECT name, location_city, MAPTOSTRING(menu) AS menu FROM restaurants;
       name        |       location_city        |                    menu                                                  
-------------------+---------+------------------+----------------------------------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 Bakersfield Tacos | ["Pittsburgh"]             | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)

The menu column is an array of structs. Notice that the output is a set of key/value pairs, with the key being the array index. Bob's Pizzeria has two items on its menu, and each value is a struct. The first item ("0") is a struct with an "item" value of "cheese pizza" and a "price" of "$8.25".

You can use keys to access specific values. The following example selects the first menu item from each restaurant. Note that all keys are strings, even array indexes.

=> SELECT name, location_city, menu['0']['item'] AS item, menu['0']['price'] AS price FROM restaurants;
       name        |       location_city        |     item     | price
-------------------+----------------------------+--------------+-------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza | $8.25
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco  | $9.95
(2 rows)

Instead of accessing specific indexes, you can use the MAPITEMS function in a subquery to explode a flexible type, as in the following example.

=>  SELECT name, location_city, menu_items['item'], menu_items['price'] 
    FROM (SELECT mapitems(menu, name, location_city) OVER(PARTITION BEST) AS (indexes, menu_items, name, location_city) 
    FROM restaurants) explode_menu;
       name        |       location_city        |  menu_items   | menu_items
-------------------+----------------------------+---------------+------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza  | $8.25
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | spinach pizza | $10.50
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco   | $9.95
 Bakersfield Tacos | ["Pittsburgh"]             | steak taco    | $10.95
(4 rows)

For a complete list of flex mapping functions, see Flex Table Functions.