Using Flexible Complex Types

In some cases, the complex types in a data file are structured such that you cannot fully describe their structure in a table definition. For example, you can load one-dimensional arrays into Vertica tables, but you cannot load multi-dimensional arrays, mixed types such as structs containing arrays, and other complex types directly. (You can use some complex types with external tables.)

In other cases, you could fully describe the structure in a table definition but might prefer not to. For example, if the data contains a struct with a very large number of fields, and in your queries you will only read a few of them, you might prefer to not have to enumerate them all individually. And if the data file's schema is still evolving and the type definitions might change, you might prefer not to fully define, and thus have to update, the complete structure. Further, for external tables, a deeply-nested set of structs could exceed the nesting limit for the table if fully specified.

About Flexible Types

Flexible types are a way to store complex or unstructured data as a binary blob in one column, in a way that allows access to individual elements of that data. This is the same approach that Vertica uses with flex tables, which support loading unstructured or semi-structured data. In a flex table, all data from a source is loaded into a single column named __raw__.  From this column you can materialize other columns, such as a specific field in JSON data. Or you use special lookup functions in queries to read values directly out of the __raw__ column. For more about flex tables, see Understanding Flex Tables.

Vertica uses a similar approach with complex types. You can describe the types fully using the ROW, ARRAY, and MAP types in your table definition, where supported. Or you can instead treat a complex type as a flexible type and not fully describe it. Each complex type that you choose to treat this way becomes its own flex-style column. You are not limited to a single column containing all data as in flex tables; instead, you can treat any complex type in Parquet, JSON, or Avro data, no matter how deeply it nests other types, as one flex-like column.

Defining Flexible Columns

To use a flexible complex type, declare the column as LONG VARBINARY. You might also need to set other parameters in the parser, as described in the parser documentation.

Consider a Parquet file with a restaurants table and the following columns:

  • name: varchar
  • cuisine type: varchar
  • location (cities): array[varchar]
  • menu: array of structs, each struct having an item name and a price

This data contains two complex columns, location (an array) and menu (an array of structs). The latter cannot be represented through strong typing, because ARRAY supports only primitive types. The following example defines both columns as flexible columns by using LONG VARBINARY.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY)
    AS COPY FROM '/data/rest*.parquet' 
    PARQUET(allow_long_varbinary_match_complex_type='True');

The allow_long_varbinary_match_complex_type parameter is specific to the Parquet parser. It is required if you define any column as a flexible type. Without this parameter, Vertica tries to match the LONG VARBINARY declaration in the table to a varbinary column in the Parquet file, finds a complex type instead, and reports a data-type mismatch.

You need not treat all complex columns as flexible types. The following definition is also valid.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY) 
    AS COPY FROM '/data/rest*.parquet' 
    PARQUET(allow_long_varbinary_match_complex_type='True');

For Parquet data, you can use the INFER_EXTERNAL_TABLE_DDL function to derive a table definition from a data file. By default, this function uses strong typing for complex types. To get flexible types, specify a value of 'long varbinary' for the optional vertica_type_for_complex_type argument.

Querying Flexible Columns

Flexible columns are stored as LONG VARBINARY, so selecting them directly produces unhelpful results. Instead, use the flex mapping functions to extract values from these columns. The MAPTOSTRING function translates the complex type to JSON, as shown in the following example.

=> SELECT name, location_city, MAPTOSTRING(menu) AS menu FROM restaurants;
       name        |       location_city        |                    menu                                                  
-------------------+---------+------------------+----------------------------------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 Bakersfield Tacos | ["Pittsburgh"]             | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)

The menu column is an array of structs. Notice that the output is a set of key/value pairs, with the key being the array index. Bob's Pizzeria has two items on its menu, and each value is a struct. The first item ("0") is a struct with an "item" value of "cheese pizza" and a "price" of "$8.25".

You can use keys to access specific values. The following example selects the first menu item from each restaurant. Note that all keys are strings, even array indexes.

=> SELECT name, location_city, menu['0']['item'] AS item, menu['0']['price'] AS price FROM restaurants;
       name        |       location_city        |     item     | price
-------------------+----------------------------+--------------+-------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza | $8.25
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco  | $9.95
(2 rows)

Instead of accessing specific indexes, you can use the MAPITEMS function in a subquery to explode a flexible type, as in the following example.

=>  SELECT name, location_city, menu_items['item'], menu_items['price'] 
    FROM (SELECT mapitems(menu, name, location_city) OVER(PARTITION BEST) AS (indexes, menu_items, name, location_city) 
    FROM restaurants) explode_menu;
       name        |       location_city        |  menu_items   | menu_items
-------------------+----------------------------+---------------+------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza  | $8.25
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | spinach pizza | $10.50
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco   | $9.95
 Bakersfield Tacos | ["Pittsburgh"]             | steak taco    | $10.95
(4 rows)

For a complete list of flex mapping functions, see Flex Table Functions.