Complex Data Types for favroparser
If you load Avro data into a Flex table, Vertica loads all data into the __raw__ (VMap) column, including complex types found in the data. If you load Avro data into a columnar table, you specify individual columns but you can load an Avro complex type into a Vertica flexible complex type without fully specifying the schema. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from it. See Flexible Complex Types.
For one-dimensional arrays of scalar types in columnar tables, you can also define the column as an array instead of using flexible complex type.
The favroparser
supports the following complex data types, indicated in the Avro file with the "type" field:
Flexible Types
The following example demonstrates the use of flexible complex types. Consider an Avro file containing the following data:
{ "name" : "Bob's pizzeria", "cuisine" : "Italian", "location_city" : ["Cambridge", "Pittsburgh"], "menu" : [{"item" : "cheese pizza", "price" : "$8.25"}, {"item" : "spinach pizza", "price" : "$10.50"}] } { "name" : "Bakersfield Tacos", "cuisine" : "Mexican", "location_city" : ["Pittsburgh"], "menu" : [{"item" : "veggie taco", "price" : "$9.95"}, {"item" : "steak taco", "price" : "$10.95"}] }
Create a table, specifying the location_city and menu columns (the complex types) as LONG VARBINARY:
=> CREATE TABLE restaurant(name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY);
Load the data using the Avro parser:
=> COPY restaurant FROM '/data/restaurant.avro' PARSER favroparser (flatten_maps=false, flatten_arrays=false);
Because we loaded the complex columns as LONG VARBINARY, directly querying the columns produces binary results:
=> SELECT * FROM restaurant; name | cuisine | location_city | menu -------------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bob's pizzeria | Italian | \001\000\000\000\037\000\000\000\002\000\000\000\014\000\000\000\025\000\000\000CambridgePittsburgh\002\000\000\000\014\000\000\000\015\000\000\00001 | \001\000\000\000\202\000\000\000\002\000\000\000\014\000\000\000F\000\000\000\001\000\000\000\035\000\000\000\002\000\000\000\014\000\000\000\030\000\000\000cheese pizza$8.25\002\000\000\000\014\000\000\000\020\000\000\000itemprice\001\000\000\000\037\000\000\000\002\000\000\000\014\000\000\000\031\000\000\000spinach pizza$10.50\002\000\000\000\014\000\000\000\020\000\000\000itemprice\002\000\000\000\014\000\000\000\015\000\000\00001 Bakersfield Tacos | Mexican | \001\000\000\000\022\000\000\000\001\000\000\000\010\000\000\000Pittsburgh\001\000\000\000\010\000\000\0000 | \001\000\000\000~\000\000\000\002\000\000\000\014\000\000\000E\000\000\000\001\000\000\000\034\000\000\000\002\000\000\000\014\000\000\000\027\000\000\000veggie taco$9.95\002\000\000\000\014\000\000\000\020\000\000\000itemprice\001\000\000\000\034\000\000\000\002\000\000\000\014\000\000\000\026\000\000\000steak taco$10.95\002\000\000\000\014\000\000\000\020\000\000\000itemprice\002\000\000\000\014\000\000\000\015\000\000\00001 (2 rows)
However, you can use Flex functions and direct access (through indices) to return readable values:
=> SELECT MAPTOSTRING(location_city), MAPTOSTRING(menu) FROM restaurant; maptostring | maptostring -------------------------------------------------+-------------------------------------------------------- { "0": "Cambridge", "1": "Pittsburgh" } | { "0": { "item": "cheese pizza", "price": "$8.25" }, "1": { "item": "spinach pizza", "price": "$10.50" } } { "0": "Pittsburgh" } | { "0": { "item": "veggie taco", "price": "$9.95" }, "1": { "item": "steak taco", "price": "$10.95" } } (2 rows) => SELECT menu['0']['item'] FROM restaurant; menu -------------- cheese pizza veggie taco (2 rows)
The COPY statement shown in this example sets flatten_maps
to false. Without that change, the keys for the complex columns would not work as expected, because record and array keys would be "flattened" at the top level. Querying menu['0']['item'] would produce no results. Instead, query flattened values as in the following example:
=> SELECT menu['0.item'] FROM restaurant; menu -------------- veggie taco cheese pizza (2 rows)
Flattening directives apply to the entire COPY statement. You cannot flatten some columns and not others, or prevent flattening values in a complex column that is itself within a flattened flex table.
Records
Records have the following attributes:
Attribute | Description |
---|---|
name | A JSON string for the name of the record |
fields |
A JSON array used to list fields. Each field is a JSON object:
|
The name
of each field is used as a virtual column name. If flatten_records = true
and several nesting levels are present, Vertica concatenates the record names to create the key_name
, as follows:
{ "type": "record", "name": "Profile", "fields" : [ {"name": "UserName", "type": "string"}, {"name": "Address", "type": "string"} ] }
{ "type": "record", "name": "Profile", "fields" : [ {VerticaUser}, {VerticaUser Address} ] }
Vertica creates virtual columns for the records as follows:
Name | Value |
---|---|
UserName
|
VerticaUser
|
Address
|
VerticaUser Address
|
Enums
Enums (enumerated values) use the type name enum
and support the following attributes:
Attribute | Description |
---|---|
name | A JSON string for the name of the enum |
symbols | A JSON array used to list symbols as JSON strings. All symbols in an enum must be unique and duplicates are prohibited |
Example:
{ "type": "enum", "name": "suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }
Consider the preceding Avro schema with a record that contains a field with the value HEARTS
. In this case, the key value pair copied into the __raw__
column has suit
as the key and HEARTS
as the value.
Arrays
Arrays use the type name array
and support one attribute:
Attribute | Description |
---|---|
items | The schema of the array's items |
For example, declare an array of strings:
{"type": "array", "items": "string"}
You can treat arrays of scalar types like other Avro complex types, loading them into flex tables or columns, or you can use strong typing. This section describes the flex approach; the next section describes strongly-typed arrays.
Similar to the capabilities for Records
, you can nest and flatten Arrays
using flatten_arrays=true
:
{ "__name__" : "Order", <-- artificial __name__ key for record "customer_id" : "111222", "order_details" : { <-- array of records "0" : { <-- array index 0 "__name__" : "OrderDetail", "product_detail" : {
"__name__" : "Product", "price" : "46.21", "product_category" : { <- array of strings "0" : "electronics", "1" : "printers", "2" : "computers" }, "product_name" : "mycompany printer 123abc", "product_status" : "ONLY_FEW_LEFT" } }, "order_id" : "2389646", "total" : "132.43" }
Here is the result of flattening the array:
{ "order_details.0.__name__" : "OrderDetail", "order_details.0.product_detail.0.product_category" : "electronics", "order_details.0.product_detail.1.product_category" : "prnters", "order_details.0.product_detail.2.product_category" : "computers", "order_details.0.product_detail.__name__" : "Product", "order_details.0.product_detail.price" : "46.21", "order_details.0.product_detail.product_name" : "mycompany printer 123abc", "order_details.0.product_detail.product_status" : "ONLY_FEW_LEFT", "__name__" : "Order", "customer_id" : "111222", "order_id" : "2389646", "total" : "132.43" }
Arrays with Strong Typing
Instead of using flex tables or flexible columns, you can load Avro arrays into array columns in columnar tables. The parser ignores the flatten_arrays
parameter in this case.
A table column's type can be an array of any scalar type. You cannot declare arrays of arrays (multi-dimensional arrays) or have arrays of other complex types. For these uses of arrays, see the previous section.
For information on how to declare array columns, see Arrays and Sets (Collections).
Maps
Maps use the type name map
and support one attribute:
Attribute | Description |
---|---|
values | The schema of the map's items |
The favroparser
treats map keys as strings. For example, you can declare the map
type as a long as follows:
{"type": "map", "values": "long"}
Similar to Records
types, Maps
can also be nested and flattened using flatten_maps=true
.
The favroparser
inserts key-value pairs from the Avro map as key-value pairs in the __raw__
column. For an Avro record that has KeyX
with value 10
, and KeyY
with value 20
, favroparser
loads the key-value pairs as virtual columns KeyX
and KeyY
, with values 10
and 20
, respectively.
Unions
Vertica uses JSON arrays to represent Avro Unions
. Consider this example:
{"name":"TransactionID","type":["string","null"]}
The field TransactionID
can be a string
or null.
Fixed
Fixed (fixed
) Avro types support two attributes:
Attribute | Description |
---|---|
name | A string for the name of this data type |
size | An integer, specifying the number of bytes per value |
For example, you can declare a 16-byte quantity:
{"type": "fixed", "size": 16, "name": "md5"}
With the preceding declaration is the Avro file schema, consider a record that contains a field with the following byte values for the key md5
:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5]
The favroparser
loads the key value pair as an md5
key with the preceding byte values.