Complex Data Types for favroparser

If you load Avro data into a Flex table, Vertica loads all data into the __raw__ (VMap) column, including complex types found in the data. If you load Avro data into a columnar table, you specify individual columns but you can load an Avro complex type into a Vertica flexible complex type without fully specifying the schema. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from it. See Using Flexible Complex Types.

For one-dimensional arrays of scalar types in columnar tables, you can also define the column as an array instead of using flexible complex type.

The favroparser supports the following complex data types, indicated in the Avro file with the "type" field:

Flexible Types

The following example demonstrates the use of flexible complex types. Consider an Avro file containing the following data:

{
    "name" : "Bob's pizzeria",
    "cuisine" : "Italian",
    "location_city" : ["Cambridge", "Pittsburgh"],
    "menu" : [{"item" : "cheese pizza", "price" : "$8.25"},
              {"item" : "spinach pizza", "price" : "$10.50"}]
}
{
    "name" : "Bakersfield Tacos",
    "cuisine" : "Mexican",
    "location_city" : ["Pittsburgh"],
    "menu" : [{"item" : "veggie taco", "price" : "$9.95"},
              {"item" : "steak taco", "price" : "$10.95"}]
}

Create a table, specifying the location_city and menu columns (the complex types) as LONG VARBINARY:

=> CREATE TABLE restaurant(name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY);

Load the data using the Avro parser:

=> COPY restaurant FROM '/data/restaurant.avro' PARSER favroparser (flatten_maps=false, flatten_arrays=false);

Because we loaded the complex columns as LONG VARBINARY, directly querying the columns produces binary results:

=> SELECT * FROM restaurant;
       name        | cuisine |                                                                     location_city                                                                     |                                                                                                                                                                                                                         menu                                                                                                                                                                                                                         
-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bob's pizzeria    | Italian | \001\000\000\000\037\000\000\000\002\000\000\000\014\000\000\000\025\000\000\000CambridgePittsburgh\002\000\000\000\014\000\000\000\015\000\000\00001 | \001\000\000\000\202\000\000\000\002\000\000\000\014\000\000\000F\000\000\000\001\000\000\000\035\000\000\000\002\000\000\000\014\000\000\000\030\000\000\000cheese pizza$8.25\002\000\000\000\014\000\000\000\020\000\000\000itemprice\001\000\000\000\037\000\000\000\002\000\000\000\014\000\000\000\031\000\000\000spinach pizza$10.50\002\000\000\000\014\000\000\000\020\000\000\000itemprice\002\000\000\000\014\000\000\000\015\000\000\00001
 Bakersfield Tacos | Mexican | \001\000\000\000\022\000\000\000\001\000\000\000\010\000\000\000Pittsburgh\001\000\000\000\010\000\000\0000                                           | \001\000\000\000~\000\000\000\002\000\000\000\014\000\000\000E\000\000\000\001\000\000\000\034\000\000\000\002\000\000\000\014\000\000\000\027\000\000\000veggie taco$9.95\002\000\000\000\014\000\000\000\020\000\000\000itemprice\001\000\000\000\034\000\000\000\002\000\000\000\014\000\000\000\026\000\000\000steak taco$10.95\002\000\000\000\014\000\000\000\020\000\000\000itemprice\002\000\000\000\014\000\000\000\015\000\000\00001
(2 rows)

However, you can use Flex functions and direct access (through indices) to return readable values:

=> SELECT MAPTOSTRING(location_city), MAPTOSTRING(menu) FROM restaurant;
                   maptostring                   |             maptostring       
-------------------------------------------------+--------------------------------------------------------
 {
    "0": "Cambridge",
    "1": "Pittsburgh"
} | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 {
    "0": "Pittsburgh"
}                       | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)
 
=> SELECT menu['0']['item'] FROM restaurant;
     menu    
--------------
 cheese pizza
 veggie taco
(2 rows)

The COPY statement shown in this example sets flatten_maps to false. Without that change, the keys for the complex columns would not work as expected, because record and array keys would be "flattened" at the top level. Querying menu['0']['item'] would produce no results. Instead, query flattened values as in the following example:

=> SELECT menu['0.item'] FROM restaurant;
     menu
--------------
 veggie taco
 cheese pizza
(2 rows)

Flattening directives apply to the entire COPY statement. You cannot flatten some columns and not others, or prevent flattening values in a complex column that is itself within a flattened flex table.

Records

Records have the following attributes:

Attribute Description
name A JSON string for the name of the record
fields

A JSON array used to list fields. Each field is a JSON object:

  • name: A JSON string for the name of the field
  • type: A JSON object used to define a schema or a JSON string used for naming a record definition

The name of each field is used as a virtual column name. If flatten_records = true and several nesting levels are present, Vertica concatenates the record names to create the key_name, as follows: 

{
  "type": "record",
  "name": "Profile",
  "fields" : [
      {"name": "UserName", "type": "string"},
      {"name": "Address", "type": "string"}
   ]
}
{
  "type": "record",
  "name": "Profile",
  "fields" : [
      {VerticaUser},
      {VerticaUser Address}
]
}

Vertica creates virtual columns for the records as follows:

Names Values
UserName VerticaUser
Address VerticaUser Address

Enums

Enums (enumerated values) use the type name enum and support the following attributes:

Attribute Description
name A JSON string for the name of the enum
symbols A JSON array used to list symbols as JSON strings. All symbols in an enum must be unique and duplicates are prohibited

Example:

{   
	"type": "enum",
	"name": "suit",
	"symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
}

Consider the preceding Avro schema with a record that contains a field with the value HEARTS. In this case, the key value pair copied into the __raw__ column has suit as the key and HEARTS as the value.

Arrays

Arrays use the type name array and support one attribute:

Attribute Description
items The schema of the array's items

For example, declare an array of strings:

{"type": "array", "items": "string"}

You can treat arrays of scalar types like other Avro complex types, loading them into flex tables or columns, or you can use strong typing. This section describes the flex approach; the next section describes strongly-typed arrays.

Similar to the capabilities for Records, you can nest and flatten Arrays using flatten_arrays=true :

{
	"__name__" : "Order",                        <-- artificial __name__ key for record
	"customer_id" : "111222",
	"order_details" : {                                <-- array of records
	"0" : {                                               <-- array index 0
		"__name__" : "OrderDetail",
		"product_detail" : {
			"__name__" : "Product",
			 "price" : "46.21",
			 "product_category" : {           <- array of strings
				"0" : "electronics",
				"1" : "printers",
				"2" : "computers"
				},
		"product_name" : "mycompany printer 123abc",
		"product_status" : "ONLY_FEW_LEFT"
		}
	},
	"order_id" : "2389646",
	"total" : "132.43"
}		

Here is the result of flattening the array:

{
	"order_details.0.__name__" : "OrderDetail",
	"order_details.0.product_detail.0.product_category" : "electronics",
	"order_details.0.product_detail.1.product_category" : "prnters",
	"order_details.0.product_detail.2.product_category" : "computers",
	"order_details.0.product_detail.__name__" : "Product",
	"order_details.0.product_detail.price" : "46.21",
	"order_details.0.product_detail.product_name" : "mycompany printer 123abc",
	"order_details.0.product_detail.product_status" : "ONLY_FEW_LEFT",
	"__name__" : "Order",
	"customer_id" : "111222",
	"order_id" : "2389646",
	"total" : "132.43"
}

Arrays with Strong Typing

Instead of using flex tables or flexible columns, you can load Avro arrays into array columns in columnar tables. The parser ignores the flatten_arrays parameter in this case.

A table column's type can be an array of any scalar type. You cannot declare arrays of arrays (multi-dimensional arrays) or have arrays of other complex types. For these uses of arrays, see the previous section.

For information on how to declare array columns, see Arrays and Sets (Collections).

Maps

Maps use the type name map and support one attribute:

Attribute Description
values The schema of the map's items

The favroparser treats map keys as strings. For example, you can declare the map type as a long as follows: 

{"type": "map", "values": "long"}

Similar to Records types, Maps can also be nested and flattened using flatten_maps=true.

The favroparser inserts key-value pairs from the Avro map as key-value pairs in the __raw__ column. For an Avro record that has KeyX with value 10, and KeyY with value 20, favroparser loads the key-value pairs as virtual columns KeyX and KeyY, with values 10 and 20, respectively.

Unions

Vertica uses JSON arrays to represent Avro Unions. Consider this example:

{"name":"TransactionID","type":["string","null"]}

The field TransactionID can be a string or null.

Fixed

Fixed (fixed) Avro types support two attributes:

Attribute Description
name A string for the name of this data type
size An integer, specifying the number of bytes per value

For example, you can declare a 16-byte quantity:

{"type": "fixed", "size": 16, "name": "md5"}

With the preceding declaration is the Avro file schema, consider a record that contains a field with the following byte values for the key md5:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5]

The favroparser loads the key value pair as an md5 key with the preceding byte values.