Reading Structs as Inline Types

For data in Parquet format, you can treat a struct as a single column instead of expanding the struct fields into individual columns. You might do so to make your external table definition more closely match the layout of the data you are reading. You might also do so to reduce the number of columns in your table definition.

Defining Tables with Complex Types

Use the ROW expression to define a struct column. In the following example, the data has columns for customer name, address, and account number, and the address is a struct. The types you declare in Vertica must be compatible with the types in the Parquet data.

=> CREATE EXTERNAL TABLE customers (
    name VARCHAR, 
    address ROW(street VARCHAR, city VARCHAR, zipcode INT), 
    accountID INT)
  AS COPY FROM '...' PARQUET;	

Within the ROW, you specify the fields and their data types using the same syntax as for columns. Vertica treats the struct as a single column for purposes of queries. (To expand structs into their own columns instead, see Reading Structs as Expanded Columns.)

Structs can contain other structs. In the following example, employees have various personal information, including an address which is itself a struct.

=> CREATE EXTERNAL TABLE employees(
    employeeID INT, 
    personal ROW(
      name VARCHAR, 
      address ROW(street VARCHAR, city VARCHAR, zipcode INT),
      taxID INT), 
    department VARCHAR)
  AS COPY FROM '...' PARQUET;

If a struct contains fields with null values, those fields have null values in Vertica. If the struct value itself is null, Vertica rejects the row by default. See Handling Null Struct Values for more information.

By default, Vertica requires the definition of the external table to match the schema of the external data. For example, with the data used in the previous employees example, the following definition is an error:

=> CREATE EXTERNAL TABLE employees(
    employeeID INT, 
    personal ROW(
      name VARCHAR, 
      address ROW(street VARCHAR, city VARCHAR), 
      zipcode INT,
      taxID INT), 
    department VARCHAR)
  AS COPY FROM '...' PARQUET;
ERROR 9151: Datatype mismatch [...]

The data contains an address struct with three fields (street, city, zipcode), so the external table must also use a ROW with three fields. Changing the ROW to have two fields and promoting one of the fields to the parent ROW is a mismatch. Each ROW must match and, if structs are nested in the data, the complete structure must match.

To vary the table definition from the data's schema, add the skip_strong_schema_match parameter to the PARQUET function:

=> CREATE EXTERNAL TABLE employees(
    employeeID INT, 
    personal ROW(
      name VARCHAR, 
      address ROW(street VARCHAR, city VARCHAR), 
      zipcode INT,
      taxID INT), 
    department VARCHAR)
  AS COPY FROM '...' PARQUET(skip_strong_schema_match='true');

The skip_strong_schema_match parameter allows you to vary only containment. Primitive data types must still match and you must still refer to data columns and fields in the order in which they appear in the data.

Restrictions

ROW columns have several restrictions:

  • The maximum nesting depth is 10.
  • The maximum number of total columns and fields in a table is 1600.  The ROW itself is not counted, just its fields.
  • ROW columns cannot use any constraints (such as NOT NULL) or defaults.
  • ROW fields cannot be auto_increment or setof.
  • A ROW definition must include at least one field.
  • "Row" is a reserved keyword within a ROW definition, but is permitted as the name of a table or column.
  • ROW columns cannot be modified using ALTER TABLE...ALTER COLUMN. You must drop and recreate the external table to change a ROW column.
  • External tables containing ROW columns cannot also contain identity, auto-increment, or sequence columns.

Querying Complex Types

In queries, ROW columns are shown in output using JSON format. In the following example, the last row shows partial data.

=> SELECT * FROM customers ORDER BY accountID;
        name        |                              address                               | accountID
--------------------+--------------------------------------------------------------------+-----------
 Missy Cooper       | {"street":"911 San Marcos St","city":"Austin","zipcode":73344}     |        17
 Sheldon Cooper     | {"street":"100 Main St Apt 4B","city":"Pasadena","zipcode":91001}  |       139
 Leonard Hofstadter | {"street":"100 Main St Apt 4A","city":"Pasadena","zipcode":91001}  |       142
 Leslie Winkle      | {"street":"23 Fifth Ave Apt 8C","city":"Pasadena","zipcode":91001} |       198
 Raj Koothrappali   | {"street":null,"city":"Pasadena","zipcode":91001}                    |       294                          |       482
(5 rows)

Most values are cast to UTF-8 strings, as shown for street and city here. Integers and booleans are cast to JSON Numerics and thus not quoted.

Use dot notation (column.field) to access fields:

=> SELECT address.city FROM customers;
   city
----------
 Pasadena
 Pasadena
 Pasadena
 Pasadena
 Austin
(5 rows)

You can use row columns or specific fields to restrict queries, as in the following examples.

=> SELECT address FROM customers WHERE address.city ='Pasadena';
                              address
--------------------------------------------------------------------
 {"street":"100 Main St Apt 4B","city":"Pasadena","zipcode":91001}
 {"street":"100 Main St Apt 4A","city":"Pasadena","zipcode":91001}
 {"street":"23 Fifth Ave Apt 8C","city":"Pasadena","zipcode":91001}
 {"street":null,"city":"Pasadena","zipcode":91001}
(4 rows)

You can use the ROW syntax to specify literal values, such as the address in the WHERE clause in the following example.

=> SELECT name,address FROM customers 
   WHERE address = ROW('100 Main St Apt 4A','Pasadena',91001);
        name        |                              address                      
--------------------+-------------------------------------------------------------------
 Leonard Hofstadter | {"street":"100 Main St Apt 4A","city":"Pasadena","zipcode":91001}
(1 row)	

You can join on values from structs as you would from any other column:

=> SELECT accountID,department from customers JOIN employees 
   ON customers.name=employees.personal.name;
 accountID | department
-----------+------------
       139 | Physics
       142 | Physics
       294 | Astronomy

You can join on full structs.  The following example joins the addresses in the employees and customers tables.

=> SELECT employees.personal.name,customers.accountID FROM employees 
JOIN customers ON employees.personal.address=customers.address;
        name        | accountID
--------------------+-----------
 Sheldon Cooper     |       139
 Leonard Hofstadter |       142
(2 rows)

If a reference is ambiguous, Vertica prefers column names over field names.

You can use many operators and predicates with ROW columns, including JOIN, GROUP BY, ORDER BY, IS [NOT] NULL, and comparison operations in nullable filters. Some operators are nonsensical for structured data and are not supported. See the ROW reference page for a complete list.

System Tables

Information about all ROW types is recorded in the COMPLEX_TYPES system table. You must have read permission for the external table that uses a ROW to see its entries in this system table. ROW types are not shown in the TYPES system table.

Each row in COMPLEX_TYPES represents one field of one ROW. The field name is the name used in the table definition if present, or a generated name beginning with _field otherwise. Each row also includes the (generated) name of its containing ROW, a string beginning with _ct_. ("CT" stands for "complex type".)

The following example shows selected columns from this table after defining the employee table only.

=> CREATE EXTERNAL TABLE warehouse(
	name VARCHAR, id_map MAP<INT,VARCHAR>,
	data row(record INT, total FLOAT, description VARCHAR(100)), 
	prices ARRAY[INT], comment VARCHAR(200), sales_total FLOAT, storeID INT) 
  AS COPY FROM ... PARQUET;

=> SELECT type_id,type_kind,type_name,field_id,field_name,field_type_name,field_position 
	FROM COMPLEX_TYPES ORDER BY type_id,field_name;	
			
      type_id      | type_kind |       type_name       | field_id | field_name  | field_type_name | field_position
-------------------+-----------+-----------------------+----------+-------------+-----------------+----------------				
 45035996274278280 | Map       | _ct_45035996274278280 |        6 | key         | int             |              0
 45035996274278280 | Map       | _ct_45035996274278280 |        9 | value       | varchar(80)     |              1
 45035996274278282 | Row       | _ct_45035996274278282 |        9 | description | varchar(80)     |              2
 45035996274278282 | Row       | _ct_45035996274278282 |        6 | record      | int             |              0
 45035996274278282 | Row       | _ct_45035996274278282 |        7 | total       | float           |              1
 45035996274278284 | Array     | _ct_45035996274278284 |        6 |             | int             |              0
(6 rows)

This table shows the fields for the two ROW types defined in the table. When a ROW contains another ROW, as is the case here with the nested address field, the field_type_name column uses the generated name of the contained ROW. The same number, minus the leading "_ct_", serves as the field_id.