PARQUET (Parser)

Use the PARQUET parser with the COPY statement to load data in the Parquet format. When loading data into Vertica you can read all primitive types, UUIDs, and arrays of primitive types. When creating an external table you can additionally read structs and multi-dimensional arrays (arrays of arrays). See Loading Complex Types for further information on using arrays.

By default, the Parquet parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose Schema Matching.

When loading Parquet data, Vertica caches the Parquet metadata to improve efficiency. This cache uses local TEMP storage and is not used if TEMP is remote. See the ParquetMetadataCacheSizeMB configuration parameter to change the size of the cache.

Vertica does not support Parquet files that were written using the DATA_PAGE_V2 page type.

Parameters

All parameters are optional.

hive_partition_cols Comma-separated list of columns that are partition columns in the data. See Using Partition Columns.

flatten_complex_type_nulls

Boolean. Whether to flatten a null struct value to null values for all of its fields (true, default) or reject a row containing a null struct value (false). See Reading Structs.

This parameter is deprecated and will be removed in a future release.

allow_no_match

Boolean. Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
allow_long_varbinary_match_complex_type Boolean. Whether to enable flexible column types (see Using Flexible Complex Types). If true, the Parquet parser allows a complex type in the data to match a table column defined as LONG VARBINARY. If false, the Parquet parser requires strong typing of complex types. With the parameter set you can still use strong typing. Set this parameter to false if you want use of flexible columns to be treated as an error.
do_soft_schema_match_by_name Boolean. Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and parquet file (false, default). See Loose Schema Matching for more information.
reject_on_materialized_type_error Boolean, applies only if do_soft_schema_match_by_name is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true (default) means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.

Loose Schema Matching

By default, the Parquet parser uses strong schema matching. This means that all columns in the Parquet data must be loaded, in the same order as in the data. However, there are times when you only want to pull certain columns, or you want to be able to accommodate future changes in the Parquet schema.

Use the do_soft_schema_match_by_name parameter to enable soft schema matching. This setting has the following effects:

  • Columns in the data are matched to columns in the table by their names. Names must exactly match but are case-insensitive.
  • Columns that exist in the Parquet data but are not part of the table definition are ignored.
  • Columns that exist in the table definition but not the Parquet data are filled with NULL. The parser logs an UNMATCHED_TABLE_COLUMNS_PARQUETPARSER event in QUERY_EVENTS.
  • If the same case-insensitive column name occurs more than once in the Parquet data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
  • Hive partition columns do not need to appear last in the table definition like they do with strong schema matching. If a partition column specified in the table definition does not exist in the data path, the parser uses a value of NULL and logs a MISSING_HIVE_PARTITION event in the QUERY_EVENTS table.
  • Column types do not need to exactly match, so long as the data type in the Parquet file can be coerced to the type used by the table. If a type cannot be coerced, the parser logs a TYPE_MISMATCH_COLUMNS_PARQUETPARSER event in QUERY_EVENTS. If reject_on_materialized_type_error is true then the parser rejects the row. If it is false, the parser uses NULL or, for string values that are too long, truncates the value.
  • Columns using complex types (other than one-dimensional arrays of primitive types) can be defined but not queried.

Loading Complex Types

Parquet data containing one-dimensional arrays (arrays of primitive types) can be loaded into Vertica-managed tables. The native array type is a different type than the external-table type; that is, the types have different OIDs. Syntactically they are the same, both in how you define them in tables and in how you query them.

The native array type does not support multi-dimensional arrays. If your data contains such arrays, you can read them directly using external tables. Alternatively, you can create the external table and then use CREATE TABLE AS SELECT to create a new Vertica-managed table, extracting each nested array as its own column.

Vertica supports only 3-level-encoded arrays, not 2-level-encoded.

Vertica does not support the required mode for complex types, only optional.

Data Types

The Parquet parser maps Parquet data types to Vertica data types as follows.

Parquet Logical Type Vertica Data Type
StringLogicalType VARCHAR
MapLogicalType MAP
ListLogicalType ARRAY/SET
IntLogicalType INT/NUMERIC
DecimalLogicalType NUMERIC
DateLogicalType DATE
TimeLogicalType TIME
TimestampLogicalType TIMESTAMP
UUIDLogicalType UUID

The following logical types are not supported:

  • EnumLogicalType

  • IntervalLogicalType

  • JSONLogicalType

  • BSONLogicalType

  • UnknownLogicalType

The Parquet parser supports the following mappings of physical types:

Parquet Physical Type Vertica Data Type
BOOLEAN BOOLEAN
INT32/INT64 INT
INT96 Supported only for TIMESTAMP
FLOAT DOUBLE
DOUBLE DOUBLE
BYTE_ARRAY VARBINARY
FIXED_LEN_BYTE_ARRAY BINARY

The following mappings are supported with type coercion and loose schema matching.

Parquet Physical Type Coercible to Vertica Data Type
BOOLEAN BOOLEAN
INT32, INT64, BOOLEAN INT
FLOAT, DOUBLE DOUBLE
INT32, INT96 DATE
INT64, INT96 TIMESTAMP, TIMESTAMPTZ

INT64

If precision > 0: INT32, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY

Numeric
BYTE_ARRAY CHAR, VARCHAR, LONG VARCHAR, BINARY, VARBINARY, LONG VARBINARY
FIXED_LEN_BYTE_ARRAY UUID

Examples

Use the PARQUET clause without parameters if your data is not partitioned.

=> COPY t FROM 's3://AWS_DataLake/sales.parquet' PARQUET;

In the following example, the "id" and "name" columns are included in the data and the "created" and "region" columns are partition columns. Partition columns must be listed last when defining columns with strong schema matching (the default).

=> CREATE EXTERNAL TABLE t (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'hdfs:///path/*/*/*' 
   PARQUET(hive_partition_cols='created,region');

In the following example, the data directory contains no files.

=> CREATE EXTERNAL TABLE customers (...)
    AS COPY FROM 'hdfs:///data/*.parquet' PARQUET;
=> SELECT COUNT(*) FROM customers;
ERROR 7869: No files match when expanding glob: [hdfs:///data/*.parquet]

To read zero rows instead of producing an error, use the allow_no_match parameter with PARQUET:

=> CREATE EXTERNAL TABLE customers (...)
    AS COPY FROM 'hdfs:///data/*.parquet' PARQUET(allow_no_match='true');
=> SELECT COUNT(*) FROM customers;
 count 
------- 
     0 
(1 row) 

To allow reading a complex type (menu, in this example) as a flexible column type, use the allow_long_varbinary_match_complex_type parameter:

=> CREATE EXTERNAL TABLE restaurants
    (name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY) 
    AS COPY FROM '/data/rest*.parquet' 
    PARQUET(allow_long_varbinary_match_complex_type='True');

To read only some columns from the restaurant data, use loose schema matching:

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(allow_long_varbinary_match_complex_type='True', 
            do_soft_schema_match_by_name='True');
			
=> SELECT * from restaurant;
       name        | cuisine 
-------------------+----------
 Bob's pizzeria    | Italian
 Bakersfield Tacos | Mexican
(2 rows)