Creating External Tables with ORC or Parquet Data

In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC or PARQUET as follows:

=> CREATE EXTERNAL TABLE tableName (columns)
     AS COPY FROM path ORC[(parameters)];
=> CREATE EXTERNAL TABLE tableName (columns)
     AS COPY FROM path PARQUET[(parameters)];

The ORC and Parquet parsers (see ORC (Parser) and PARQUET (Parser)) take several optional parameters.

The following example shows how you can read from all ORC files in a local directory.

=> CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, 
	  a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, 
	  a10 VARCHAR(20), a11 CHAR(20), a12 BINARY(20), 
	  a13 DECIMAL(10,5))
   AS COPY FROM '/data/orc_test_*.orc' ORC;

The following example shows how to use a name service with the hdfs scheme. This example assumes that the name service, hadoopNS, is defined in the Hadoop configuration files that were copied to the Vertica cluster. (See Configuring the hdfs Scheme.)

=> CREATE EXTERNAL TABLE tt (a1 INT, a2 VARCHAR(20))
    AS COPY FROM 'hdfs://hadoopNS/data/file.parquet' PARQUET;

The following example shows how to load multiple ORC files from one S3 bucket.

=> ALTER DATABASE DEFAULT SET PARAMETER AWSRegion = 'us-west-1';
ALTER DATABASE			
=> CREATE EXTERNAL TABLE sales (...)
    AS COPY FROM 's3://Data_1/sales.orc', 's3://Data_2/sales.orc' ORC;

When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. You may omit partition columns. Unlike with some other data sources, you cannot select only the data columns of interest. If you omit data columns, queries using the table abort with an error.

Vertica provides functions to assist with creating table definitions for Parquet data. The GET_METADATA function inspects a file and reports metadata including information about columns. The INFER_EXTERNAL_TABLE_DDL returns a starting point for the CREATE EXTERNAL TABLE statement (see Deriving a Table Definition from the Data). You can use EXPORT_OBJECTS to see the definition of an external table, including complex types.

If the data is partitioned you must alter the path value and specify the hive_partition_cols argument for the ORC or PARQUET parameter. You must also list partitioned columns last in columns. See Using Partition Columns.

If path is a path on the local file system on a Vertica node, specify the node using ON NODE in the COPY statement. Do not use COPY LOCAL. If path is in HDFS or S3, COPY defaults to ON ANY NODE so you do not need to specify it.

If you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Deriving a Table Definition from the Data

For data in Parquet format, you can use the INFER_EXTERNAL_TABLE_DDL function to inspect the data and produce a starting point. This function returns a CREATE EXTERNAL TABLE statement, which might require further editing. For columns where the function could not infer the data type, the function labels the type as unknown and emits a warning. For VARCHAR and VARBINARY columns, you might need to adjust the length. Always review the statement the function returns, but especially for tables with many columns, using this function can save time and effort.

In the following example, the data contains one materialized column and two partition columns. Partition columns are always of unknown type.

=> SELECT INFER_EXTERNAL_TABLE_DDL('/data/sales/*/*/*, 'sales');
WARNING 0:  This generated statement is incomplete because of one or more unknown column types. Fix these data types before creating the table
				INFER_EXTERNAL_TABLE_DDL                                                                                          
------------------------------------------------------------------------
 create external table "sales"(
  "tx_id" int, 
  "date" UNKNOWN, 
  "region" UNKNOWN
) as copy from 'data/sales/*/*/*' parquet(hive_partition_cols='date,region');
(1 row)