Reading Hadoop Columnar File Formats

For data in certain Hadoop columnar formats, you can create external tables or copy data into Vertica directly, without going through Hive to get the metadata. Currently, Vertica supports direct reading for the ORC (Optimized Row Columnar) and Parquet formats. Vertica can read ORC and Parquet data from a local file system or from HDFS.

Vertica can also export data in the Parquet format. See Exporting Data.

Requirements

ORC or Parquet files must not use complex data types. Vertica supports all simple data types supported in Hive version 0.11 or later.

Files compressed by Hive or Impala require Zlib (GZIP) or Snappy compression. Vertica does not support LZO compression for these formats.

Creating External Tables

In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC or PARQUET as follows:

=> CREATE EXTERNAL TABLE tableName (columns)
	AS COPY FROM path ORC[(hive_partition_cols='partitions')];
=> CREATE EXTERNAL TABLE tableName (columns)
	AS COPY FROM path PARQUET[(hive_partition_cols='partitions')];

Set the value of path based on where the file is located. If the file resides:

On the local file system of the node where you issue the command—Use a local file path. Escape special characters in file paths with backslashes.
In HDFS—Use the hdfs scheme and then append the file path. Escape special characters in HDFS paths using URL encoding. HDFS URLs usually begin with hdfs:///. See Reading Directly from HDFS for more information about URL format.

When defining an external table, you must define all of the data columns in the file. You may omit partition columns. Unlike with some other data sources, you cannot select only the data columns of interest. If you omit data columns, the ORC or Parquet reader aborts with an error.

If the data is partitioned you must alter the path value and specify the hive_partition_cols argument for the ORC or PARQUET parameter. You must also list partitioned columns last in columns. See Using Partition Columns.

If path is a path on the local file system on a Vertica node, specify the node using ON NODE in the COPY statement. Do not use COPY LOCAL. If path is in HDFS, COPY defaults to ON ANY NODE so you do not need to specify it.

Files stored in HDFS are governed by HDFS privileges. For files stored on the local disk, however, Vertica requires that users be granted access. All users who have administrative privileges have access. For other users, you must create a storage location and grant access to it. See CREATE EXTERNAL TABLE AS COPY. Only users who have access to both the Vertica user storage location and the HDFS directory can read from the table.

Loading Data

In the COPY statement, specify a format of ORC or PARQUET:

=> COPY tableName FROM path ORC[(hive_partition_cols='partitions')];
=> COPY tableName FROM path PARQUET[(hive_partition_cols='partitions')];

As with external tables, path may be a local or hdfs: path. For information about the hive_partition_cols parameter, see Using Partition Columns.

Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Supported Data Types

Vertica can natively read columns of all data types supported in Hive version 0.11 and later except for complex types. If the data contains complex types such as maps, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. Vertica does not attempt to read only some columns; either the entire file is read or the operation fails. For a complete list of supported types, see HIVE Data Types.

Examples

The following example shows how you can read from all ORC files in a local directory. This example uses all supported data types.

=> CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, 
			  a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, 
			  a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), 
			  a14 DECIMAL(10,5))
	AS COPY FROM '/data/orc_test_*.orc' ORC;

The following example shows how to use a name service with the hdfs scheme. This example assumes that the name service, hadoopNS, is defined in the Hadoop configuration files that were copied to the Vertica cluster.