Troubleshooting Reads from ORC and Parquet Files

You might encounter the following issues when reading ORC or Parquet files.

File Not Found or Permission Denied

If a query against an external table produces a file or permission error, ensure that the user executing the query has the necessary permissions in both Vertica and the file system. See the permissions section in Creating External Tables.

Reads from Parquet Files Report Unexpected Data-Type Mismatches

If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INTEGER, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example:

ERROR 7247:  Datatype mismatch: column 2 in the parquet_cpp source 
[/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int 

This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING.

Queries on Parquet Tables Show No Results

A query against an external table can incorrectly report 0 rows if the Parquet file was written using an unsupported format. When this happens, Vertica records a message like the following:

WARNING> @db_node0001: 01000/9226: Number of rows [0] do not match with number 
of rows [18] in metadata for parquet source path /data/filename.parquet

This mismatch occurs for Parquet files that were written using the DATA_PAGE_V2 page type. Vertica cannot read this format.

Error 7087: Wrong Number of Columns

When loading data, you might see an error stating that you have the wrong number of columns:

=> CREATE TABLE nation (nationkey bigint, name varchar(500),
			regionkey bigint, comment varchar(500)); 
=> COPY nation from :orc_dir ORC;
   ERROR 7087: Attempt to load 4 columns from an orc source 
   [/tmp/orc_glob/test.orc] that has 9 columns 

When you load data from Hadoop native file formats, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition.

For Parquet Data, Time Zones in Timestamp Values Are Not Correct

Reading timestamps from a Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the Parquet format does not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets timestamps read from Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time such as the following:

WARNING 0:  SQL TIMESTAMPTZ is more appropriate for Parquet TIMESTAMP
because values are stored in UTC

When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP.

Time zones can also be incorrect in ORC data, but the reason is different.

For ORC Data, Time Zones in Timestamp Values Are Not Correct

Vertica and Hive both use the Apache ORC library to interact with ORC data. The behavior of this library changed with Hive version 1.2.0, so timestamp representation depends on what version was used to write the data.

When writing timestamps, the ORC library now records the time zone in the stripe footer. Vertica looks for this value and applies it when loading timestamps. If the file was written with an older version of the library, the time zone is missing from the file.

If the file does not contain a time zone, Vertica uses the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table.

The first time you query a new ORC data source, you should query this table to look for missing time zone information:

=> SELECT event_category, event_type, event_description, operator_name, event_details, COUNT(event_type) 
   GROUP BY event_category, event_type, event_description, operator_name, event_details 
   ORDER BY event_details;
event_category | event_type | event_description | operator_name | event_details | count
EXECUTION | ORC_FILE_INFO | ORC file does not have writer timezone information | OrcParser | Timestamp values in the ORC source [data/sales_stats.orc] will be computed using local timezone | 2

(1 row)

Time zones can also be incorrect in Parquet data, but the reason is different.

Some Date and Timestamp Values Are Wrong by Several Days

When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values.

Error 7226: Cannot Find Partition Column

When querying data, you might see an error message stating that a partition column is missing:

ERROR 7226: Cannot find partition column [region] in parquet source

This error can occur if you partition your ORC or Parquet data (see Using Partition Columns). If you create an external table and then change the partition structure, for example by renaming a column, you must then re-create the external table. If you see this error, update your table to match the partitioning on disk.

Error 6766: Is a Directory

When querying data you might see an error message stating that an input file is a directory:

ERROR 6766: Error reading from orc parser input stream 
[/tmp/orc_glob/more_nations]: Is a directory

This error occurs if the glob in the table's COPY FROM clause matches an empty directory. This error occurs only for files in the Linux file system; empty directories in HDFS are ignored.

To correct the error, make the glob more specific.  Instead of *, for example, use *.orc.