This blog post was authored by Monica Cellio.
Parquet and ORC are widely-used Hadoop columnar file formats. Because these formats are columnar, they perform extremely well when queried as external tables in Vertica. Vertica queries implement column selection, predicate pushdown, and partition pruning.
Vertica has supported reading Parquet and ORC data from HDFS or from the local file system since version 7.2.3. Increasingly, customers are storing Parquet and ORC data not in HDFS but in vast data lakes on S3. Starting with version 9.0 you can now browse that data lake using Vertica without moving or transforming any data.
You can create external tables and query the data in place:
=> CREATE EXTERNAL TABLE sales (itemID int, date date, price float)
AS COPY FROM 's3://datalake/sales/*.parquet' PARQUET;
When you find something that you want to do deeper analysis on, and when you need the top performance of Vertica’s ROS format, you can load selected data into Vertica:
=> COPY sales FROM 's3://data_1/salesQ1.orc', 's3://data_2/salesQ1.orc' ORC;
For both external tables and load, you can use partitioning to speed up queries by reducing the amount of data that Vertica has to read. This is called partition pruning. Partitions in Parquet and ORC data are represented in the directory structure rather than in the data files themselves. For example, data that is partitioned by date might have this directory structure:
If your data files use partitioning, you can take advantage of it in your COPY FROM statements:
=> CREATE EXTERNAL TABLE sales (itemID int, date date)
AS COPY FROM 's3://datalake/sales/*/*.parquet' PARQUET(hive_partition_cols='date');
When queries depend on date values, Vertica can automatically skip (prune) non-matching partitions and the data files they contain.
For more information about reading from S3, see Bulk-Loading Data
in the Administrator’s Guide. For more information about using partitions, see Using Partition Columns