What's New in Vertica 9.0: Reading Parquet and ORC from S3

This blog post was authored by Monica Cellio.

Parquet and ORC are widely-used Hadoop columnar file formats. Because these formats are columnar, they perform extremely well when queried as external tables in Vertica. Vertica queries implement column selection, predicate pushdown, and partition pruning.

Vertica has supported reading Parquet and ORC data from HDFS or from the local file system since version 7.2.3. Increasingly, customers are storing Parquet and ORC data not in HDFS but in vast data lakes on S3. Starting with version 9.0 you can now browse that data lake using Vertica without moving or transforming any data.

You can create external tables and query the data in place:

=> CREATE EXTERNAL TABLE sales (itemID int, date date, price float) 
    AS COPY FROM 's3://datalake/sales/*.parquet' PARQUET;

When you find something that you want to do deeper analysis on, and when you need the top performance of Vertica’s ROS format, you can load selected data into Vertica: => COPY sales FROM 's3://data_1/salesQ1.orc', 's3://data_2/salesQ1.orc' ORC; For both external tables and load, you can use partitioning to speed up queries by reducing the amount of data that Vertica has to read. This is called partition pruning. Partitions in Parquet and ORC data are represented in the directory structure rather than in the data files themselves. For example, data that is partitioned by date might have this directory structure:

path/created=2017-10-01/*.parquet
path/created=2017-10-02/*.parquet
path/created=2017-10-03/*.parquet
path/...

If your data files use partitioning, you can take advantage of it in your COPY FROM statements:

=> CREATE EXTERNAL TABLE sales (itemID int, date date)
    AS COPY FROM 's3://datalake/sales/*/*.parquet' PARQUET(hive_partition_cols='date');

When queries depend on date values, Vertica can automatically skip (prune) non-matching partitions and the data files they contain.

For more information about reading from S3, see Bulk-Loading Data in the Administrator’s Guide. For more information about using partitions, see Using Partition Columns.

About the Author

Soniya Shah
Information Developer

Currently, a first year law student with a background in science and technology. Experienced technical writer, with specializations in software documentation, big data, blog development, and website development. I build user-centered content to communicate complex and technical information more easily.

I used to work for Vertica full time for about 3 years. I still work at Vertica part time while going to law school.

Update: Soniya is now doing her law internship, and no longer working at Vertica. Good luck, Soniya!

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

What’s New in Vertica 9.0: Reading Parquet and ORC from S3

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

What’s New in Vertica 9.0: Reading Parquet and ORC from S3

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates

See More 9.0 Posts