This blog post was authored by Deepak Majeti.
High angle view of Beijing Guomao.
Vertica customers often ask the following questions:
1. “We want to keep hot/warm data in Vertica and move warm/cold data to an open file format on cheap external storage. How do we do this? “
2.”How can we store the results from Vertica in an open file format on Hadoop Distributed File System?”
Parquet Export, introduced in Vertica 8.1.1, addresses these questions.
This feature allows Vertica customers to export data from Vertica to files in Parquet
format. The rest of this blog post explains and evaluates this new Parquet Export feature.
What are Parquet Export capabilities?
With Parquet export, you can move table data from Vertica to the Parquet file format on an external file system (such as Linux File System or Hadoop Distributed File System). The output result set of any SELECT
statement can be exported to Parquet format. This means that users can apply expressions on top of table columns before exporting. Users can also use external tables to load data from a different file format (such as CSV, TXT), and export to Parquet format. Additionally, this feature allows sorting and partitioning of the exported result set.
: Parquet Export does not alter the data that is stored inside Vertica.
How do you use it?
Parquet Export uses the familiar EXPORT TO
syntax. The following is a sample query and output:
=> EXPORT TO PARQUET(directory = 'hdfs:///user1/parquet_data',
rowGroupSizeMB = 10,
compression = 'snappy')
OVER(PARTITION BY department_id)
AS SELECT name, designation, salary FROM employee_records;
For details, see EXPORT TO PARQUET
in the Vertica documentation.
How do Sorting and Partitioning work while exporting to Parquet?
You can use an “OVER(ORDER BY transaction_id)” clause to specify the columns in the result set that need to be sorted before exporting.
You can use an “OVER(PARTITION BY transaction_date)” clause to specify the columns in the result set that need to be partitioned before exporting. The partitioning is similar to Hive-style partitioning, where directories are created hierarchically for each partition column value.
Sorting and partitioning columns provides multiple benefits for managing the data. For instance, Vertica can efficiently apply predicate pushdown while reading if Parquet columns are sorted and partitioned.
What are the atomicity and concurrency guarantees?
Parquet Export is atomic, which means that the export is based on a consistent snapshot of the SELECT statement’s result set and the files written are visible only after the export is completed. Therefore, a load from the directory where the files are being exported sees either all the data or none. This is achieved by first exporting the results to a temporary directory and then atomically renaming the directory to the specified directory.
Users can invoke multiple Parquet Export queries concurrently if the destination directories do not overlap. If any overlap between two queries is detected, the queries are aborted.
In the advent of a failure or cancellation of the query, the partially exported parquet files are automatically cleaned up. However, there are instances where the cleanup could fail.
Can generated Parquet files be read from other tools?
Vertica uses the standard Apache Parquet C++
library to write the parquet files. This means that the files written follow the official Parquet specification and will continue to do so. Our developers contribute regularly to this project. This allows our developers to efficiently integrate Vertica with the Parquet C++ library by taking advantage of the implementation expertise.
How stable is Parquet Export?
Parquet Export is optimized to use the maximum allowed concurrency within the resources provided. Both Vertica and the Parquet C++ libraries have been tuned to use a minimal memory footprint. The following table details preliminary experimental results that were obtained in our lab:
5-node Vertica cluster. Each node (2 x Intel(R) Xeon(R) E5-2660 v3) has 40 processors @ 2.60GHz and 256 GB RAM.
CREATE TABLE products(
total_gain numeric(32, 10)
400 Million Rows of Random values (~735 GB ROS data). Because the values are random, the data is poorly compressed. This is the worst-case scenario and therefore provides an upper bound for the results.
Data is exported to each node’s local Linux file system.
Performance evaluation of exporting various Vertica data types
This experiment studied Parquet Export performance with various Vertica types. It is important for Parquet Export to perform similarly on all Vertica types. We studied this behavior by exporting data of each type. We executed an EXPORT statement for each column and then measured how much time the export required.
The graph below shows the results. The time taken to export each Vertica type was proportional to the size of the type. We can therefore infer that the export performance is similar for all the types. The last bar in the graph (*) shows that the time taken to export all types together was similar to the total time taken to export each type individually. This shows that combining various types has no impact on Parquet Export performance.
Data Volume Scaling
This experiment studied Parquet Export performance with varying data sizes on a fixed number of nodes. It is important that time taken increases linearly as the data size increases. To study this, we execute Parquet Export with varying number of rows.
The graph below shows the time taken linearly increased with a linear increase of the data. This shows that Parquet Export is scalable as the data volume increases.
Parquet Export helps Vertica users export data in ROS format to an open Parquet file format. This feature has multiple use cases including the ability to store cold data on cheap external storage. Parquet Export syntax is simple and provides powerful options such as sorting and partitioning of data.
Exported parquet files exported follow the official Parquet specification. Preliminary performance results show that Parquet Export is stable and scalable.