Exporting Data

You might want to export data from Vertica, either to share it with other Hadoop-based applications or to move lower-priority data from ROS to less-expensive storage. You can use the EXPORT TO PARQUET statement to export a table (or part of one) as Parquet data.

You can export data to HDFS or to the local file system. You can export ROS data or data that is readable through external tables. After exporting ROS data, you can drop affected ROS partitions to reclaim storage space. If you need to access the data in Vertica again, you can create external tables from the exported data.

To export data, use EXPORT TO PARQUET in combination with a SELECT statement, as in the following example:

=> EXPORT TO PARQUET(directory='hdfs:///data/sales_data') 
   AS SELECT * FROM public.sales;
 Rows Exported
---------------
         14336
(1 row)

The directory argument specifies where to write the files and is required. The directory must not already exist.

You can use EXPORT TO PARQUET to write queries across multiple tables in Vertica and export the results. With this approach you can take advantage of powerful, fast query execution in Vertica while making the results available to other Hadoop clients:

=> EXPORT TO PARQUET(directory='hdfs:///data/sales_by_region') 
   AS SELECT sale.price, sale.date, store.region 
   FROM public.sales sale 
   JOIN public.vendor store ON sale.distribID = store.ID;
 Rows Exported
---------------
         23301
(1 row)

EXPORT TO PARQUET takes optional parameters to specify compression format and row-group size (in MB), as in the following example:

=> EXPORT TO PARQUET(directory='hdfs:///data/sales_data',
		      compression = 'uncompressed', rowGroupSize = '32') 
   AS SELECT * FROM public.sales;
 Rows Exported
---------------
         14336
(1 row)

The default compression type is Snappy.

The row-group size affects memory consumption during export. An export thread consumes at least 64MB of RAM if the value is 64. The default value of 64 is a compromise between writing larger row groups and allowing enough free memory for other Vertica operations. If you perform exports when the database is not otherwise under heavy load, you can improve read performance later by increasing row-group size on export.

When exporting, you can use the optional OVER clause to specify how to partition and/or sort data. Partitioning reduces the sizes of the output data files and can improve performance when Vertica queries external tables containing this data. (See Using Partition Columns.) If you do not specify how to partition the data, Vertica optimizes the export for maximum parallelism.

To specify partition columns, use PARTITION BY in the OVER clause as in the following example:

=> EXPORT TO PARQUET(directory = 'hdfs:///data/export') 
  OVER(PARTITION BY date) AS SELECT date, price FROM public.sales;
 Rows Exported
---------------
         28337
(1 row)

You can sort values within each partition for a further performance improvement. Sort table columns based on the likelihood of their occurrence in query predicates; columns that most frequently occur in comparison or range predicates should be sorted first. You can sort values within each partition using ORDER BY in the OVER clause:

=> EXPORT TO PARQUET(directory = 'hdfs:///data/export') 
  OVER(PARTITION BY date ORDER BY price) AS SELECT date, price FROM public.sales;
 Rows Exported
---------------
         28337
(1 row)

You can use ORDER BY even without partitioning. Storing data in sorted order can improve data access and predicate evaluation performance.

Targets in the OVER clause must be column references; they cannot be expressions. For more information about OVER, see SQL Analytics.

If you are exporting data to a local file system, you might want to force a single node to write all of the files. To do so, use an empty OVER clause.

You cannot export columns with the TIME, TIMEZ, and INTERVAL data types. If your table includes columns of these types, exclude them by explicitly selecting the columns you can export:

=> EXPORT TO PARQUET(directory='hdfs:///data/sales_data') 
   AS SELECT date, transactionID, price FROM public.sales;
 Rows Exported
---------------
         14336
(1 row)

You can only perform one export per output directory. If you perform more than one concurrent export to the same directory, only one will succeed.