Partitioning and Sorting Data

When exporting, you can use the optional OVER clause in the SELECT statement to specify how to partition and/or sort the exported data. Partitioning reduces the sizes of the output data files and can improve performance when Vertica queries external tables containing this data. (See Using Partition Columns.) If you do not specify how to partition the data, Vertica optimizes the export for maximum parallelism.

To specify partition columns, use PARTITION BY in the OVER clause as in the following example:

=> EXPORT TO PARQUET(directory = 'webhdfs:///data/export') 
  OVER(PARTITION BY date) AS SELECT transactionID, price FROM public.sales;
 Rows Exported
---------------
         28337
(1 row)

You can both partition by a column and include that column in the SELECT clause. Including the column allows you to sort it. Vertica still takes advantage of the partitioning during query execution.

You can sort values within a partition for a further performance improvement. Sort table columns based on the likelihood of their occurrence in query predicates; columns that most frequently occur in comparison or range predicates should be sorted first. You can sort values within each partition using ORDER BY in the OVER clause:

=> EXPORT TO PARQUET(directory = 'webhdfs:///data/export') 
  OVER(PARTITION BY date ORDER BY price) AS SELECT date, price FROM public.sales;
 Rows Exported
---------------
         28337
(1 row)

You can use ORDER BY even without partitioning. Storing data in sorted order can improve data access and predicate evaluation performance.

Targets in the OVER clause must be column references; they cannot be expressions. For more information about OVER, see SQL Analytics.

If you are exporting data to a local file system, you might want to force a single node to write all of the files. To do so, use an empty OVER clause.