EXPORT TO PARQUET
Exports a table, columns from a table, or query results to files in the Parquet format. You can use an OVER() clause to partition the data before export. Partitioning data can improve query performance by enabling partition pruning; see Improving Query Performance for Data Stored in HDFS.
You can export data stored in Vertica in ROS format and data from external tables.
The exported files are owned by the superuser.
EXPORT TO PARQUET returns the number of rows written.
During the export, Vertica writes files to a temporary directory in the same location as the destination and renames the directory when the export is complete. Do not attempt to use the files in the temporary directory.
Syntax
EXPORT TO PARQUET ( directory = 'directory' [, compression = 'compression_type' ] [, rowGroupSizeMB = 'size' ] ) [ OVER (over-clause ) ] AS SELECT query-expression;
Parameters
directory |
The destination directory for the Parquet files. The directory must not exist, and the current user must have permission to write it. The directory can be in HDFS or on the local file system. |
compression_type |
Column compression type, one of Snappy or Uncompressed. The default is Snappy. |
size |
The uncompressed size of exported row groups, in MB. The maximum value is 512 and the default is 64. The row groups in the exported files are smaller because Parquet files are compressed on write. For best performance, set size to be smaller than the HDFS block size. |
Arguments
over-clause |
Specifies how to partition table data using PARTITION BY. Within partitions you can sort by using ORDER BY. See SQL Analytics. This clause may contain column references but not expressions. If you partition data, Vertica creates a Hive-style partition directory structure. See Using Partition Columns for a description of the structure. If you omit this clause, Vertica optimizes for maximum parallelism. |
query-expression |
Specifies the data to be exported. See SELECT for the syntax. |
Privileges
- SELECT privileges on the source table
- USAGE privileges on source table schema
- Write privileges for the destination directory
Notes
- This operation does not support the TIME, TIMEZ, and INTERVAL data types. Decimal precision must be <= 38.
- Vertica does not convert TIMESTAMP values to UTC. To avoid problems arising from time zones, use TIMESTAMPTZ instead of TIMESTAMP.
- You must provide an alias column label for selected column targets that are expressions.
- The OVER() clause, if specified, can contain only column references, not an expression.
- This operation exports raw Flex columns as binary data.
- The exported Hive types might not be identical to the Vertica types. For example, a Vertica INT is exported as a Hive BIGINT. When defining Hive external tables to read exported data, you might have to adjust column definitions.
- If you specify a directory in the local file system that is not shared storage, Vertica distributes the files among nodes according to how the export is partitioned. To write all local files on one node, use an empty OVER() clause.
- If you specify a directory in the local file system, you must have a USER storage location.
- Vertica does not support simultaneous exports to the same directory in HDFS. The results are undefined.
- Output file names follow the pattern: [8-character hash]-[node name]-[thread_id].parquet.
-
Parquet files exported to a local filesystem by any Vertica user are owned by the Vertica superuser. Parquet files exported to HDFS are owned by the Vertica user who exported the data.
Examples
The following example demonstrates exporting all columns from the T1 table in the public schema, using Snappy compression (the default).
=> EXPORT TO PARQUET(directory = 'hdfs:///user1/data') AS SELECT * FROM public.T1; Rows Exported --------------- 87436 (1 row)
The following example demonstrates exporting the results of a query using more than one table.
=> EXPORT TO PARQUET(directory='hdfs:///data/sales_by_region') AS SELECT sale.price, sale.date, store.region FROM public.sales sale JOIN public.vendor store ON sale.distribID = store.ID; Rows Exported --------------- 23301 (1 row)
The following example demonstrates partitioning and exporting data. EXPORT TO PARQUET first partitions the data on 'b' and then, within each partition, sorts by 'd'.
=> EXPORT TO PARQUET(directory = 'hdfs:///user2/data') OVER(PARTITION BY b ORDER BY d) AS SELECT b, d FROM public.T2; Rows Exported --------------- 120931 (1 row)
The following example uses an alias column label for a selected column target that is an expression.
=> EXPORT TO PARQUET(directory = 'hdfs:///user3/data') OVER(ORDER BY col1) AS SELECT col1 + col1 AS A, col2 FROM public.T3; Rows Exported --------------- 14336 (1 row)