Vertica Analytics Platform Version 9.2.x Documentation

EXPORT TO PARQUET

Exports a table, columns from a table, or query results to files in the Parquet format. You can use an OVER() clause to partition the data before export. Partitioning data can improve query performance by enabling partition pruning; see Improving Query Performance.

There are some limitations on the queries you can use in an export statement. See Query Restrictions.

You can export data stored in Vertica in ROS format and data from external tables.

EXPORT TO PARQUET returns the number of rows written.

During an export to HDFS or an NFS mount point, Vertica writes files to a temporary directory in the same location as the destination and renames the directory when the export is complete. Do not attempt to use the files in the temporary directory. During an export to S3, Vertica writes files directly to the destination path, so you must wait for the export to finish before reading the files. (For more about special S3 considerations, see Exporting to S3.)

Syntax

EXPORT TO PARQUET ( directory = 'path'
		    [, param=value [,...] ] )
	[ OVER (over‑clause ) ] 
	AS SELECT query‑expression;

Parameters

directory

The destination directory for the Parquet files. The directory must not exist, and the current user must have permission to write it. The destination can be on HDFS, S3, or an NFS mount point on the local file system.

compression

Column compression type, one of Snappy or Uncompressed. The default is Snappy.

rowGroupSizeMB

The uncompressed size of exported row groups, in MB (integer). The minimum value is 1 and the maximum value is fileSizeMB (or unlimited if fileSizeMB is 0). The default is 512. The row groups in the exported files are smaller because Parquet files are compressed on write. For best performance when exporting to HDFS, set size to be smaller than the HDFS block size.

fileSizeMB

The maximum file size of a single output Parquet file. By default, Vertica limits exports to a file size of 10GB. This value is a hint, not a hard limit. A value of 0 means there is no limit.

This value affects the size of individual output files, not the total output size. For smaller values Vertica divides the output into more files; all data is still exported.

fileMode

For writes to HDFS only, permission to apply to all exported files. You can specify the value in Unix octal format (such as '665') or "user-group-other" format (such as 'rwxr-xr-x'). The value must be formatted as a string even if using the octal format.

Valid octal values range from '0' to '1777'. See HDFS Permissions in the Apache Hadoop documentation.

The default is '660' regardless of the value of fs.permissions.umask-mode in hdfs-site.xml.

When writing files to any destination other than HDFS, this parameter has no effect.

dirMode

For writes to HDFS only, permission to apply to all exported directories. Values follow the same rules as those for fileMode.  Further, you must give the Vertica HDFS user full permission (at least 'rwx------' or '700').

The default is '755' regardless of the value of fs.permissions.umask-mode in hdfs-site.xml.

When writing files to any destination other than HDFS, this parameter has no effect.

Arguments

over‑clause

Specifies how to partition table data using PARTITION BY. Within partitions you can sort by using ORDER BY. See SQL Analytics. This clause may contain column references but not expressions.

If you partition data, Vertica creates a Hive-style partition directory structure, transforming column names to lowercase. See Using Partition Columns for a description of the directory structure.

If you omit this clause, Vertica optimizes for maximum parallelism.

query‑expression

Specifies the data to be exported. See SELECT for the syntax. See Query Restrictions for important limitations.

Privileges

  • SELECT privileges on the source table
  • USAGE privileges on source table schema
  • Write privileges for the destination directory

Query Restrictions

You must provide an alias column label for selected column targets that are expressions.

The query can contain only a single outer SELECT statement. For example, you cannot use UNION as in the following example.

=> EXPORT TO PARQUET(directory = '/mnt/shared_nfs/accounts/rm') 
   OVER(PARTITION BY hash) 
   AS
   SELECT 1 as account_id, '{}' as json, 0 hash
   UNION ALL
   SELECT 2 as account_id, '{}' as json, 1 hash;
ERROR 8975:  Only a single outer SELECT statement is supported
HINT:  Please use a subquery for multiple outer SELECT statements

Instead, rewrite the query to use a subquery:

=> EXPORT TO PARQUET(directory = '/mnt/shared_nfs/accounts/rm')
   OVER(PARTITION BY hash)
   AS
   SELECT
    account_id,
    json
   FROM
   (
     SELECT 1 as account_id, '{}' as json, 0 hash
     UNION ALL
     SELECT 2 as account_id, '{}' as json, 1 hash
   ) a; 
 Rows Exported 
---------------
             2
(1 row)			

To use composite statements such as UNION, INTERSECT, and EXCEPT, rewrite them as subqueries.

Data Types

EXPORT TO PARQUET does not support the following data types:

Decimal precision must be <= 38.

Vertica does not convert TIMESTAMP values to UTC. To avoid problems arising from time zones, use TIMESTAMPTZ instead of TIMESTAMP.

The exported Hive types might not be identical to the Vertica types. For example, a Vertica INT is exported as a Hive BIGINT. When defining Hive external tables to read exported data, you might have to adjust column definitions.

This operation exports raw Flex columns as binary data.

Output Files

EXPORT TO PARQUET always creates the output directory, even if the query produces zero rows.

You must use a shared file location for output. If you use a directory in the local file system, it must be an NFS-mounted directory.

For output to the local file system, you must have a USER storage location.

Output file names follow the pattern: [8-character hash]-[node name]-[thread_id].parquet. Column names in partition directories are lowercase.

When exporting to S3 the maximum size of all output is 5TB. You might have to divide large exports into more than one piece.

Vertica does not support simultaneous exports to the same directory in HDFS or S3. The results are undefined.

When exporting to the local file system, the permission mode is 700 for directories and 600 for files. You cannot override these values.

Parquet files exported to a local file system by any Vertica user are owned by the Vertica superuser. Parquet files exported to HDFS or S3 are owned by the Vertica user who exported the data.

Examples

The following example demonstrates exporting all columns from the T1 table in the public schema, using Snappy compression (the default).

=> EXPORT TO PARQUET(directory = 'hdfs:///user1/data') 
  AS SELECT * FROM public.T1;

The following example demonstrates exporting the results of a query using more than one table.

=> EXPORT TO PARQUET(directory='s3://DataLake/sales_by_region') 
   AS SELECT sale.price, sale.date, store.region 
   FROM public.sales sale 
   JOIN public.vendor store ON sale.distribID = store.ID;

The following example demonstrates partitioning and exporting data. EXPORT TO PARQUET first partitions the data on region and then, within each partition, sorts by store.

=> EXPORT TO PARQUET(directory = 's3://DataLake/user2/data') 
  OVER(PARTITION BY store.region ORDER BY store.ID) 
  AS SELECT sale.price, sale.date, store.ID 
  FROM public.sales sale
  JOIN public.vendor store ON sale.distribID = store.ID;

The following example uses an alias column label for a selected column target that is an expression.

=> EXPORT TO PARQUET(directory = 'hdfs:///user3/data') 
  OVER(ORDER BY col1) AS SELECT col1 + col1 AS A, col2
  FROM public.T3;

The following example sets permissions for the output.

=> EXPORT TO PARQUET(directory = 'hdfs:///user1/data',
     fileMode='432', dirMode='rwxrw-r-x') 
  AS SELECT * FROM public.T1;