Specifying COPY FROM Options

Each COPY statement requires a FROM clause to indicate the location of the file or files being loaded. In the FROM clause you specify a path to load from or STDIN. You can also, optionally, specify the compression format of the file as UNCOMPRESSED, BZIP, GZIP, or LZO.

When using COPY in conjunction with a CREATE EXTERNAL TABLE statement, you cannot use the COPY FROM STDIN or LOCAL options. For most external tables, you must also define a user storage location to allow non-administrative users to query the table. See Required Permissions in Creating External Tables.

Some COPY FROM options are not available for all file types.  See COPY Parameters.

Loading from STDIN

Using STDIN for the FROM option lets you load uncompressed data, BZIP, or GZIP files.

Loading from a Specific Path

Use the pathToData argument to indicate the location of the file to load. You can load data from the following locations:

  • The local file system.
  • NFS, through a mount point on the local file system.
  • HDFS, using a URL of the form "hdfs:///path/to/data". For more information about HDFS URLs, see HDFS URL Format in Integrating with Apache Hadoop.
  • An S3 or Google Cloud Storage bucket, for data in text, delimited, Parquet, and ORC formats only. See Loading from an S3 Bucket and Loading from Google Cloud Storage.

When copying from the local file system, the COPY statement expects to find files in the same location on every node that participates in the query. If you are using NFS, then you can create an NFS mount point on each node. Doing so allows all database nodes to participate in the load for better performance without requiring files to be copied to all nodes.

Treat NFS mount points as local files in paths:

=> COPY sales FROM '/mount/sales.dat' ON ANY NODE;

You can specify more than one path in the same COPY statement, as in the following example.

=> COPY myTable FROM 'hdfs:///data/sales/01/*.dat', 'hdfs:///data/sales/02/*.dat', 
	'hdfs:///data/sales/historical.dat';

If pathToData resolves to a storage location on a local file system (not HDFS), and the user invoking COPY is not a superuser, these permissions are required:

  • The storage location must have been created with the USER option (see CREATE LOCATION)
  • The user must already have been granted READ access to the storage location where the file or files exist, as described in GRANT (Storage Location)

Vertica prevents symbolic links from allowing unauthorized access.

Loading with Wildcards (glob)

You can invoke COPY for a large number of files in a shared directory with a single statement such as:

=> COPY myTable FROM '/data/manyfiles/*.dat' ON ANY NODE;

The glob (*) must indicate a set of files, not directories. The following statement fails if /data/manyfiles contains any subdirectories:

=> COPY myTable FROM '/data/manyfiles/*' ON ANY NODE;

Using a wildcard with the ON ANY NODE clause expands the file list on the initiator node. This command then distributes the individual files among all nodes, so that the COPY workload is evenly distributed across the entire cluster.

ON ANY NODE is the default for HDFS paths, as in the following example:

=> COPY myTable FROM 'hdfs:///data/manyfiles/*';

You can also distribute a file set across a subset of nodes, which you might do to balance concurrent loads. For example, this command distributes the loading of individual files among the three named nodes:

=> COPY myTable FROM '/mydirectory/ofmanyfiles/*.dat' 
   ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003);

Distributing file loads across nodes depends on two configuration parameters, EnableApportionLoad and EnableApportionFileLoad. Both are enabled by default. See General Parameters for more information about these parameters.

Loading from an S3 Bucket

To access data in S3 you must first do the following tasks:

  • By default, bucket-access is restricted to the communal storage bucket. You should use an AWS access key to load data from non-communal storage buckets.
  • Set the AWSRegion configuration parameter to tell Vertica which AWS region your S3 bucket is in, as in the following example. If the region is not correct, you might experience a delay before the load fails because Vertica retries several times before giving up. The default region is us-east-1.

    => ALTER SESSION SET AWSRegion='us-west-1';
  • To allow users without superuser privileges to access data in S3, create a USER storage location for the S3 path (see CREATE LOCATION) and grant users access, as in the following example:

    => CREATE LOCATION 's3://datalake' SHARED USAGE 'USER' LABEL 's3user';
    
    => CREATE ROLE ExtUsers;						
       --- Assign users to this role using GRANT (Role).
    						
    => GRANT READ ON LOCATION 's3://datalake' TO ExtUsers;						
  • If you are using AWS STS temporary session tokens, set the AWSSessionToken parameter as shown in the following example.

    $ aws sts get-session-token 
      { 
        "Credentials": { 
            "AccessKeyId": "ASIAJZQNDVS727EHDHOQ", 
            "SecretAccessKey": "F+xnpkHbst6UPorlLGj/ilJhO5J2n3Yo7Mp4vYvd", 
            "SessionToken": "FQoDYXdzEKv//////////wEaDMWKxakEkCyuDH0UjyKsAe6/3REgW5VbWtpuYyVvSnEK1jzGPHi/jPOPNT7Kd+ftSnD3qdaQ7j28SUW9YYbD50lcXikz/HPlusPuX9sAJJb7w5oiwdg+ZasIS/+ejFgCzLeNE3kDAzLxKKsunvwuo7EhTTyqmlLkLtIWu9zFykzrR+3Tl76X7EUMOaoL31HOYsVEL5d9I9KInF0gE12ZB1yN16MsQVxpSCavOFHQsj/05zbxOQ4o0erY1gU=", 
            "Expiration": "2018-07-18T05:56:33Z" 
        } 
      } 
    					
    $ vsql
    => ALTER SESSION SET AWSAuth = 'ASIAJZQNDVS727EHDHOQ:F+xnpkHbst6UPorlLGj/ilJhO5J2n3Yo7Mp4vYvd';				
    => ALTER SESSION SET AWSSessionToken = 'FQoDYXdzEKv//////////wEaDMWKxakEkCyuDH0UjyKsAe6/3REgW5VbWtpuYyVvSnEK1jzGPHi/jPOPNT7Kd+ftSnD3qdaQ7j28SUW9YYbD50lcXikz/HPlusPuX9sAJJb7w5oiwdg+ZasIS/+ejFgCzLeNE3kDAzLxKKsunvwuo7EhTTyqmlLkLtIWu9zFykzrR+3Tl76X7EUMOaoL31HOYsVEL5d9I9KInF0gE12ZB1yN16MsQVxpSCavOFHQsj/05zbxOQ4o0erY1gU=';	

    If the token expires before your Vertica session ends, you will need to renew it.  Session tokens are best used for short-lived sessions.

You might need to set other S3 Parameters to specify a certificate authority. You can set parameters globally and for the current session with ALTER DATABASE…SET PARAMETER and ALTER SESSION…SET PARAMETER, respectively.

If you use session tokens, all AWS parameters must be set at the session level.

You can then load data from S3 as in the following example.

=> COPY t FROM 's3://datalake/sales.parquet' PARQUET;

You can specify either a path, as in the previous example, or a glob, if all files in the glob can be loaded together. In the following example, AWS_DataLake contains only ORC files.

=> COPY t FROM 's3://datalake/*' ORC;

You can specify a list of comma-separated S3 buckets as in the following example. All buckets must be in the same region. To load from more than one region, use separate COPY statements and change the value of AWSRegion between calls.

=> COPY t FROM 's3://AWS_Data_1/sales.parquet', 's3://AWS_Data_2/sales.parquet' PARQUET;

Parquet files can be partitioned, and Vertica can use partitioning information to improve query performance. See Using Partition Columns for more information.

Loading from Google Cloud Storage

To access data in Google Cloud Storage (GCS) you must first do the following tasks:

  • Create a default project, obtain a developer key, and enable S3 interoperability mode as described in the GCS documentation.

  • Set the GCSAuth configuration parameter as in the following example.

    => ALTER SESSION SET GCSAuth='id:secret';

When reading from GCS you do not need to specify a region.

You might need to set other Google Cloud Storage Parameters. You can set parameters globally and for the current session with ALTER DATABASE…SET PARAMETER and ALTER SESSION…SET PARAMETER, respectively.

If you use session tokens, all GCS parameters must be set at the session level.

You can then access data from GCS as in the following example.

=> COPY t FROM 'gs://DataLake/clicks.parquet' PARQUET;

You can load from more than one bucket in the same COPY statement.

Loading ORC and Parquet Files

In the COPY statement, specify a format of ORC or PARQUET:

=> COPY tableName FROM path ORC[(hive_partition_cols='partitions')];
=> COPY tableName FROM path PARQUET[(hive_partition_cols='partitions')];

For information about the hive_partition_cols parameter, see Using Partition Columns.

For information about performance improvements specific to these formats, see Improving Query Performance.

Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Loading on Specific Nodes

You can optionally indicate which node or nodes should parse the input by using any of the following:

  • A node name
  • A set of nodes
  • ON ANY NODE

Using the ON ANY NODE clause indicates that the source file to load is available on all of the nodes. If you specify this clause, COPY opens the file and parses it from any node in the cluster. ON ANY NODE is the default for HDFS and S3 paths.

Using the ON nodeset clause indicates that the source file is on all named nodes. If you specify this clause, COPY opens the file and parses it from any node in the set. Be sure that the source file you specify is available and accessible on each applicable cluster node.

Loading from a Local Client

To bulk-load data from a client, and without requiring database superuser privileges, use the COPY FROM LOCAL option. You can load from either STDIN, or a specific path, but not from a specific node (or ON ANY NODE), since you are loading from the client. All local files are loaded and parsed serially with each COPY statement, so you cannot perform parallel loads with the LOCAL option. See Using Parallel Load Streams.

You can load one or more files in the supported formats: UNCOMPRESSED, BZIP, GZIP, or LZO.

For specific information about saving rejected data and exceptions files when using COPY from LOCAL, see Capturing Load Rejections and Exceptions.

Loading Compressed Files (BZIP, GZIP, and LZO)

You can load files compressed with BZIP, GZIP, or LZO. To do so, you must indicate the compression format for each file when loading multiple files. For example, this statement copies a BZIP file into the flex table twitter, using the fjsonparser:

=> COPY twitter FROM '/server1/TWITTER/tweets1.json.bz2' BZIP parser fjsonparser() direct; 
 Rows Loaded 
------------- 
      172094 
(1 row) 

The following statement loads an LZO file delimited with '|' characters into the flex table twitter:

=> COPY twitter FROM '/server1/TWITTER/tweets2.lzo' LZO DELIMITER '|';
 Rows Loaded
-------------
      19421
(3 rows)

Vertica supports the following options to the lzop command used to compress the file:

  • compression level: -1 through -9, --fast, --best
  • -F, --no-checksum, --crc32, --adler32

For more information about these options, see lzop.org.

You cannot use BZIP, GZIP, and LZO compression with files in the ORC or Parquet formats.