Specifying COPY FROM Options

Each COPY statement requires a FROM option to indicate the location of the file or files being loaded. This syntax excerpt shows the available FROM keywords, and their associated file format options:

FROM { STDIN ...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] 
...| 'pathToData' [ ON nodename | ON nodeset | ON ANY NODE ] 
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...] 
...| LOCAL STDIN | 'pathToData' 
...... [ BZIP | GZIP | LZO | UNCOMPRESSED ] [, ...] 
}

Each of the FROM keywords lets you optionally specify the format of the load file as UNCOMPRESSED, BZIP, GZIP, or LZO.

Note: When using COPY in conjunction with a CREATE EXTERNAL TABLE statement, you cannot use the COPY FROM STDIN or LOCAL options.

Loading from STDIN

Using STDIN for the FROM option lets you load uncompressed data, BZIP, or GZIP files.

Loading from a Specific Path

You must use the pathToData argument to indicate the location of the file to load. You can optionally indicate which node or nodes should parse the input by using any of the following:

You can load one or more files, using a path on the local file system or in HDFS.

Using the ON ANY NODE clause indicates that the source file to load is available on all of the nodes. If you specify this clause, COPY opens the file and parses it from any node in the cluster. ON ANY NODE is the default for HDFS paths.

Using the ON nodeset clause indicates that the source file is on all named nodes. If you specify this clause, COPY opens the file and parses it from any node in the set. Be sure that the source file you specify is available and accessible on each applicable cluster node.

If pathToData resolves to a storage location on a local file system (not HDFS), and the user invoking COPY is not a superuser, these permissions are required:

If a user with privileges, who is not a superuser, invokes COPY from that storage location, Vertica prevents symbolic links from allowing unauthorized access.

Loading BZIP and GZIP Files

You can load files compressed with BZIP or GZIP. To do so, you must indicate the compression format for each file when loading multiple files. For example, this statement copies a BZIP file into the flex table twitter, using the fjsonparser:

=> COPY twitter FROM '/server1/TWITTER/tweets1.json.bz2' BZIP parser fjsonparser() direct; 
 Rows Loaded 
------------- 
      172094 
(1 row) 

Loading LZO Files

You can load LZO files using the same COPY statements as you use for BZIP and GZIP files. The following statement loads an LZO file delimited with '|' characters into the flex table twitter:

=> COPY twitter FROM '/server1/TWITTER/tweets2.lzo' LZO DELIMITER '|';
 Rows Loaded
-------------
      19421
(3 rows)

Vertica supports the following options to the lzop command used to compress the file:

For more information about these options, see lzop.org.

Loading with Wildcards (glob)

You can invoke COPY for a large number of files in a shared directory with a single statement such as:

=> COPY myTable FROM '/data/manyfiles/*.dat' ON ANY NODE;

The glob (*) must indicate a set of files, not directories. The following statement fails if /data/manyfiles contains any subdirectories:

=> COPY myTable FROM '/data/manyfiles/*' ON ANY NODE;

Using a wildcard with the ON ANY NODE clause expands the file list on the initiator node. This command then distributes the individual files among all nodes, so that the COPY workload is evenly distributed across the entire cluster.

ON ANY NODE is the default for HDFS paths, as in the following example:

=> COPY myTable FROM 'hdfs:///data/manyfiles/*';

You can also distribute a file set across a subset of nodes, which you might do to balance concurrent loads. For example, this command distributes the loading of individual files among the three named nodes:

=> COPY myTable FROM '/mydirectory/ofmanyfiles/*.dat' 
   ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003);

Distributing file loads across nodes depends on two configuration parameters, EnableApportionLoad and EnableApportionFileLoad. Both are enabled by default. See General Parameters for more information about these parameters.

Loading from a Local Client

To bulk load data from a client, and without requiring database superuser privileges, use the COPY FROM LOCAL option. You can load from either STDIN, or a specific path, but not from a specific node (or ON ANY NODE), since you are loading from the client. All local files are loaded and parsed serially with each COPY statement, so you cannot perform parallel loads with the LOCAL option. See Using Parallel Load Streams.

You can load one or more files in the supported formats: UNCOMPRESSED, BZIP, GZIP, or LZO.

For specific information about saving rejected data and exceptions files when using COPY from LOCAL, see Capturing Load Rejections and Exceptions.