Using Parallel Load Streams

Vertica can divide the work of loading data, taking advantage of parallelism to speed up the operation. Vertica supports several types of parallelism:

  • Distributed load: Files in a multi-file load are loaded on several nodes in parallel, instead of all being loaded on a single node.
  • Apportioned load: A single large file or other single source is divided into segments (portions), which are assigned to several nodes to be loaded in parallel.

    If the file is different on two nodes, an incorrect or incomplete result is returned, with no error or warning.

    Apportioned load is enabled by default. If you want to disable it, set the EnableApportionLoad configuration parameter to 0.

  • Cooperative parse: A source being loaded on a single node uses multi-threading to parallelize the parse.  Cooperative parse is enabled by default. If you want to disable it, set the EnableCooperativeParse configuration parameter to 0.

See General Parameters for information about the configuration parameters.

Specifying Distributed File Loads

The COPY FROM statement provides several ways to distribute a file load.

You can direct individual files in a multi-file load to specific nodes, as in the following example of distributed load.

=> COPY t FROM '/data/file1.dat' ON v_vmart_node0001, '/data/file2.dat' ON v_vmart_node0002;

You can use globbing (wildcard expansion) to specify a group of files with the ON ANY NODE directive, as in the following example.

  • If apportioned load is enabled (the default), Vertica assigns different files to different nodes. Both the EnableApportionedLoad and EnableApportionedFileLoad must be set to 1.
  • If apportioned load is disabled, a single node loads all the data.
=> COPY t FROM '/data/*.dat' ON ANY NODE;

If you have a single file instead of a group of files, you can still, potentially, benefit from apportioned load. The file must be large enough to divide into portions at least equal to ApportionedFileMinimumPortionSizeKB in size. You must also use a parser that supports apportioned load. The delimited parser built into Vertica supports apportioned load, but other parsers might not.

The following example shows how you can load a single large file using multiple nodes.

=> COPY t FROM '/data/bigfile.dat' ON ANY NODE;

You can limit the nodes that participate in an apportioned load. Doing so is useful if you need to balance several concurrent loads. Vertica apportions each load individually; it does not account for other loads that might be in progress on those nodes. You can, therefore, potentially speed up your loads by managing apportioning yourself.

The following example shows how you can apportion loads on specific nodes.

=> COPY t FROM '/data/big1.dat' ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003),
		'/data/big2.dat' ON (v_vmart_node0004, v_vmart_node0005);

Loaded files can be of different formats, such as BZIP, GZIP, and others. However, because file compression is a filter, you cannot use apportioned load for a compressed file.

Specifying Distributed Loads with Sources

You can also apportion loads using COPY WITH SOURCE. You can create sources and parsers with the User-Defined Load (UDL) API. If both the source and parser support apportioned load, and EnableApportionLoad is set, then Vertica attempts to divide the load among nodes.

The following example shows a load that you could apportion.

=> COPY t WITH SOURCE MySource() PARSER MyParser();

The built-in delimited parser supports apportioning, so you can use it with a user-defined source, as in the following example.

=> COPY t WITH SOURCE MySource();

Number of Load Streams

Although the number of files you can load is not restricted, the optimal number of load streams depends on several factors, including:

  • Number of nodes
  • Physical and logical schemas
  • Host processors
  • Memory
  • Disk space

Using too many load streams can deplete or reduce system memory required for optimal query processing. See Best Practices for Managing Workload Resources for advice on configuring load streams.