Distributing a Load
Vertica can divide the work of loading data among multiple database nodes, taking advantage of parallelism to speed up the operation. How this is done depends on where the data is and what types of parallelism the parsers support.
Vertica can be most effective in distributing a load when the data to be loaded is found in shared storage available to all nodes. Sometimes, however, data is available only on specific nodes, which you must specify.
Types of Load Parallelism
Vertica supports several types of parallelism. The default (DELIMITED) parser uses load parallelism, and user-defined parsers can enable it.
- Distributed load: Files in a multi-file load are loaded on several nodes in parallel, instead of all being loaded on a single node.
- Apportioned load: A single large file or other single source is divided into segments (portions), which are assigned to several nodes to be loaded in parallel. Apportioned load is enabled by default. To disable it, set the EnableApportionLoad configuration parameter to 0.
- Cooperative parse: A source being loaded on a single node uses multi-threading to parallelize the parse. Cooperative parse is enabled by default. To disable it, set the EnableCooperativeParse configuration parameter to 0.
See General Parameters for information about the configuration parameters.
Loading on Specific Nodes
You can indicate which node or nodes should parse an input path by using any of the following:
- A node name:
ON node
- A set of nodes:
ON nodeset
(see Specifying Distributed File Loads) ON ANY NODE
Using the ON ANY NODE
clause indicates that the source file to load is available on all of the nodes. If you specify this clause, COPY opens the file and parses it from any node in the cluster. ON ANY NODE
is the default for HDFS and S3 paths.
Using the ON nodeset
clause indicates that the source file is on all named nodes. If you specify this clause, COPY opens the file and parses it from any node in the set. Be sure that the source file you specify is available and accessible on each applicable cluster node.
If the data to be loaded is on a client, use COPY FROM LOCAL instead of specifying nodes. All local files are loaded and parsed serially with each COPY statement, so you cannot perform parallel loads with the LOCAL option.
Specifying Distributed File Loads
You can direct individual files in a multi-file load to specific nodes, as in the following example of distributed load.
=> COPY t FROM '/data/file1.dat' ON v_vmart_node0001, '/data/file2.dat' ON v_vmart_node0002;
You can use globbing (wildcard expansion) to specify a group of files with the ON ANY NODE directive, as in the following example.
- If apportioned load is enabled (the default), Vertica assigns different files to different nodes. Both the EnableApportionedLoad and EnableApportionedFileLoad must be set to 1.
- If apportioned load is disabled, a single node loads all the data.
=> COPY t FROM '/data/*.dat' ON ANY NODE;
If you have a single file instead of a group of files, you can still, potentially, benefit from apportioned load. The file must be large enough to divide into portions at least equal to ApportionedFileMinimumPortionSizeKB in size, and this size must be large enough to contain at least one whole record. You must also use a parser that supports apportioned load. The delimited parser built into Vertica supports apportioned load, but other parsers might not.
The following example shows how you can load a single large file using multiple nodes.
=> COPY t FROM '/data/bigfile.dat' ON ANY NODE;
You can limit the nodes that participate in an apportioned load. Doing so is useful if you need to balance several concurrent loads. Vertica apportions each load individually; it does not account for other loads that might be in progress on those nodes. You can, therefore, potentially speed up your loads by managing apportioning yourself.
The following example shows how you can apportion loads on specific nodes.
=> COPY t FROM '/data/big1.dat' ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003), '/data/big2.dat' ON (v_vmart_node0004, v_vmart_node0005);
Loaded files can be of different formats, such as BZIP, GZIP, and others. However, because file compression is a filter, you cannot use apportioned load for a compressed file.
Specifying Distributed Loads with Sources
You can also apportion loads using COPY WITH SOURCE. You can create sources and parsers with the user-defined load (UDL) API. If both the source and parser support apportioned load, and EnableApportionLoad is set, then Vertica attempts to divide the load among nodes.
The following example shows a load that you could apportion.
=> COPY t WITH SOURCE MySource() PARSER MyParser();
The built-in delimited parser supports apportioning, so you can use it with a user-defined source, as in the following example.
=> COPY t WITH SOURCE MySource();
Number of Load Streams
Although the number of files you can load is not restricted, the optimal number of load streams depends on several factors, including:
- Number of nodes
- Physical and logical schemas
- Host processors
- Memory
- Disk space
Using too many load streams can deplete or reduce system memory required for optimal query processing. See Best Practices for Managing Workload Resources for advice on configuring load streams.