Jim Knicely authored this tip.
Vertica can divide the work of loading data, taking advantage of parallelism to speed up the operation. One supported type of parallelism is called apportioned load.
An apportioned load divides a single large file or other single source into segments (portions), which are assigned to several nodes to be loaded in parallel.
I want to load a data file that contains 100,000,000 records.
dbadmin=> \! wc -l /home/dbadmin/big_data.txt
For my first load attempt, I’ll load the file from a single node in my 3 node cluster.
Timing is on.
dbadmin=> COPY big_data FROM '/home/dbadmin/big_data.txt' DIRECT;
Time: First fetch (1 row): 49078.222 ms. All rows formatted: 49078.268 ms
Next I will re-run the load, but this time include the “ON ANY NODE” option of the COPY command so that Vertica performs an apportioned load.
dbadmin=> COPY big_data FROM '/home/dbadmin/big_data.txt' ON ANY NODE DIRECT;
Time: First fetch (1 row): 21141.006 ms. All rows formatted: 21141.045 ms
Wow! An apportioned load executed over twice as fast as a single node load!
dbadmin=> SELECT 100 - (21141.006 / 49078.222 * 100) || '%' PCT_FASTER;