Faster Data Loads with Apportioned Load: Quick Tip

Posted September 13, 2018 by Phil Molea, Sr. Information Developer, Vertica

Jim Knicely authored this tip.

Vertica can divide the work of loading data, taking advantage of parallelism to speed up the operation. One supported type of parallelism is called apportioned load.

An apportioned load divides a single large file or other single source into segments (portions), which are assigned to several nodes to be loaded in parallel.

Example:

I want to load a data file that contains 100,000,000 records. dbadmin=> \! wc -l /home/dbadmin/big_data.txt 100000000 /home/dbadmin/big_data.txt For my first load attempt, I’ll load the file from a single node in my 3 node cluster. dbadmin=> \timing Timing is on. dbadmin=> COPY big_data FROM '/home/dbadmin/big_data.txt' DIRECT; Rows Loaded ------------- 100000000 (1 row) Time: First fetch (1 row): 49078.222 ms. All rows formatted: 49078.268 ms Next I will re-run the load, but this time include the “ON ANY NODE” option of the COPY command so that Vertica performs an apportioned load. dbadmin=> COPY big_data FROM '/home/dbadmin/big_data.txt' ON ANY NODE DIRECT; Rows Loaded ------------- 100000000 (1 row) Time: First fetch (1 row): 21141.006 ms. All rows formatted: 21141.045 ms Wow! An apportioned load executed over twice as fast as a single node load! dbadmin=> SELECT 100 - (21141.006 / 49078.222 * 100) || '%' PCT_FASTER; PCT_FASTER --------------------- 56.923855146993700% (1 row) Helpful link:

https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/ExtendingVertica/UDx/UDL/ApportionedLoad.htm

Have fun!