Data Loading in Vertica Using COPY

Bulk Loading with the COPY Statement

The COPY statement is the most efficient way to load large amounts of data into a Vertica database. You can copy one or more files onto a cluster host using the COPY command. For bulk loading, the most useful COPY commands are:

  • COPY LOCAL: Loads a data file or all specified files from a local client system to the Vertica host, where the server processes the files.
  • COPY with source data in a Vertica cluster: Loads a data file or all specified files from different sources like JSON and CSV to Vertica internal format in a Vertica cluster.
  • COPY using User Defined Load (UDL) functions with custom sources, filters, or parsers: Loads a data file or specified files from custom user defined source, parser, and filters by controlling data load settings.

All types of COPY statements share the same methodologies and processes, but they all have different limitations. Regardless of the differences, the COPY statement always has two phases:

  • Phase I (initiator) loads and parses the file and distributes the file to other nodes.
  • Phase II (executor) processes the data on all the nodes.

With COPY, you can use many Execution Engine operators for bulk loading. Some of the Execution Engine operators for loading one or more files are Load, Parse, Load Union, Segment, Sort/Merge, and DataTarget.

The COPY statement creates segments for each projection if the target table is segmented. Segmentation defines how the data is spread among the cluster nodes for query performance and fast data purges.

How the COPY Statement Loads Data

The COPY statement workflow for loading one or more files occurs in two phases:

Phase I

  1. The Load operator loads the source file with data into the database. The Parse operator parses the loaded data in the database.
  2. The Load Union operator merges the parsed data into one container before segmenting the data. The operator is active when loading multiple files and is inactive when loading one file.
  3. The Segment operator segments the parsed data into one or more projections depending on the size of the data. In addition, table partitioning segregates the data on each node to distribute the data evenly across multiple database nodes. Doing so ensures that all nodes participate in executing the query.

Phase II

  1. The Sort operator sorts the segmented data and projections. The Merge operator merges the sorted data appropriately. The Sort and Merge operators work on aggregated data.
  2. The DataTarget operator copies the data on the disk.

The following figure shows the workload of loading one or more files in two phases. The light-blue and dark-blue boxes represent Execution Engine operators.

Loading One or More Files.png

In  Vertica 8.0 with apportioned load, if all the nodes have access to the source data, Phase I occurs on several nodes. An apportioned load is a divisible load, such that you can load a single data file on more than one node. The built-in delimited parser can apportion a load automatically, if the source is available on multiple nodes and can be divided.If the apportioned load is not available, Phase I occurs only on the nodes that read the file.

Phase II uses additional Execution Engine operators with pre-join projections and live aggregate projections. The following figure Pre-Join Projections shows the additional Execution Engine operators, JOIN and SCAN, for a dimension table. The following figure Live Aggregate Projections shows the additional GROUP BY/Top-K Execution Engine operator.

Pre-join Projections_part 1.png

Pre-join projections add additional Execution Engine operators JOIN and SCAN for the dimension table.

Live Aggregate Projections_part 2.png

Live aggregate projections add GROUP BY/Top-K Execution Engine operators.

Load Methods

Depending on the data you are loading, the COPY statement has several load methods. You can choose from three load methods:

  • COPY AUTO
  • COPY DIRECT
  • COPY TRICKLE

When to Use COPY AUTO

Note This is only applicable for Vertica versions 10.0 and earlier.

COPY uses the AUTO method to load data into WOS. Use this default AUTO load method for smaller bulk loads. The AUTO option is most useful when you cannot determine the size of the file. Once the WOS is full, COPY continues loading directly to ROS containers on disk. ROS data is sorted and encoded.

When to Use COPY DIRECT

Note This is only applicable for Vertica versions 10.0 and earlier. In Vertica version 10 and beyond, DIRECT is the default load method.

COPY uses the DIRECT method to load data directly into ROS containers. Use the DIRECT load method for large bulk loads (100 MB or more). The DIRECT method improves performance for large files by avoiding the WOS and loading data into ROS containers. Using DIRECT to load many smaller data sets results in many ROS containers, which have to be combined later.

When to Use COPY TRICKLE

Note This is only applicable for Vertica versions 10.0 and earlier.

COPY uses the TRICKLE method to load data directly into WOS. Use the TRICKLE load method to load data incrementally after you complete your initial bulk load. If the WOS becomes full, an error occurs and the entire data load is rolled back. Use this method only when you have a finely tuned load and moveout process at your site, and you are confident that the WOS can hold the data you are loading. This option is more efficient than AUTO when loading data into partitioned tables.

Loading Data into Vertica from External Sources

Vertica provides various ways to load data from external sources such as HDFS, AWS S3, and GCS apart from local file system.

Copy from S3 (AWS)

To copy the data from S3, you need to have access to S3 bucket. There are various ways to connect to an S3 bucket, but first you need to create a table.

=> CREATE TABLE Customers
  (FirstName varchar (50), 
   LastName varchar (50), 
   Company varchar (50));

Next, you need to set authentication parameters using one of the following methods.

Method 1: Using Session Tokens

AWS provides temporary security credentials that can provide access to AWS resources. The lifetime of those credentials is short, ranging between 15 minutes and 36 hours. You can set the validity of the token by passing --duration-seconds to aws sts command. Use the following command to get a session token from AWS.

{
    "Credentials": {
        "AccessKeyId": "ASIAXA3DIEK2DTGRZJVE",
        "SecretAccessKey": "pVVi8HSC5B4foA9XeCQyf9SR6VNLxVPQ/sX3bfDT",
        "SessionToken": "IQoJb3JpZ2luX2VjEHMaCXVzLWVhc3QtMSJHMEUCIFdMMX9+CpFS6mJDKxY8AnsC32hq1PRps3prBikB5UI0AiEA8nQTUYu1CdjiVQD2A74ARU1eJxsx4t5etvsIMlzW2ycq9AEIzP//////////ARAAGgw0ODI4NTUxMDExMDgiDGMVTf5EsfVhh3TPRyrIAeD+BQc76DwCu74cLu2YCETqemJBBQAFIATkH8LGsaLjXSEpDJiU2NSXU7kr6y8v1ijBfADgdsK0a4JaCKCswwPN6GOv86nW47exxe0RG/yUGxeHE9Hu6kzbiLq7cAOBixkz8e57qxhN8smN7Hg+HHtgvqMlzd1nWG89I7pmZEqqRHpYqz2DMuzb6lkJenB7zZDb1V9icNiQkVtYC+YXcvLs5zyd/9m6KXr5Lq9Zi2XGKcz+V756W79J9tk8J6pHVCVAgS2q+tdvMOCP1IMGOpgB9B2V7szJkOL+fTox4l7hQCZvqCljXjZIRATX62ff7OdSFe8zGfthyxNV67pb0iUVYGJm/ji/cRnF9jZ8KYk+StxlW81/lY5QFtdJpsYENLWg0+Zk/t+oxEJj/KnvJbj4bBDj373w3s7vuTg+01H5xtN1fz2jN8Hlu80N3ZNJG4eOBfodIvCRhjPsmFsKDKRx/MvLUH6r+1k=",
        "Expiration": "2021-04-13T14:54:24+00:00"
    }
}

When you get the session token, set the AWS S3 parameters at the session level. It is mandatory to set all AWS parameters at the session level when using session tokens.

ALTER SESSION SET AWSAuth ='<AccessKeyId:SecretAccessKey>’;
ALTER SESSION SET AWSSessionToken =' <SessionToken>';
ALTER SESSION SET AWSRegion='us-east-1';
ALTER SESSION SET AWSEndPoint='s3.us-east-1.amazonaws.com';

Execute the following COPY command to load data from the S3 bucket to the table you created earlier.

 dbadmin=> copy Customers from 's3://s3-test-bucket/customers.csv' delimiter ',' rejected data as table customers_rejected;
 Rows Loaded
-------------
           7
(1 row)

Method 2: Using Session or Database Level Parameters

Use an Access Key and a Secret Access key to connect to your S3 bucket.

To set the parameters at the session level, use the following command:

ALTER SESSION SET AWSAuth ='<AccessKeyId:SecretAccessKey>’;
ALTER SESSION SET AWSRegion='us-east-1';
ALTER SESSION SET AWSEndPoint='s3.us-east-1.amazonaws.com';

To set the parameters at the database level, use the following command:

ALTER DATABASE DEFAULT SET AWSAuth ='<AccessKeyId:SecretAccessKey>’;
ALTER DATABASE DEFAULT SET AWSRegion='us-east-1';
ALTER DATABASE DEFAULT SET AWSEndPoint='s3.us-east-1.amazonaws.com';

Then, use the following command to load data from the S3 bucket into the table:

 dbadmin=> copy Customers from 's3://s3-test-bucket/customers.csv' delimiter ',' rejected data as table customers_rejected;
 Rows Loaded
-------------
           7
(1 row)

The following parameters are to be used when connecting to an S3 bucket. If you plan to use SSL, there are additional parameters that need to be set. For more information, see the Vertica documentation.

Parameter Description
AWSSessionToken A temporary security token generated by running the get-session-token command.
AWSAuth An ID and secret key for authentication.
AWSEndpoint The endpoint to use when interpreting S3 URLs. AWSEndpoint varies based on the region you are trying to connect. For more information, see Amazon Simple Storage Service endpoints.
AWSRegion The AWS region containing the S3 bucket from which to read files. If you do not set the correct region, you might experience a delay before queries fail because Vertica retries several times before giving up.

Copy from GCS

To copy from Google Cloud Storage (GCS) to Vertica, you must set a default project in your GCS account. The default project is used for GET or PUT operations against bucket. Use hash-based message authentication code (HMAC) keys to connect to cloud storage from vertica. The Access ID and Secret can be set either at database level or session level.

ALTER SESSION SET GCSAuth= 'ID:secret';
Copy Customers from 'gs://gcsvertica/customers.csv' delimiter ',';
 Rows Loaded
-------------
           7
(1 row)

The following parameters are used to connect to a GCS bucket.

Parameter Description
GCSAuth An ID and secret key for authentication.
GCSEndpoint The connection endpoint address. Default: storage.googleapis.com
GCSEnableHttps Specifies whether to use the HTTPS protocol when connecting to GCS

Copy from HDFS

Vertica supports copying data from Hadoop Distributed File System (HDFS). Vertica needs the Hadoop configuration files in order to connect and copy data. You need to copy hdfs-site.xml and core-site.xml to all Vertica nodes with the same directory path. Then, set HadoopConfDir parameters with the directory path.

ALTER DATABASE DEFAULT SET HadoopConfDir = '/etc/hadoop/conf';

In order to verify that all files are copied appropriately, the database administrator should run the following command:

HDFS_CLUSTER_CONFIG_CHECK                                                                                                                                                                                                                                                                                                                  
---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------ hdfs_cluster_config_check --

        Hadoop Conf Path : [/etc/hadoop/conf]
  [OK] HadoopConfDir verified on all nodes
        Connection Timeout (seconds) : [60]
        Token Refresh Frequency (seconds) : [0]
        HadoopFSBlockSizeBytes (MB) : [64]

  [OK] Found [1] hadoop cluster configurations


------------- Cluster 1 -------------
        Is DefaultFS : [true]
        Nameservice : []
        Namenodes : [rhel7c.verticacorp:8020]
        High Availability : [false]
        RPC Encryption : [false]
        Kerberos Authentication : [false]
        HTTPS Only : [false]
  [INFO] Checking connections to [hdfs:///]
        vertica : [OK]
        dbadmin : [OK]

  [INFO] Checking connections to [http://rhel7c.verticacorp:9870]
  [OK] Can make authenticated external curl connection
  [INFO] Checking webhdfs
        vertica : [OK]
        dbadmin : [OK]
  [!] hdfs_cluster_config_check : [PASS]

(1 row)

Once the verification is successful, you can copy the data by specifying the file path with hdfs:/// URL prefix. Vertica will get the NameNode information from Hadoop config files.

Copy Customers from 'hdfs:///tmp/customers.csv' delimiter ',';
 Rows Loaded
-------------
           7
(1 row)

Vertica System Tables That Monitor Data Loads

Vertica provides system tables that allow you to monitor your database loads:

  • LOAD_STREAMS: Monitors active and historical load metrics for load streams on each node and provides statistics about loaded and rejected records.
  • DC_LOAD_EVENTS: Stores information about important system events during load parsing.
    • Batchbegin
    • Sourcebegin
    • Parsebegin
    • Parsedone
    • Sourcedone
    • Batchdone
    • UDFS_EVENTS: Stores information about events involving S3 file system such as copy from AWS S3, GCS.
    • UDFS_OPS_PER_HOUR: Summarizes statistics about events involving S3 file system for each hour.

    • UDFS_OPS_PER_MINUTE: Summarizes statistics about events involving S3 and GCS file system for each minute.

    • UDFS_STATISTICS: Summarizes statistics about file-system operations. It provides information about the Linux, S3, and WebHDFS file systems, and information about metadata (but not data) for the Libhdfs++ file system.

Tuning the Data Load

Resource pool parameters and configuration parameters affect the performance of data load.

Resource Pool Parameters

The following parameters specify characteristics of resource pools that help the database administrator manage resources for loading data.

Parameter Description

PLANNEDCONCURRENCY

Defines the amount of memory allocated per COPY command. Represents the preferred number of concurrently executing queries in the resource pool.

MAXCONCURRENCY

Limits the number of concurrent COPY jobs and represents the maximum number of concurrent execution slots available to the resource pool.

EXECUTIONPARALLELISM

Limits the number of threads used to process any single query issued in this resource pool and assigned to the load. Vertica sets this value based on the number of cores, available memory, and amount of data in the system. Unless memory is limited, or the amount of data is very small, Vertica sets this value to the number of cores on the node.

Query Budget

The query_budget_kb column in the RESOURCE_POOL_STATUS system table displays the target memory for queries executed on the associated pool.

To check the query budget_kb, use the following command:

=> SELECT pool_name, query_budget_kb FROM resource_pool_status;

Before you modify the query budget_kb, be aware of the following memory considerations:

  • If MEMORYSIZE > 0 and MAXMEMORYSIZE is empty or equal to MEMORYSIZE, the query budget = MEMORYSIZE / Planned Concurrency
  • If MEMORYSIZE = 0 and MAXMEMORYSIZE > 0, the query budget = (MAXMEMORYSIZE * 0.95) / Planned Concurrency
  • If MEMORYSIZE = 0 and MAXMEMORYSIZE is empty, the query budget = [(General Pool * 0.95) – (sum(MEMORYSIZE of other pools) ] / Planned Concurrency

How to Change Resource Pool Parameters

To change resource pool parameters, use the following command:

=> ALTER RESOURCE POOL <pool_name> <parameter> <new_value>;

Data Loading Configuration Parameters

The following configuration parameters can help you improve the performance of data load.

Parameter Description

EnableCooperativeParse

Implements multi-threaded cooperative parsing capabilities on a node. You can use this parameter for both delimited and fixed-width loads. The cooperative parse parallelization is local to the node of the source data.

SortWorkerThreads

Controls the number of sort worker threads. When set to 0, it disables background threads. Improves the load performance when the bottleneck is at parse/sort phase of load.

ReuseDataConnections

Attempts to reuse TCP connections between query executions.

DataBufferDepth

Governs the buffers to allocate for data connections.

CompressNetworkData

Compresses the data traffic and reduces the data bandwidth.

EnableApportionLoad

Defines the apportionable source/parser for the load and splits the data into appropriate portions. In  Vertica 8.0, the apportioned load works with FilePortionSource source function.

MultiLevelNetworkRoutingFactor

Defines the network routing for large clusters and adjusts the count reduction factor.

Troubleshooting Load Scenarios

If you cannot resolve an issue when loading data into your database, contact Vertica Support.

Loading Large Files

Use Case: Large files require a long time to load.

Recommendation: Make the load parallel on each node in one of two ways:

  • Use the EnableApportionLoad parameter to make the work load parallel between different nodes in the cluster. For apportioned load, share the files between the nodes loading the file, and install the FilePortionSource Source UDx parameter using the following statement:
=> COPY copy_test.store_sales_fact with source FilePortionSource(file='/data/test_copy/source_data5/
Store_Sales_Fact.tbl',nodes='v_vdb_node0001,v_vdb_node0002,v_vdb_node0003')direct;
  • Split and stage the files in the NFS mount point so that all the nodes have access to the files on any node.

Both options have similar loading performance. However, the second option requires you to manually split the files.

Loading Multiple Small Files To the Same Target Table

Use Case: Using COPY DIRECT to load multiple small files degrades performance. Multiple statements with small files generate multiple ROS containers. A large number of ROS containers affects the performance of Vertica and requires additional work for the Tuple Mover after the load completes.

Recommendation: Consider the following options for loading multiple small files with the COPY statement:

  • Control the number of COPY statements to combine the loading files. Fewer COPY statements reduce the number of transactions and load more data in one transaction.
  • Use Linux pipes to combine the loading files.
  • Combine files in the same COPY statement to give better performance.

Loading Wide Tables

Use Case: Wide tables with large VARCHAR columns are bottlenecks of the workflow in Phase II of the COPY command.

Recommendation: Consider the following options for loading wide tables:

  • Change the LoadMergeChunkSizeK parameter as an exception for specific loads.
  • Use flex tables for wide tables and for multiple small tables. Loading wide tables into flex tables requires loading one field instead of many fields. Thus, it reduces the size of the catalog and improves overall database performance. The initial load is very fast, with data available to the users quickly. However, query performance is lower in comparison to columnar storage.
  • Using GROUPED correlated columns to load wide tables. GROUPED clause groups two or more columns into a single disk file. Two columns are correlated if the value of one column is related to the value of the other column.

You cannot resolve this issue by adding more resources, splitting the files, or parallelizing the work between the nodes. Contact Vertica Support and adjust the configuration parameters under their guidance.

Executor Nodes for Load

Use Case: An executor node is reserved for computation purposes only and does not contain any data segments. Dedicated nodes use the total CPU. Two use cases relate to this issue:

  • Large files must be loaded within 24 hours.
  • The parse operation blocks the workflow.

In such use cases, the following conditions usually exist:

  • Wide tables have high kilobytes per rows.
  • The GZIP files are compressed to reduce the network transfer time. When the files are placed in Vertica local storage, the COPY command uncompresses the data increasing the CPU usage.
  • The tables with wide segmentation keys use more CPU.

Recommendation: Use executor nodes after consultation with Vertica Support. The resource pool configuration results in appropriate usage of resources on executor and data nodes. As resource pool parameters apply across the cluster, Vertica Support provides parameters for an efficient model. Instead of using executor nodes for loading, use servers with more CPU and assign exclusive CPUs to the load resource pool, thus limiting resources dedicated to load operations.