Loading Data Using the HDFS Connector
You can use the HDFS User Defined Source (UDS) in a COPY statement to load data from HDFS files.
The syntax for using the HDFS UDS in a COPY statement is:
COPY tableName SOURCE Hdfs(url='WebHDFSFileURL', [username='username'], [low_speed_limit=speed]);
tableName |
The name of the table to receive the copied data. |
WebHDFSFileURL |
A string containing one or more URLs that identify the file or files to be read. See below for details. Use commas to separate multiple URLs . If a URL contains certain special characters, you must escape them:
|
username |
The username of a Hadoop user that has permissions to access the files you want to copy. If you are using Kerberos, omit this argument. |
speed |
The minimum data transmission rate, expressed in bytes per second, that the connector allows. The connector breaks any connection between the Hadoop and Vertica clusters that transmits data slower than this rate for more than 1 minute. After the connector breaks a connection for being too slow, it attempts to connect to another node in the Hadoop cluster. This new connection can supply the data that the broken connection was retrieving. The connector terminates the COPY statement and returns an error message if:
Default Value: 1048576 (1MB per second transmission rate) |
The HDFS File URL
The url parameter in the Hdfs function call is a string containing one or more comma-separated HTTP URLs. These URLS identify the files in HDFS that you want to load. The format for each URL in this string is:
http://NameNode:port/webhdfs/v1/HDFSFilePath
NameNode |
The host name or IP address of the Hadoop cluster's name node. |
Port |
The port number on which the WebHDFS service is running. This number is usually 50070 or 14000, but may be different in your Hadoop installation. |
webhdfs/v1/ |
The protocol being used to retrieve the file. This portion of the URL is always the same. It tells Hadoop to use version 1 of the WebHDFS API. |
HDFSFilePath |
The path from the root of the HDFS filesystem to the file or files you want to load. This path can contain standard Linux wildcards. Important: Any wildcards you use to specify multiple input files must resolve to files only. They must not include any directories. For example, if you specify the path |
The following example shows how to use the Vertica Connector for HDFS to load a single file named /tmp/test.txt
. The Hadoop cluster's name node is named hadoop
.
=> COPY testTable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt', username='hadoopUser'); Rows Loaded ------------- 2 (1 row)
Copying Files in Parallel
The basic COPY statement in the previous example copies a single file. It runs on just a single host in the Vertica cluster because the Connector cannot break up the workload among nodes. Any data load that does not take advantage of all nodes in the Vertica cluster is inefficient.
To make loading data from HDFS more efficient, spread the data across multiple files on HDFS. This approach is often natural for data you want to load from HDFS. Hadoop MapReduce jobs usually store their output in multiple files.
You specify multiple files to be loaded in your Hdfs function call by:
- Using wildcards in the URL
- Supplying multiple comma-separated URLs in the url parameter of the Hdfs user-defined source function call
- Supplying multiple comma-separated URLs that contain wildcards
Loading multiple files through the Vertica Connector for HDFS results in a efficient load. The Vertica hosts connect directly to individual nodes in the Hadoop cluster to retrieve files. If Hadoop has broken files into multiple chunks, the Vertica hosts directly connect to the nodes storing each chunk.
The following example shows how to load all of the files whose filenames start with "part-" located in the /user/hadoopUser/output
directory on the HDFS. If there are at least as many files in this directory as there are nodes in the Vertica cluster, all nodes in the cluster load data from the HDFS.
=> COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/hadoopUser/output/part-*', username='hadoopUser'); Rows Loaded ------------- 40008 (1 row)
To load data from multiple directories on HDFS at once use multiple comma-separated URLs in the URL string:
=> COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/HadoopUser/output/part-*, http://hadoop:50070/webhdfs/v1/user/AnotherUser/part-*', username='H=hadoopUser'); Rows Loaded ------------- 80016 (1 row)
Note: Vertica statements must be less than 65,000 characters long. If you supply too many long URLs in a single statement, you could go over this limit. Normally, you would only approach this limit if you are automatically generating of the COPY statement using a program or script.
Viewing Rejected Rows and Exceptions
COPY statements that use the Vertica Connector for HDFS use the same method for recording rejections and exceptions as other COPY statements. Rejected rows and exceptions are saved to log files. These log files are stored by default in the CopyErrorLogs subdirectory in the database's catalog directory. Due to the distributed nature of the Vertica Connector for HDFS, you cannot use the ON option to force all exception and rejected row information to be written to log files on a single Vertica host. Instead, you need to collect the log files from across the hosts to review all of the exceptions and rejections generated by the COPY statement.
For more about handling rejected rows, see Capturing Load Rejections and Exceptions.