Using the HDFS Connector

Deprecated: The HDFS Connector has been deprecated and will be removed in a future release. Use the hdfs URL scheme instead. See Reading Directly from HDFS.

The Hadoop Distributed File System (HDFS) is the location where Hadoop usually stores its input and output files. It stores files across the Hadoop cluster redundantly, to keep the files available even if some nodes are down. HDFS also makes Hadoop more efficient, by spreading file access tasks across the cluster to help limit I/O bottlenecks.

The HDFS Connector lets you load files from HDFS into Vertica using the COPY statement. You can also create external tables that access data stored on HDFS as if it were a native Vertica table. The connector is useful if your Hadoop job does not directly store its data in Vertica or if you want to use User-Defined Extensions (UDxs) to load data stored in HDFS.

Note: The files you load from HDFS using the HDFS Connector usually have a delimited format. Column values are separated by a character, such as a comma or a pipe character (|). This format is the same type used in other files you load with the COPY statement. Hadoop MapReduce jobs often output tab-delimited files.

The HDFS Connector takes advantage of the distributed nature of both Vertica and Hadoop. Individual nodes in the Vertica cluster connect directly to nodes in the Hadoop cluster when you load multiple files from HDFS.

Hadoop splits large files into file segments that it stores on different nodes. The connector directly retrieves these file segments from the node storing them, rather than relying on the Hadoop cluster to reassemble the file.

The connector is read-only; it cannot write data to HDFS.

The HDFS Connector can connect to a Hadoop cluster through unauthenticated and Kerberos-authenticated connections.

In This Section