Integrating with Apache Hadoop

Apache™ Hadoop™, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System.

You can use Vertica with HDFS in several ways:

  • You can import HDFS data into locally-stored ROS files.
  • You can access HDFS data in place using external tables. You can define the tables yourself or get schema information from Hive, a Hadoop component.
  • You can use HDFS as a storage location for ROS files.
  • You can export data from Vertica to share with other Hadoop components using a Hadoop columnar format. See Exporting Data to Files for more information.

Hadoop file paths are expressed as URLs in the webhdfs or hdfs URL scheme. For more about using these schemes, see HDFS File System.

Hadoop Distributions

Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported.

If you are using Cloudera, you can manage your Vertica cluster using Cloudera Manager. See Integrating With Cloudera Manager.

If you are using MapR, see Integrating Vertica with the MapR Distribution of Hadoop.

WebHDFS Requirement

By default, if you use a URL in the hdfs scheme, Vertica uses the (deprecated) Libhdfs++ library instead of WebHDFS. However, it falls back to WebHDFS for features not available in Libhdfs++, such as encryption zones, wire encryption, or writes. Even if you always use URLs in the hdfs scheme to choose Libhdfs++, you must still have a WebHDFS service available to handle these fallback cases. In addition, for some uses, such as Eon Mode communal storage, you must use WebHDFS directly with the webhdfs scheme.

Support for LibHDFS++ is deprecated. In the future, URLs in the hdfs scheme will be automatically converted to the webhdfs scheme. To make this change in your database, set the HDFSUseWebHDFS configuration parameter to 1 (enabled).

In This Section