Integrating with Apache Hadoop

Apache™ Hadoop™, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System.

You can use Vertica with HDFS in several ways:

See Hadoop Interfaces for more information about these options.

A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop.

Hadoop Distributions

Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported.

If you are using Cloudera, you can manage your Vertica cluster using Cloudera Manager. See Integrating With Cloudera Manager.

If you are using MapR, see Integrating Vertica with the MapR Distribution of Hadoop.

Cluster Architecture

Vertica supports two cluster architectures. Which architecture you use affects the decisions you make about integration. These options might also be limited by license terms.

These layout options are described in Cluster Layout.

File Paths

Hadoop file paths are expressed as URLs in the hdfs or webhdfs URL scheme. If you need to escape a special character in a path, use URL escaping. All input characters that are not a-z, A-Z, 0-9, '-', '.', '_' or '~' must be converted to URL encoding (%NN where NN is a two-digit hexadecimal number). The following example URL-encodes a file name with a space in it.

hdfs:///opt/data/my%20file.orc

You can use globs, including regular expressions, in file paths. When Hive writes data, it sometimes creates temporary files with a "_COPYING" suffix. If you try to read these files into Vertica you will get an error message, because they are not a valid format.  The following example copies only files ending in digits, the usual format for exports from Hive:

=> CREATE EXTERNAL TABLE (...) AS COPY FROM hdfs:///data/parquet/*_[0-9] PARQUET;

In This Section