Integrating with Apache Hadoop
Apache™ Hadoop™, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System.
You can use Vertica with HDFS in several ways:
- You can import HDFS data into locally-stored ROS files .
- You can access HDFS data in place using external tables. You can define the tables yourself or get schema information from HCatalog, a Hadoop component.
- You can use HDFS as a storage location for ROS files.
- You can export data from Vertica to share with other Hadoop components using a Hadoop columnar format.
See Hadoop Interfaces for more information about these options.
A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop.
Hadoop Distributions
Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported.
If you are using Cloudera, you can manage your Vertica cluster using Cloudera Manager. See Integrating With Cloudera Manager.
If you are using MapR, see Integrating Vertica with the MapR Distribution of Hadoop.
Cluster Architecture
Vertica supports two cluster architectures. Which architecture you use affects the decisions you make about integration. These options might also be limited by license terms.
- You can co-locate Vertica on some or all of your Hadoop nodes. Vertica can then take advantage of data locality.
- You can build a Vertica cluster that is separate from your Hadoop cluster. In this configuration, Vertica can fully use each of its nodes; it does not share resources with Hadoop.
These layout options are described in Cluster Layout.
File Paths
Hadoop file paths are expressed as URLs in the hdfs
or webhdfs
URL scheme. If you need to escape a special character in a path, use URL escaping. All input characters that are not a-z, A-Z, 0-9, '-', '.', '_' or '~' must be converted to URL encoding (%NN where NN is a two-digit hexadecimal number). The following example URL-encodes a file name with a space in it.
hdfs:///opt/data/my%20file.orc
You can use globs, including regular expressions, in file paths. When Hive writes data, it sometimes creates temporary files with a "_COPYING" suffix. If you try to read these files into Vertica you will get an error message, because they are not a valid format. The following example copies only files ending in digits, the usual format for exports from Hive:
=> CREATE EXTERNAL TABLE (...) AS COPY FROM hdfs:///data/parquet/*_[0-9] PARQUET;