Configuring the hdfs Scheme

Vertica uses information from the Hadoop cluster configuration to support the hdfs scheme. Vertica nodes therefore must have access to certain Hadoop configuration files. To use a cluster with High Availability Name Node or to read from more than one Hadoop cluster, you must perform additional configuration.

For both co-located and separate clusters that use Kerberos authentication, configure Vertica for Kerberos as explained in Configure Vertica for Kerberos Authentication.

Using the hdfs scheme still requires access to the WebHDFS service. For some special cases, Vertica cannot use the hdfs scheme and falls back to webhdfs.

Accessing Hadoop Configuration Files

To support the hdfs scheme, your Vertica nodes need access to certain Hadoop configuration files:

If Vertica is co-located on HDFS nodes, then those configuration files are already present.
If Vertica is running on a separate cluster, you must copy the required files to all database nodes. A simple way to do so is to configure your Vertica nodes as Hadoop edge nodes. Client applications run on edge nodes; from Hadoop's perspective, Vertica is a client application. You can use Ambari or Cloudera Manager to configure edge nodes. For more information, see the documentation from your Hadoop vendor.

Verify that the value of the HadoopConfDir configuration parameter (see Hadoop Parameters) includes a directory containing the files listed in the following table. If you do not set a value, Vertica looks for the files in /etc/hadoop/conf. You can use the VERIFY_HADOOP_CONF_DIR meta-function to verify that Vertica can find configuration files.

Vertica uses the following configuration files and properties. If a property is not defined, Vertica uses the defaults shown in the table. Your Hadoop configuration files must specify all properties that have no defaults.

File	Properties	Default
core-site.xml	fs.defaultFS	none
hdfs-site.xml	dfs.client.failover.max.attempts	4
	dfs.client.failover.connection.retries.on.timeouts	0
	ipc.client.connect.timeout	30 seconds
	ipc.client.connect.retry.interval	10 seconds
	(For HA NN:) dfs.nameservices	none

Using a Cluster with High Availability Name Nodes

If your Hadoop cluster uses High Availability (HA) Name Nodes, verify that the dfs.nameservices parameter and the individual name nodes are defined in hdfs-site.xml.

Using More Than One Hadoop Cluster

In some cases, a Vertica cluster requires access to more than one HDFS cluster. For example, your business might use separate HDFS clusters for separate regions, or you might need data from both test and deployment clusters.

To support multiple clusters, perform the following steps:

Copy the configuration files from all HDFS clusters to your database nodes. You can place the copied files in any location readable by Vertica. However, as a best practice, you should place them all in the same directory tree, with one subdirectory per HDFS cluster. The locations must be the same on all database nodes.
Set the HadoopConfDir configuration parameter. The value is a colon-separated path containing the directories for all of your HDFS clusters.
Use an explicit name node or name service in the URL when creating an external table or copying data. Do not use hdfs:/// because it could be ambiguous. For more information about URLs, see URL Format.

Vertica connects directly to a name node or name service; it does not otherwise distinguish among HDFS clusters. Therefore, names of HDFS name nodes and name services must be globally unique.

Updating Configuration Files

If you update the configuration files after starting Vertica, use the following statement to refresh them:

=> SELECT CLEAR_HDFS_CACHES();

The CLEAR_HDFS_CACHES function also flushes information about which Name Node is active in a High Availability (HA) Hadoop cluster. Therefore, the first hdfs request after calling this function is slow, because the initial connection to the Name Node can take more than 15 seconds.