Configuring the hdfs Scheme

Vertica uses information from the Hadoop cluster configuration to support the hdfs scheme. Vertica nodes therefore must have access to certain Hadoop configuration files. To use a cluster with High Availability Name Node or to read from more than one Hadoop cluster, you must perform additional configuration.

For both co-located and separate clusters that use Kerberos authentication, configure Vertica for Kerberos as explained in Configure Vertica for Kerberos Authentication.

Using the hdfs scheme still requires access to the WebHDFS service. For some special cases, Vertica cannot use the hdfs scheme and falls back to webhdfs.

Accessing Hadoop Configuration Files

To support the hdfs scheme, your Vertica nodes need access to certain Hadoop configuration files:

Verify that the value of the HadoopConfDir configuration parameter (see Apache Hadoop Parameters) includes a directory containing the files listed in the following table. If you do not set a value, Vertica looks for the files in /etc/hadoop/conf. You can use the VERIFY_HADOOP_CONF_DIR meta-function to verify that Vertica can find configuration files.

Vertica uses the following configuration files and properties. If a property is not defined, Vertica uses the defaults shown in the table. Your Hadoop configuration files must specify all properties that have no defaults.

File Properties Default
core-site.xml fs.defaultFS none
 

(for doAs users:) hadoop.proxyuser.username.users

none
  (for doAs users:) hadoop.proxyuser.username.hosts none
hdfs-site.xml dfs.client.failover.max.attempts 4
dfs.client.failover.connection.retries.on.timeouts 0
  ipc.client.connect.timeout 30 seconds
  ipc.client.connect.retry.interval 10 seconds
  (For HA NN:) dfs.nameservices none

Using a Cluster with High Availability Name Nodes

If your Hadoop cluster uses High Availability (HA) Name Nodes, verify that the dfs.nameservices parameter and the individual name nodes are defined in hdfs-site.xml.

Using More Than One Hadoop Cluster

In some cases, a Vertica cluster requires access to more than one HDFS cluster. For example, your business might use separate HDFS clusters for separate regions, or you might need data from both test and deployment clusters.

To support multiple clusters, perform the following steps:

  1. Copy the configuration files from all HDFS clusters to your database nodes. You can place the copied files in any location readable by Vertica. However, as a best practice, you should place them all in the same directory tree, with one subdirectory per HDFS cluster. The locations must be the same on all database nodes.
  2. Set the HadoopConfDir configuration parameter. The value is a colon-separated path containing the directories for all of your HDFS clusters.
  3. Use an explicit name node or name service in the URL when creating an external table or copying data. Do not use hdfs:/// because it could be ambiguous. For more information about URLs, see HDFS URL Format.

Vertica connects directly to a name node or name service; it does not otherwise distinguish among HDFS clusters. Therefore, names of HDFS name nodes and name services must be globally unique.

Updating Configuration Files

If you update the configuration files after starting Vertica, use the following statement to refresh them:

=> SELECT CLEAR_HDFS_CACHES();

The CLEAR_HDFS_CACHES function also flushes information about which Name Node is active in a High Availability (HA) Hadoop cluster. Therefore, the first hdfs request after calling this function is slow, because the initial connection to the Name Node can take more than 15 seconds.