Configuring the hdfs Scheme
Vertica uses information from the Hadoop cluster configuration to support the hdfs
scheme. Vertica nodes therefore must have access to certain Hadoop configuration files. To use a cluster with High Availability NameNode or to read from more than one Hadoop cluster, you must perform additional configuration.
For both co-located and separate clusters that use Kerberos authentication, configure Vertica for Kerberos as explained in Configure Vertica for Kerberos Authentication.
Using the hdfs
scheme still requires access to the WebHDFS service. For some special cases, Vertica cannot use the hdfs
scheme and falls back to webhdfs
.
Accessing Hadoop Configuration Files
To support the hdfs
scheme, your Vertica nodes need access to certain Hadoop configuration files:
-
If Vertica is co-located on HDFS nodes, then those configuration files are already present.
-
If Vertica is running on a separate cluster, you must copy the required files to all database nodes. A simple way to do so is to configure your Vertica nodes as Hadoop edge nodes. Client applications run on edge nodes; from Hadoop's perspective, Vertica is a client application. You can use Ambari or Cloudera Manager to configure edge nodes. For more information, see the documentation from your Hadoop vendor.
Verify that the value of the HadoopConfDir configuration parameter (see Apache Hadoop Parameters) includes a directory containing the files listed in the following table. If you do not set a value, Vertica looks for the files in /etc/hadoop/conf. For all Vertica users, the directory is accessed by the Linux user under which the Vertica server process runs.
Vertica uses the following configuration files and properties. If a property is not defined, Vertica uses the defaults shown in the table. Your Hadoop configuration files must specify all properties that have no defaults.
File | Properties | Default |
---|---|---|
core-site.xml | fs.defaultFS | none |
(for doAs users:) hadoop.proxyuser.username.users |
none | |
(for doAs users:) hadoop.proxyuser.username.hosts | none | |
hdfs-site.xml | dfs.client.failover.max.attempts | 4 |
dfs.client.failover.connection.retries.on.timeouts | 0 | |
ipc.client.connect.timeout | 30 seconds | |
ipc.client.connect.retry.interval | 10 seconds | |
(For HA NN:) dfs.nameservices | none |
Using a Cluster with High Availability NameNodes
If your Hadoop cluster uses High Availability (HA) NameNodes, verify that the dfs.nameservices parameter and the individual NameNodes are defined in hdfs-site.xml.
Using More Than One Hadoop Cluster
In some cases, a Vertica cluster requires access to more than one HDFS cluster. For example, your business might use separate HDFS clusters for separate regions, or you might need data from both test and deployment clusters.
To support multiple clusters, perform the following steps:
- Copy the configuration files from all HDFS clusters to your database nodes. You can place the copied files in any location readable by Vertica. However, as a best practice, you should place them all in the same directory tree, with one subdirectory per HDFS cluster. The locations must be the same on all database nodes.
- Set the HadoopConfDir configuration parameter. The value is a colon-separated path containing the directories for all of your HDFS clusters.
- Use an explicit NameNode or nameservice in the URL when creating an external table or copying data. Do not use
hdfs:///
because it could be ambiguous. For more information about URLs, see HDFS URL Format.
Vertica connects directly to a NameNode or nameservice; it does not otherwise distinguish among HDFS clusters. Therefore, names of HDFS NameNodes and nameservices must be globally unique.
Verifying the Configuration
Use the VERIFY_HADOOP_CONF_DIR function to verify that Vertica can find configuration files in HadoopConfDir.
Use the HDFS_CLUSTER_CONFIG_CHECK function to test access through the hdfs
scheme.
For more information about testing your configuration, see Verifying HDFS Configuration.
Updating Configuration Files
If you update the configuration files after starting Vertica, use the following statement to refresh them:
=> SELECT CLEAR_HDFS_CACHES();
The CLEAR_HDFS_CACHES function also flushes information about which NameNode is active in a High Availability (HA) Hadoop cluster. Therefore, the first hdfs
request after calling this function is slow, because the initial connection to the NameNode can take more than 15 seconds.