Configuring Vertica to Restore HDFS Storage Locations

Your Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster.

The steps you need to take depend on:

  • The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS storage location.
  • The distribution of Linux running on your Vertica cluster.

Installing the Hadoop packages necessary to run distcp does not turn your Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files.

Configuration Overview

The steps for configuring your Vertica cluster to restore backups for HDFS storage location are:

  1. If necessary, install and configure a Java runtime on the hosts in the Vertica cluster.
  2. Find the location of your Hadoop distribution's package repository.
  3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster.
  4. Install the necessary Hadoop packages on your Vertica hosts.
  5. Set two configuration parameters in your Vertica database related to Java and Hadoop.
  6. If your HDFS storage location uses Kerberos, set additional configuration parameters to allow Vertica user credentials to be proxied.
  7. Confirm that the Hadoop distcp command runs on your Vertica hosts.

The following sections describe these steps in greater detail.

Installing a Java Runtime

You Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to:

If your Vertica database has a JVM installed, verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports.

If the JVM installed on your Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both Vertica and your Hadoop distribution. See Vertica SDKs in Supported Platforms for a list of the JVMs compatible with Vertica.

If your Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instructions in Installing the Java Runtime on Your Vertica Cluster.

Finding Your Hadoop Distribution's Package Repository

Many Hadoop distributions have their own installation system, such as Cloudera's Manager or Hortonwork's Ambari. However, they also support manual installation using native Linux packages such as RPM and .deb files. These package files are maintained in a repository. You can configure your Vertica hosts to access this repository to download and install Hadoop packages.

Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques.:

Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running on your Vertica cluster. Be sure that the package repository that you select matches the version of Hadoop distribution installed on your Hadoop cluster.

Configuring Vertica Nodes to Access the Hadoop Distribution’s Package Repository

Configure the nodes in your Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation.

The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves:

  • Downloading a configuration file.
  • Adding the configuration file to the package management system's configuration directory.
  • For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root account keyring.
  • Updating the package management system's index to have it discover new packages.

You must add the Hadoop repository to all hosts in your Vertica cluster.

Installing the Required Hadoop Packages

After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are:

  • hadoop
  • hadoop-hdfs
  • hadoop-client

The names of the packages are usually the same across all Hadoop and Linux distributions.These packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install.

To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution:

  • On Red Hat and CentOS, the package manager command is yum.
  • On Debian and Ubuntu, the package manager command is apt-get.
  • On SUSE the package manager command is zypper.

Consult your Linux distribution's documentation for instructions on installing packages.

Setting Configuration Parameters

You must set two configuration parameters to enable Vertica to restore HDFS data:

  • JavaBinaryForUDx is the path to the Java executable. You may have already set this value to use Java UDxs or the HCatalog Connector. You can find the path for the default Java executable from the Bash command shell using the command:

    which java
  • HadoopHome is the path where Hadoop is installed on the Vertica hosts. This is the directory that contains bin/hadoop (the bin directory containing the Hadoop executable file). The default value for this parameter is /usr. The default value is correct if your Hadoop executable is located at /usr/bin/hadoop.

The following example demonstrates setting and then reviewing the values of these parameters.

=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
=> SELECT get_config_parameter('JavaBinaryForUDx');
(1 row)
=> ALTER DATABASE mydb SET HadoopHome = '/usr';

=> SELECT get_config_parameter('HadoopHome');
(1 row)

There are additional parameters you may, optionally, set:

  • HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait before failing. The default value for each is 180 seconds. If you are confident that your file system will fail more quickly, you can potentially improve performance by lowering these values.

  • HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client chooses this; Vertica uses the same value for all nodes. We recommend against changing this unless directed to.

  • HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks of this size. The default is 64MB.

Setting Kerberos Parameters

If your Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every Vertica node, set the following parameters:

Parameter Value
yarn.resourcemanager.proxy-user-privileges.enabled true
yarn.resourcemanager.proxyusers.*.groups *
yarn.resourcemanager.proxyusers.*.hosts *
yarn.resourcemanager.proxyusers.*.users *
yarn.timeline-service.http-authentication.proxyusers.*.groups *
yarn.timeline-service.http-authentication.proxyusers.*.hosts *
yarn.timeline-service.http-authentication.proxyusers.*.users *

No changes are needed on HDFS nodes that are not also Vertica nodes.

Confirming that distcp Runs

After the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it:

  1. Log into any host in your cluster as the database superuser.
  2. At the Bash shell, enter the command:

    $ hadoop distcp
  3. The command should print a message similar to the following:

    usage: distcp OPTIONS [source_path...] <target_path>
     -async                 Should distcp execution be blocking
     -atomic                Commit all changes or none
     -bandwidth <arg>       Specify bandwidth per map in MB
     -delete                Delete from target, files missing in source
     -f <arg>               List of files that need to be copied
     -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
     -i                     Ignore failures during copy
     -log <arg>             Folder on DFS where distcp execution logs are
     -m <arg>               Max number of concurrent maps to use for copy
     -mapredSslConf <arg>   Configuration for ssl config file, to use with
     -overwrite             Choose to overwrite target files unconditionally,
                            even if they exist.
     -p <arg>               preserve status (rbugpc)(replication, block-size,
                            user, group, permission, checksum-type)
     -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
     -skipcrccheck          Whether to skip CRC checks between source and
                            target paths.
     -strategy <arg>        Copy strategy to use. Default is dividing work
                            based on file sizes
     -tmp <arg>             Intermediate work path to be used for atomic
     -update                Update target, copying only missingfiles or
  4. Repeat these steps on the other hosts in your database to verify that all of the hosts can run distcp.


If you cannot run the distcp command, try the following steps:

  • If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin directory to the system search path. An alternative is to create a symbolic link in an existing directory in the search path (such as /usr/bin) to the hadoop binary.
  • Ensure the version of Java installed on your Vertica cluster is compatible with your Hadoop distribution.
  • Review the Linux package installation tool's logs for errors. In some cases, packages may not be fully installed, or may not have been downloaded due to network issues.
  • Ensure that the database administrator account has permission to execute the hadoop command. You may need to add the account to a specific group in order to allow it to run the necessary commands.