Configuring Vertica to Restore HDFS Storage Locations

Your Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster.

The steps you need to take depend on:

The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS storage location.
The distribution of Linux running on your Vertica cluster.

Note: Installing the Hadoop packages necessary to run distcp does not turn your Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files.

Configuration Overview

The steps for configuring your Vertica cluster to restore backups for HDFS storage location are:

If necessary, install and configure a Java runtime on the hosts in the Vertica cluster.
Find the location of your Hadoop distribution's package repository.
Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster.
Install the necessary Hadoop packages on your Vertica hosts.
Set two configuration parameters in your Vertica database related to Java and Hadoop.
If your HDFS storage location uses Kerberos, set additional configuration parameters to allow Vertica user credentials to be proxied.
Confirm that the Hadoop distcp command runs on your Vertica hosts.

The following sections describe these steps in greater detail.

Installing a Java Runtime

You Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to:

Execute User-Defined Extensions developed in Java. See Developing User-Defined Extensions (UDxs) for more information.
Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for more information.

If your Vertica database has a JVM installed, verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports.

If the JVM installed on your Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both Vertica and your Hadoop distribution. See Vertica SDKs in Supported Platforms for a list of the JVMs compatible with Vertica.

If your Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instructions in Installing the Java Runtime on Your Vertica Cluster.

Finding Your Hadoop Distribution's Package Repository

Many Hadoop distributions have their own installation system, such as Cloudera's Manager or Hortonwork's Ambari. However, they also support manual installation using native Linux packages such as RPM and .deb files. These package files are maintained in a repository. You can configure your Vertica hosts to access this repository to download and install Hadoop packages.

Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques.:

Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running on your Vertica cluster. Be sure that the package repository that you select matches the version of Hadoop distribution installed on your Hadoop cluster.

Configuring Vertica Nodes to Access the Hadoop Distribution’s Package Repository

Configure the nodes in your Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation.

The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves:

Downloading a configuration file.
Adding the configuration file to the package management system's configuration directory.
For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root account keyring.
Updating the package management system's index to have it discover new packages.

The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu 12.04 host. These steps in this example are explained in the Hortonworks documentation.

$ wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list \
  -O /etc/apt/sources.list.d/hdp.list  
--2014-08-20 11:06:00--  http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list
Connecting to 16.113.84.10:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 161 [binary/octet-stream]
Saving to: `/etc/apt/sources.list.d/hdp.list'

100%[======================================>] 161         --.-K/s   in 0s      

2014-08-20 11:06:00 (8.00 MB/s) - `/etc/apt/sources.list.d/hdp.list' saved [161/161]

$ gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD
gpg: requesting key 07513CAD from hkp server pgp.mit.edu
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 07513CAD: public key "Jenkins (HDP Builds) <jenkin@hortonworks.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)

$ gpg -a --export 07513CAD | apt-key add -
OK

$ apt-get update
Hit http://us.archive.ubuntu.com precise Release.gpg
Hit http://extras.ubuntu.com precise Release.gpg                     
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Hit http://us.archive.ubuntu.com precise-updates Release.gpg                   
Get:2 http://public-repo-1.hortonworks.com HDP-UTILS Release.gpg [836 B]
Get:3 http://public-repo-1.hortonworks.com HDP Release.gpg [836 B]             
Hit http://us.archive.ubuntu.com precise-backports Release.gpg                 
Hit http://extras.ubuntu.com precise Release                                   
Get:4 http://security.ubuntu.com precise-security Release [50.7 kB]            
Get:5 http://public-repo-1.hortonworks.com HDP-UTILS Release [6,550 B]         
Hit http://us.archive.ubuntu.com precise Release                               
Hit http://extras.ubuntu.com precise/main Sources                              
Get:6 http://public-repo-1.hortonworks.com HDP Release [6,502 B]               
Hit http://us.archive.ubuntu.com precise-updates Release                       
Get:7 http://public-repo-1.hortonworks.com HDP-UTILS/main amd64 Packages [1,955 B]
Get:8 http://security.ubuntu.com precise-security/main Sources [108 kB]        
Get:9 http://public-repo-1.hortonworks.com HDP-UTILS/main i386 Packages [762 B]
                              . . .
Reading package lists... Done

You must add the Hadoop repository to all hosts in your Vertica cluster.

Installing the Required Hadoop Packages

After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are:

hadoop
hadoop-hdfs
hadoop-client

The names of the packages are usually the same across all Hadoop and Linux distributions.These packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install.

To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution:

On Red Hat and CentOS, the package manager command is yum.
On Debian and Ubuntu, the package manager command is apt-get.
On SUSE the package manager command is zypper.

Consult your Linux distribution's documentation for instructions on installing packages.

The following example demonstrates installing the required Hadoop packages from the Hortonworks 2.1 distribution on an Ubuntu 12.04 system.

# apt-get install hadoop hadoop-hdfs hadoop-client
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  bigtop-jsvc hadoop-mapreduce hadoop-yarn zookeeper
The following NEW packages will be installed:
  bigtop-jsvc hadoop hadoop-client hadoop-hdfs hadoop-mapreduce hadoop-yarn
  zookeeper
0 upgraded, 7 newly installed, 0 to remove and 90 not upgraded.
Need to get 86.6 MB of archives.
After this operation, 99.8 MB of additional disk space will be used.
Do you want to continue [Y/n]? Y
Get:1 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      bigtop-jsvc amd64 1.0.10-1 [28.5 kB]
Get:2 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      zookeeper all 3.4.5.2.1.3.0-563 [6,820 kB]
Get:3 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop all 2.4.0.2.1.3.0-563 [21.5 MB]
Get:4 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-hdfs all 2.4.0.2.1.3.0-563 [16.0 MB]
Get:5 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-yarn all 2.4.0.2.1.3.0-563 [15.1 MB]
Get:6 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-mapreduce all 2.4.0.2.1.3.0-563 [27.2 MB]
Get:7 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-client all 2.4.0.2.1.3.0-563 [3,650 B]
Fetched 86.6 MB in 1min 2s (1,396 kB/s)                                        
Selecting previously unselected package bigtop-jsvc.
(Reading database ... 197894 files and directories currently installed.)
Unpacking bigtop-jsvc (from .../bigtop-jsvc_1.0.10-1_amd64.deb) ...
Selecting previously unselected package zookeeper.
Unpacking zookeeper (from .../zookeeper_3.4.5.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop.
Unpacking hadoop (from .../hadoop_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-hdfs.
Unpacking hadoop-hdfs (from .../hadoop-hdfs_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-yarn.
Unpacking hadoop-yarn (from .../hadoop-yarn_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-mapreduce.
Unpacking hadoop-mapreduce (from .../hadoop-mapreduce_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-client.
Unpacking hadoop-client (from .../hadoop-client_2.4.0.2.1.3.0-563_all.deb) ...
Processing triggers for man-db ...
Setting up bigtop-jsvc (1.0.10-1) ...
Setting up zookeeper (3.4.5.2.1.3.0-563) ...
update-alternatives: using /etc/zookeeper/conf.dist to provide /etc/zookeeper/conf (zookeeper-conf) in auto mode.
Setting up hadoop (2.4.0.2.1.3.0-563) ...
update-alternatives: using /etc/hadoop/conf.empty to provide /etc/hadoop/conf (hadoop-conf) in auto mode.
Setting up hadoop-hdfs (2.4.0.2.1.3.0-563) ...
Setting up hadoop-yarn (2.4.0.2.1.3.0-563) ...
Setting up hadoop-mapreduce (2.4.0.2.1.3.0-563) ...
Setting up hadoop-client (2.4.0.2.1.3.0-563) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place

Setting Configuration Parameters

You must set two configuration parameters to enable Vertica to restore HDFS data:

JavaBinaryForUDx is the path to the Java executable. You may have already set this value to use Java UDxs or the HCatalog Connector. You can find the path for the default Java executable from the Bash command shell using the command:
```
which java
```
HadoopHome is the path where Hadoop is installed on the Vertica hosts. This is the directory that contains bin/hadoop (the bin directory containing the Hadoop executable file). The default value for this parameter is /usr. The default value is correct if your Hadoop executable is located at /usr/bin/hadoop.

The following example demonstrates setting and then reviewing the values of these parameters.

=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';

=> SELECT get_config_parameter('JavaBinaryForUDx');
 get_config_parameter
----------------------
 /usr/bin/java
(1 row)

=> ALTER DATABASE mydb SET HadoopHome = '/usr';

=> SELECT get_config_parameter('HadoopHome');
 get_config_parameter
----------------------
 /usr
(1 row)

There are additional parameters you may, optionally, set:

HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait before failing. The default value for each is 180 seconds. If you are confident that your file system will fail more quickly, you can potentially improve performance by lowering these values.
HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client chooses this; Vertica uses the same value for all nodes. We recommend against changing this unless directed to.
HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks of this size. The default is 64MB.

Setting Kerberos Parameters

If your Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every Vertica node, set the following parameters:

Parameter	Value
yarn.resourcemanager.proxy-user-privileges.enabled	true
yarn.resourcemanager.proxyusers.*.groups	*
yarn.resourcemanager.proxyusers.*.hosts	*
yarn.resourcemanager.proxyusers.*.users	*
yarn.timeline-service.http-authentication.proxyusers.*.groups	*
yarn.timeline-service.http-authentication.proxyusers.*.hosts	*
yarn.timeline-service.http-authentication.proxyusers.*.users	*

No changes are needed on HDFS nodes that are not also Vertica nodes.

Confirming that distcp Runs

After the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it:

Log into any host in your cluster as the database administrator.
At the Bash shell, enter the command:
```
$ hadoop distcp
```

The command should print a message similar to the following:

usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpc)(replication, block-size,
                        user, group, permission, checksum-type)
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

Repeat these steps on the other hosts in your database to verify that all of the hosts can run distcp.

Troubleshooting

If you cannot run the distcp command, try the following steps:

If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin directory to the system search path. An alternative is to create a symbolic link in an existing directory in the search path (such as /usr/bin) to the hadoop binary.
Ensure the version of Java installed on your Vertica cluster is compatible with your Hadoop distribution.
Review the Linux package installation tool's logs for errors. In some cases, packages may not be fully installed, or may not have been downloaded due to network issues.
Ensure that the database administrator account has permission to execute the hadoop command. You may need to add the account to a specific group in order to allow it to run the necessary commands.