Configuring Vertica to Restore HDFS Storage Locations

Your Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster.

The steps you need to take depend on:

Note: Installing the Hadoop packages necessary to run distcp does not turn your Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files.

Configuration Overview

The steps for configuring your Vertica cluster to restore backups for HDFS storage location are:

  1. If necessary, install and configure a Java runtime on the hosts in the Vertica cluster.
  2. Find the location of your Hadoop distribution's package repository.
  3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster.
  4. Install the necessary Hadoop packages on your Vertica hosts.
  5. Set two configuration parameters in your Vertica database related to Java and Hadoop.
  6. If your HDFS storage location uses Kerberos, set additional configuration parameters to allow Vertica user credentials to be proxied.
  7. Confirm that the Hadoop distcp command runs on your Vertica hosts.

The following sections describe these steps in greater detail.

Installing a Java Runtime

You Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to:

If your Vertica database has a JVM installed, verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports.

If the JVM installed on your Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both Vertica and your Hadoop distribution. See Vertica SDKs in Supported Platforms for a list of the JVMs compatible with Vertica.

If your Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instructions in Installing the Java Runtime on Your Vertica Cluster.

Finding Your Hadoop Distribution's Package Repository

Many Hadoop distributions have their own installation system, such as Cloudera's Manager or Hortonwork's Ambari. However, they also support manual installation using native Linux packages such as RPM and .deb files. These package files are maintained in a repository. You can configure your Vertica hosts to access this repository to download and install Hadoop packages.

Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques.:

Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running on your Vertica cluster. Be sure that the package repository that you select matches the version of Hadoop distribution installed on your Hadoop cluster.

Configuring Vertica Nodes to Access the Hadoop Distribution’s Package Repository

Configure the nodes in your Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation.

The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves:

The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu 12.04 host. These steps in this example are explained in the Hortonworks documentation.

$ wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list \
  -O /etc/apt/sources.list.d/hdp.list  
--2014-08-20 11:06:00--  http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list
Connecting to 16.113.84.10:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 161 [binary/octet-stream]
Saving to: `/etc/apt/sources.list.d/hdp.list'

100%[======================================>] 161         --.-K/s   in 0s      

2014-08-20 11:06:00 (8.00 MB/s) - `/etc/apt/sources.list.d/hdp.list' saved [161/161]

$ gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD
gpg: requesting key 07513CAD from hkp server pgp.mit.edu
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 07513CAD: public key "Jenkins (HDP Builds) <jenkin@hortonworks.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)

$ gpg -a --export 07513CAD | apt-key add -
OK

$ apt-get update
Hit http://us.archive.ubuntu.com precise Release.gpg
Hit http://extras.ubuntu.com precise Release.gpg                     
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Hit http://us.archive.ubuntu.com precise-updates Release.gpg                   
Get:2 http://public-repo-1.hortonworks.com HDP-UTILS Release.gpg [836 B]
Get:3 http://public-repo-1.hortonworks.com HDP Release.gpg [836 B]             
Hit http://us.archive.ubuntu.com precise-backports Release.gpg                 
Hit http://extras.ubuntu.com precise Release                                   
Get:4 http://security.ubuntu.com precise-security Release [50.7 kB]            
Get:5 http://public-repo-1.hortonworks.com HDP-UTILS Release [6,550 B]         
Hit http://us.archive.ubuntu.com precise Release                               
Hit http://extras.ubuntu.com precise/main Sources                              
Get:6 http://public-repo-1.hortonworks.com HDP Release [6,502 B]               
Hit http://us.archive.ubuntu.com precise-updates Release                       
Get:7 http://public-repo-1.hortonworks.com HDP-UTILS/main amd64 Packages [1,955 B]
Get:8 http://security.ubuntu.com precise-security/main Sources [108 kB]        
Get:9 http://public-repo-1.hortonworks.com HDP-UTILS/main i386 Packages [762 B]
                              . . .
Reading package lists... Done

You must add the Hadoop repository to all hosts in your Vertica cluster.

Installing the Required Hadoop Packages

After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are:

The names of the packages are usually the same across all Hadoop and Linux distributions.These packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install.

To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution:

Consult your Linux distribution's documentation for instructions on installing packages.

The following example demonstrates installing the required Hadoop packages from the Hortonworks 2.1 distribution on an Ubuntu 12.04 system.

# apt-get install hadoop hadoop-hdfs hadoop-client
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  bigtop-jsvc hadoop-mapreduce hadoop-yarn zookeeper
The following NEW packages will be installed:
  bigtop-jsvc hadoop hadoop-client hadoop-hdfs hadoop-mapreduce hadoop-yarn
  zookeeper
0 upgraded, 7 newly installed, 0 to remove and 90 not upgraded.
Need to get 86.6 MB of archives.
After this operation, 99.8 MB of additional disk space will be used.
Do you want to continue [Y/n]? Y
Get:1 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      bigtop-jsvc amd64 1.0.10-1 [28.5 kB]
Get:2 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      zookeeper all 3.4.5.2.1.3.0-563 [6,820 kB]
Get:3 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop all 2.4.0.2.1.3.0-563 [21.5 MB]
Get:4 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-hdfs all 2.4.0.2.1.3.0-563 [16.0 MB]
Get:5 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-yarn all 2.4.0.2.1.3.0-563 [15.1 MB]
Get:6 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-mapreduce all 2.4.0.2.1.3.0-563 [27.2 MB]
Get:7 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main 
      hadoop-client all 2.4.0.2.1.3.0-563 [3,650 B]
Fetched 86.6 MB in 1min 2s (1,396 kB/s)                                        
Selecting previously unselected package bigtop-jsvc.
(Reading database ... 197894 files and directories currently installed.)
Unpacking bigtop-jsvc (from .../bigtop-jsvc_1.0.10-1_amd64.deb) ...
Selecting previously unselected package zookeeper.
Unpacking zookeeper (from .../zookeeper_3.4.5.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop.
Unpacking hadoop (from .../hadoop_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-hdfs.
Unpacking hadoop-hdfs (from .../hadoop-hdfs_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-yarn.
Unpacking hadoop-yarn (from .../hadoop-yarn_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-mapreduce.
Unpacking hadoop-mapreduce (from .../hadoop-mapreduce_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-client.
Unpacking hadoop-client (from .../hadoop-client_2.4.0.2.1.3.0-563_all.deb) ...
Processing triggers for man-db ...
Setting up bigtop-jsvc (1.0.10-1) ...
Setting up zookeeper (3.4.5.2.1.3.0-563) ...
update-alternatives: using /etc/zookeeper/conf.dist to provide /etc/zookeeper/conf (zookeeper-conf) in auto mode.
Setting up hadoop (2.4.0.2.1.3.0-563) ...
update-alternatives: using /etc/hadoop/conf.empty to provide /etc/hadoop/conf (hadoop-conf) in auto mode.
Setting up hadoop-hdfs (2.4.0.2.1.3.0-563) ...
Setting up hadoop-yarn (2.4.0.2.1.3.0-563) ...
Setting up hadoop-mapreduce (2.4.0.2.1.3.0-563) ...
Setting up hadoop-client (2.4.0.2.1.3.0-563) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place

Setting Configuration Parameters

You must set two configuration parameters to enable Vertica to restore HDFS data:

The following example demonstrates setting and then reviewing the values of these parameters.

=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
=> SELECT get_config_parameter('JavaBinaryForUDx');
 get_config_parameter
----------------------
 /usr/bin/java
(1 row)
=> ALTER DATABASE mydb SET HadoopHome = '/usr';

=> SELECT get_config_parameter('HadoopHome');
 get_config_parameter
----------------------
 /usr
(1 row)

There are additional parameters you may, optionally, set:

Setting Kerberos Parameters

If your Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every Vertica node, set the following parameters:

Parameter Value
yarn.resourcemanager.proxy-user-privileges.enabled true
yarn.resourcemanager.proxyusers.*.groups *
yarn.resourcemanager.proxyusers.*.hosts *
yarn.resourcemanager.proxyusers.*.users *
yarn.timeline-service.http-authentication.proxyusers.*.groups *
yarn.timeline-service.http-authentication.proxyusers.*.hosts *
yarn.timeline-service.http-authentication.proxyusers.*.users *

No changes are needed on HDFS nodes that are not also Vertica nodes.

Confirming that distcp Runs

After the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it:

  1. Log into any host in your cluster as the database administrator.
  2. At the Bash shell, enter the command:

    $ hadoop distcp
  3. The command should print a message similar to the following:

    usage: distcp OPTIONS [source_path...] <target_path>
                  OPTIONS
     -async                 Should distcp execution be blocking
     -atomic                Commit all changes or none
     -bandwidth <arg>       Specify bandwidth per map in MB
     -delete                Delete from target, files missing in source
     -f <arg>               List of files that need to be copied
     -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
     -i                     Ignore failures during copy
     -log <arg>             Folder on DFS where distcp execution logs are
                            saved
     -m <arg>               Max number of concurrent maps to use for copy
     -mapredSslConf <arg>   Configuration for ssl config file, to use with
                            hftps://
     -overwrite             Choose to overwrite target files unconditionally,
                            even if they exist.
     -p <arg>               preserve status (rbugpc)(replication, block-size,
                            user, group, permission, checksum-type)
     -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                            bytes
     -skipcrccheck          Whether to skip CRC checks between source and
                            target paths.
     -strategy <arg>        Copy strategy to use. Default is dividing work
                            based on file sizes
     -tmp <arg>             Intermediate work path to be used for atomic
                            commit
     -update                Update target, copying only missingfiles or
                            directories
  4. Repeat these steps on the other hosts in your database to verify that all of the hosts can run distcp.

Troubleshooting

If you cannot run the distcp command, try the following steps: