Requirements for Backing Up and Restoring HDFS Storage Locations

There are several considerations for backing up and restoring HDFS storage locations:

The HDFS directory for the storage location must have snapshotting enabled. You can either directly configure this yourself or enable the database administrator’s Hadoop account to do it for you automatically. See Hadoop Configuration for Backup and Restore for more information.
If the Hadoop cluster uses Kerberos, Vertica nodes must have access to certain Hadoop configuration files. See Configuring Kerberos below.
To restore an HDFS storage location, your Vertica cluster must be able to run the Hadoop distcp command. See Configuring distcp on a Vertica Cluster below.
HDFS storage locations do not support object-level backups. You must perform a full database backup to back up the data in your HDFS storage locations.
Data in an HDFS storage location is backed up to HDFS. This backup guards against accidental deletion or corruption of data. It does not prevent data loss in the case of a catastrophic failure of the entire Hadoop cluster. To prevent data loss, you must have a backup and disaster recovery plan for your Hadoop cluster.

Data stored on the Linux native file system is still backed up to the location you specify in the backup configuration file. It and the data in HDFS storage locations are handled separately by the vbr backup script.

Configuring Kerberos

If HDFS uses Kerberos, then to back up your HDFS storage locations you must take the following additional steps:

Grant Hadoop superuser privileges to the Kerberos principals for each Vertica node.
Copy Hadoop configuration files to your database nodes as explained in Accessing Hadoop Configuration Files. Vertica needs access to core-site.xml, hdfs-site.xml, and yarn-site.xml for backup and restore. If your Vertica nodes are co-located on HDFS nodes, these files are already present.
Set the HadoopConfDir parameter to the location of the directory containing these files. The value can be a path, if the files are in multiple directories. For example:
```
=> ALTER DATABASE exampledb SET HadoopConfDir = '/etc/hadoop/conf:/etc/hadoop/test';
```
All three configuration files must be present on this path on every database node.

If your Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must also change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every Vertica node, set the following parameters:

Parameter	Value
`yarn.resourcemanager.proxy-user-privileges.enabled`	true
`yarn.resourcemanager.proxyusers.*.groups`	*
`yarn.resourcemanager.proxyusers.*.hosts`	*
`yarn.resourcemanager.proxyusers.*.users`	*
`yarn.timeline-service.http-authentication.proxyusers.*.groups`	*
`yarn.timeline-service.http-authentication.proxyusers.*.hosts`	*
`yarn.timeline-service.http-authentication.proxyusers.*.users`	*

No changes are needed on HDFS nodes that are not also Vertica nodes.

Configuring distcp on a Vertica Cluster

Your Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster.

The steps you need to take depend on:

The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS storage location.
The distribution of Linux running on your Vertica cluster.

Installing the Hadoop packages necessary to run distcp does not turn your Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files.

Configuration Overview

The steps for configuring your Vertica cluster to restore backups for HDFS storage location are:

If necessary, install and configure a Java runtime on the hosts in the Vertica cluster.
Find the location of your Hadoop distribution's package repository.
Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster.
Install the necessary Hadoop packages on your Vertica hosts.
Set two configuration parameters in your Vertica database related to Java and Hadoop.
Confirm that the Hadoop distcp command runs on your Vertica hosts.

The following sections describe these steps in greater detail.

Installing a Java Runtime

Your Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to:

Execute user-defined extensions developed in Java. See Developing User-Defined Extensions (UDxs) for more information.
Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for more information.

If your Vertica database has a JVM installed, verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports.

If the JVM installed on your Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both Vertica and your Hadoop distribution. See Vertica SDKs for a list of the JVMs compatible with Vertica.

If your Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instructions in Installing the Java Runtime on Your Vertica Cluster.

Finding Your Hadoop Distribution's Package Repository

Many Hadoop distributions have their own installation system, such as Cloudera Manager or Ambari. However, they also support manual installation using native Linux packages such as RPM and .deb files. These package files are maintained in a repository. You can configure your Vertica hosts to access this repository to download and install Hadoop packages.

Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques.

Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running your Vertica cluster. Be sure that the package repository that you select matches the version used by your Hadoop cluster.

Configuring Vertica Nodes to Access the Hadoop Distribution’s Package Repository

Configure the nodes in your Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation.

The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves:

Downloading a configuration file.
Adding the configuration file to the package management system's configuration directory.
For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root account keyring.
Updating the package management system's index to have it discover new packages.

You must add the Hadoop repository to all hosts in your Vertica cluster.

Installing the Required Hadoop Packages

After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are:

hadoop
hadoop-hdfs
hadoop-client

The names of the packages are usually the same across all Hadoop and Linux distributions. These packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install.

To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution:

On Red Hat and CentOS, the package manager command is yum.
On Debian and Ubuntu, the package manager command is apt-get.
On SUSE the package manager command is zypper.

Consult your Linux distribution's documentation for instructions on installing packages.

Setting Configuration Parameters

You must set two Hadoop configuration parameters to enable Vertica to restore HDFS data:

JavaBinaryForUDx is the path to the Java executable. You may have already set this value to use Java UDxs or the HCatalog Connector. You can find the path for the default Java executable from the Bash command shell using the command:
```
$ which java
```
HadoopHome is the directory that contains bin/hadoop (the bin directory containing the Hadoop executable file). The default value for this parameter is /usr. The default value is correct if your Hadoop executable is located at /usr/bin/hadoop.

The following example shows how to set and then review the values of these parameters:

=> ALTER DATABASE DEFAULT SET PARAMETER JavaBinaryForUDx = '/usr/bin/java';
=> SELECT current_value FROM configuration_parameters WHERE parameter_name = 'JavaBinaryForUDx';
 current_value
---------------
 /usr/bin/java
(1 row)
=> ALTER DATABASE DEFAULT SET HadoopHome = '/usr';
=> SELECT current_value FROM configuration_parameters WHERE parameter_name = 'HadoopHome';
 current_value
---------------
 /usr
(1 row)

You can also set the following parameters:

HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait before failing. The default value for each is 180 seconds. If you are confident that your file system will fail more quickly, you can improve performance by lowering these values.
HadoopFSReplication specifies the number of replicas HDFS makes. By default, the Hadoop client chooses this; Vertica uses the same value for all nodes.

Do not change this setting unless directed otherwise by Vertica support.
HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks of this size. The default is 64MB.

Confirming that distcp Runs

After the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it:

Log into any host in your cluster as the database superuser.
At the Bash shell, enter the command:
```
$ hadoop distcp
```

The command should print a message similar to the following:

usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpc)(replication, block-size,
                        user, group, permission, checksum-type)
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

Repeat these steps on the other hosts in your database to verify that all of the hosts can run distcp.

Troubleshooting

If you cannot run the distcp command, try the following steps:

If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin directory to the system search path. An alternative is to create a symbolic link in an existing directory in the search path (such as /usr/bin) to the hadoop binary.
Ensure the version of Java installed on your Vertica cluster is compatible with your Hadoop distribution.
Review the Linux package installation tool's logs for errors. In some cases, packages may not be fully installed, or may not have been downloaded due to network issues.
Ensure that the database administrator account has permission to execute the hadoop command. You might need to add the account to a specific group in order to allow it to run the necessary commands.