Requirements for HDFS Storage Locations
Caution:
If you use any HDFS storage locations, the HDFS data must be available at the time you start Vertica. Your HDFS cluster must be operational, and the ROS files must be present. If you have moved data files, or if they have become corrupted, or if your HDFS cluster is not responsive, Vertica cannot start.
To store Vertica's data on HDFS, verify that:
- Your Hadoop cluster has WebHDFS enabled.
- All of the nodes in your Vertica cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS. See Testing Your Hadoop WebHDFS Configuration for a procedure to test the connectivity between your Vertica and Hadoop clusters.
- If your HDFS cluster is unsecured, you have a Hadoop user whose username matches the name of the Vertica database administrator (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want Vertica to store its data.
- If your HDFS cluster uses Kerberos authentication, you have a Kerberos principal for Vertica, and it has read and write access to the HDFS directory that will be used for the storage location. See Configuring Kerberos. The Kerberos KDC must also be running.
-
Your HDFS cluster has enough storage available for Vertica data. See HDFS Space Requirements below for details.
- The data you store in an HDFS-backed storage location does not expand your database's size beyond any data allowance in your Vertica license. Vertica counts data stored in an HDFS-backed storage location as part of any data allowance set by your license. See Managing Licenses in the Administrator's Guide for more information.
HDFS Space Requirements
If your Vertica database is K-safe, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS's data-redundancy feature, Hadoop stores both projections multiple times. This duplication might seem excessive. However, it is similar to how a RAID level 1 or higher stores redundant copies of both the primary and buddy projections. The redundant copies also help the performance of HDFS by enabling multiple nodes to process a request for a file.
Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting HDFS Storage Locations for details.
Additional Requirements for Backing Up Data Stored on HDFS
To back up your data stored in HDFS storage locations, your Hadoop cluster must have snapshotting enabled for the directories to be used for backups. The easiest way to do this is to give the database administrator's account superuser privileges in Hadoop, so that snapshotting can be set automatically. Alternatively, use Hadoop to enable snapshotting for each directory before using it for backups.
In addition, your Vertica database must:
- Have enough Hadoop components and libraries installed to run the Hadoop distcp command as the Vertica database-administrator user (usually dbadmin).
- Have the JavaBinaryForUDx and HadoopHome configuration parameters set correctly.
Caution: After you have created an HDFS storage location, full database backups will fail with the error message:
ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter
This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure Vertica and Hadoop to enable the backup script to back these locations. After you configure Vertica and Hadoop, you can once again perform full database backups.
See Backing Up HDFS Storage Locations for details on configuring your Vertica and Hadoop clusters to enable HDFS storage location backup.
Best Practices for SQL on Apache Hadoop
If you are using the Vertica for SQL on Apache Hadoop product, OpenText recommends the following best practices for storage locations:
- Place only data type storage locations on HDFS storage.
- Place temp space directly on the local Linux file system, not in HDFS.
- For the best performance, place the Vertica catalog directly on the local Linux file system.
- Create the database first on a local Linux file system. Then, you can extend the database to HDFS storage locations and set storage policies that exclusively place data blocks on the HDFS storage location.
- For better performance, if you are running Verticaonly on a subset of the HDFS nodes, do not run the HDFS balancer on them. The HDFS balancer can move data blocks farther away, causing Vertica to read non-local data during query execution. Queries run faster if they do not require network I/O.
Generally, HDFS requires approximately 2 GB of memory for each node in the cluster. To support this requirement in your Vertica configuration:
- Create a 2-GB resource pool.
- Do not assign any Vertica execution resources to this pool. This approach reserves the space for use by HDFS.
Alternatively, use Ambari or Cloudera Manager to find the maximum heap size required by HDFS and set the size of the resource pool to that value.
For more about how to configure resource pools, see Managing Workloads.