Configuring Kerberos
Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. If you use Kerberos, you must use it for both your HDFS and Vertica clusters.
Prerequisite: Set Up Users and the Keytab File
If you have not already configured Kerberos authentication for Vertica, follow the instructions in Configure Vertica for Kerberos Authentication. Of particular importance for Hadoop integration:
- Create one Kerberos principal per node.
- Place the keytab file(s) in the same location on each database node and set KerberosKeytabFile to its location. (See Specify the Location of the Keytab File.)
- Set KerberosServiceName to the name of the principal. (See Inform Vertica About the Kerberos Principal.)
Reads with the hdfs Scheme
Vertica can access files stored in HDFS using the hdfs
URL scheme instead of using WebHDFS. Vertica authenticates using the current user's Kerberos principal, not the database's Kerberos principal. No additional Kerberos-specific configuration is required.
HCatalog Connector
You use the HCatalog Connector to query data in Hive. The HCatalog Connector executes queries on behalf of Vertica users. If the current user has a Kerberos key, then Vertica passes that key to the HCatalog connector automatically. Verify that the administrator of your HDFS cluster has granted HDFS access to all users who need access to Hive.
In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do so is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you set this property before running hcatUtil (see Configuring Vertica for HCatalog).
HDFS Storage Location
You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). By storing the data in Vertica, rather than creating an external table, you can reduce query response times.
To use a storage location in HDFS with Kerberos, take the following steps:
- Create a Kerberos principal for each Vertica node as described under Prerequisites.
- Give all node principals read and write permission to the HDFS directory you will use as a storage location.
If you plan to back up your HDFS storage locations, take the following additional steps:
- Grant Hadoop superuser privileges to the new principals.
- Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage .
- Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring Vertica to Restore HDFS Storage Locations.
Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual.
HDFS Connector
The HDFS Connector loads data from HDFS into Vertica on behalf of the user. If the user performing the data load has a Kerberos key, then the connector uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS.
The HDFS Connector is deprecated. Use the other Hadoop interfaces instead.
Verifying Kerberos Configuration
Use the KERBEROS_HDFS_CONFIG_CHECK metafunction to verify that Vertica can use Kerberos to access HDFS. You can call it with no parameters to test all paths described in the Hadoop configuration files. Alternatively, you can specify hdfs
, webhdfs
, and WebHCat servers to test individually.
=> SELECT KERBEROS_HDFS_CONFIG_CHECK(); => SELECT KERBEROS_HDFS_CONFIG_CHECK('node1.example.com:9433', 'node2.example.com:10443', 'node2.example.com:14443');
This function does not yet check access to HiveServer2.
Token Expiration
Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. Use the HadoopFSTokenRefreshFrequency configuration parameter to specify the frequency in seconds:
=> ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400';
If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS.