Configuring Kerberos

Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. If you use Kerberos, you must use it for both your HDFS and Vertica clusters.

Vertica can interact with more than one Kerberos realm. To configure multiple realms, see Multi-realm Support in Security and Authentication.

Prerequisite: Set Up Users and the Keytab File

If you have not already configured Kerberos authentication for Vertica, follow the instructions in Configure Vertica for Kerberos Authentication. Of particular importance for Hadoop integration:

Create one Kerberos principal per node.
Place the keytab files in the same location on each database node and set configuration parameter KerberosKeytabFile to that location.
Set KerberosServiceName to the name of the principal. (See Inform Vertica About the Kerberos Principal.)

Reads with the hdfs Scheme

Vertica can access files stored in HDFS using the hdfs URL scheme instead of using WebHDFS. Vertica authenticates using the current user's Kerberos principal, not the database's Kerberos principal. No additional Kerberos-specific configuration is required.

HCatalog Connector

You use the HCatalog Connector to query data in Hive. How you configure the HCatalog Connector depends on how Hive manages authorization.

If Hive uses Sentry to manage authorization, and if Sentry uses ACL synchronization, then the HCatalog Connector must access Hive as the current user. Verify that the EnableHCatImpersonation configuration parameter is set to 1 (the default). ACL synchronization automatically provides authorized users with read access to the underlying HDFS files.
If Hive uses Sentry without ACL synchronization, then the HCatalog Connector must access Hive data as the Vertica principal. (The user still authenticates and accesses metadata normally.) Set the EnableHCatImpersonation configuration parameter to 0. The Vertica principal must have read access through Sentry.
If Hive uses Ranger to manage authorization, and the Vertica users have read access to the underlying HDFS files, then you can use user impersonation. Verify that the EnableHCatImpersonation configuration parameter is set to 1 (the default). You can, instead, disable user impersonation and give the Vertica principal read access to the HDFS files.
If Hive uses either Sentry or Ranger, the HCatalog Connector must use HiveServer2 (the default). WebHCat does not support authorization services.
If Hive does not use an authorization service, or if you are connecting to Hive using WebHCat instead of HiveServer2, then the HCatalog Connector accesses Hive as the current user. Verify that EnableHCatImpersonation is set to 1. All users must have read access to the underlying HDFS files.

In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do so is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you set this property before running hcatUtil (see Configuring Vertica for HCatalog).

HDFS Storage Location

You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). By storing the data in Vertica, rather than creating an external table, you can reduce query response times.

To use a storage location in HDFS with Kerberos, take the following steps:

Create a Kerberos principal for each Vertica node as described under Prerequisites.
Give all node principals read and write permission to the HDFS directory you will use as a storage location.

If you plan to back up your HDFS storage locations, take the following additional steps:

Grant Hadoop superuser privileges to the new principals.
Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage .
Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring Vertica to Restore HDFS Storage Locations.

Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual.

Verifying Kerberos Configuration

Use the KERBEROS_HDFS_CONFIG_CHECK metafunction to verify that Vertica can use Kerberos to access HDFS. You can call it with no parameters to test all paths described in the Hadoop configuration files. Alternatively, you can specify hdfs, webhdfs, and WebHCat servers to test individually.

=> SELECT KERBEROS_HDFS_CONFIG_CHECK();
				
=> SELECT KERBEROS_HDFS_CONFIG_CHECK('node1.example.com:9433', 
	'node2.example.com:10443', 'node2.example.com:14443');

This function does not yet check access to HiveServer2.

Token Expiration

Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. Use the HadoopFSTokenRefreshFrequency configuration parameter to specify the frequency in seconds:

=> ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400';

If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS.