Configuring Vertica for HCatalog

Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation.  You must also copy the Hadoop configuration files that specify various connection properties. Vertica uses the values in those configuration files to make its own connections to Hadoop.

You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector.

Copy Hadoop Libraries and Configuration Files

Vertica provides a tool, hcatUtil, to collect the required files from Hadoop. This tool copies selected libraries and XML configuration files from your Hadoop cluster to your Vertica cluster. This tool might also need access to additional libraries:

  • If you plan to use Hive to query files that use Snappy compression, you need access to the Snappy native libraries, libhadoop*.so and libsnappy*.so.

  • If you plan to use Hive to query files that use LZO compression, you need access to the hadoop-lzo-*.jar and libgplcompression.so* libraries. In core-site.xml you must also edit the io.compression.codecs property to include com.hadoop.compression.lzo.LzopCodec.

  • If you plan to use a JSON SerDe with a Hive table, you need access to its library. This is the same library that you used to configure Hive; for example:

    hive> add jar /home/release/json-serde-1.3-jar-with-dependencies.jar;
    
    hive> create external table nationjson (id int,name string,rank int,text string) 
          ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
          LOCATION '/user/release/vt/nationjson'; 
  • If you are using any other libraries that are not standard across all supported Hadoop versions, you need access to those libraries.

If any of these cases applies to you, do one of the following:

  • Include the path(s) in the path you specify as the value of --hcatLibPath, or
  • Copy the file(s) to a directory already on that path.

If Vertica is not co-located on a Hadoop node, you should do the following:

  1. Copy /opt/vertica/packages/hcat/tools/hcatUtil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different. In newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory, and Cloudera creates a new directory for each revision of the configuration files. Use the values from your environment in the following command:
  2. hcatUtil --copyJars 
    	  --hadoopHiveHome="$HADOOP_HOME/lib;$HIVE_HOME/lib;/hcatalog/dist/share"
     	  --hadoopHiveConfPath="$HADOOP_CONF_DIR;$HIVE_CONF_DIR;$WEBHCAT_CONF_DIR"
    	  --hcatLibPath="/tmp/hadoop-files"

    If you are using Hive LLAP, specify the hive2 directories.

  3. Verify that all necessary files were copied:
  4. hcatUtil --verifyJars --hcatLibPath=/tmp/hadoop-files
  5. Copy that output directory (/tmp/hadoop-files, in this example) to /opt/vertica/packages/hcat/lib on the Vertica node you will connect to when installing the HCatalog connector. If you are updating a Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar.
  6. Verify that all necessary files were copied:
  7. hcatUtil --verifyJars --hcatLibPath=/opt/vertica/packages/hcat

If Vertica is co-located on some or all Hadoop nodes, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command:

hcatUtil --copyJars 
	  --hadoopHiveHome="$HADOOP_HOME/lib;$HIVE_HOME/lib;/hcatalog/dist/share"
	  --hadoopHiveConfPath="$HADOOP_CONF_DIR;$HIVE_CONF_DIR;$WEBHCAT_CONF_DIR"
	  --hcatLibPath="/opt/vertica/packages/hcat/lib"

The hcatUtil script has the following arguments:

-c, --copyJars

Copy the required JAR files from hadoopHiveHome and configuration files from hadoopHiveConfPath.

-v, --verifyJars

Verify that the required files are present in hcatLibPath. Check the output of hcatUtil for error and warning messages.

--hadoopHiveHome= "value1;value2;..."

Paths to the Hadoop, Hive, and HCatalog home directories. Separate directories by semicolons (;). Enclose paths in double quotes.

Always place $HADOOP_HOME on the path before $HIVE_HOME. In some Hadoop distributions, these two directories contain different versions of the same library.

--hadoopHiveConfPath= "value1;value2;..."

Paths of the following configuration files:

  • hive-site.xml
  • core-site.xml
  • yarn-site.xml
  • webhcat-site.xml (optional with the default configuration; required if you use WebHCat instead of HiveServer2)
  • hdfs-site.xml

Separate directories by semicolons (;). Enclose paths in double quotes.

In previous releases of Vertica this parameter was optional under some conditions. It is now required.

--hcatLibPath="value"

Output path for the libraries and configuration files. On a Vertica node, use /opt/vertica/packages/hcat/lib. If you have previously run hcatUtil with a different version of Hadoop, first remove the old JAR files from the output directory (all except vertica-hcatalogudl.jar).

After you have copied the files and verified them, install the HCatalog connector.

Install the HCatalog Connector

On the same node where you copied the files from hcatUtil, install the HCatalog connector by running the install.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. This script creates the library and VHCatSource and VHCatParser.

The data that was copied using hcatUtil is now stored in the database. If you change any of those values in Hadoop, you need to rerun hcatUtil and install.sql. The following statement returns the names of the libraries and configuration files currently being used:

=> SELECT dependencies FROM user_libraries WHERE lib_name='VHCatalogLib';
			

Now you can create HCatalog schema parameters, which point to your existing Hadoop services, as described in Defining a Schema Using the HCatalog Connector.

Upgrading to a New Version of Vertica

After upgrading to a new version of Vertica, perform the following steps:

  1. Uninstall the HCatalog Connector using the uninstall.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path.

  2. Delete the contents of the hcatLibPath directory except for vertica-hcatalogudl.jar.

  3. Rerun hcatUtil.

  4. Reinstall the HCatalog Connector using the install.sql script.

For more information about upgrading Vertica, see Upgrade Vertica.

Additional Options for Hadoop Columnar File Formats

When reading Hadoop columnar file formats (ORC or Parquet), the HCatalog Connector attempts to use the built-in readers. When doing so, it uses the hdfs scheme by default. In order to use the hdfs scheme, you must perform the configuration described in Configuring the hdfs Scheme.

To have the HCatalog Connector use the webhdfs scheme instead, use ALTER DATABASE to set HDFSUseWebHDFS to 1. This setting applies to all HDFS access, not just the HCatalog Connector.