CREATE HCATALOG SCHEMA

Define a schema for data stored in a Hive data warehouse using the HCatalog Connector. For more information, see Using the HCatalog Connector in Integrating with Apache Hadoop.

Most of the optional parameters are read out of Hadoop configuration files if available. If you copied the Hadoop configuration files as described in Configuring Vertica for HCatalog, you can omit most parameters. By default this statement uses the values specified in those configuration files. If the configuration files are complete, the following is a valid statement:

=> CREATE HCATALOG SCHEMA hcat;

If a value is not specified in the configuration files and a default is shown in the parameter list, then that default value is used.

Some parameters apply only if you are using HiveServer2 (the default). Others apply only if you are using WebHCat, a legacy Hadoop service. When using HiveServer2, use HIVESERVER2_HOSTNAME to specify the server host. When using WebHCat, use WEBSERVICE_HOSTNAME to specify the server host.

If you need to use WebHCat you must also set the HCatalogConnectorUseHiveServer2 configuration parameter to 0. See Apache Hadoop Parameters.

After creating the schema, you can change many (but not all) parameters using ALTER HCATALOG SCHEMA.

Syntax

CREATE HCATALOG SCHEMA [IF NOT EXISTS] schemaName
    [AUTHORIZATION user-id] 
    [WITH [param=value [,...] ] ]

Arguments

Argument Description
[IF NOT EXISTS]

If given, the statement exits without an error when the schema named in schemaName already exists.

schemaName

The name of the schema to create in the Vertica catalog. The tables in the Hive database will be available through this schema.

AUTHORIZATION user-id The name of a Vertica account to own the schema being created. This parameter is ignored if Kerberos authentication is being used; in that case the current vsql user is used.

Parameters

Parameter Description
HOSTNAME

The hostname, IP address, or URI of the database server that stores the Hive data warehouse's metastore information.

If you specify this parameter and do not also specify PORT, then this value must be in the URI format used for hive.metastore.uris in hive-site.xml.

If the Hive metastore supports High Availability, you can specify a comma-separated list of URIs for this value.

If this value is not specified, hive-site.xml must be available.

PORT The port number on which the metastore database is running. If you specify this parameter, you must also specify HOSTNAME and it must be a name or IP address (not a URI).
HIVESERVER2_HOSTNAME

The hostname or IP address of the HiveServer2 service. This parameter is optional if in hive-site.xml you set one of the following properties:

  • hive.server2.thrift.bind.host to a valid host
  • hive.server2.support.dynamic.service.discovery to true

This parameter is ignored if you are using WebHCat.

WEBSERVICE_HOSTNAME The hostname or IP address of the WebHCat service, if using WebHCat instead of HiveServer2. If this value is not specified, webhcat-site.xml must be available.
WEBSERVICE_PORT The port number on which the WebHCat service is running, if using WebHCat instead of HiveServer2. If this value is not specified, webhcat-site.xml must be available.
WEBHDFS_ADDRESS The host and port ("host:port") for the WebHDFS service. This parameter is used only for reading ORC and Parquet files. If this value is not set, hdfs-site.xml must be available to read these file types through the HCatalog Connector.
HCATALOG_SCHEMA The name of the Hive schema or database that the Vertica schema is being mapped to. The default is schemaName.
CUSTOM_PARTITIONS Whether the Hive schema uses custom partition locations ('YES' or 'NO'). If the schema uses custom partition locations, then Vertica queries Hive to get those locations when executing queries. These additional Hive queries can be expensive, so use this parameter only if you need to. The default is 'NO' (disabled). For more information, see Using Partitioned Data in Integrating with Apache Hadoop.
HCATALOG_USER The username of the HCatalog user to use when making calls to the HiveServer2 or WebHCat server. The default is the current database user.
HCATALOG_CONNECTION_TIMEOUT The number of seconds the HCatalog Connector waits for a successful connection to the HiveServer or WebHCat server. A value of 0 means wait indefinitely.
HCATALOG_SLOW_TRANSFER_LIMIT The lowest data transfer rate (in bytes per second) from the HiveServer2 or WebHCat server that the HCatalog Connector accepts. See HCATALOG_SLOW_TRANSFER_TIME for details.
HCATALOG_SLOW_TRANSFER_TIME The number of seconds the HCatalog Connector waits before enforcing the data transfer rate lower limit. After this time has passed, the HCatalog Connector tests whether the data transfer rate is at least as fast as the value set in HCATALOG_SLOW_TRANSFER_LIMIT. If it is not, then the HCatalog Connector breaks the connection and terminates the query.
SSL_CONFIG The path of the Hadoop ssl-client.xml configuration file. This parameter is required if you are using HiveServer2 and it uses SSL wire encryption. This parameter is ignored if you are using WebHCat.

The default values for HCATALOG_CONNECTOR_TIMEOUT, HCATALOG_SLOW_TRANSFER_LIMIT, and HCATALOG_SLOW_TRANSFER_TIME are set by the database configuration parameters HCatConnectionTimeout, HCatSlowTransferLimit, and HCatSlowTransferTime. See Apache Hadoop Parameters in the Administrator's Guide for more information.

Configuration Files

The HCatalog Connector uses the following values from the Hadoop configuration files if you do not override them when creating the schema.

File Properties
hive-site.xml hive.server2.thrift.bind.host (used for HIVESERVER2_HOSTNAME)
  hive.server2.thrift.port
  hive.server2.transport.mode
  hive.server2.authentication
  hive.server2.authentication.kerberos.principal
  hive.server2.support.dynamic.service.discovery
  hive.zookeeper.quorum (used as HIVESERVER2_HOSTNAME if dynamic service discovery is enabled)
  hive.zookeeper.client.port
  hive.server2.zookeeper.namespace
  hive.metastore.uris (used for HOSTNAME and PORT)
ssl-client.xml ssl.client.truststore.location
  ssl.client.truststore.password

Privileges

The user must be a superuser or be granted all permissions on the database to use this statement.

The user also requires access to Hive data in one of the following ways:

  • Have USAGE permissions on hcatalog_schema, if Hive does not use an authorization service (Sentry or Ranger) to manage access.
  • Have permission through an authorization service, if Hive uses it to manage access. In this case you must either set EnableHCatImpersonation to 0, to access data as the Vertica principal, or grant users access to the HDFS data. For Sentry, you can use ACL synchronization to manage HDFS access.
  • Be the dbadmin user, with or without an authorization service.

Examples

The following example shows how to use CREATE HCATALOG SCHEMA to define a new schema for tables stored in a Hive database and then query the system tables that contain information about those tables:

=> CREATE HCATALOG SCHEMA hcat WITH HOSTNAME='hcathost' PORT=9083 
   HCATALOG_SCHEMA='default' HIVESERVER2_HOSTNAME='hs.example.com' 
   SSL_CONFIG='/etc/hadoop/conf/ssl-client.xml' HCATALOG_USER='admin';
CREATE SCHEMA
=> \x
Expanded display is on.

=> SELECT * FROM v_catalog.hcatalog_schemata;
-[ RECORD 1 ]----------------+-------------------------------------------
schema_id                    | 45035996273748224
schema_name                  | hcat
schema_owner_id              | 45035996273704962
schema_owner                 | admin
create_time                  | 2017-12-05 14:43:03.353404-05
hostname                     | hcathost
port                         | -1
hiveserver2_hostname         | hs.example.com
webservice_hostname          |
webservice_port              | 50111
webhdfs_address              | hs.example.com:50070
hcatalog_schema_name         | default
ssl_config                   | /etc/hadoop/conf/ssl-client.xml 
hcatalog_user_name           | admin
hcatalog_connection_timeout  | -1
hcatalog_slow_transfer_limit | -1
hcatalog_slow_transfer_time  | -1
custom_partitions            | f

=> SELECT * FROM v_catalog.hcatalog_table_list;
-[ RECORD 1 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | nation
hcatalog_user_name | admin
-[ RECORD 2 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | raw
hcatalog_user_name | admin
-[ RECORD 3 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | raw_rcfile
hcatalog_user_name | admin
-[ RECORD 4 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | raw_sequence
hcatalog_user_name | admin

The following example shows how to specify more than one metastore host.

=> CREATE HCATALOG SCHEMA hcat
   WITH HOSTNAME='thrift://node1.example.com:9083,thrift://node2.example.com:9083';

The following example shows how to include custom partition locations:

=> CREATE HCATALOG SCHEMA hcat WITH HCATALOG_SCHEMA='default'
	HIVESERVER2_HOSTNAME='hs.example.com' 
	CUSTOM_PARTITIONS='yes';