HDFS File System (Libhdfs++)
HDFS is the Hadoop Distributed File System. This page describes the hdfs
scheme, which uses the Libhdfs++ library. You can also access data in HDFS through WebHDFS. Regardless of which scheme you use, the WebHDFS service must be available. For more information about WebHDFS, see the Hadoop documentation.
URI Format
One of the following:
hdfs://[nameservice]/path
hdfs://namenode-host:port/path
Characters may be URL-encoded (%NN where NN is a two-digit hexadecimal number) but are not required to be, except that the '%' character must be encoded.
To use the default name service specified in the HDFS configuration files, omit nameservice and use hdfs:///
. Use this shorthand only for reading external data, not for creating a storage location.
Always specify a name service or host explicitly when using Vertica with more than one HDFS cluster. The name service or host name must be globally unique. Using hdfs:///
could produce unexpected results because Vertica uses the first value of fs.defaultFS that it finds.
Authentication
Vertica can use Kerberos authentication with Cloudera or Hortonworks HDFS clusters. See Accessing Kerberized HDFS Data.
Configuration Parameters
The following database configuration parameters apply to the HDFS file system. You can set parameters globally and for the current session with ALTER DATABASE…SET PARAMETER and ALTER SESSION…SET PARAMETER, respectively. For more information about these parameters, see Apache Hadoop Parameters.
Parameter | Description |
---|---|
EnableHDFSBlockInfoCache |
Boolean, specifies whether to distribute block location metadata collected during planning on the initiator to all database nodes for execution, reducing NameNode contention. Disabled by default. |
HadoopConfDir |
Directory path containing the XML configuration files copied from Hadoop. The same path must be valid on every Vertica node. The files are accessed by the Linux user under which the Vertica server process runs. |
HadoopImpersonationConfig | Session parameter specifying the delegation token or Hadoop user for HDFS access. See HadoopImpersonationConfig Format for information about the value of this parameter and Proxy Users and Delegation Tokens for more general context. |
HDFSUseWebHDFS |
Boolean, specifies whether to use the |
Configuration Files
The path specified in HadoopConfDir must include a directory containing the files listed in the following table. Vertica reads these files at database start time. If you do not set a value, Vertica looks for the files in /etc/hadoop/conf.
If a property is not defined, Vertica uses the defaults shown in the table. If no default is specified for a property, the configuration files must specify a value.
File | Properties | Default |
---|---|---|
core-site.xml | fs.defaultFS | none |
(for doAs users:) hadoop.proxyuser.username.users |
none | |
(for doAs users:) hadoop.proxyuser.username.hosts | none | |
hdfs-site.xml | dfs.client.failover.max.attempts | 15 |
dfs.client.failover.sleep.base.millis | 500 | |
dfs.client.failover.sleep.max.millis | 15000 | |
(For HA NN:) dfs.nameservices | none | |
(WebHDFS:) dfs.namenode.http-address or dfs.namenode.https-address | none | |
(WebHDFS:) dfs.datanode.http.address or dfs.datanode.https.address | none |
If using High Availability (HA) NameNodes, the individual NameNodes must also be defined in hdfs‑site.xml
If you are using Eon Mode with communal storage on HDFS, then if you set dfs.encrypt.data.transfer you must use the swebhdfs
scheme for communal storage.
To verify that Vertica can find configuration files in HadoopConfDir, use the VERIFY_HADOOP_CONF_DIR function.
To test access through the hdfs
scheme, use the HDFS_CLUSTER_CONFIG_CHECK function.
For more information about testing your configuration, see Verifying HDFS Configuration.
To reread the configuration files, use the CLEAR_HDFS_CACHES function.
Examples
The following example creates an external table using data in HDFS, specifying a name service.
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20)) AS COPY FROM 'hdfs://hadoopNS/data/users.csv';