Reading Directly from HDFS

When reading files from HDFS, you can use the hdfs scheme instead of the webhdfs scheme. Using the hdfs scheme can improve performance and stability by bypassing the WebHDFS service.

You can use the hdfs scheme with COPY and with CREATE EXTERNAL TABLE AS COPY. When using the hdfs scheme with COPY, you do not need to specify ON ANY NODE.

To support direct access, Vertica requires access to certain configuration files from your HDFS cluster. See Configuring the hdfs Scheme.

URL Format

You specify the location of a file in HDFS using a URL. In most cases, you use the hdfs:/// URL prefix (three slashes), and then specify the file path. Vertica uses the fs.defaultFS Hadoop configuration parameter to find the data. The following example loads data stored in HDFS.

=> COPY t FROM 'hdfs:///opt/data/file1.dat';

You can specify a host and port explicitly using the following format: hdfs://host:port/. The specified host is the Name Node, not an individual data node. If you are using High Availability (HA) Name Nodes you should not use an explicit host because high availability is provided through nameservices instead.

Your HDFS cluster might use High Availability Name Nodes or define nameservices. If so, you should use the nameservice instead of the host and port, in the format hdfs://nameservice/. The nameservice you specify must be defined in hdfs-site.xml.

The following example shows how you can use a nameservice, hadoopNS, with the hdfs scheme.

=> CREATE EXTERNAL TABLE tt (a1 INT, a2 VARCHAR(20))
	AS COPY FROM 'hdfs://hadoopNS/data/file.csv';

If you are using Vertica to access data from more than one HDFS cluster, always use explicit nameservices or hosts in the URL. Using hdfs:/// could produce unexpected results because Vertica uses the first value of fs.defaultFS that it finds. To access multiple HDFS clusters, you must use host and service names that are globally unique. See Configuring the hdfs Scheme for more information.

Note: All characters in URLs that are not a–z, A–Z, 0–9, '-', '.', '_' or '~' must be converted to URL encoding (%NN where NN is a two-digit hexadecimal number). For example, use %20 for space.

Kerberos Authentication

If the file you want to read resides on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal to authenticate. It does not use the database's principal.

You can use the KERBEROS_HDFS_CONFIG_CHECK metafunction to verify that Vertica is correctly configured for Kerberos access.