HDFS URL Format
You specify the location of a file in HDFS using a URL. In most cases, you use the hdfs:///
URL prefix (three slashes) with COPY, and then specify the file path. The hdfs
scheme uses the Libhdfs++ library to read files and is more efficient than WebHDFS.
Do not use hdfs:///
when creating a storage location, because you do not want a storage location to depend on the value of HadoopConfDir, which can change. Use this shorthand only for reading external data.
The following example loads data stored in HDFS.
=> COPY users FROM 'hdfs:///data/users.csv';
Vertica uses the fs.defaultFS Hadoop configuration parameter to find the NameNode, which it uses to access the data. You can instead specify a host and port explicitly using the following format: hdfs://host:port/
. The specified host is the NameNode, not an individual data node. If you are using High Availability (HA) NameNodes you should not use an explicit host because high availability is provided through nameservices instead.
Your HDFS cluster might use High Availability NameNodes or define nameservices. If so, you should use the nameservice instead of the host and port, in the format hdfs://nameservice/
. The nameservice you specify must be defined in hdfs-site.xml.
The following example shows how you can use a nameservice, hadoopNS, with the hdfs
scheme.
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20)) AS COPY FROM 'hdfs://hadoopNS/data/users.csv';
If you are using Vertica to access data from more than one HDFS cluster, always use explicit nameservices or hosts in the URL. Using hdfs:///
could produce unexpected results because Vertica uses the first value of fs.defaultFS that it finds. To access multiple HDFS clusters, you must use host and service names that are globally unique. See Configuring the hdfs Scheme for more information.
All characters in URLs that are not a–z, A–Z, 0–9, '-', '.', '_' or '~' must be converted to URL encoding (%NN where NN is a two-digit hexadecimal number). For example, use %20 for space.