Creating an External Table with an HDFS Source
You can use the HDFS Connector as a source for an external table that lets you directly perform queries on the contents of files on the Hadoop Distributed File System (HDFS). See Using External Tables in the Administrator's Guide for more information on external tables. If your HDFS data is in ORC or Parquet format, using the special readers for those formats might provide better performance. See Reading Hadoop Columnar File Formats.
Using an external table to access data stored on an HDFS cluster is useful when you need to extract data from files that are periodically updated, or have additional files added on HDFS. It saves you from having to drop previously loaded data and then reload the data using a COPY statement. The external table always accesses the current version of the files on HDFS.
Note: An external table performs a bulk load each time it is queried. Its performance is significantly slower than querying an internal Vertica table. You should only use external tables for infrequently-run queries (such as daily reports). If you need to frequently query the content of the HDFS files, you should either use COPY to load the entire content of the files into Vertica or save the results of a query run on an external table to an internal table which you then use for repeated queries.
To create an external table that reads data from HDFS, use the HDFS Use-Defined Source (UDS) in a CREATE EXTERNAL TABLE AS COPY statement. The COPY portion of this statement has the same format as the COPY statement used to load data from HDFS. See Loading Data Using the HDFS Connector for more information.
The following simple example shows how to create an external table that extracts data from every file in the /user/hadoopUser/example/output
directory using the HDFS Connector.
=> CREATE EXTERNAL TABLE hadoopExample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE Hdfs(url= -> 'http://hadoop01:50070/webhdfs/v1/user/hadoopUser/example/output/*', -> username='hadoopUser'); CREATE TABLE => SELECT * FROM hadoopExample; A | B | C | D -------+---+---+--- test1 | 1 | 2 | 3 test1 | 3 | 4 | 5 (2 rows)
Later, after another Hadoop job adds contents to the output directory, querying the table produces different results:
=> SELECT * FROM hadoopExample; A | B | C | D -------+----+----+---- test3 | 10 | 11 | 12 test3 | 13 | 14 | 15 test2 | 6 | 7 | 8 test2 | 9 | 0 | 10 test1 | 1 | 2 | 3 test1 | 3 | 4 | 5 (6 rows)
Load Errors in External Tables
Normally, querying an external table on HDFS does not produce any errors if rows rejected by the underlying COPY statement (for example, rows containing columns whose contents are incompatible with the data types in the table). Rejected rows are handled the same way they are in a standard COPY statement: they are written to a rejected data file, and are noted in the exceptions file. For more information on how COPY handles rejected rows and exceptions, see Capturing Load Rejections and Exceptions in the Administrator's Guide.
Rejections and exception files are created on all of the nodes that load data from the HDFS. You cannot specify a single node to receive all of the rejected row and exception information. These files are created on each Vertica node as they process files loaded through the Vertica Connector for HDFS.
Note: Since the the connector is read-only, there is no way to store rejection and exception information on the HDFS.
Fatal errors during the transfer of data (for example, specifying files that do not exist on the HDFS) do not occur until you query the external table. The following example shows what happens if you recreate the table based on a file that does not exist on HDFS.
=> DROP TABLE hadoopExample; DROP TABLE => CREATE EXTERNAL TABLE hadoopExample (A INTEGER, B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE HDFS(url='http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt', -> username='hadoopUser'); CREATE TABLE => SELECT * FROM hadoopExample; ERROR 0: Error calling plan() in User Function HdfsFactory at [src/Hdfs.cpp:222], error code: 0, message: No files match [http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt]
Note that it is not until you actually query the table that the connector attempts to read the file. Only then does it return an error.