Querying External Tables
After you create an external table, you can query it as you would query any other table. Suppose we have created the following external tables:
=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR) AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ','; CREATE TABLE => CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT) AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ','; CREATE TABLE
We can now write queries against these tables, such as the following:
=> SELECT * FROM catalog; id | description | category ----+----------------------+------------- 10 | 24in monitor | computers 11 | 27in monitor | computers 12 | 24in IPS monitor | computers 20 | 1TB USB drive | computers 21 | 2TB USB drive | computers 22 | 32GB USB thumb drive | computers 30 | 40in LED TV | electronics 31 | 50in LED TV | electronics 32 | 60in plasma TV | electronics (9 rows) => SELECT * FROM inventory; storeID | prodID | quantity ---------+--------+---------- 502 | 10 | 17 502 | 11 | 2 517 | 10 | 1 517 | 12 | 2 517 | 12 | 4 542 | 10 | 3 542 | 11 | 11 542 | 12 | 1 (8 rows) => SELECT inventory.storeID,catalog.description,inventory.quantity FROM inventory JOIN catalog ON inventory.prodID = catalog.id; storeID | description | quantity ---------+------------------+---------- 502 | 24in monitor | 17 517 | 24in monitor | 1 542 | 24in monitor | 3 502 | 27in monitor | 2 542 | 27in monitor | 11 517 | 24in IPS monitor | 2 517 | 24in IPS monitor | 4 542 | 24in IPS monitor | 1 (8 rows)
One important difference between external tables and Vertica-managed tables is that querying an external table reads the external data every time. (See How External Tables Differ from Vertica-Managed Tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.
Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the
ExternalTablesExceptionsLimit configuration parameter. The default value is 100. Setting the
–1 removes the limit, but is not recommended.
If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the
vertica.log file and does not report subsequent rejected rows.
ExternalTablesExceptionsLimit configuration parameter differs from using the COPY statement
REJECTMAX parameter to set a low rejection threshold. The
REJECTMAX value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than
REJECTMAX, COPY aborts execution instead of logging a warning in
Queries that include joins perform better if the smaller table is the inner one. For Vertica-managed tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.
After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)
The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.
Using External Tables with User-Defined Load (UDL) Functions
You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL) in Extending Vertica.