Creating External Tables

To create an external table you combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. CREATE EXTERNAL TABLE AS COPY uses a subset of parameters from CREATE TABLE and COPY.

You define your table columns as you would for a Vertica native table using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. How you specify the FROM path depends on where the file is located and the data format. See Specifying Where to Load Data From and Data Formats.

When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. You may omit partition columns. If you omit data columns, queries using the table abort with an error. For other data formats, you can select only the data columns of interest.

If you load from multiple ORC or Parquet files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY statement and stores the table name and COPY statement definition in the catalog. Each time a SELECT query references an external table, Vertica parses and executes the stored COPY statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.

Successfully returning data from an external table requires that the COPY definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.

The following example defines an external table for delimited data stored in HDFS:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT) 
    AS COPY FROM 'hdfs:///dat/ext1.csv' DELIMITER ',';

The following example uses data in the ORC format that is stored in S3. The data has two partition columns, represented as directories in the file structure. For more information about partitions, see Using Partition Columns.

=> CREATE EXTERNAL TABLE transactions (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 's3://datalake/sales/*/*/*' 
   ORC(hive_partition_cols='created,region');

The following example shows how you can read from all Parquet files in a local directory, with no partitions and no globs:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT) 
    AS COPY FROM '/data/sales/*.parquet' PARQUET;

When using the ORC and Parquet formats, Vertica supports some additional options in the COPY statement and data structures for columns. See ORC (Parser) and PARQUET (Parser).

If ORC or Parquet data is partitioned, Vertica expects Hive-style partitioning. If you see unexpected results when reading data, verify that globs in your file paths correctly align with the partition structure. See Troubleshooting External Tables.

Special Considerations for External Tables

If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.

You can see unexpected query results if constraints on columns cause values to be rejected:

  • If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.
  • If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.
  • When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.

One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).

When using the COPY parameter ON ANY NODE, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.

If your data is in Parquet or ORC format, you can take advantage of partitioning to limit the amount of data that Vertica reads. These formats are special in this respect because they embed metadata in the file headers. For more information about using partitioned data, see Using Partition Columns.

Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.

When working with a new external data source, consider setting REJECTMAX to 1 to make problems in the data apparent. Testing in this way allows you to discover problems in the data before running production queries against it.

After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.

Required Permissions

In addition to having permission in Vertica, users must have read access to the external data.

  • For data on the local disk this access is governed by local file permissions.
  • For data in HDFS, access might be governed by Kerberos authentication. See Accessing Kerberized HDFS Data.
  • For data on S3, you need access through an AWS IAM role. See S3 Object Store.

For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) Object Store.

By default, you must also be a database superuser to access external tables through a SELECT statement.

In most cases, to allow users without superuser access to query external tables, an administrator must create a USER storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (Storage Location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp (including in subdirectories to any depth):

=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user';  
=> GRANT ALL ON LOCATION '/tmp' to Bob;

Organizing External Table Data

If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), your COPY definition statement can use wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:

/2018/monthly/

In this case, the external table COPY statement includes a wildcard definition such as the following:

=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/2018/monthly/*'

Whenever a Vertica query references the external table archive, and Vertica parses the COPY statement, all stored data in the top-level monthly directory is accessible to the query.

Deriving a Table Definition from the Data

You can use the INFER_EXTERNAL_TABLE_DDL function to inspect Parquet or ORC data and produce a starting point for a table definition. This function returns a CREATE EXTERNAL TABLE statement, which might require further editing. For columns where the function could not infer the data type, the function labels the type as unknown and emits a warning. For VARCHAR and VARBINARY columns, you might need to adjust the length. Always review the statement the function returns, but especially for tables with many columns, using this function can save time and effort.

In the following example, the data contains one materialized column and two partition columns. Partition columns are always of unknown type.

=> SELECT INFER_EXTERNAL_TABLE_DDL('/data/sales/*/*/* 
          USING PARAMETERS format = "parquet", table_name = 'sales');
WARNING 9311:  This generated statement is incomplete because of one or more unknown column types. Fix these data types before creating the table
				INFER_EXTERNAL_TABLE_DDL                                                                                          
------------------------------------------------------------------------
 create external table "sales"(
  "tx_id" int, 
  "date" UNKNOWN, 
  "region" UNKNOWN
) as copy from 'data/sales/*/*/*' parquet(hive_partition_cols='date,region');
(1 row)		

For Parquet files, you can use the GET_METADATA function to inspect a file and report metadata including information about columns.

Validating Table Definitions

When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM statement. For example, if you omit a required keyword in the statement, creating the external table fails:

=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar) 
    AS COPY '/home/dbadmin/designer.log'; 
ERROR 2778:  COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source

Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.

To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.

Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling Messy Data.

Viewing External Table Definitions

When you create an external table, Vertica stores the COPY definition statement in the table_definition column of the TABLES system table.

To list all tables, use a SELECT * query, as shown:

=> SELECT * FROM TABLES WHERE table_definition <> '';

Use a query such as the following to list the external table definitions:

=> SELECT table_name, table_definition FROM TABLES;
 table_name |                                table_definition                                                   
------------+----------------------------------------------------------------------
 t1         | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t1_copy    | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t2         | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ','
(3 rows)