Creating External Tables
To create an external table you combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. With this statement, you define your table columns as you would for a Vertica-managed database using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. CREATE EXTERNAL TABLE AS COPY
uses a subset of parameters from CREATE TABLE
and COPY
.
How you specify the FROM path depends on where the file is located. See Specifying Where to Load Data From.
When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY
statement and stores the table name and COPY
statement definition in the catalog. Each time a SELECT
query references an external table, Vertica parses and executes the stored COPY
statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.
Successfully returning data from an external table requires that the COPY
definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.
The following example defines an external table for data stored in HDFS:
=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT) AS COPY FROM 'hdfs:///dat/ext1.csv' DELIMITER ',';
The following example uses data in the Parquet format that is stored in S3:
=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT) AS COPY FROM 's3://datalake/sales/*.parquet' PARQUET;
When using the ORC and Parquet formats, Vertica supports some additional options in the COPY
statement and data structures for columns. See Using External Tables with ORC and Parquet Data.
If ORC or Parquet data is partitioned, Vertica expects Hive-style partitioning. If you see unexpected results when reading data, verify that globs in your file paths correctly align with the partition structure. See Troubleshooting Reads from ORC and Parquet Files.
Special Considerations for External Tables
If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.
You can see unexpected query results if constraints on columns cause values to be rejected:
- If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.
- If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.
- When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.
One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).
When using the COPY parameter ON ANY NODE
, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.
If your data is in Parquet or ORC format, you can take advantage of partitioning to limit the amount of data that Vertica reads. These formats are special in this respect because they embed metadata in the file headers. For more information about using partitioned data, see Using Partition Columns.
Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.
When working with a new external data source, consider setting REJECTMAX to 1 to make problems in the data apparent. Testing in this way allows you to discover problems in the data before running production queries against it.
After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.
Required Permissions
In addition to having permission in Vertica, users must have read access to the external data.
- For data on the local disk this access is governed by local file permissions.
- For data in HDFS, access might be governed by Kerberos authentication. See Accessing Kerberized HDFS Data.
- For data on S3, you need access through an AWS IAM role. See S3 Object Store.
For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) Object Store.
By default, you must also be a database superuser to access external tables through a SELECT statement.
In most cases, to allow users without superuser access to query external tables, an administrator must create a 'user' storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (Storage Location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp (including in subdirectories to any depth):
=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user'; => GRANT ALL ON LOCATION '/tmp' to Bob;
Organizing External Table Data
If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), your COPY definition statement can use wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:
/2018/monthly/
In this case, the external table COPY statement includes a wildcard definition such as the following:
=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/2018/monthly/*'
Whenever a Vertica query references the external table archive
, and Vertica parses the COPY statement, all stored data in the top-level monthly
directory is accessible to the query.
Validating Table Definitions
When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM
statement. For example, if you omit a required keyword in the statement (such as FROM
), creating the external table fails:
=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar) AS COPY '/home/dbadmin/designer.log'; ERROR 2778: COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source
Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.
To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.
Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling Messy Data.
Viewing External Table Definitions
When you create an external table, Vertica stores the COPY definition statement in the table_definition
column of the v_catalog.tables
system table.
To list all tables, use a select *
query, as shown:
=> SELECT * FROM v_catalog.tables WHERE table_definition <> '';
Use a query such as the following to list the external table definitions (table_definition
):
=> SELECT table_name, table_definition FROM v_catalog.tables; table_name | table_definition ------------+---------------------------------------------------------------------- t1 | COPY FROM 'TMPDIR/external_table.dat' DELIMITER ',' t1_copy | COPY FROM 'TMPDIR/external_table.dat' DELIMITER ',' t2 | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ',' (3 rows)