Working with External Data
An alternative to importing data into Vertica is to query it in place. Querying external data instead of importing it can be advantageous in some cases:
- If you want to explore data, such as in a data lake, before selecting data to load into Vertica.
- If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.
- If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.
- You have a very large volume of data and do not want to increase your license capacity.
- You have lower-priority data in Vertica that you still want to be able to query.
To query external data, you must describe your data as an external table. Like Vertica-managed tables, external tables have table definitions and can be queried. Unlike Vertica-managed tables, external tables have no catalog and Vertica loads selected data from the external source as needed. For some formats, the query planner can take advantage of partitions and sorting in the data, so querying an external table does not mean you load all of the data at query time. (For more information about Vertica-managed tables, see Working with Vertica-Managed Tables.)
There is one special type of external data not covered in this section. If you are reading data from Hadoop, and specifically from a Hive data warehouse, then instead of defining your own external tables you can read the schema information from Hive. For more information, see Using the HCatalog Connector in Integrating with Apache Hadoop.