INFER_EXTERNAL_TABLE_DDL

Inspects a file in Parquet or ORC format and returns a CREATE EXTERNAL TABLE AS COPY statement that can be used to read the file. This statement might be incomplete. It could also contain more columns or columns with longer names than what Vertica supports; this function does not enforce Vertica system limits. Always inspect the output and address any issues before using it to create a table.

Parquet and ORC files contain insufficient information to infer the type of partition columns, so this function shows these columns with a data type of UNKNOWN and emits a warning.

The function handles most data types found in Parquet and ORC data, including complex types. If a type is not supported in Vertica, the function emits a warning.

By default, the function fully defines complex types, including handling nested types. You can instead treat the column as a single LONG VARBINARY value, which is useful if your data contains combinations of nesting that Vertica does not support.

Syntax

INFER_EXTERNAL_TABLE_DDL( 'path' USING PARAMETERS param=value[,...] )

The following syntax is deprecated:

INFER_EXTERNAL_TABLE_DDL( 'path', 'table_name' [, 'vertica_type_for_complex_type'] )

Arguments

path

Path to a file or directory. Any path that is valid for COPY is valid for this function. This function does not operate on files in other formats.

Parameters

format Input format (string), either Parquet or ORC. This parameter is required.
table_name

The name of the external table to create. This parameter is required.

Do not include a schema name as part of the table name; use the table_schema parameter.

table_schema The schema in which to create the external table.
vertica_type_for_complex_type Type used to represent all columns of complex types, if you do not want to expand them fully. The only supported value is LONG VARBINARY. For more information, see Flexible Complex Types.

Privileges

Non-superuser: READ privileges on the USER-accessible storage location.

Examples

In the following example, the input file contains data for a table with two INT columns. The table definition can be fully inferred, and you can use the returned SQL statement as-is.

=> SELECT INFER_EXTERNAL_TABLE_DDL('/data/orders/*.orc'
        USING PARAMETERS format = 'orc', table_name = 'orders');
 
				INFER_EXTERNAL_TABLE_DDL
--------------------------------------------------------------------------------------------------
create external table "orders" (
  "id" int, 
  "quantity" int
) as copy from '/data/orders/*.orc' orc;
(1 row)

To create a table in a schema, use the table_schema parameter. Do not add it to the table name; the function treats it as a name with a period in it, not a schema.

The following example shows output with complex types. You can use the definition as-is or modify the VARCHAR sizes:

=> SELECT INFER_EXTERNAL_TABLE_DDL('/data/people/*.parquet'
        USING PARAMETERS format = 'parquet', table_name = 'employees');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                    INFER_EXTERNAL_TABLE_DDL 
-------------------------------------------------------------------------
 create external table "employees"(
  "employeeID" int, 
  "personal" Row(
    "name" varchar, 
    "address" Row(
      "street" varchar, 
      "city" varchar, 
      "zipcode" int
    ), 
    "taxID" int
  ), 
  "department" varchar
 ) as copy from '/data/people/*.parquet' parquet;
(1 row)			

In the following example, the input file contains a map in the "prods" column. You can read a map as an array of rows:

=> SELECT INFER_EXTERNAL_TABLE_DDL('/data/orders.parquet' 
    USING PARAMETERS format='parquet', table_name='orders');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
				INFER_EXTERNAL_TABLE_DDL                                                                                       
------------------------------------------------------------------------
 create external table "orders"(
  "orderkey" int,
  "custkey" int,
  "prods" Array[Row(
    "key" varchar,
    "value" numeric(12,2)
  )],
  "orderdate" date
 ) as copy from '/data/orders.parquet' parquet;
(1 row)

The following example uses partition columns. Types of partition column cannot be determined from the data and you must edit to specify the types. In this example, the date and region columns are in the data in addition to being partition columns, and so the table definition shows them twice:

=> SELECT INFER_EXTERNAL_TABLE_DDL('/data/sales/*/*/*
        USING PARAMETERS format = 'parquet', table_name = 'sales');
WARNING 9262: This generated statement is incomplete because of one or more unknown column types. 
Fix these data types before creating the table
				INFER_EXTERNAL_TABLE_DDL                                                                                          
------------------------------------------------------------------------
 create external table "sales"(
  "tx_id" int, 
  "date" UNKNOWN, 
  "region" UNKNOWN
) as copy from '/data/sales/*/*/*' parquet(hive_partition_cols='date,region');
(1 row)		

For VARCHAR and VARBINARY columns, this function does not specify a length. The Vertica default length for these types is 80 bytes. If the data values are longer, using this table definition unmodified could cause data to be truncated. Always review VARCHAR and VARBINARY columns to determine if you need to specify a length. This function emits a warning if the input file contains columns of these types:

WARNING 9311: This generated statement contains one or more varchar/varbinary columns which default to length 80