Handling Messy Data
Loading data with COPY
has two main phases, parsing and loading. During parsing, if COPY
encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever COPY
cannot parse a row of data. Following are some parser errors that can cause a rejected row:
- Unsupported parser options
- Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections
- Malformed context for the parser in use
- Missing delimiters
Optionally, COPY
can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default, COPY
aborts a load if it encounters errors during the loading phase.
Several optional parameters let you determine how strictly COPY
handles rejections. For example, you can have COPY
fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how COPY
handles rejected data.
Save Rejected Rows (REJECTED DATA and EXCEPTIONS)
The COPY
statement automatically saves a copy of each rejected row in a rejected-data file. COPY
also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called CopyErrorLogs
, as shown in this example:
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-rejected-data.1
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-exceptions.1
You can optionally save COPY
rejections and exceptions in one of two other ways:
- Use the
REJECTED DATA reject_path
andEXCEPTIONS except_path
parameters to save both outputs to locations of your choice.REJECTED DATA
records rejected rows, whileEXCEPTIONS
records a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (COPY
creates the directory if it does not exist.) If a path value is a file,COPY
uses it as a file prefix if multiple files are written. - Use the
REJECTED DATA AS TABLE reject_table
clause. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving Rejected Data To a Table.
Vertica recommends saving rejected data to a table. However, saving to a table excludes saving to a default or specific rejected data file.
If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database
installation, rejected data table records are stored in the RejectionTableData
directory as follows:
=> cd v_mart_node003_data\RejectionTableData\ => ls TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1 TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2 . . . TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1 TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2
COPY LOCAL Rejected Data
For COPY LOCAL
operations, if you use REJECTED DATA
or EXCEPTIONS
with a file path, the files are written on the client. If you want rejections to be available on all nodes, use REJECTED DATA AS TABLE
instead of REJECTED DATA
.
Enforce Truncating or Rejecting Rows (ENFORCELENGTH)
When parsing data of type CHAR
, VARCHAR
, BINARY
, or VARBINARY
, rows may exceed the target table length. By default, COPY
truncates such rows without rejecting them.
Use the ENFORCELENGTH
parameter to reject rows that exceed the target table.
For example, loading 'abc'
into a table column specified as VARCHAR(2)
results in COPY
truncating the value to 'ab'
and loading it. Loading the same row with the ENFORCELENGTH
parameter causes COPY
to reject the row.
Vertica supports NATIVE
and NATIVE VARCHAR
values up to 65K. If any value exceeds this limit, COPY
rejects the row, even when ENFORCELENGTH
is not in use.
Specify a Maximum Number of Rejections (REJECTMAX)
The REJECTMAX
parameter specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.
When the number of rejected records becomes equal to the REJECTMAX
value, the load fails. If you do not specify a value for REJECTMAX
, or if the value is 0, COPY
allows an unlimited number of exceptions to occur.
If you allow COPY
to reject rows and proceed when it encounters transformation errors, consider using REJECTMAX
to limit the impact. See Handling Transformation Errors.
Handling Transformation Errors
By default, COPY
aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.
To have COPY
treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General Parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use REJECTED DATA
or REJECTED DATA AS TABLE
to specify the output location.
You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using REJECTMAX
when enabling transformation rejections.
If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.
Abort Data Loads for Any Error (ABORT ON ERROR)
Using the ABORT ON ERROR
argument is the most restrictive way to load data, because no exceptions or rejections are allowed. A COPY
operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.
If you use the ABORT ON ERROR
as part of a CREATE EXTERNAL TABLE AS COPY FROM
statement, the option is used whenever a query references the external table. The offending error is saved in the COPY
exceptions or rejected data file.
Understanding Row Rejections and Rollback Errors
Depending on the type of error that COPY
encounters, Vertica does one of the following:
- Rejects the offending row and loads other rows into a table
- Rolls back the entire
COPY
statement without loading any data
If you specify ABORT ON ERROR
with the COPY
statement, the load automatically rolls back if COPY
cannot parse any row.
The following table summarizes the reasons for rejected rows or rollbacks.
Rejected Rows | Load Rollback |
---|---|
|
|
This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following COPY
statement, "a::INT + b::INT"
is a SQL expression in which a
and b
are derived values:
=> CREATE TABLE t (i INT); => COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT) FROM STDIN; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> cat|dog >> \.
Vertica cannot parse the row to the requested data type and rejects the row:
ERROR 2827: Could not convert "cat" from column "*FILLER*".a to an int8
If a
resolved to 'cat'
and b
to 'dog'
, the next expression 'cat'::INT + 'dog'::INT
would return an expression evaluator error:
=> SELECT 'cat'::INT + 'dog'::INT; ERROR 3681: Invalid input syntax for integer: "cat"
The following COPY
statement would also roll back because Vertica cannot parse the row to the requested data type:
=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;
In the following COPY
statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the 'cat'
row as a VARCHAR type, COPY parses 'cat'
directly as an INTEGER.
=> COPY t (a FILLER INT, i AS a) FROM STDIN;
In the following example, transformation errors are rejected instead of aborting the load.
=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1; ALTER DATABASE => CREATE TABLE sales (price INTEGER); COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej; dollars \. => SELECT rejected_data, rejected_reason FROM sales_rej; rejected_data | rejected_reason ---------------+----------------------------------------------- dollars | Invalid integer format 'dollars' for column 1 (price) (1 row)