Handling Messy Data
Loading data with
COPY has two main phases, parsing and loading. During parsing, if
COPY encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever
COPY cannot parse a row of data. Following are some parser errors that can cause a rejected row:
- Unsupported parser options
- Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections
- Malformed context for the parser in use
- Missing delimiters
COPY can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default,
COPY aborts a load if it encounters errors during the loading phase.
Several optional parameters let you determine how strictly
COPY handles rejections. For example, you can have
COPY fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how
COPY handles rejected data.
Save Rejected Rows (REJECTED DATA and EXCEPTIONS)
COPY statement automatically saves a copy of each rejected row in a rejected-data file.
COPY also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called
CopyErrorLogs, as shown in this example:
You can optionally save
COPY rejections and exceptions in one of two other ways:
- Use the
REJECTED DATA reject_pathand
EXCEPTIONS except_pathparameters to save both outputs to locations of your choice.
REJECTED DATArecords rejected rows, while
EXCEPTIONSrecords a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (
COPYcreates the directory if it does not exist.) If a path value is a file,
COPYuses it as a file prefix if multiple files are written.
- Use the
REJECTED DATA AS TABLE reject_tableclause. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving Rejected Data To a Table.
Vertica recommends saving rejected data to a table. However, saving to a table excludes saving to a default or specific rejected data file.
If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database
installation, rejected data table records are stored in the
RejectionTableData directory as follows:
=> cd v_mart_node003_data\RejectionTableData\ => ls TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1 TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2 . . . TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1 TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2
COPY LOCAL Rejected Data
COPY LOCAL operations, if you use
REJECTED DATA or
EXCEPTIONS with a file path, the files are written on the client. If you want rejections to be available on all nodes, use
REJECTED DATA AS TABLE instead of
Enforce Truncating or Rejecting Rows (ENFORCELENGTH)
When parsing data of type
VARBINARY, rows may exceed the target table length. By default,
COPY truncates such rows without rejecting them.
ENFORCELENGTH parameter to reject rows that exceed the target table.
For example, loading
'abc' into a table column specified as
VARCHAR(2) results in
COPY truncating the value to
'ab' and loading it. Loading the same row with the
ENFORCELENGTH parameter causes
COPY to reject the row.
NATIVE VARCHAR values up to 65K. If any value exceeds this limit,
COPY rejects the row, even when
ENFORCELENGTH is not in use.
Specify a Maximum Number of Rejections (REJECTMAX)
REJECTMAX parameter specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.
When the number of rejected records becomes equal to the
REJECTMAX value, the load fails. If you do not specify a value for
REJECTMAX, or if the value is 0,
COPY allows an unlimited number of exceptions to occur.
If you allow
COPY to reject rows and proceed when it encounters transformation errors, consider using
REJECTMAX to limit the impact. See Handling Transformation Errors.
COPY aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.
COPY treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General Parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use
REJECTED DATA or
REJECTED DATA AS TABLE to specify the output location.
You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using
REJECTMAX when enabling transformation rejections.
If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.
Abort Data Loads for Any Error (ABORT ON ERROR)
ABORT ON ERROR argument is the most restrictive way to load data, because no exceptions or rejections are allowed. A
COPY operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.
If you use the
ABORT ON ERROR as part of a
CREATE EXTERNAL TABLE AS COPY FROM statement, the option is used whenever a query references the external table. The offending error is saved in the
COPY exceptions or rejected data file.
Understanding Row Rejections and Rollback Errors
Depending on the type of error that
COPY encounters, Vertica does one of the following:
- Rejects the offending row and loads other rows into a table
- Rolls back the entire
COPYstatement without loading any data
If you specify
ABORT ON ERROR with the
COPY statement, the load automatically rolls back if
COPY cannot parse any row.
The following table summarizes the reasons for rejected rows or rollbacks.
|Rejected Rows||Load Rollback|
This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following
"a::INT + b::INT" is a SQL expression in which
b are derived values:
=> CREATE TABLE t (i INT); => COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT) FROM STDIN; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> cat|dog >> \.
Vertica cannot parse the row to the requested data type and rejects the row:
ERROR 2827: Could not convert "cat" from column "*FILLER*".a to an int8
a resolved to
'dog', the next expression
'cat'::INT + 'dog'::INT would return an expression evaluator error:
=> SELECT 'cat'::INT + 'dog'::INT; ERROR 3681: Invalid input syntax for integer: "cat"
COPY statement would also roll back because Vertica cannot parse the row to the requested data type:
=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;
In the following
COPY statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the
'cat' row as a VARCHAR type, COPY parses
'cat' directly as an INTEGER.
=> COPY t (a FILLER INT, i AS a) FROM STDIN;
In the following example, transformation errors are rejected instead of aborting the load.
=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1; ALTER DATABASE => CREATE TABLE sales (price INTEGER); COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej; dollars \. => SELECT rejected_data, rejected_reason FROM sales_rej; rejected_data | rejected_reason ---------------+----------------------------------------------- dollars | Invalid integer format 'dollars' for column 1 (price) (1 row)