What’s up with rejected data?

Posted April 12, 2017 by Chana Zolty

This blog post was authored by Kanti Mann.

In a perfect world, any and all data you attempt to load into your database would seamlessly and accurately move from point A to point B. Unfortunately, this doesn’t always happen. Occasionally, data fails to load into its destination table, and you’ll probably want to know what didn’t load, and why. Luckily, COPY makes it easy for you to both find out what happened, and hopefully, fix any problems and load your data again. This blog explores why COPY rejects data from being loaded, describes some options you have for handling rejections, and provides you with some best practices.

Why is data rejected?

In Vertica, the COPY statement has two main stages: parsing and loading (there are other stages, but we’ll stick to these two). COPY rejects data only if it encounters problems during its parser phase. That’s when you end up with rejected data. Potential causes for parsing errors include:
  • – Unsupported parser options
  • – Incorrect data types for the table into which data is being loaded
  • – Malformed context for the parser in use
  • – Missing delimiters
The good news is that you can control how strictly Vertica handles any parse errors. For example, you may want the whole load to fail if even one row is rejected. For that, use the optional parameter ABORT ON ERROR. Or, you may want to limit the number of rejected rows you’ll permit. Use REJECTMAX to set a threshold after which you want COPY to roll back the load process.

What happens to rejected data?

Vertica gives you these options to save rejected data:
  • – Do nothing. Vertica automatically saves a rejected data file and an accompanying explanation of each rejected row (the exception), to files in a catalog subdirectory called CopyErrorLogs.
  • – Specify file locations of your choice using the REJECTED DATA and EXCEPTIONS parameters.
  • – Save rejected data to a table. Using a table lets you query what data was rejected, and why. You can then fix any incorrect data, and reload it.

What save option should I use?

Vertica recommends saving rejected data to a table. The table contains both the rejected data and the exception in one location. Saving rejected data to a table is simple, using the REJECTED DATA AS TABLE reject_table clause in the COPY statement.

If you specify a rejected table that doesn’t yet exist, Vertica creates it for you. If the table exists, Vertica appends any additional rejected rows. If your COPY statement specifies NO COMMIT, Vertica saves the rejected data table as a LOCAL TEMP table and returns a message that it has created a LOCAL TEMP table.

Either way, Vertica stores the table in a data subdirectory, called RejectionTableData directory, where you can list the reject_table: dbadmin: ~/VMart/v_vmart_node0001_data/RejectionTableData$ ls TABLE_REJECTED_RECORDS_"reject_table"_v_vmart_node0001-449788:0x629b2e_45035996275600188_1.1 In the following example, this COPY statement loads values from STDIN into a table (sample_table) that has one column of type INT. The statement specifies reject_table for rejected data: => COPY sample_table FROM STDIN REJECTED DATA AS TABLE reject_table; End with a backslash and a period on a line by itself. >> 1 >> 2 >> 3 >> a >> \. The last value is not a character, which is not allowed for a designated INTEGER column. After this load attempt, we can query our reject_table to see its contents: => SELECT * FROM reject_table; -[ RECORD 1 ]-------------+--------------------------------------------- node_name | v_vmart_node0001 file_name | STDIN session_id | v_vmart_node0001-449788:0x629b2e transaction_id | 45035996275600188 statement_id | 1 batch_number | 0 row_number | 6 rejected_data | a rejected_data_orig_length | 1 rejected_reason | Invalid integer format 'a' for column 1 (c1) Note: Rejected data tables are a little different from flex or columnar tables. They do not support DML and DDL activities. For best results, periodically drop any rejected data tables that you no longer use.

What if I need to save rejected data to files?

If you cannot use a table to save COPY rejected data, feel free to continue using either the COPY default location for rejected data files, or to specify a path and name of your choice.

When you specify REJECTED DATA AS a file of your choice, all rejected data should be written to that file. However, this is not always feasible because of COPY parallelism, where loading tasks use different threads.

The resource pool controls the number of threads COPY uses per load. During cooperative parsing, COPY uses all available threads. Otherwise, COPY uses one thread per source or portion, as follows:
  • – When loading from a user-defined source, COPY creates a thread for each object returned from the prepareUDSources() method.
  • – When loading from files, such as a glob, the number of threads depends on the number of files and their sizes. COPY uses at least one thread per file, but more if a file exceeds a certain size limit.

What determines how many rejected data files exist?

Even when you supply a path and name for a rejected data file, multiple COPY threads trying their best to write to your file will step on each other’s rejected data toes!  To avoid this from happening, each thread generates its own rejected data files as required. COPY then treats your specified rejected data file, my_reject_file as a prefix.  Each thread can write to its own my_reject_file.suffix_value. The suffix_value is either a thread index (if multiple threads are performing parallel parsing), such as .1, .2, and so on, or an offset for a UDL apportioned load.

So, each time you give COPY your own rejected file name, expect to see multiple files of that name (each with a numeric suffix) once COPY is complete.

To learn more about rejected data, including truncated rows, exporting tables, and other delights, see our documentation.