UDParser Class

You can subclass the UDParser class when you need to parse data that is in a format that the COPY statement's native parser cannot handle.

During parser execution, Vertica always calls three methods: setup(), process(), and destroy().  It might also call getRejectedRecord().

UDParser Constructor

The UDParser class performs important initialization required by all subclasses, including initializing the StreamWriter object used by the parser. Therefore, your constructor must call super().

UDParser Methods

Your UDParser subclass must override process() or processWithMetadata():

processWithMetadata() is available only for user-defined extensions (UDxs) written in the C++ programming language.

  • process() reads the raw input stream as one large file. If there are any errors or failures, the entire load fails. You can implement process() with a source or filter that implements processWithMetadata(), but it might result in parsing errors.
    You can implement process() when the upstream source or filter implements processWithMetadata(), but it might result in parsing errors.

  • processWithMetadata() is useful when the data source has metadata about record boundaries available in some structured format that's separate from the data payload. With this interface, the source emits a record length for each record in addition to the data.

    By implementing processWithMetadata() instead of process() in each phase, you can retain this record length metadata throughout the load stack, which enables a more efficient parse that can recover from errors on a per-message basis, rather than a per-file or per-source basis. KafkaSource and Kafka parsers (KafkaAvroParser, KafkaJSONParser, and KafkaParser) use this mechanism to support per-Kafka-message rejections when individual Kafka messages are corrupted.

    To implement processWithMetadata(), you must override useSideChannel() to return true.

Additionally, you must override getRejectedRecord() to return information about rejected records.

Optionally, you can override the other UDParser class methods. For the signatures of these methods and other language-specific information, see Parser Classes (C++) and Parser Classes (Java).

Parser Execution

The following sections detail the execution sequence each time a user-defined parser is called. The following example overrides the process() method.

Setting Up
COPY calls setup() before the first time it calls process(). Use setup() to perform any initial setup tasks that your parser needs to parse data. This setup includes retrieving parameters from the class context structure or initializing data structures for use during filtering. Vertica calls this method before calling the process() method for the first time. Your object might be destroyed and re-created during use, so make sure that your object is restartable.

Parsing
COPY calls process() repeatedly during query execution. Vertica passes this method a buffer of data to parse into columns and rows and one of the following input states defined by InputState:

  • OK: currently at the start of or in the middle of a stream
  • END_OF_FILE: no further data is available.
  • END_OF_CHUNK: the current data ends on a record boundary and the parser should consume all of it before returning. This input state only occurs when using a chunker.
  • START_OF_PORTION: the input does not start at the beginning of a source. The parser should find the first end-of-record mark. This input state only occurs when using apportioned load. You can use the getPortion() method to access the offset and size of the portion.
  • END_OF_PORTION: the source has reached the end of its portion. The parser should finish processing the last record it started and advance no further. This input state only occurs when using apportioned load.

The parser must reject any data that it cannot parse, so that Vertica can report the rejection and write the rejected data to files.

The process() method must parse as much data as it can from the input buffer. The buffer might not end on a row boundary. Therefore, it might have to stop parsing in the middle of a row of input and ask for more data. The input can contain null bytes, if the source file contains them, and is not automatically null-terminated.

A parser has an associated StreamWriter object, which performs the actual writing of the data. When your parser extracts a column value, it uses one of the type-specific methods on StreamWriter to write that value to the output stream. See Writing Data for more information about these methods.

A single call to process() might write several rows of data. When your parser finishes processing a row of data, it must call next() on its StreamWriter to advance the output stream to a new row. (Usually a parser finishes processing a row because it encounters an end-of-row marker.)

When your process() method reaches the end of the buffer, it tells Vertica its current state by returning one of the following values defined by StreamState:

  • INPUT_NEEDED: the parser has reached the end of the buffer and needs more data to parse.
  • DONE: the parser has reached the end of the input data stream.
  • REJECT: the parser has rejected the last row of data it read (see Rejecting Rows).

Tearing Down
COPY calls destroy() after the last time that process() is called. It frees any resources reserved by the setup() or process() method.

Vertica calls this method after the process() method indicates it has completed parsing the data source. However, sometimes data sources that have not yet been processed might remain. In such cases, Vertica might later call setup() on the object again and have it parse the data in a new data stream. Therefore, write your destroy() method so that it leaves an instance of your UDParser subclass in a state where setup() can be safely called again.

Reporting Rejections
If process() rejects a row, Vertica calls getRejectedRecord() to report it. Usually, this method returns an instance of the RejectedRecord class with details of the rejected row.

Writing Data

A parser has an associated StreamWriter object, which you access by calling getStreamWriter(). In your process() implementation, use the setType() methods on the StreamWriter object to write values in a row to specific column indexes. Verify that the data types you write match the data types expected by the schema.

The following example shows how you can write a value of type long to the fourth column (index 3) in the current row:

StreamWriter writer = getStreamWriter();
...
writer.setLongValue(3, 98.6);

StreamWriter provides methods for all the basic types, such as setBooleanValue(), setStringValue(), and so on. See the API documentation for a complete list of StreamWriter methods, including options that take primitive types or explicitly set entries to null.

The Java API supports additional options for writing data.  See Parser Classes.

Rejecting Rows

If your parser finds data it cannot parse, it should reject the row by:

  1. Saving details about the rejected row data and the reason for the rejection. These pieces of information can be directly stored in a RejectedRecord object, or in fields on your UDParser subclass, until they are needed.
  2. Updating the row's position in the input buffer by updating input.offset so it can resume parsing with the next row.
  3. Signaling that it has rejected a row by returning with the value StreamState.REJECT.
  4. Returning an instance of the RejectedRecord class with the details about the rejected row.

Breaking Up Large Loads

Vertica provides two ways to break up large loads. Apportioned Load allows you to distribute a load among several database nodes. Cooperative Parse (C++ only) allows you to distribute a load among several threads on one node.