UDParser Class

You can subclass the UDParser class when you need to parse data that is in a format that the COPY statement's native parser cannot handle.

During parser execution, Vertica always calls three methods: setup(), process(), and destroy().  It might also call getRejectedRecord().

UDParser Methods

The UDParser class defines the following methods. Your subclass must override the process() and getRejectedRecord() methods. You can optionally override the other methods.

For the signatures of these methods and language-specific information, see Parser Classes (C++) and Parser Classes (Java).

The UDParser class performs important initialization required by all subclasses, including initializing the StreamWriter object used by the parser. Therefore, your constructor must call super().

Setting Up

COPY calls setup() before the first time it calls process(). Use setup() to perform any initial setup tasks that your parser needs to parse data. This setup includes retrieving parameters from the class context structure or initializing data structures for use during filtering. Vertica calls this method before calling the process() method for the first time. Your object might be destroyed and re-created during use, so make sure that your object is restartable.

Parsing

COPY calls process() repeatedly during query execution. Vertica passes this method a buffer of data to parse into columns and rows and one of the following input states defined by InputState:

  • OK: currently at the start of or in the middle of a stream
  • END_OF_FILE: no further data is available.
  • END_OF_CHUNK: the current data ends on a record boundary and the parser should consume all of it before returning. This input state only occurs when using a chunker.
  • START_OF_PORTION: the input does not start at the beginning of a source. The parser should find the first end-of-record mark. This input state only occurs when using apportioned load. You can use the getPortion() method to access the offset and size of the portion.
  • END_OF_PORTION: the source has reached the end of its portion. The parser should finish processing the last record it started and advance no further. This input state only occurs when using apportioned load.

The parser must reject any data that it cannot parse, so that Vertica can report the rejection and write the rejected data to files.

The process() method must parse as much data as it can from the input buffer. The buffer might not end on a row boundary. Therefore, it might have to stop parsing in the middle of a row of input and ask for more data. The input can contain null bytes, if the source file contains them, and is not automatically null-terminated.

A parser has an associated StreamWriter object, which performs the actual writing of the data. When your parser extracts a column value, it uses one of the type-specific methods on StreamWriter to write that value to the output stream. See Writing Data for more information about these methods.

A single call to process() might write several rows of data. When your parser finishes processing a row of data, it must call next() on its StreamWriter to advance the output stream to a new row. (Usually a parser finishes processing a row because it encounters an end-of-row marker.)

When your process() method reaches the end of the buffer, it tells Vertica its current state by returning one of the following values defined by StreamState:

  • INPUT_NEEDED: the parser has reached the end of the buffer and needs more data to parse.
  • DONE: the parser has reached the end of the input data stream.
  • REJECT: the parser has rejected the last row of data it read (see Rejecting Rows).

Tearing Down

COPY calls destroy() after the last time that process() is called. It frees any resources reserved by the setup() or process() method.

Vertica calls this method after the process() method indicates it has completed parsing the data source. However, sometimes data sources that have not yet been processed might remain. In such cases, Vertica might later call setup() on the object again and have it parse the data in a new data stream. Therefore, write your destroy() method so that it leaves an instance of your UDParser subclass in a state where setup() can be safely called again.

Reporting Rejections

If process() rejects a row, Vertica calls getRejectedRecord() to report it. Usually, this method returns an instance of the RejectedRecord class with details of the rejected row.

Writing Data

A parser has an associated StreamWriter object, which you access by calling getStreamWriter(). In your process() implementation, use the setType() methods on the StreamWriter object to write values in a row to specific column indexes. Verify that the data types you write match the data types expected by the schema.

The following example shows how you can write a value of type long to the fourth column (index 3) in the current row:

StreamWriter writer = getStreamWriter();
...
writer.setLongValue(3, 98.6);

StreamWriter provides methods for all the basic types, such as setBooleanValue(), setStringValue(), and so on. See the API documentation for a complete list of StreamWriter methods, including options that take primitive types or explicitly set entries to null.

The Java API supports additional options for writing data.  See Parser Classes.

Rejecting Rows

If your parser finds data it cannot parse, it should reject the row by:

  1. Saving details about the rejected row data and the reason for the rejection. These pieces of information can be directly stored in a RejectedRecord object, or in fields on your UDParser subclass, until they are needed.
  2. Updating the row's position in the input buffer by updating input.offset so it can resume parsing with the next row.
  3. Signaling that it has rejected a row by returning with the value StreamState.REJECT.
  4. Returning an instance of the RejectedRecord class with the details about the rejected row.

Breaking Up Large Loads

Vertica provides two ways to break up large loads. Apportioned Load allows you to distribute a load among several database nodes. Cooperative Parse (C++ only) allows you to distribute a load among several threads on one node.