UDChunker Class

You can subclass the UDChunker class to allow your parser to support Cooperative Parse. This class is available only in the C++ API.

Fundamentally, a UDChunker is a very simplistic parser. Like UDParser, it has the following three methods: setup(), process(), and destroy(). You must override process(); you may override the others. This class has one additional method, alignPortion(), which you must implement if you want to enable Apportioned Load for your UDChunker.

For the signatures of these methods, see Parser Classes.

Setting Up and Tearing Down

As with UDParser, you can define initialization and cleanup code for your chunker. Vertica calls setup() before the first call to process() and destroy() after the last call to process(). Your object might be reused among multiple load sources, so make sure that setup() completely initializes all fields.

Chunking

Vertica calls process() to divide an input into chunks that can be parsed independently. The method takes an input buffer and an indicator of the input state:

  • OK: the input buffer begins at the start of or in the middle of a stream.
  • END_OF_FILE: no further data is available.
  • END_OF_PORTION: the source has reached the end of its portion. This state occurs only when using apportioned load.

If the input state is END_OF_FILE, the chunker should set the input.offset marker to input.size and return DONE. Returning INPUT_NEEDED is an error.

If the input state is OK, the chunker should read data from the input buffer and find record boundaries. If it finds the end of at least one record, it should align the input.offset marker with the byte after the end of the last record in the buffer and return CHUNK_ALIGNED. For example, if the input is "abc~def" and "~" is a record terminator, this method should set input.offset to 4, the position of "d". If process() reaches the end of the input without finding a record boundary, it should return INPUT_NEEDED.

You can divide the input into smaller chunks, but consuming all available records in the input can have better performance. For example, a chunker could scan backwards from the end of the input to find a record terminator, which might be the last of many records in the input, and return it all as one chunk without scanning through the rest of the input.

If the input state is END_OF_PORTION, the chunker should behave as it does for an input state of OK, except that it should also set a flag. When called again, it should find the first record in the next portion and align the chunk to that record.

The input data can contain null bytes, if the source file contains them. The input argument is not automatically null-terminated.

The process() method must not block indefinitely. If this method cannot proceed for an extended period of time, it should return KEEP_GOING. Failing to return KEEP_GOING has several consequences, such as preventing your user from being able to cancel the query.

See Chunker Example: Delimited Parser and Chunker for an example of the process() method using chunking.

Aligning Portions

If your chunker supports apportioned load, implement the alignPortion() method. Vertica calls this method one or more times, before calling process(), to align the input offset with the beginning of the first complete chunk in the portion. The method takes an input buffer and an indicator of the input state:

  • START_OF_PORTION: the beginning of the buffer corresponds to the start of the portion. You can use the getPortion() method to access the offset and size of the portion.
  • OK: the input buffer is in the middle of a portion.
  • END_OF_PORTION: the end of the buffer corresponds to the end of the portion or beyond the end of a portion.
  • END_OF_FILE: no further data is available.

The method should scan from the beginning of the buffer to the start of the first complete record.  It should set input.offset to this position and return one of the following values:

  • DONE, if it found a chunk. input.offset is the first byte of the chunk.
  • INPUT_NEEDED, if the input buffer does not contain the start of any chunk. It is an error to return this from an input state of END_OF_FILE.
  • REJECT, if the portion (not buffer) does not contain the start of any chunk.