Data Streaming Integration Terms

Vertica uses the following terms to describe its streaming feature. These are general terms, which may differ from each specific streaming platform's terminology.

Terminology

Term Description
Host A data streaming server.
Source A feed of messages in a common category which streams into the same Vertica target tables. In Apache Kafka, a source is known as a topic.
Partition Unit of parallelism within data streaming. Data streaming splits a source into multiple partitions, which can each be served in parallel to consumers such as a Vertica database. Within a partition, messages are usually ordered chronologically.
Offset An index into a partition. This index is the position within an ordered queue of messages, not an index into an opaque byte stream.
Message A unit of data within data streaming. The data is typically in JSON or Avro format. Messages are loaded as rows into Vertica tables, and are uniquely identified by their source, partition, and offset.

Data Loader Terminology

Data Loader Term Description

Scheduler

An external tool that schedules data loads from a streaming data source into Vertica.
Microbatch

A microbatch represents a single segment of a data load from a streaming data source. It encompasses all of the information the scheduler needs to perform a load from a streaming data source into Vertica.

Frame

The window of time during which a Scheduler executes microbatches to load data. This window controls the duration of each COPY statement the scheduler runs as a part of the microbatch. During the frame, the scheduler gives an active microbatch from each source an opportunity to load data. It gives priority to microbatches that need more time to load data based on the history of previous microbatches.

Stream

A feed of messages that is identified by a source and partition.

The offset uniquely identifies the position within a particular source-partition stream.

Lane

A thread within a job scheduler instance that issues microbatches to perform the load.

The number of lanes available is based on the PlannedConcurrency of the job scheduler's resource pool. Multiple lanes allow the scheduler to run microbatches for different sources in parallel during a frame.