Data Streaming Integration Terms

Vertica integrates with data streaming applications through a number of components. To use Vertica with data streaming, you should be familiar with these terms.

Terminology

Term Description
Host A data streaming server.
Source A feed of messages in a common category which streams into the same Vertica target tables. In Apache Kafka, a source is known as a topic.
Partition Unit of parallelism within data streaming. Data streaming splits a source into multiple partitions, which can each be served in parallel to consumers such as a Vertica database. Within a partition, all messages are ordered chronologically.
Offset An index into a partition. This index is the position within an ordered queue of messages, not an index into an opaque byte stream.
Message A unit of data within data streaming. The data is typically in JSON or AVRO format. Messages are loaded as rows into Vertica tables, and are uniquely identified by their source, partition, and offset.

Data Loader Terminology

Data Loader Term Description

Job scheduler

A tool for continuous loading of data from data streaming into Vertica.
Micro-batch A pair of statements that: a) load data from all sources configured for this micro-batch into a Vertica target table; and b) update the progress within the streams. Being an atomic transaction, the micro-batch rolls back if any part of these operations fail so that each message is loaded exactly once.
Frame

Duration of time in which the scheduler will attempt to load each configured source once.

Stream

A feed of messages that is identified by a source and partition.

The offset uniquely identifies the position within a particular source-partition stream.

Lane

A thread within a job scheduler instance that issues micro-batches to perform the load.

The number of lanes available is based on the PlannedConcurrency of the job scheduler's resource pool. Multiple lanes allow for parallelism of micro-batches during a frame.