Data Streaming Integration Terms
Vertica integrates with data streaming applications through a number of components. To use Vertica with data streaming, you should be familiar with these terms.
Terminology
Term | Description |
---|---|
Host | A data streaming server. |
Source | A feed of messages in a common category which streams into the same Vertica target tables. In Apache Kafka, a source is known as a topic. |
Partition | Unit of parallelism within data streaming. Data streaming splits a source into multiple partitions, which can each be served in parallel to consumers such as a Vertica database. Within a partition, all messages are ordered chronologically. |
Offset | An index into a partition. This index is the position within an ordered queue of messages, not an index into an opaque byte stream. |
Message | A unit of data within data streaming. The data is typically in JSON or AVRO format. Messages are loaded as rows into Vertica tables, and are uniquely identified by their source, partition, and offset. |
Data Loader Terminology
Data Loader Term | Description |
---|---|
Job scheduler |
A tool for continuous loading of data from data streaming into Vertica. |
Micro-batch | A pair of statements that: a) load data from all sources configured for this micro-batch into a Vertica target table; and b) update the progress within the streams. Being an atomic transaction, the micro-batch rolls back if any part of these operations fail so that each message is loaded exactly once. |
Frame |
Duration of time in which the scheduler will attempt to load each configured source once. |
Stream |
A feed of messages that is identified by a source and partition. The offset uniquely identifies the position within a particular source-partition stream. |
Lane |
A thread within a job scheduler instance that issues micro-batches to perform the load. The number of lanes available is based on the PlannedConcurrency of the job scheduler's resource pool. Multiple lanes allow for parallelism of micro-batches during a frame. |