Vertica Analytics Platform Version 10.1.x Documentation

Data Streaming Integration Terms

Vertica uses the following terms to describe its streaming feature. These are general terms, which may differ from each specific streaming platform's terminology.

Terminology

Term	Description
Host	A data streaming server.
Source	A feed of messages in a common category which streams into the same Vertica target tables. In Apache Kafka, a source is known as a topic.
Partition	Unit of parallelism within data streaming. Data streaming splits a source into multiple partitions, which can each be served in parallel to consumers such as a Vertica database. Within a partition, messages are usually ordered chronologically.
Offset	An index into a partition. This index is the position within an ordered queue of messages, not an index into an opaque byte stream.
Message	A unit of data within data streaming. The data is typically in JSON or Avro format. Messages are loaded as rows into Vertica tables, and are uniquely identified by their source, partition, and offset.

Data Loader Terminology

Data Loader Term	Description
Scheduler	An external tool that schedules data loads from a streaming data source into Vertica.
Microbatch	A microbatch represents a single segment of a data load from a streaming data source. It encompasses all of the information the scheduler needs to perform a load from a streaming data source into Vertica.
Frame	The window of time during which a Scheduler executes microbatches to load data. This window controls the duration of each COPY statement the scheduler runs as a part of the microbatch. During the frame, the scheduler gives an active microbatch from each source an opportunity to load data. It gives priority to microbatches that need more time to load data based on the history of previous microbatches.
Stream	A feed of messages that is identified by a source and partition. The offset uniquely identifies the position within a particular source-partition stream.
Lane	A thread within a job scheduler instance that issues microbatches to perform the load. The number of lanes available is based on the PlannedConcurrency of the job scheduler's resource pool. Multiple lanes allow the scheduler to run microbatches for different sources in parallel during a frame.

Was this topic helpful?

Explore

Vertica Concepts

Getting Started

Connect

Big Data and Analytics Community

Learn

Vertica Knowledge Base

Vertica Training

Contact

Send documentation feedback

Vertica Support

© 2007 - 2024 Open Text Corporation

Privacy Policy

Cookie Preferences