Data Pipelines: Vertica and Kafka

This blog post was authored by Tom Wall and Soniya Shah.

At Vertica, we want to make it as easy as possible for your Vertica environment to coexist with other tools and technologies. We know that one size does not fit all. Sometimes you need a customized, end-to-end view of your system.

Imagine you’re on a team of mobile game developers. Your game went viral – the user base is growing rapidly and you need to scale it up to handle the load. Additionally, you want to keep users engaged by enhancing the experience based on user feedback. Vertica is great for powering the analytics of user engagement, but to perform that analysis, you need a way to get data into your system.

Apache Kafka is a great fit for this use case – Kafka acts as a high performance message bus that connects disparate systems. It decouples applications that generate data (producers) from the systems that use the data (consumers), so that they can be managed and scaled independently. It is designed with a streaming abstraction in mind – data continuously arrives from producers and is sent to consumers with scalable throughput and low latency.

Data enters Kafka as a message, which is organized into common categories called topics. This abstraction is similar in concept to rows in a Vertica table. For scalability, Kafka divides topics into partitions that can be read and written in parallel with tools such as Vertica.

In Vertica, you can use the Kafka integration features to automatically load data to your database as it streams through Kafka. The scheduler solves many of the problems of designing a consistent, fault tolerant, scalable load pipeline, so you don’t have to. The scheduler works by:

• Implementing an infinite streaming abstraction by dispatching sequences of small loads known as microbatches
• Tracking the position within the stream atomically alongside the data for exactly-once consumption
• Dynamically adapting to busy workloads with intelligent scheduling heuristics and resource manager integration
• Maintaining configuration and runtime state in Vertica tables, so that it can be managed via CLI and monitored using SQL and tools like Vertica Management Console

The functionality that powers the scheduler is also available to use in a more direct fashion. If you need more control or customization than what the scheduler offers, you can use the same tools and techniques employed by the scheduler to develop and customize your own robust data pipelines. The Kafka UDx plugin offers several functions for interaction with Kafka to do loading, parsing, and metadata operations.

Using this functionality, Vertica can act as both a consumer and a producer. Vertica can send query results and monitoring data to Kafka for use in other systems connected to Kafka. This creates a closed loop of scalable processing functionality and opens up other opportunities to route data to the right systems for the right job. Data can move to and from Vertica into any application or system that has Kafka connectivity with ease, enabling you to focus less on data infrastructure and more on solving your business needs.

For more information about integrating with Kafka, see the Vertica documentation.