Data Streaming Job Scheduler

The data streaming job scheduler is a tool for continuous loading of streaming data into Vertica. The scheduler comes pre-packaged and installed with the Vertica rpm. For information on job scheduler requirements, refer to Vertica Integration for Apache Kafka.

You can use the scheduler from any node by running the vkconfig script:

/opt/vertica/packages/kafka/bin/vkconfig

Note: If you do not want the scheduler to use Vertica host resources, or if you want to limit user access to the Vertica nodes, install the RPM on the host but do not create a database.

What the Scheduler Does

A scheduler instance works by creating frames and issuing micro-batches that load data into tables in your Vertica database. The scheduler loads all (enabled) sources to Vertica target tables during a single frame duration and continuously schedules frames as one completes.

You can add as many sources as you want to a single scheduler. Doing so allows the scheduler to collect all data from all these sources every single frame. This option is helpful if you have a large number of sources.

What Happens When You Create a Scheduler

When you create a new scheduler, the following events occur:

When the script creates the schema and associated tables, it sets the LOCKTIMEOUT session configuration parameter to 0 for the session running the micro-batches. When LOCKTIMEOUT is 0, data loads continuously because the scheduler does not have to wait for a lock to be released. If a table is already locked, Vertica cancels the frame and records an error in the events table.

The script creates the resource pool with defaults that benefit loading data into Vertica. While you can alter this pool to your business needs, OpenText strongly recommends following these guidelines:

Validating Schedulers

When you create or configure a scheduler, Vertica validates the settings that you provide. Vertica checks the following settings:

You can configure validation checking using the --validation-type parameter in Scheduler Utility Options.

Synchronizing Schedulers

By default, Vertica automatically synchronizes source information with host clusters. You can configure the synchronization interval using the --refresh-config Scheduler Utility Option. Vertica synchronizes the following settings:

You can configure synchronization settings using the --auto-sync parameter in Scheduler Utility Options.

Launching a Scheduler

To launch a scheduler, you must have a running streaming instance in a place Vertica can access. Additionally, you must configure the scheduler and set up sources for streaming.

When you launch a scheduler, the scheduler collects data from your sources, starting at the specified offset. You can view the stream_microbatch_history table to see what the scheduler is doing at any given time.

To learn how to create, configure, and launch a scheduler, see Using Streaming Data with Vertica in this guide.

You can also choose to bypass the scheduler. For example, you might want to do a single load with a specific range of offsets. For more information, see Using COPY with Data Streaming in this guide.

Launching Multiple Schedulers for High Availability

For high availability, you can launch two or more identical schedulers that target the same configuration schema. You can differentiate these different schedulers using the --instance-name CLI option with the Launch Utility Options. The scheduler not in use remains in stand-by mode and can only perform scheduling if the active scheduler fails. In this case, the stand-by process allows the stream to continue without interruption.

Viewing Schedulers from the MC

You can also view the status of Kafka jobs from the MC. For more information, refer to Viewing Load History.

Updating Schedulers After Vertica Upgrades

A scheduler is only compatible with the version of Vertica that created it. When you upgrade Vertica to a new major version or service pack, you must also update your schedulers using the --upgrade option before you can restart them. If you do not update a scheduler, you receive an error message if you try to launch it. For example:

$ vkconfig launch --conf weblog.conf
com.vertica.solutions.kafka.exception.FatalException: Configured scheduler schema and current scheduler configuration schema version do not match. Upgrade configuration by running: vkconfig scheduler --upgrade
	at com.vertica.solutions.kafka.scheduler.StreamCoordinator.assertVersion(StreamCoordinator.java:64)
	at com.vertica.solutions.kafka.scheduler.StreamCoordinator.run(StreamCoordinator.java:125)
	at com.vertica.solutions.kafka.Launcher.run(Launcher.java:205)
	at com.vertica.solutions.kafka.Launcher.main(Launcher.java:258)
Scheduler instance failed. Check log file. Check log file.
$ vkconfig scheduler --upgrade --conf weblog.conf
Checking if UPGRADE necessary...
UPGRADE required, running UPGRADE...
UPGRADE completed successfully, now the scheduler configuration schema version is v8.1.1
$ vkconfig launch --conf weblog.conf
                   .  .  .