Data Streaming Job Scheduler
The data streaming job scheduler is a tool for continuous loading of streaming data into Vertica. The scheduler comes pre-packaged and installed with the Vertica rpm. For information on job scheduler requirements, refer to Vertica Integration for Apache Kafka.
You can use the scheduler from any node by running the vkconfig script:
/opt/vertica/packages/kafka/bin/vkconfig
Note: If you do not want the scheduler to use Vertica host resources, or if you want to limit user access to the Vertica nodes, install the RPM on the host but do not create a database.
What the Scheduler Does
A scheduler instance works by creating frames and issuing micro-batches that load data into tables in your Vertica database. The scheduler loads all (enabled) sources to Vertica target tables during a single frame duration and continuously schedules frames as one completes.
You can add as many sources as you want to a single scheduler. Doing so allows the scheduler to collect all data from all these sources every single frame. This option is helpful if you have a large number of sources.
What Happens When You Create a Scheduler
When you create a new scheduler, the following events occur:
- The script creates a new Vertica schema with a name you specify (default is stream_config). You use this name to identify the scheduler during configuration.
- The script creates Data Streaming Schema Tables for the Vertica schema.
- The script creates the resource pool kafka_default_pool, if it does not already exist.
When the script creates the schema and associated tables, it sets the LOCKTIMEOUT session configuration parameter to 0 for the session running the micro-batches. When LOCKTIMEOUT is 0, data loads continuously because the scheduler does not have to wait for a lock to be released. If a table is already locked, Vertica cancels the frame and records an error in the events table.
The script creates the resource pool with defaults that benefit loading data into Vertica. While you can alter this pool to your business needs, OpenText strongly recommends following these guidelines:
- Leave the QUEUETIMEOUT parameter set to 0. This value is the default for job scheduler resource pools. A value of 0 allows data to load continuously. If the scheduler has to wait for resources, it cannot progress, compromising scheduling configurations.
- Leave reflexive moveout enabled. This option is on automatically when you create a schedule. With reflexive moveout turned on, the Tuple Mover automatically performs a moveout operation when data is committed so that your WOS always has space to load data. For large volumes of data (>100 MB) use a load method of DIRECT.
Validating Schedulers
When you create or configure a scheduler, Vertica validates the settings that you provide. Vertica checks the following settings:
- Confirms that all brokers in the specified cluster exist.
- Confirms that the specified source exists.
- Compares the host being configured to all existing cluster hosts. If the host already exists, Vertica cancels the configuration.
- Verifies that the number of partitions equals the number provided by the user. If no number of partitions is specified, Vertica sets the value to the number of partitions the source has in the cluster.
- Compares the cluster host list to the cluster being configured. If the cluster already exists, Vertica cancels the configuration.
You can configure validation checking using the --validation-type
parameter in Scheduler Utility Options.
Synchronizing Schedulers
By default, Vertica automatically synchronizes source information with host clusters. You can configure the synchronization interval using the --refresh-config
Scheduler Utility Option. Vertica synchronizes the following settings:
- Updates the broker list for each cluster.
- Confirms that each source exists. If a source does not exist, Vertica disables it. If Vertica cannot reach the cluster at all, it takes no action.
- Updates the number of partitions for each source.
You can configure synchronization settings using the --auto-sync
parameter in Scheduler Utility Options.
Launching a Scheduler
To launch a scheduler, you must have a running streaming instance in a place Vertica can access. Additionally, you must configure the scheduler and set up sources for streaming.
When you launch a scheduler, the scheduler collects data from your sources, starting at the specified offset. You can view the stream_microbatch_history table to see what the scheduler is doing at any given time.
To learn how to create, configure, and launch a scheduler, see Using Streaming Data with Vertica in this guide.
You can also choose to bypass the scheduler. For example, you might want to do a single load with a specific range of offsets. For more information, see Using COPY with Data Streaming in this guide.
Launching Multiple Schedulers for High Availability
For high availability, you can launch two or more identical schedulers that target the same configuration schema. You can differentiate these different schedulers using the --instance-name
CLI option with the Launch Utility Options. The scheduler not in use remains in stand-by mode and can only perform scheduling if the active scheduler fails. In this case, the stand-by process allows the stream to continue without interruption.
Viewing Schedulers from the MC
You can also view the status of Kafka jobs from the MC. For more information, refer to Viewing Load History.
Updating Schedulers After Vertica Upgrades
A scheduler is only compatible with the version of Vertica that created it. When you upgrade Vertica to a new major version or service pack, you must also update your schedulers using the --upgrade
option before you can restart them. If you do not update a scheduler, you receive an error message if you try to launch it. For example:
$ vkconfig launch --conf weblog.conf com.vertica.solutions.kafka.exception.FatalException: Configured scheduler schema and current scheduler configuration schema version do not match. Upgrade configuration by running: vkconfig scheduler --upgrade at com.vertica.solutions.kafka.scheduler.StreamCoordinator.assertVersion(StreamCoordinator.java:64) at com.vertica.solutions.kafka.scheduler.StreamCoordinator.run(StreamCoordinator.java:125) at com.vertica.solutions.kafka.Launcher.run(Launcher.java:205) at com.vertica.solutions.kafka.Launcher.main(Launcher.java:258) Scheduler instance failed. Check log file. Check log file. $ vkconfig scheduler --upgrade --conf weblog.conf Checking if UPGRADE necessary... UPGRADE required, running UPGRADE... UPGRADE completed successfully, now the scheduler configuration schema version is v8.1.1 $ vkconfig launch --conf weblog.conf . . .