Best Practices for Preparing Your Cluster for Rebalance

This document describes how to prepare your cluster for rebalance and how to estimate how long it will take to rebalance your database.

This document was co-authored by Rakesh Bankula and Soniya Shah.

How to Prepare Your Cluster for Rebalance

The following section describes recommendations for preparing your cluster for rebalance. Some of these recommendations are required, and others are optional, but strongly advised.

Required Operations

Cleanup

The time it takes to rebalance your database depends on the size of your data and the number of objects in the database, such as schemas, tables, projections, and ROS containers. For best results, perform the following:

  • Drop schemas and tables that are not required or redundant.
  • Check tables that contain multiple projections and drop redundant projections. Use the following query to determine the number of projections per table:

  • => SELECT projection_schema, anchor_table_name, count (distinct projection_name) proj_count FROM projections group by 1,2 ORDER BY 3 desc;

  • Check fact tables that contain a large number of ROS containers to drop any older partitions that are no longer needed. You can use the following query to get information about tables that have a large number of partitions:

=> SELECT table_schema, projection_name, count (distinct partition_key) partition_count FROM partitions group by 1,2 
ORDER BY 3 desc;

Advance AHM

If your tables contain many deleted records and the AHM epoch is held back by too many epochs compared to the current_epoch, there is a possibility that rebalance can take longer. This happens because split ROS containers may run into replay deletes during rebalance and consume more time. To prevent this issue, advance the AHM by running the following:

=> SELECT make_ahm_now();

Ensure the AHM is one epoch less than the current epoch by running the following:

=> SELECT get_ahm_epoch(), get_last_good_epoch(), get_current_epoch();

After you verify the AHM is updated, Vertica recommends that you purge any deleted records before you begin your database rebalance. In particular, consider purging any deleted records from fact tables that contain millions of records.

Configuration Parameter Setting

Vertica run queries on storage_containers multiple times during the rebalance process to determine the order in which tables can be rebalanced without running out of disk space. Databases that contain a larger number of ROS containers will find this query takes a few minutes to return results. Running this query multiple times results in a longer rebalance time.

If disk space is not a concern in your database, you can avoid running these queries by setting the configuration parameter RebalanceQueryStorageContainers to 0. This parameter is available in Vertica 7.2.3-2 and later.

Resource Pool Settings

On the Refresh pool, set MAXCONCURRENCY and PLANNED CONCURRENCY equal to the number of cores in your database.

If you are performing a rebalance during a maintenance window, temporarily release memory from any user resource pools that have the initial memory size set by running the alter resource pool to set the MEMORYSIZE to 0 on user defined pools. After you complete the rebalance, you can revert to the original resource pool settings.

Check Rebalance Success

After rebalance completes, run the following statement:

=> SELECT get_node_dependencies();

If rebalance completes successfully, this command returns the number of rows equal to the number of nodes in the cluster + 1. For example, on a 10 node cluster, it returns 11 rows and in each line there should be 10 digits of 1s and 0s.

New Nodes

When you select new nodes to add to an existing cluster, verify the system configuration of the new nodes is identical or better than the existing nodes in the cluster. This configuration includes the number of CPUs, the clock speed of processors, memory size, disk IO, and network throughput. To check that the system configuration is identical or better, run vioperf, vnetperf, and vcuperf on the new nodes and compare the results to the existing nodes.

Configure your operating system parameters on the new nodes to match the parameters on the existing nodes.

Optional Operations

Upgrade

If you are using a Vertica version earlier than 8.0.1-x, consider upgrading to the latest Vertica version to enhance your rebalance performance. Some of the rebalance improvements available in Vertica 8.0.1-x and later include increased parallelism in separating storage containers and an improved efficiency of rebalance monitoring tables, like rebalance_table_status and rebalance_projection_status.

Backup

Perform a backup of your current database before you add or remove nodes and start rebalance activity. Performing a backup is an important and recommended task before making any major configuration changes to your cluster.

Maintenance Window

If possible, attempt to perform your rebalance during a maintenance window by stopping ETL and all user queries. This will allot all system resources to rebalance to help complete the rebalance as quickly as possible. If it not possible to get an exclusive maintenance window for rebalance, try to minimize the ETLs involved in load, updates, and deleting queries, while allowing select user queries.

Restart the Database

If possible, restart your database before you begin rebalance. If your database was UP for a long time, restarting helps to reduce the catalog size and helps to improve the overall rebalance performance.

Estimate Rebalance Time

The database rebalance process involves the following steps:

  1. Splitting ROS containers on each existing node.
  2. Transferring ROS containers to all nodes.
  3. Internal query on system tables to find the disk space used per projection.
  4. Miscellaneous operations, like catalog updates and node dependencies calculations.

You can use the following equation to estimate the time it will take to rebalance your database:

Total time to rebalance the database = ROS split time + ROS transfer time + query on system tables + miscellaneous operations (node dependencies calculation, catalog updates)

1. Ros Split Time Estimation

There is no historical timing information available for split ROS containers. The time it takes to split a ROS container of a given projection is equal to the time it takes to merge a ROS container of the same size.

To estimate the time needed to split ROS containers you must find the number of ROS containers, the size of the ROS containers, and the time it takes the Tuple Mover to merge different-sized ROS containers. The time it takes for the Tuple Mover mergeout operation to complete might not be available in any DC or system table for every size of ROS containers that belong to every projection in the database. For this estimation, we group the ROS containers into different buckets based on size and then find the count of ROS containers, as well as the average mergeout time for each of these groups.

Average Time to Split ROS Containers of Different Sizes

In this calculation, the mergeout operations that went through replay delete are not considered because the ROS split during rebalance does not go through replay delete if the AHM was advanced before the rebalance started; it also does not consider mergeout operations related to user-issued purge operations.

By reviewing more than 250 different scrutinizes collected from different customers with different database configurations, number of nodes, and data sizes, we have calculated the average mergeout time of different ROS container sizes. These timings are listed in the Appendix. You can use these mergeout timings to estimate the ROS split time. For more accurate information, run the query in the Appendix to obtain the times for your database.

Number of ROS Containers

To find the number of ROS containers for each of the bucket sizes, we used the information in the projection_storage system table. The query to find the number of ROS containers for each of the bucket sizes is in the Appendix.

ROS split time = [(Number of ROS containers of a given size) x (Average TM mergeout time for that ROS size)]/ (Number of CPU cores)

For example on a system with the following configuration:

CPU cores: 24

ROS containers, 1GB in size: 105,901

ROS containers, between 1GB and 2GB in size: 94

Average time Tuple Mover mergeout takes to merge 1Gb: 9 seconds

Average time Tuple Mover mergeout takes to merge 2GB: 18 seconds

The equation would look like the following:

ROS split time = [(105901 ROS containers) x (9 seconds) + (94 ROS containers) x (18 seconds)] / (24 CPU cores) = 39783 seconds = 11 hours

2. Ros Transfer Time Estimation

This step depends on network throughput and the amount of data on each node. To estimate the time for this step, we need to find the size of the data on each node and the network speed.

For example, on a system with the following configuration:

Network throughput: 10Gbps

Data on each node: 3TB

When you double the number of nodes in the cluster and perform rebalance, the estimate for ROS transfer time is between 1 and 1.5 hours.

3. Internal System Table Query

Vertica runs queries on storage_containers multiple times during the rebalance process to determine the order in which tables can be rebalanced without running out of disk space. Databases that contain a larger number of ROS containers will find this query takes a few minutes to return results. Running this query multiple times results in a longer rebalance time.

If disk space is not a concern in your database, you can avoid running these queries by setting the configuration parameter RebalanceQueryStorageContainers to 0. This configuration parameter is available in Vertica version 7.2.3-2 and later. If your database is running an earlier Vertica version, we recommend you upgrade to the latest version before you start to rebalance your database, and disable this configuration parameter.

4. Miscellaneous Rebalance Operations

This step must be done for each projection in the system. To compute the time required for this step, you must find the number of projections in your database. Based on a review of multiple scrutinizes collected from different customer systems, on average, this step took about 250milliseconds per projection.

For example, on a database with 30,000 projections, both segmented and unsegmented, this step takes about 2 hours.

A Customer's Experience with Rebalance

A Vertica customer using Vertica 7.1.2-6 had a system configured with an HP Gen 8 system with 24 cores, 256 GB of memory and a 10Gbps network. Each node had 2.1 TB of data, 25,000 segmented projections, and 11,000 unsegmented projections. This customer completed rebalance after removing 5 nodes from a 16 node cluster in about 13 hours. There were no ETL or user queries running during the 13 hours of rebalance activity.

For More Information

Understanding Rebalancing, Part 1: What Happens During Rebalancing

Understanding Rebalancing, Part 2: Optimizing for Rebalancing

Note: In Vertica 8.0, the new table rebalance_operations is available. You can use this table to monitor your rebalance progress.

Appendix

The merge_time table DDL and query to populate the data:

=> CREATE TABLE merge_time
( ros_size varchar(25),duration_sec float);

The calculated average mergeout time based on 250 different scrutinizes collected from different customers with different clusters and data sizes:

ros_size                 |  time_sec
-------------------------+------------
A_Less_than_100MB        |    0.844
B_Between_100MB_to_200MB |    2.000
C_Between_200MB_to_400MB |    6.833
D_Between_400MB_to_600MB |   11.095
E_Between_600MB_to_800MB |   13.000
F_Between_800MB_to_1GB   |   23.000
G_Between_1GB_to_2GB     |   35.000
H_Between_2GB_to_4GB     |   68.000
I_Between_4GB_to_8GB     |  137.846
J_Between_8GB_to_16GB    |  320.444
K_Between_16GB_to_32GB   |  617.650
L_Greater_than_32GB      | 2528.500

You can insert the average values listed above into the merge_time table or run the following query to get information from your database:

=> INSERT into merge_time
=> SELECT ros_size, max(avg_duration) time_sec from (

=> SELECT
s.node_name,
CASE
WHEN s.total_size_in_bytes < 100000000 THEN 'A_Less_than_100MB'
WHEN (s.total_size_in_bytes > 100000000 AND
s.total_size_in_bytes < 200000000) THEN 'B_Between_100MB_to_200MB'
WHEN (s.total_size_in_bytes > 200000000 AND
s.total_size_in_bytes < 400000000) THEN 'C_Between_200MB_to_400MB'
WHEN (s.total_size_in_bytes > 400000000 AND
s.total_size_in_bytes < 600000000) THEN 'D_Between_400MB_to_600MB'
WHEN (s.total_size_in_bytes > 600000000 AND
s.total_size_in_bytes < 800000000) THEN 'E_Between_600MB_to_800MB'
WHEN (s.total_size_in_bytes > 800000000 AND
s.total_size_in_bytes < 1000000000) THEN 'F_Between_800MB_to_1GB'
WHEN (s.total_size_in_bytes > 1000000000 AND
s.total_size_in_bytes < 2000000000) THEN 'G_Between_1GB_to_2GB'
WHEN (s.total_size_in_bytes > 2000000000 AND
s.total_size_in_bytes < 4000000000) THEN 'H_Between_2GB_to_4GB'
WHEN (s.total_size_in_bytes > 4000000000 AND
s.total_size_in_bytes < 8000000000) THEN 'I_Between_4GB_to_8GB'
WHEN (s.total_size_in_bytes > 8000000000 AND
s.total_size_in_bytes < 16000000000) THEN 'J_Between_8GB_to_16GB'
WHEN (s.total_size_in_bytes > 16000000000 AND
s.total_size_in_bytes < 32000000000) THEN 'K_Between_16GB_to_32GB'
ELSE 'L_Greater_than_32GB'
END AS ros_size,
avg(DATEDIFF(SECOND,s.time,c.time)) as avg_duration
FROM dc_tuple_mover_events s
JOIN dc_tuple_mover_events c
ON  s.node_name=c.node_name
AND s.projection_oid = c.projection_oid
AND s.transaction_id = c.transaction_id
AND s.session_id = c.session_id
WHERE s.operation = 'Mergeout'
AND c.operation = 'Mergeout'
AND s.event = 'Start'
AND c.event = 'Complete'
AND s.container_count > 1
AND c.container_count > 1
AND s.transaction_id not in (select distinct transaction_id from dc_tuple_mover_events where event ilike '%replay delete%')
GROUP BY 1,2 ) f GROUP  BY 1;commit;

The ROS count table DDL and query to populate the data:

=> CREATE TABLE public.ros_count
( ros_size varchar(25),
ros_count int);

=> INSERT into ros_count
select ros_size,max(ros_count) max_ros_cnt from (
SELECT
node_name,
CASE
WHEN used_bytes < 100000000 THEN 'A_Less_than_100MB'
WHEN (used_bytes > 100000000 AND
used_bytes < 200000000) THEN 'B_Between_100MB_to_200MB'
WHEN (used_bytes > 200000000 AND
used_bytes < 400000000) THEN 'C_Between_200MB_to_400MB'
WHEN (used_bytes > 400000000 AND
used_bytes < 600000000) THEN 'D_Between_400MB_to_600MB'
WHEN (used_bytes > 600000000 AND
used_bytes < 800000000) THEN 'E_Between_600MB_to_800MB'
WHEN (used_bytes > 800000000 AND
used_bytes < 1000000000) THEN 'F_Between_800MB_to_1GB'
WHEN (used_bytes > 1000000000 AND
used_bytes < 2000000000) THEN 'G_Between_1GB_to_2GB'
WHEN (used_bytes > 2000000000 AND
used_bytes < 4000000000) THEN 'H_Between_2GB_to_4GB'
WHEN (used_bytes > 4000000000 AND
used_bytes < 8000000000) THEN 'I_Between_4GB_to_8GB'
WHEN (used_bytes > 8000000000 AND
used_bytes < 16000000000) THEN 'J_Between_8GB_to_16GB'
WHEN (used_bytes > 16000000000 AND
used_bytes < 32000000000) THEN 'K_Between_16GB_to_32GB'
ELSE 'L_Greater_than_32GB'
END AS ros_size,
COUNT(*) AS ros_count
FROM storage_containers
GROUP BY 1, 2 ORDER BY 1, 2) f group by 1;commit;

The query to calculate the ROS split time:

=> SELECT sum(duration_sec*ros_count) split_time_seconds FROM merge_time m JOIN ros_count r 
ON r.ros_size=m.ros_size;

This time is in seconds. Divide this by the number of CPU cores on a node to get the ROS split estimation time.