Large Cluster Best Practices

Keep the following best practices in mind when you are planning and managing a large cluster implementation.

Planning the number of control nodes

To asses how many cluster nodes should be control nodes, use the square root of the total number of nodes expected to be in the database cluster to help satisfy both data K-Safety and rack fault tolerance for the cluster. Depending on the result, you might need to adjust the number of control nodes to account for your physical hardware/rack count. For example, if you have 121 nodes (with a result of 11), and your nodes will be distributed across 8 racks, you might want to increase the number of control nodes to 16 so you have two control nodes per rack.

See Planning a Large Cluster Arrangement.

Control node assignment/realignment

After you specify the number of control nodes, you must update the control host's (spread) configuration files to reflect the catalog change. Certain cluster management functions might require that you run other functions or restart the database or both.

If, for example, you drop a control node, cluster nodes that point to it are reassigned to another control node. If that node fails, all the nodes assigned to it also fail, so you need to use the Administration Tools to restart the database. In this scenario, you'd call the REALIGN_CONTROL_NODES() and RELOAD_SPREAD(true) functions, which notify nodes of the changes and realign fault groups. Calling RELOAD_SPREAD(true) connects an existing cluster node to a newly-assigned control node.

On the other hand, if you run REALIGN_CONTROL_NODES() multiple times in a row, the layout does not change beyond the initial setup, so you don't need to restart the database. But if you add or drop a node and then run REALIGN_CONTROL_NODES(), the function call could change many node assignments.

Here's what happens with control node assignments when you add or drop nodes, whether those nodes are control nodes or non-control nodes:

For more information, see Defining and Realigning Control Nodes on an Existing Cluster and Rebalancing Data Across Nodes.

Allocate standby nodes

Have as many standby nodes available as you can, ideally on racks you are already using in the cluster. If a node suffers a non-transient failure, use the Administration Tools "Replace Host" utility to swap in a standby node.

Standby node availability is especially important for control nodes. If you are swapping a node that's a control node, all nodes assigned to the control node's host grouping will need to be taken offline while you swap in the standby node. For details on node replacement, see Replacing Nodes.

Plan for cluster growth

If you plan to expand an existing cluster to 120 or more nodes, you can configure the number of control nodes for the cluster after you add the new nodes. See Defining and Realigning Control Nodes.

Write custom fault groups

When you deploy a large cluster, Vertica automatically creates fault groups around control nodes, placing nodes that share a control node into the same fault group. Alternatively, you can specify which cluster nodes should reside in a particular correlated failure group and share a control node. See High Availability With Fault Groups in Vertica Concepts.

Use segmented projections

On large-cluster setups, minimize the use of unsegmented projections in favor of segmented projections. When you use segmented projections, Vertica creates buddy projections and distributes copies of segmented projections across database nodes. If a node fails, data remains available on the other cluster nodes.

Use the Database Designer

OpenText recommends that you use the Database Designer to create your physical schema. If you choose to design projections manually, you should segment large tables across all database nodes and replicate (unsegment) small table projections on all database nodes.