Adjusting Spread Daemon Timeouts For Virtual Environments

You may see Vertica nodes leave the database even though they are still running. This issue can happen on networks that are prone to spikes in latency or in virtual environments where a node's VM may be paused for a short period of time. You can adjust a setting in Vertica to help prevent this issue from occurring.

Vertica relies on spread daemons to pass messages between database nodes. When a node fails to respond to a spread message after a timeout period, Vertica assumes the node is down and starts to remove it from the database.

By default, Vertica sets the spread timeout period to 8 seconds when the IP addresses of all nodes (or all control nodes, if the database is using Large Cluster) start with the same two bytes. For example, Vertica sets the spread timeout to 8 seconds if all your nodes have IP addresses in the to range. When the nodes IP addresses do not have IP addresses that start with the same two bytes, Vertica sets the timeout to 25 seconds.

If network delays or temporary pauses of a VM last longer than the spread timeout period, you may see up nodes leave the database. In these cases, you can increase the spread timeout to reduce or eliminate instances where up nodes leave the database.

Azure's Memory-Preserving Updates and Spread Timeouts

In Azure, you may see running nodes leave the database due to scheduled maintenance. Azure's maintenance down time is usually well-defined. For example, Azure's memory-preserving updates can pause a VM for up to 30 seconds while performing maintenance on the system hosting the VM. This pause does not disrupt the node. It continues normal operation once Azure resumes it. See the Azure documentation's topic on Maintenance for virtual machines in Azure for more information about updates. If Azure pauses a node for longer than the spread timeout period, Vertica interprets the node's inability to respond to a spread message as the node going down, even though it will resume running normally.

If you deploy your Vertica cluster using the Azure Marketplace, the spread timeout defaults to 35 seconds. If you manually create your cluster in Azure, the spread timeout defaults to 8 or 25 seconds, as described earlier.

Setting the Spread Timeout

When you know your network or nodes may be unable to respond for a specific amount of time, you can increase the spread timeout period to longer than this time. Adjust the timeout to the period of time the node may be unable to respond, plus an additional 5 seconds as a safety margin.

For example, if you know Azure's memory-preserving maintenance can pause your VMs for up to 30 seconds, set the spread timeout to 35 seconds.

If you do not know exactly how long network or node disruptions can last, you can try increasing the spread timeout gradually, until you see reduced instances of up nodes leaving the database. Be as conservative with this setting as you can.

Vertica cannot react to a node going down or being shut down improperly before the timeout period has elapsed. Changing spread’s timeout to too high a value can result in longer query restarts if a node goes down.

You can see the current setting of the spread timeout by querying the SPREAD_STATE table:

    node_name     | token_timeout
 v_vmart_node0003 |          8000
 v_vmart_node0001 |          8000
 v_vmart_node0002 |          8000
(3 rows)

You change the spread timeout using the SET_SPREAD_OPTION function to set the token timeout to a new value. This value is a string, and sets the timeout in milliseconds.

Changing spread settings using SET_SPREAD_OPTION will have a minor impact on your cluster as it pauses while the change propagates to the entire cluster.

This example sets the timeout to 35 seconds (35000ms):

=> SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');
NOTICE 9003:  Spread has been notified about the change
 Spread option 'TokenTimeout' has been set to '35000'.

(1 row)

    node_name     | token_timeout
 v_vmart_node0001 |         35000
 v_vmart_node0002 |         35000
 v_vmart_node0003 |         35000
(3 rows);

The changes you make to the spread timeout may not take effect immediately. It may take some time before you see the settings change in the V_MONITOR.SPREAD_STATE table.

See Also