Configuring tcp Idle Settings for Long Running Idle Sessions

Posted August 14, 2017 by Soniya Shah, Information Developer

Database Server Room
This blog post was authored by Soniya Shah.

Important: For all recommendations to changing setting values, you must change the settings on all nodes in the cluster. It is not advisable to have different settings on different nodes.

Have you ever encountered one of the following types of errors? ==> VSQL vsql => select sleep(3600); ERROR: could not receive data from server: Connection timed out ==> rsync rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7] ==> vertica.log @v_db_node0018: 08006/5167: Unexpected EOF on client connection @v_db_node0018: 00000/4719: Session v_db_node0018-291924:0x4dbf8 ended; closing connection (connCnt 3) With each of these errors, the server is saying the client was disconnected and the client experiences a disconnect from the server. The root cause is that there is an intermediate network element (such as a firewall, gateway, switch, router or load balancer) with a lower idle timeout threshold that disconnects both the client and the server from one another.

You may run into these errors if you are a user:

• In a cloud environment
• Doing copy cluster or backup across WAN links
• That executes long running queries or reports
• That keeps open connections to the database with little or no activity

If you encounter these errors, check the following parameters on both your client and the Vertica server: # cat /proc/sys/net/ipv4/tcp_keepalive_time 7200 # cat /proc/sys/net/ipv4/tcp_keepalive_intvl 75 # cat /proc/sys/net/ipv4/tcp_keepalive_probes 9 The first two parameters are expressed in seconds, and the last is the pure number. This means that the keepalive routines wait for two hours (7200 seconds) before sending the first keepalive probe, and then resend the probe every 75 seconds. If no ACK response is received nine consecutive times, the connection is marked as broken.

To modify this value, you must write new values into the files. For example, suppose you decide to configure the host so that keepalive routines start after 10 minutes of channel inactivity, and then send probes in intervals of one minute.

You can change the settings by doing the following as root. Note that for the new settings to take effect, you must restart the process: # echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time # echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl # echo 20 > /proc/sys/net/ipv4/tcp_keepalive_probes Use sysctl to change them and make them persistent. You must make these changes on all Vertica nodes and relevant SQL clients: in /etc/sysctl.conf net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 20 net.ipv4.tcp_keepalive_time = 600 If you are using AWS Elastic Load Balance, you must also increase the idle timeout, to a value greater than net.ipv4.tcp_keepalive_time. For more information, see AWS Elastic Load Balancing with Vertica.

If you are using a Windows client, the following registry keys must be changed.

Warning

Changes made to the Windows registry happen immediately, and no backup is automatically made. Do not edit the Windows registry unless you are confident about doing so. Microsoft has issued the following warning with respect to the Registry Editor:

Using Registry Editor incorrectly can cause serious, system-wide problems that may require you to re-install Windows to correct them. Microsoft cannot guarantee that any problems resulting from the use of Registry Editor can be solved. Use this tool at your own risk.

Note: What you see when opening the registry editor or backing it up, may vary slightly according to your operating system.

These settings may vary depending on your Windows version:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

KeepAliveInterval
Key: Tcpip\Parameters
Value Type: REG_DWORD-time in milliseconds
Valid Range: 1-0xFFFFFFFF
Default: 1000 (one second)
Change to: 60000 (60 seconds)

KeepAliveTime
Key: Tcpip\Parameters
Value Type: REG_DWORD-time in milliseconds
Valid Range: 1-0xFFFFFFFF
Default: 7,200,000 (two hours)
Change to: 600,000 (10 min)

TcpMaxDataRetransmissions
Key: Tcpip\Parameters
Value Type: REG_DWORD-number
Valid Range: 0-0xFFFFFFFF
Default: 5
Change to: 20

We recommend lowering the tcp values so that the idle connection has a check every 10 minutes, rather than every two hours.

Notes:

1. Changing your idle timeout configuration might not work. Occasionally, intermediate devices detect keepalives and terminate the connection despite configuration changes. These devices then enforce their own idle settings.

2. It may take a few iterations to find the right setting in your environment, depending on the specific idle timeout configuration of the intermediate device.