Load Balancing on Elastic Kubernetes Clusters

Posted March 7, 2023 by Sruthi Anumula, Senior Database Support Engineer

Vintage businessman concept pointing on the wall wearing futuristic helmet at office

Your long-running sessions could fail after you deployed Vertica on Elastic Kubernetes Cluster (EKS) with Load Balancer as the service type.

dbadmin@v-sc-0:/$ /opt/vertica/bin/vsql -h internal-acc1a79a37984458b9930acd01cba3f5-782667786.us-east-1.elb.amazonaws.com -c "select sleep(70);"
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

When the load balancer type is not specified in yaml, by default, EKS generates a classic load balancer with a default connection idle timeout of 60 seconds. The aws-load-balancer-connection-idle-timeout can be set up to 4000 seconds, according to the documentation. What should be done, though, if a query takes more than 4000 seconds?

dbadmin@v-sc-0:/$ /opt/vertica/bin/vsql -h internal-acc1a79a37984458b9930acd01cba3f5-782667786.us-east-1.elb.amazonaws.com -c "select sleep(4001);"
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

Well, there is a solution to it. Instead of Class Load Balancer, you can use Network Load Balancer (NLB). For best performance, Vertica recommends using Network Load Balancer.

Include the following annotation in the yaml file when deploying the database to ensure that an NLB load balancer is generated.

serviceName: LoadbalancerService
serviceType: LoadBalancer
serviceAnnotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb-ip"

AWS, however, sets the TCP flow idle timeout value by default to 350 seconds. This value cannot be changed. From Vertica 11.x onwards, a set of similar keepalive parameters that can replace TCP keepalive parameter values are supported. By default, all Vertica keepalive parameters are set to 0, which signifies using TCP keepalive settings. You need to adjust the database keepalive settings so that keepalives are sent less than 350 seconds apart. Use the following queries to alter the database’s keep alive settings:

ALTER DATABASE DEFAULT SET KeepAliveIdleTime = 300;
ALTER DATABASE DEFAULT SET KeepAliveProbeInterval = 60;
ALTER DATABASE DEFAULT SET KeepAliveProbeCount = 20;

You will now see that Vertica successfully completes the sleep test after running it for 400 seconds, which is longer than the usual NLB timeout of 350 seconds.

dbadmin@v1-sc-0:/$ date;vsql -h a21a604ba68664737b89a5bf267147f5-7d71000035490e9f.elb.us-east-1.amazonaws.com -c 'select sleep(355)';date
Tue Feb 21 21:17:18 UTC 2023
sleep
-------
0
(1 row)

Tue Feb 21 21:23:13 UTC 2023
dbadmin@v1-sc-0:/$

References

Configure TCP keepalive with AWS Network Load Balancer
Limiting the Number and Length of Client Connections
AWS Load Balance Controller Annotations