Using Hadoop Rack Locality to Boost Vertica Performance

Posted May 4, 2017 by Soniya Shah, Information Developer

Rear view of two partially unrecognizable men sitting in front of a computer. One of them is pointing at a screen where are several lines of computer code.
This blog post was authored by Monica Cellio.

When database nodes are co-located on Hadoop data nodes, Vertica can take advantage of the Hadoop rack configuration to execute queries against ORC and Parquet data. Moving query execution closer to the data reduces network latency and can improve performance.

Vertica automatically uses database nodes that are co-located with the HDFS nodes that contain the data. This feature, called node locality, requires no additional configuration.

When Vertica is co-located on only a subset of HDFS nodes, sometimes there is no database node that is co-located with the data. However, performance is usually better if a query uses a database node in the same rack. This feature, new in version 8.1, is called rack locality. Using rack locality, you can see performance improvements with just one Vertica node per rack, which reduces the effort of adding Vertica to an existing Hadoop cluster.

Consider an HDFS cluster with the following layout:

(Not shown: the nodes in each rack are connected by a local network, the rack networks are connected to a public network, and the Vertica nodes are connected by a private network.)

HDFS takes advantage of rack locality too, so the cluster already has a description of the rack structure in a topology mapping file. You can use this to configure Vertica.

Here’s an excerpt from the topology mapping file for the HDFS cluster shown in the diagram: [network_topology] dn11.example.com=/rack1 10.20.41.51=/rack1 dn12.example.com=/rack1 10.20.41.52=/rack1 dn13.example.com=/rack1 10.20.41.53=/rack1 ... dn21.example.com=/rack2 10.20.41.71=/rack2 dn22.example.com=/rack2 10.20.41.72=/rack2 dn23.example.com=/rack2 10.20.41.73=/rack2 ... dn31.example.com=/rack3 10.20.41.91=/rack3 dn32.example.com=/rack3 10.20.41.92=/rack3 dn33.example.com=/rack3 10.20.41.93=/rack3 ... You can use this data to create a Fault Group Description for the Vertica nodes in this cluster. Vertica uses this information to route query execution to the nodes closest to the data. /rack1 /rack2 /rack3 /rack1 = db01 /rack2 = db02 /rack3 = db03 Rack locality works with multi-level racks. If one set of racks is in your west-coast data center and another in your east-coast data center, Vertica understands that structure.

See Configuring Rack Locality in the Vertica documentation for instructions on creating and using fault groups.