Hash Joins Versus Merge Joins

The Vertica optimizer implements a join with one of the following algorithms:

The optimizer automatically chooses the most appropriate algorithm to execute a query, given the projections that are available.

Facilitating Merge Joins

To facilitate a merge join, create projections for the joined tables that are sorted on the join predicate columns. The join predicate columns should be the first columns in the ORDER BY clause.

For example, tables first and second are defined as follows, with projections first_p1 and second_p1, respectively. The projections are sorted on data_first and data_second:

CREATE TABLE first ( id INT, data_first INT );
CREATE PROJECTION first_p1 AS SELECT * FROM first ORDER BY data_first;

CREATE TABLE second ( id INT, data_second INT );
CREATE PROJECTION second_p1 AS SELECT * FROM first ORDER BY data_second;

When you join these tables on unsorted columns first.id and second.id, Vertica uses the hash join algorithm:

 EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN HASH [Cost: 752, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can facilitate execution of this query with the merge join algorithm by creating projections first_p2 and second_p2, which are sorted on join columns first_p2.id and second_p2.id, respectively:

CREATE PROJECTION first_p2 AS SELECT id, data_first FROM first ORDER BY id SEGMENTED BY hash(id, data_first) ALL NODES;
CREATE PROJECTION second_p2 AS SELECT id, data_second FROM second ORDER BY id SEGMENTED BY hash(id, data_second) ALL NODES;

If the query joins significant amounts of data, the query optimizer uses the merge algorithm:

EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN MERGEJOIN(inputs presorted) [Cost: 731, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can also facilitate a merge join by using subqueries to pre-sort the join predicate columns. For example:

SELECT first.id, first.data_first, second.data_second FROM 
  (SELECT * FROM first ORDER BY id ) first JOIN (SELECT * FROM second ORDER BY id) second ON first.id = second.id;