Hash Joins Versus Merge Joins
The Vertica optimizer implements a join with one of the following algorithms:
- Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins.
- Hash join is used when projections of the joined tables are not already sorted on the join columns. In this case, the optimizer builds an in-memory hash table on the inner table's join column. The optimizer then scans the outer table for matches to the hash table, and joins data from the two tables accordingly. The cost of performing a hash join is low if the entire hash table can fit in memory. Cost rises significantly if the hash table must be written to disk.
The optimizer automatically chooses the most appropriate algorithm to execute a query, given the projections that are available.
Facilitating Merge Joins
To facilitate a merge join, create projections for the joined tables that are sorted on the join predicate columns. The join predicate columns should be the first columns in the
ORDER BY clause.
For example, tables
second are defined as follows, with projections
second_p1, respectively. The projections are sorted on
CREATE TABLE first ( id INT, data_first INT ); CREATE PROJECTION first_p1 AS SELECT * FROM first ORDER BY data_first; CREATE TABLE second ( id INT, data_second INT ); CREATE PROJECTION second_p1 AS SELECT * FROM first ORDER BY data_second;
When you join these tables on unsorted columns
second.id, Vertica uses the hash join algorithm:
EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id; Access Path: +-JOIN HASH [Cost: 752, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)
You can facilitate execution of this query with the merge join algorithm by creating projections
second_p2, which are sorted on join columns
CREATE PROJECTION first_p2 AS SELECT id, data_first FROM first ORDER BY id SEGMENTED BY hash(id, data_first) ALL NODES; CREATE PROJECTION second_p2 AS SELECT id, data_second FROM second ORDER BY id SEGMENTED BY hash(id, data_second) ALL NODES;
If the query joins significant amounts of data, the query optimizer uses the merge algorithm:
EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id; Access Path: +-JOIN MERGEJOIN(inputs presorted) [Cost: 731, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)
You can also facilitate a merge join by using subqueries to pre-sort the join predicate columns. For example:
SELECT first.id, first.data_first, second.data_second FROM (SELECT * FROM first ORDER BY id ) first JOIN (SELECT * FROM second ORDER BY id) second ON first.id = second.id;