
- Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and use less memory than hash joins.
- Hash join is used when projections of the joined tables are not already sorted on the join columns. In this case, the optimizer builds an in-memory hash table on the inner table’s join column. The optimizer then scans the outer table for matches to the hash table and joins data from the two tables accordingly. The cost of performing a hash join is low if the entire hash table can fit in memory. Cost rises significantly if the hash table must be written to disk.
No problem. Here are two other methods that you can employ to force a merge join over a hash join so that your query can complete.
Example:
If I create these two simple tables:
dbadmin=> create table f (c1 int, c2 varchar(10));
CREATE TABLE
dbadmin=> create table d (c1 int, c2 varchar(10));
CREATE TABLE
A join of the 2 tables will use a hash join:
explain select f.* from f left join d on d.c1 = f.c1;
Access Path:
+-JOIN HASH [LeftOuter] [Cost: 0, Rows: 0] (PATH ID: 1) Outer (LOCAL ROUND ROBIN)
| Join Cond: (d.c1 = f.c1)
| +-- Outer -> SELECT [Cost: 0, Rows: 0] (PATH ID: 2)
| +-- Inner -> SELECT [Cost: 0, Rows: 0] (PATH ID: 3
But I can force a merge join by ordering each join input by the join columns:
explain select f.* from (select * from f order by 1) f left join (select * from d order by 1) d on d.c1 = f.c1;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 0, Rows: 0] (PATH ID: 1)
| Join Cond: (d.c1 = f.c1)
| +-- Outer -> SELECT [Cost: 0, Rows: 0] (PATH ID: 2)
| | +---> SORT [Cost: 0, Rows: 0] (PATH ID: 3)
| | | Order: f.c1 ASC
| +-- Inner -> SELECT [Cost: 0, Rows: 0] (PATH ID: 5)
| | +---> SORT [Cost: 0, Rows: 0] (PATH ID: 6)
| | | Order: d.c1 ASC
I can also use Vertica hints to force a merge join:
explain select /* + SYNTACTIC_JOIN*/ f.* from f left join /*+JTYPE(FM)*/ d on d.c1 = f.c1;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 0, Rows: 0] (PATH ID: 1) Outer (SORT ON JOIN KEY) Inner (SORT ON JOIN KEY)
| Join Cond: (d.c1 = f.c1)
| +-- Outer -> SELECT [Cost: 0, Rows: 0] (PATH ID: 2)
| +-- Inner -> SELECT [Cost: 0, Rows: 0] (PATH ID: 3)
Helpful Links:
https://www.vertica.com/docs/latest/HTML/Content/Authoring/AnalyzingData/Optimizations/HashJoinsVs.MergeJoins.htm
Have fun!