Forcing a Merge Join

Posted April 23, 2019 by Jim Knicely, Big Data Solutions Architect, Vertica

Hand writing the text: Helpful Tips
The Vertica optimizer implements a join with one of the following algorithms:

  • Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and use less memory than hash joins.
  • Hash join is used when projections of the joined tables are not already sorted on the join columns. In this case, the optimizer builds an in-memory hash table on the inner table’s join column. The optimizer then scans the outer table for matches to the hash table and joins data from the two tables accordingly. The cost of performing a hash join is low if the entire hash table can fit in memory. Cost rises significantly if the hash table must be written to disk.
If you have limited resources (i.e. memory), some of your queries might not be able to perform a hash join, and Vertica will let you know with the “Join inner did not fit in memory” error. Typically, you can get around this error by optimizing your table projections or telling Vertica you want to enable join spill. But what if neither of those solutions is an option?

No problem. Here are two other methods that you can employ to force a merge join over a hash join so that your query can complete.

Example:

If I create these two simple tables: dbadmin=> create table f (c1 int, c2 varchar(10)); CREATE TABLE dbadmin=> create table d (c1 int, c2 varchar(10)); CREATE TABLE A join of the 2 tables will use a hash join: explain select f.* from f left join d on d.c1 = f.c1; Access Path: +-JOIN HASH [LeftOuter] [Cost: 0, Rows: 0] (PATH ID: 1) Outer (LOCAL ROUND ROBIN) | Join Cond: (d.c1 = f.c1) | +-- Outer -> SELECT [Cost: 0, Rows: 0] (PATH ID: 2) | +-- Inner -> SELECT [Cost: 0, Rows: 0] (PATH ID: 3 But I can force a merge join by ordering each join input by the join columns: explain select f.* from (select * from f order by 1) f left join (select * from d order by 1) d on d.c1 = f.c1; Access Path: +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 0, Rows: 0] (PATH ID: 1) | Join Cond: (d.c1 = f.c1) | +-- Outer -> SELECT [Cost: 0, Rows: 0] (PATH ID: 2) | | +---> SORT [Cost: 0, Rows: 0] (PATH ID: 3) | | | Order: f.c1 ASC | +-- Inner -> SELECT [Cost: 0, Rows: 0] (PATH ID: 5) | | +---> SORT [Cost: 0, Rows: 0] (PATH ID: 6) | | | Order: d.c1 ASC I can also use Vertica hints to force a merge join: explain select /* + SYNTACTIC_JOIN*/ f.* from f left join /*+JTYPE(FM)*/ d on d.c1 = f.c1; Access Path: +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 0, Rows: 0] (PATH ID: 1) Outer (SORT ON JOIN KEY) Inner (SORT ON JOIN KEY) | Join Cond: (d.c1 = f.c1) | +-- Outer -> SELECT [Cost: 0, Rows: 0] (PATH ID: 2) | +-- Inner -> SELECT [Cost: 0, Rows: 0] (PATH ID: 3) Helpful Links:

https://www.vertica.com/docs/latest/HTML/Content/Authoring/AnalyzingData/Optimizations/HashJoinsVs.MergeJoins.htm

Have fun!