Watch Out for Data Induced “Cross” Joins

Posted May 20, 2019 by Jim Knicely, Big Data Solutions Architect, Vertica

Helpful Tips text with hand and marker over data center background
I recently worked with a client who reported that a query in Vertica would not return data. That is, the query ran forever. The query was pretty simple, joining two 100 billion+ tables together. After some investigation, we realized that the query could potentially return trillions and trillions of records!

Why? Because the join keys actually performed a CROSS JOIN. In this case, it turned out to be bad data in one of the tables. I call this a data-induced cross join, but it was actually just an unexpected many to many join. The point is, be careful with your join conditions!

Example: dbadmin=> CREATE TABLE t1 (c INT); CREATE TABLE dbadmin=> CREATE TABLE t2 (c INT); CREATE TABLE dbadmin=> INSERT INTO t1 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO t1 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO t2 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> INSERT INTO t2 SELECT 1; OUTPUT -------- 1 (1 row) dbadmin=> SELECT t1.*, t2.* FROM t1 JOIN t2 USING (c); c | c ---+--- 1 | 1 1 | 1 1 | 1 1 | 1 (4 rows) dbadmin=> SELECT t1.*, t2.* FROM t1 CROSS JOIN t2; c | c ---+--- 1 | 1 1 | 1 1 | 1 1 | 1 (4 rows) See how we are getting a cross join from the inner join? Imagine the result set from joining two tables that each have 100 billion+ records!

Helpful Link:

https://www.vertica.com/docs/latest/HTML/Content/Authoring/AnalyzingData/Queries/Joins/CrossJoins.htm

Have fun!