
You can use range joins to categorize data into buckets. Vertica provides performance optimizations for <, <=, >, >=, and BETWEEN predicates. These optimizations are particularly useful when a column from one table is restricted to be in a range specified by two columns of another table.
Range joins can be long running if you have very large table(s). This post provides you with options to work with range joins by using the following methods:
• Convert range join to inline lookup function—create a new function to group the age into two categories.
• Convert range join to equi join – where a range predicate (such as between) is replaced by an = predicate
The following describes a typical use case for using a range join:
• There is a large fact table (person) with individual age data
• There is a dimension table (buckets), which is used to classify each individual in the person table along two dimensions, based on age
=> CREATE TABLE person (ssn varchar primary key, classid int, dateOfBirth date, dateDeceased date) ORDER BY classid, ssn segmented by hash(classid) all nodes ksafe;
=> CREATE TABLE buckets (classid int NOT NULL, fromValue int NOT NULL, toValue int NOT NULL, description varchar(100) NOT NULL) ORDER BY classid, fromValue, toValue unsegmented all nodes ksafe;
=> SELECT * FROM person;
ssn | classid | dateOfBirth | dateDeceased
--------------+---------+-------------+--------------
170-749-963 | 2 | 1957-04-28 | 1978-04-28
924-435-7894 | 1 | 0198-04-28 | 1994-04-28
25-154-3821 | 1 | 1952-04-28 | 1987-04-28
430-263-3135 | 2 | 1910-04-28 | 1977-04-28
705-218-7617 | 1 | 1964-04-28 | 1992-04-28
502-687-4360 | 2 | 1957-04-28 | 1963-04-28
568-959-8677 | 1 | 1927-04-28 |
692-368-8739 | 2 | 1944-04-28 |
715-83-4558 | 1 | 1954-04-28 |
732-260-3942 | 2 | 1984-04-28 |
121-01-1111 | 1 | |
(11 rows)
=> SELECT * FROM buckets;
classid | fromValue | toValue | description
---------+-----------+---------+-------------
1 | 0 | 2 | infant
1 | 3 | 5 | toddler
1 | 6 | 12 | child
1 | 13 | 19 | teen
1 | 20 | 62 | adult
1 | 63 | 100 | senior
2 | 0 | 18 | dependent
2 | 19 | 100 | taxpayer
(8 rows)
Buckets classify the person based on their age and categorizes them into taxpayers and dependents. Use Query 1 to join the buckets and person table. The description column in this query is a classification of the person based on the data in the buckets table. It is a result of the range join operation on the person’s age (as derived from dateDeceased and dateOfBirth) and the range data in the buckets table (fromValue and toValue).
=> SELECT p.ssn,p.classid,dateOfBirth,dateDeceased,description
FROM person p
JOIN buckets b
ON (p.classid=b.classid
AND
COALESCE(DATEDIFF('year', p.dateOfBirth, COALESCE(p.dateDeceased,SYSDATE)),-5)
BETWEEN b.fromValue AND b.toValue) -- <== This is the range join that can add time to this query. It checks to see if a persons age is between a given range
order by 1,2,3,4,5;
ssn | classid | dateOfBirth | dateDeceased | description
--------------+---------+-------------+--------------+-------------
170-749-963 | 2 | 1957-04-28 | 1978-04-28 | taxpayer
25-154-3821 | 1 | 1952-04-28 | 1987-04-28 | adult
430-263-3135 | 2 | 1910-04-28 | 1977-04-28 | taxpayer
502-687-4360 | 2 | 1957-04-28 | 1963-04-28 | dependent
568-959-8677 | 1 | 1927-04-28 | | senior
692-368-8739 | 2 | 1944-04-28 | | taxpayer
705-218-7617 | 1 | 1964-04-28 | 1992-04-28 | adult
715-83-4558 | 1 | 1954-04-28 | | senior
732-260-3942 | 2 | 1984-04-28 | | taxpayer
(9 rows)
Many times this range join can be slow running and can increase the time it takes for a query to complete. Let’s see two options to eliminate range joins.
Option 1: Convert the range join to inline lookup function
To eliminate the slow running range join, replace the entire join by a materialized column on the person table. This option is only possible if the buckets table is small.The following is an example of how you can modify the person table to eliminate the buckets table entirely:
=> ALTER TABLE person add column description varchar default
case
WHEN classid=2 THEN
( CASE
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=0 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=18 then 'dependent'
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=19 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) <=100 then 'tax payer'
ELSE null
END)
when classid=1 then
( CASE
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=0 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=2 then 'infant'
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=3 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=5 then 'toddler'
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=6 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=12 then 'child'
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=13 and
COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=19 then 'teen'
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=20 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=62 then 'adult'
WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=63 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=100 then 'senior'
ELSE null
END)
ELSE null END;
Rewrite Query 1 as follows. This query has no range join and the query produces the same output as before.
=> SELECT ssn,classid,dateOfBirth,dateDeceased,description
FROM person
WHERE description is not null
ORDER by 1,2,3,4,5;
ssn | classid | dateOfBirth | dateDeceased | description
--------------+---------+-------------+--------------+-------------
170-749-963 | 2 | 1957-04-28 | 1978-04-28 | tax payer
25-154-3821 | 1 | 1952-04-28 | 1987-04-28 | adult
430-263-3135 | 2 | 1910-04-28 | 1977-04-28 | tax payer
502-687-4360 | 2 | 1957-04-28 | 1963-04-28 | dependent
568-959-8677 | 1 | 1927-04-28 | | senior
692-368-8739 | 2 | 1944-04-28 | | tax payer
705-218-7617 | 1 | 1964-04-28 | 1992-04-28 | adult
715-83-4558 | 1 | 1954-04-28 | | senior
732-260-3942 | 2 | 1984-04-28 | | tax payer
(9 rows)
Option 2: Convert the range join to an equi join
If the buckets table is very large you cannot write a function to materialize the description column. Use another alternative and flatten the buckets table as shown below.Here, we use the buckets2 table instead of the buckets table. The buckets2table is a flattened table and does not have a fromValue and toValue for age. Instead, the table has a single value age, and a corresponding description.
=> SELECT * FROM buckets2;
classid | age | description
---------+-----+-------------
1 | 0 | infant
1 | 1 | infant
1 | 2 | infant
1 | 3 | toddler
1 | 4 | toddler
1 | 5 | toddler
2 | 6 | child
…
…
2 | 0 | dependent
2 | 1 | dependent
2 | 2 | dependent
2 | 3 | dependent
…
2 | 19 | taxpayer
…
2 | 100 | taxpayer
(202 rows)
Rewrite Query 1 as follows. This query has no range join (between clause), instead there are equi-joins (in bold).
=> SELECT * FROM
(
SELECT ssn,p.classid,dateOfBirth,dateDeceased,description FROM person p join buckets2 b using (classid, age) where p.age > 0
union
SELECT ssn,p.classid,dateOfBirth,dateDeceased,description FROM person p join buckets2 b on (p.classid = b.classid and datediff('year',dateofbirth,sysdate) = b.age) WHERE p.age < 0
) AS X
ORDER BY 1,2,3,4,5;
ssn | classid | dateOfBirth | dateDeceased | description
--------------+---------+-------------+--------------+-------------
170-749-963 | 2 | 1957-04-28 | 1978-04-28 | taxpayer
25-154-3821 | 1 | 1952-04-28 | 1987-04-28 | adult
430-263-3135 | 2 | 1910-04-28 | 1977-04-28 | taxpayer
502-687-4360 | 2 | 1957-04-28 | 1963-04-28 | dependent
568-959-8677 | 1 | 1927-04-28 | | senior
692-368-8739 | 2 | 1944-04-28 | | taxpayer
705-218-7617 | 1 | 1964-04-28 | 1992-04-28 | adult
715-83-4558 | 1 | 1954-04-28 | | senior
732-260-3942 | 2 | 1984-04-28 | | taxpayer
(9 rows)