City in Blur Motion
As a Vertica user, you know that using joins can improve query performance by combining records from one or more tables. But sometimes, you need to develop complex joins. Vertica supports many different kinds of joins that perform different functions based on your needs.
In this blog, we’ll give you a refresher on join algorithms and predicates, as well as the following Vertica join types:
- Inner joins
- Left, right, and full outer joins
- Natural joins
- Cross joins
In Vertica, we refer to the tables participating in the join as left or right. The left table is specified first in the join statement, and the right table is the table mentioned second.
Examples used in this blog
For this blog, we’ll use the following scenario:
Your organization wants employees to increase or refresh their computer programming skills. To accomplish this, you put together a list of courses from a local university that your employees can take. To help keep track of employees and completed classes, you reference two tables: employees and courses, described in more detail below:
employees: A table with detailed employee information for all of your organization’s 87 employees. For room, the following outputs only show a portion of the tables:
=> SELECT * FROM employees ORDER BY personal_id ASC;
personal_id | employee_name | employee_age | employee_gender
1 | John | 43 | M
2 | Dan | 52 | M
3 | Lori | 38 | F
4 | Tom | 55 | M
5 | Mary | 39 | F
6 | Virginia | 66 | F
7 | Gary | 63 | M
8 | Rebecca | 19 | F
9 | Steven | 32 | M
10 | Jessica | 25 | F
courses: A table the university uses to record each time someone takes a course. This includes people from outside your organization.
=> SELECT * FROM courses ORDER BY record_id ASC;
record_id | personal_id | course_id | course_name | date_taken
1 | 10 | 1 | Intro to Comp Sci | 2015-01-05
2 | 7 | 4 | Java 303 | 2015-04-05
3 | 53 | 3 | Database Architecture | 2015-08-03
4 | 4 | 2 | SQL 101 | 2016-01-13
5 | 6 | 4 | Java 303 | 2014-09-10
6 | 6 | 1 | Intro to Comp Sci | 2015-03-29
7 | 2 | 2 | SQL 101 | 2014-10-30
8 | 27 | 2 | SQL 101 | 2013-07-04
9 | 8 | 4 | Java 303 | 2016-01-05
10 | 3 | 3 | Database Architecture | 2015-07-13
You can gain valuable insight into the relationship between the employees and courses by performing a variety of joins. All the examples in this blog will use the employees and courses tables.
How does Vertica perform a join?
Vertica uses two basic algorithms to perform a join operation:
- A hash join is used to join large data sets. In a hash join, Vertica uses the smaller (or inner) table to build an in-memory hash table on the join column. The Vertica execution engine then scans the larger (or outer) table and examines the hash table to look for matches. The table size is determined by the number of rows times the size of each row. The optimizer chooses a hash join when projections are not sorted on the join columns. Although there are no sort requirements, the cost for performing a hash join can rise if the entire hash table can’t fit in memory.
- If both inputs are pre-sorted on the join column, the optimizer chooses a merge join.
The good news is, you don’t necessarily have to pay a whole lot of attention to the join algorithms because the Vertica optimizer automatically chooses the most appropriate algorithm given the query and projections in a system. You can facilitate a merge join by adding a projection that is sorted on the join keys.
In the JOIN SQL statement, the join predicate specifies how Vertica should join the tables. You specify the join predicates (for example, relational operators like <, <=, >, >=, <>, =, <=>) in the ON clause along with the columns from the left and right tables that should be combined.
Vertica supports any arbitrary join expression with both matching and non-matching column values. If the clause uses an equality predicate, indicated with an equal sign (=), the join is considered an equi-join. Consequently, non equi-joins use predicates other than the equal sign, for example, the greater than sign (>).
For the examples in this blog, we’ll use the equality predicate (=).
Now that you have a basic understanding of what a join is for, let’s look at the different types of joins you can perform with Vertica.
An inner join combines records from two tables based on a join predicate, and returns rows from columns specified in the SELECT statement that satisfy that predicate. Since we’re using the equality predicate, performing an inner join will return values from columns in the SELECT statement where the columns specified in the JOIN statement match. If a row from the left table’s joined column matches three rows from the right table’s specified column, the join will return three rows (provided the columns are specified in the SELECT statement). When a value in the joined column appears in one table but not the other, that row is not returned.
In the diagram below, the green shaded area represents values that are returned by an inner join. Keep in mind, duplicate values may be returned.
The query used to create an inner join looks like this:
=> SELECTAn inner join is the most commonly used type of join so the INNER keyword is optional.
If you want to list your employees who have taken classes, along with information about these classes, you can join the two tables on their common column, personal_id. This inner join will include rows from both table’s where the personal_id from the employees table matches a value in the courses tables personal_id column.
=> SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken
FROM employees e
INNER JOIN courses c ON e.personal_id = c.personal_id
ORDER BY e.personal_id ASC;
Some employees, like Virginia, show up multiple times because they have taken multiple classes (i.e., her personal id shows up twice in the courses table). Other employees, like Mary, don’t show up at all because they haven’t taken any classes yet (i.e., isn’t in the courses table).
But what if you want all the rows from one or both tables to be included in your result set? For that, you can use an outer join.
Outer joins extend the functionality of inner joins by letting you preserve rows in one or both tables that do not have matching rows in the other table. Create outer joins using the following syntax:
=> SELECT FROM [ LEFT | RIGHT | FULL ] OUTER JOIN ON
Vertica gives you the option of performing left outer joins, right outer joins, or full outer joins.
Left outer joins
A left outer join preserves a complete set of records from the left (preserved) table, along with any matched records in the right (non-preserved) table. Where Vertica finds no match, it inserts a null value for the right table.
In the diagram below, the green area represents values that are returned by a left outer join:
Using our previous example, let’s say you want to see a list of all your employees along with classes they’ve taken, if applicable. The result of a left outer join will be a record of every employee and a blank entry for rows from the courses table if that employee hasn’t taken a course.
=> SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e LEFT OUTER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;
Right outer joins
When performing a right outer join, Vertica returns a complete set of records from the right-joined (preserved) table, as well as matched values from the left-joined (non-preserved) table. Where Vertica finds no match, it returns null values for the left table.
In the diagram below, the green area represents values that are returned by a right outer join:
You can use this example to list all courses taken, whether or not one of your employees participated.
=> SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e RIGHT OUTER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;
Full outer joins
You can combine left outer joins and right outer joins with a full outer join. A full outer join returns all records specified in the SELECT list, including nulls (missing matches), from either table in the join.
In the diagram below, the green area represents values that are returned by a full outer join:
A full outer join is useful if you want to see, for example, each employee who has taken a particular class and each class that has been taken by one of your employees, but you also want to see all the employees who have not taken any courses, as well as any course that has not been attended by one of your employees:
=> SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e FULL OUTER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;
A natural join is just a join with an implicit join predicate. An implicit join predicate doesn’t include any explicit join conditions, such as e.personal_id = c.personal_id. Instead, implicit connections are formed by matching all pairs of columns in the tables that have the same name and compatible data types. (If the data types are incompatible, Vertica returns an error.) The result set contains only one column representing the pair of equally-named columns.
Natural joins take the following form:
=> SELECT FROM NATURAL [ INNER | LEFT OUTER | RIGHT OUTER | FULL OUTER ] JOIN
Vertica performs a natural join by creating an inner join (default) on the column common to both tables. This is useful both syntactically, and if you don’t know the names or commonality of columns.
For example, this query naturally joins the two personal_id columns.
=> SELECT * FROM employees e NATURAL JOIN courses c;
Use a cross join when you want to all possible combinations of matching one table with another.
Cross joins take the following form:
=> SELECT FROM CROSS JOIN
For example, you may want to compare every employee with every course record to see who they’ve taken classes with (identified by the personal_id number). In the result set, Vertica retrieves a record from the employees table and creates a row for every record in the courses table. It then does the same for the rest of the records in the employee table until each row in the employee table is displayed with each row of the courses table. The number of rows equals the number of rows in the first table multiplied by the number of rows in the second table. In our example, the output is 100 rows (10 * 10). The example output below only shows a sample:
=> SELECT * FROM employees e CROSS JOIN courses ORDER BY e.personal_id ASC;
For more information about joins, including restrictions and optimization, see the Vertica documentation.
Also read our blog on range joins!