verticapy.vDataFrame.join#

Joins the vDataFrame with another one or an input_relation.

Warning

Joins can make the vDataFrame structure heavier. It is recommended that you check the current structure using the current_relation method and save it with the to_db method, using the parameters inplace = True and relation_type = table.

Parameters#

input_relation: SQLRelation

Relation to join with.

on: tuple | dict | list, optional

If using a list: List of 3-tuples. Each tuple must include (key1, key2, operator) — where key1 is the key of the vDataFrame, key2 is the key of the input_relation, and operator is one of the following:

‘=’:
exact match
‘<’:
key1 < key2
‘>’:
key1 > key2
‘<=’:
key1 <= key2
‘>=’:
key1 >= key2
‘llike’:
key1 LIKE ‘%’ || key2 || ‘%’
‘rlike’:
key2 LIKE ‘%’ || key1 || ‘%’
‘linterpolate’:
key1 INTERPOLATE key2
‘rinterpolate’:
key2 INTERPOLATE key1

Some operators need 5-tuples: (key1, key2, operator, operator2, x) where operator2 is a simple operator (=, >, <, <=, >=), x is a float or an integer, and operator is one of the following:

‘jaro’:
JARO(key1, key2) operator2 x
‘jarow’:
JARO_WINCKLER(key1, key2) operator2 x
‘lev’:
LEVENSHTEIN(key1, key2) operator2 x

If using a dictionary: This parameter must include all the different keys. It must be similar to the following: {"relationA_key1": "relationB_key1" ...,"relationA_keyk": "relationB_keyk"} where relationA is the current vDataFrame and relationB is the input_relation or the input vDataFrame.

on_interpolate: dict, optional

Dictionary of all unique keys. This is used to join two event series together using some ordered attribute. Event series joins let you compare values from two series directly, rather than having to normalize the series to the same measurement interval. The dict must be similar to the following: {"relationA_key1": "relationB_key1" ...,"relationA_keyk": "relationB_keyk"} where relationA is the current vDataFrame and relationB is the input_relation or the input vDataFrame.

how: str, optional

Join Type.

left:
Left Join.
right:
Right Join.
cross:
Cross Join.
full:
Full Outer Join.
natural:
Natural Join.
inner:
Inner Join.

expr1: SQLExpression, optional

List of the different columns in pure SQL to select from the current vDataFrame, optionally as aliases. Aliases are recommended to avoid ambiguous names. For example: column or column AS my_new_alias.

expr2: SQLExpression, optional

Returns#

vDataFrame: object result of the join.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create two vDataFrame which we can JOIN for this example:

employees_data = vp.vDataFrame(
    {
        "employee_id": [1, 2, 3, 4],
        "employee_name": ['Alice', 'Bob', 'Charlie', 'David'],
        "department_id": [101, 102, 101, 103],
    },
)


departments_data = vp.vDataFrame(
    {
        "department_id": [101, 102, 104],
        "department_name": ['HR', 'Finance', 'Marketing'],
    }
)

	123 employee_id Integer 100%	...	Abc employee_name Varchar(7) 100%	123 department_id Integer 100%
1	1	...	Alice	101
2	2	...	Bob	102
3	3	...	Charlie	101
4	4	...	David	103

	123 department_id Integer 100%	Abc department_name Varchar(9) 100%
1	101	HR
2	102	Finance
3	104	Marketing

Let us look at the different type of JOINs available below:

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN

After that we will also have a look at:

Other operators.
Special operators like Jaro-Winkler.

INNER JOIN#

We can conveniently JOIN the two vDataFrame using the key column. Let us perform an INNER JOIN. INNER JOIN is executed to combine rows from both the main table and the input_relation based on a specified condition. Only the rows with matching values in the specified column are included in the result. If there is no match, those rows are excluded from the output.

result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "inner",
    expr1 = [
        "employee_id AS ID",
        "employee_name AS Name",
    ],
    expr2 = ["department_name AS Dep"],
)

	123 ID Integer 100%	...	Abc Name Varchar(7) 100%	Abc Dep Varchar(9) 100%
1	1	...	Alice	HR
2	2	...	Bob	Finance
3	3	...	Charlie	HR

LEFT JOIN#

Similarly we can perform a LEFT JOIN which ensures that all rows from the main table are included in the result, and matching rows from the input_relation are included if they exist. If there is no match, the columns from the input relation will contain NULL values for the corresponding rows in the result.

left_join_result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "left",
    expr1 = ["employee_id AS ID", "employee_name AS Name"],
    expr2 = ["department_name AS Dep"],
)

	123 ID Integer 100%	...	Abc Name Varchar(7) 100%	Abc Dep Varchar(9) 75%
1	3	...	Charlie	HR
2	1	...	Alice	HR
3	4	...	David	[null]
4	2	...	Bob	Finance

RIGHT JOIN#

A RIGHT JOIN is employed to include all rows from the input_relation in the result, regardless of whether there are matching values in the main table. Rows from the main table are included if there are matching values, and for non-matching rows, the columns from the main table will contain NULL values in the result.

right_join_result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "right",
    expr1 = ["employee_id AS ID", "employee_name AS Name"],
    expr2 = ["department_name AS Dep"],
)

	123 ID Integer 75%	...	Abc Name Varchar(7) 75%	Abc Dep Varchar(9) 100%
1	1	...	Alice	HR
2	3	...	Charlie	HR
3	[null]	...	[null]	Marketing
4	2	...	Bob	Finance

FULL JOIN#

A FULL JOIN is utilized to include all rows from both the main table and the input_relation in the result. Matching rows are included based on the specified condition, and for non-matching rows in either table, the columns from the non-matching side will contain NULL values in the result. This ensures that all rows from both tables are represented in the output.

full_join_result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "full",
    expr1 = ["employee_id AS ID", "employee_name AS Name"],
    expr2 = ["department_name AS Dep"],
)

	123 ID Integer 80%	...	Abc Name Varchar(7) 80%	Abc Dep Varchar(9) 80%
1	1	...	Alice	HR
2	2	...	Bob	Finance
3	3	...	Charlie	HR
4	4	...	David	[null]
5	[null]	...	[null]	Marketing

OTHER OPERATORS#

Let us explore some additional features of joins. For that let us create another table:

additional_departments_data = vp.vDataFrame(
    {
        "department_size": [12, 8, 8, 10],
        "department": ['HR', 'Fin', 'Mar', 'IT'],
    }
)

	123 department_size Integer 100%	Abc department Varchar(3) 100%
1	12	HR
2	8	Fin
3	8	Mar
4	10	IT

Notice the names are a bit different than the “department_name” column in the previous department_data table. In such cases we can utilize the llike operator:

department_join = departments_data.join(
    input_relation = additional_departments_data,
    on = [("department_name", "department", "llike")],
    how = "inner",
    expr1 = ["department_id AS ID", "department_name AS Dep"],
    expr2 = ["department_size AS Size"],
)

	123 ID Integer 80%	...	Abc Name Varchar(7) 80%	Abc Dep Varchar(9) 80%
1	1	...	Alice	HR
2	2	...	Bob	Finance
3	3	...	Charlie	HR
4	4	...	David	[null]
5	[null]	...	[null]	Marketing

Note

VerticaPy provides an array of join options and diverse operators, delivering an exceptional user experience.

JARO-WINKLER#

VerticaPy also allows you to JOIN tables using the Jaro-Winkler method. It is a string similarity metric used to compare the similarity between two strings. This method can be particularly useful in scenarios where slight spelling mistakes are expected between keys of different tables.

Let us create two tables for this example:

users_data = vp.vDataFrame(
    {
        "user_id": [1, 2, 3],
        "email": ['alice@email.com', 'bob@email.com', 'charlie@email.com'],
        "username": ['Ali', 'Bob', 'Charlie'],
        "age": [25, 30, 22],
    },
)


orders_data = vp.vDataFrame(
    {
        "order_id": [101, 102, 103],
        "email": ['Alice@email.com', 'bob@email.com', 'charlee@email.com'],
        "product_name": ['Laptop', 'Headphones', 'Smartphone'],
        "quantity": [2, 1, 3],
    }
)

	123 user_id Integer 100%	...	Abc email Varchar(17) 100%	123 age Integer 100%
1	1	...	alice@email.com	25
2	2	...	bob@email.com	30
3	3	...	charlie@email.com	22

	123 order_id Integer 100%	...	Abc email Varchar(17) 100%	123 quantity Integer 100%
1	101	...	Alice@email.com	2
2	102	...	bob@email.com	1
3	103	...	charlee@email.com	3

Notice that some emails are not correctly spelled, so we can use the jarow option to JOIN them:

result = users_data.join(
    input_relation = orders_data,
    on = [("email", "email", "jarow", ">=", 0.9)],
    how = "inner",
    expr1 = [
        "user_id AS ID",
        "username AS Name",
        "email",
    ],
    expr2 = ["product_name AS Item", "quantity AS Qty"],
)

	123 order_id Integer 100%	...	Abc email Varchar(17) 100%	123 quantity Integer 100%
1	101	...	Alice@email.com	2
2	102	...	bob@email.com	1
3	103	...	charlee@email.com	3