Loading...

verticapy.vDataFrame.join#

vDataFrame.join(input_relation: str | vDataFrame, on: None | tuple | dict | list = None, on_interpolate: dict | None = None, how: Literal['left', 'right', 'cross', 'full', 'natural', 'self', 'inner', None] = 'natural', expr1: str | list[str] | StringSQL | list[StringSQL] | None = None, expr2: str | list[str] | StringSQL | list[StringSQL] | None = None) vDataFrame#

Joins the vDataFrame with another one or an input_relation.

Warning

Joins can make the vDataFrame structure heavier. It is recommended that you check the current structure using the current_relation method and save it with the to_db method, using the parameters inplace = True and relation_type = table.

Parameters#

input_relation: SQLRelation

Relation to join with.

on: tuple | dict | list, optional

If using a list: List of 3-tuples. Each tuple must include (key1, key2, operator) — where key1 is the key of the vDataFrame, key2 is the key of the input_relation, and operator is one of the following:

  • ‘=’:

    exact match

  • ‘<’:

    key1 < key2

  • ‘>’:

    key1 > key2

  • ‘<=’:

    key1 <= key2

  • ‘>=’:

    key1 >= key2

  • ‘llike’:

    key1 LIKE ‘%’ || key2 || ‘%’

  • ‘rlike’:

    key2 LIKE ‘%’ || key1 || ‘%’

  • ‘linterpolate’:

    key1 INTERPOLATE key2

  • ‘rinterpolate’:

    key2 INTERPOLATE key1

Some operators need 5-tuples: (key1, key2, operator, operator2, x) where operator2 is a simple operator (=, >, <, <=, >=), x is a float or an integer, and operator is one of the following:

  • ‘jaro’:

    JARO(key1, key2) operator2 x

  • ‘jarow’:

    JARO_WINCKLER(key1, key2) operator2 x

  • ‘lev’:

    LEVENSHTEIN(key1, key2) operator2 x

If using a dictionary: This parameter must include all the different keys. It must be similar to the following: {"relationA_key1": "relationB_key1" ...,"relationA_keyk": "relationB_keyk"} where relationA is the current vDataFrame and relationB is the input_relation or the input vDataFrame.

on_interpolate: dict, optional

Dictionary of all unique keys. This is used to join two event series together using some ordered attribute. Event series joins let you compare values from two series directly, rather than having to normalize the series to the same measurement interval. The dict must be similar to the following: {"relationA_key1": "relationB_key1" ...,"relationA_keyk": "relationB_keyk"} where relationA is the current vDataFrame and relationB is the input_relation or the input vDataFrame.

how: str, optional

Join Type.

  • left:

    Left Join.

  • right:

    Right Join.

  • cross:

    Cross Join.

  • full:

    Full Outer Join.

  • natural:

    Natural Join.

  • inner:

    Inner Join.

expr1: SQLExpression, optional

List of the different columns in pure SQL to select from the current vDataFrame, optionally as aliases. Aliases are recommended to avoid ambiguous names. For example: column or column AS my_new_alias.

expr2: SQLExpression, optional

List of the different columns in pure SQL to select from the current vDataFrame, optionally as aliases. Aliases are recommended to avoid ambiguous names. For example: column or column AS my_new_alias.

Returns#

vDataFrame

object result of the join.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create two vDataFrame which we can JOIN for this example:

employees_data = vp.vDataFrame(
    {
        "employee_id": [1, 2, 3, 4],
        "employee_name": ['Alice', 'Bob', 'Charlie', 'David'],
        "department_id": [101, 102, 101, 103],
    },
)


departments_data = vp.vDataFrame(
    {
        "department_id": [101, 102, 104],
        "department_name": ['HR', 'Finance', 'Marketing'],
    }
)

123
employee_id
Integer
100%
...
Abc
employee_name
Varchar(7)
100%
123
department_id
Integer
100%
11...Alice101
22...Bob102
33...Charlie101
44...David103
123
department_id
Integer
100%
Abc
department_name
Varchar(9)
100%
1101HR
2102Finance
3104Marketing

Let us look at the different type of JOINs available below:

  • INNER JOIN

  • LEFT JOIN

  • RIGHT JOIN

  • FULL JOIN

After that we will also have a look at:

  • Other operators.

  • Special operators like Jaro-Winkler.

INNER JOIN#

We can conveniently JOIN the two vDataFrame using the key column. Let us perform an INNER JOIN. INNER JOIN is executed to combine rows from both the main table and the input_relation based on a specified condition. Only the rows with matching values in the specified column are included in the result. If there is no match, those rows are excluded from the output.

result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "inner",
    expr1 = [
        "employee_id AS ID",
        "employee_name AS Name",
    ],
    expr2 = ["department_name AS Dep"],
)

123
ID
Integer
100%
...
Abc
Name
Varchar(7)
100%
Abc
Dep
Varchar(9)
100%
11...AliceHR
22...BobFinance
33...CharlieHR

LEFT JOIN#

Similarly we can perform a LEFT JOIN which ensures that all rows from the main table are included in the result, and matching rows from the input_relation are included if they exist. If there is no match, the columns from the input relation will contain NULL values for the corresponding rows in the result.

left_join_result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "left",
    expr1 = ["employee_id AS ID", "employee_name AS Name"],
    expr2 = ["department_name AS Dep"],
)

123
ID
Integer
100%
...
Abc
Name
Varchar(7)
100%
Abc
Dep
Varchar(9)
75%
13...CharlieHR
21...AliceHR
34...David[null]
42...BobFinance

RIGHT JOIN#

A RIGHT JOIN is employed to include all rows from the input_relation in the result, regardless of whether there are matching values in the main table. Rows from the main table are included if there are matching values, and for non-matching rows, the columns from the main table will contain NULL values in the result.

right_join_result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "right",
    expr1 = ["employee_id AS ID", "employee_name AS Name"],
    expr2 = ["department_name AS Dep"],
)

123
ID
Integer
75%
...
Abc
Name
Varchar(7)
75%
Abc
Dep
Varchar(9)
100%
11...AliceHR
23...CharlieHR
3[null]...[null]Marketing
42...BobFinance

FULL JOIN#

A FULL JOIN is utilized to include all rows from both the main table and the input_relation in the result. Matching rows are included based on the specified condition, and for non-matching rows in either table, the columns from the non-matching side will contain NULL values in the result. This ensures that all rows from both tables are represented in the output.

full_join_result = employees_data.join(
    input_relation = departments_data,
    on = [("department_id", "department_id", "=")],
    how = "full",
    expr1 = ["employee_id AS ID", "employee_name AS Name"],
    expr2 = ["department_name AS Dep"],
)

123
ID
Integer
80%
...
Abc
Name
Varchar(7)
80%
Abc
Dep
Varchar(9)
80%
11...AliceHR
22...BobFinance
33...CharlieHR
44...David[null]
5[null]...[null]Marketing

OTHER OPERATORS#

Let us explore some additional features of joins. For that let us create another table:

additional_departments_data = vp.vDataFrame(
    {
        "department_size": [12, 8, 8, 10],
        "department": ['HR', 'Fin', 'Mar', 'IT'],
    }
)

123
department_size
Integer
100%
Abc
department
Varchar(3)
100%
112HR
28Fin
38Mar
410IT

Notice the names are a bit different than the “department_name” column in the previous department_data table. In such cases we can utilize the llike operator:

department_join = departments_data.join(
    input_relation = additional_departments_data,
    on = [("department_name", "department", "llike")],
    how = "inner",
    expr1 = ["department_id AS ID", "department_name AS Dep"],
    expr2 = ["department_size AS Size"],
)

123
ID
Integer
80%
...
Abc
Name
Varchar(7)
80%
Abc
Dep
Varchar(9)
80%
11...AliceHR
22...BobFinance
33...CharlieHR
44...David[null]
5[null]...[null]Marketing

Note

VerticaPy provides an array of join options and diverse operators, delivering an exceptional user experience.

JARO-WINKLER#

VerticaPy also allows you to JOIN tables using the Jaro-Winkler method. It is a string similarity metric used to compare the similarity between two strings. This method can be particularly useful in scenarios where slight spelling mistakes are expected between keys of different tables.

Let us create two tables for this example:

users_data = vp.vDataFrame(
    {
        "user_id": [1, 2, 3],
        "email": ['alice@email.com', 'bob@email.com', 'charlie@email.com'],
        "username": ['Ali', 'Bob', 'Charlie'],
        "age": [25, 30, 22],
    },
)


orders_data = vp.vDataFrame(
    {
        "order_id": [101, 102, 103],
        "email": ['Alice@email.com', 'bob@email.com', 'charlee@email.com'],
        "product_name": ['Laptop', 'Headphones', 'Smartphone'],
        "quantity": [2, 1, 3],
    }
)

123
user_id
Integer
100%
...
Abc
email
Varchar(17)
100%
123
age
Integer
100%
11...alice@email.com25
22...bob@email.com30
33...charlie@email.com22
123
order_id
Integer
100%
...
Abc
email
Varchar(17)
100%
123
quantity
Integer
100%
1101...Alice@email.com2
2102...bob@email.com1
3103...charlee@email.com3

Notice that some emails are not correctly spelled, so we can use the jarow option to JOIN them:

result = users_data.join(
    input_relation = orders_data,
    on = [("email", "email", "jarow", ">=", 0.9)],
    how = "inner",
    expr1 = [
        "user_id AS ID",
        "username AS Name",
        "email",
    ],
    expr2 = ["product_name AS Item", "quantity AS Qty"],
)

123
order_id
Integer
100%
...
Abc
email
Varchar(17)
100%
123
quantity
Integer
100%
1101...Alice@email.com2
2102...bob@email.com1
3103...charlee@email.com3

See also

vDataFrame.append() : Append a vDataFrame with another one or an input_relation.