verticapy.vDataFrame.join#
- vDataFrame.join(input_relation: str | vDataFrame, on: None | tuple | dict | list = None, on_interpolate: dict | None = None, how: Literal['left', 'right', 'cross', 'full', 'natural', 'self', 'inner', None] = 'natural', expr1: str | list[str] | StringSQL | list[StringSQL] | None = None, expr2: str | list[str] | StringSQL | list[StringSQL] | None = None) vDataFrame #
Joins the
vDataFrame
with another one or aninput_relation
.Warning
Joins can make the vDataFrame structure heavier. It is recommended that you check the current structure using the
current_relation
method and save it with theto_db
method, using the parametersinplace = True
andrelation_type = table
.Parameters#
- input_relation: SQLRelation
Relation to join with.
- on: tuple | dict | list, optional
If using a list: List of 3-tuples. Each tuple must include (key1, key2, operator) — where
key1
is the key of thevDataFrame
,key2
is the key of theinput_relation
, andoperator
is one of the following:- ‘=’:
exact match
- ‘<’:
key1 < key2
- ‘>’:
key1 > key2
- ‘<=’:
key1 <= key2
- ‘>=’:
key1 >= key2
- ‘llike’:
key1 LIKE ‘%’ || key2 || ‘%’
- ‘rlike’:
key2 LIKE ‘%’ || key1 || ‘%’
- ‘linterpolate’:
key1 INTERPOLATE key2
- ‘rinterpolate’:
key2 INTERPOLATE key1
Some operators need 5-tuples:
(key1, key2, operator, operator2, x)
whereoperator2
is a simple operator(=, >, <, <=, >=)
, x is afloat
or aninteger
, andoperator
is one of the following:- ‘jaro’:
JARO(key1, key2) operator2 x
- ‘jarow’:
JARO_WINCKLER(key1, key2) operator2 x
- ‘lev’:
LEVENSHTEIN(key1, key2) operator2 x
If using a dictionary: This parameter must include all the different keys. It must be similar to the following:
{"relationA_key1": "relationB_key1" ...,"relationA_keyk": "relationB_keyk"}
whererelationA
is the currentvDataFrame
andrelationB
is theinput_relation
or the inputvDataFrame
.- on_interpolate: dict, optional
Dictionary of all unique keys. This is used to join two event series together using some ordered attribute. Event series joins let you compare values from two series directly, rather than having to normalize the series to the same measurement interval. The dict must be similar to the following:
{"relationA_key1": "relationB_key1" ...,"relationA_keyk": "relationB_keyk"}
whererelationA
is the currentvDataFrame
andrelationB
is theinput_relation
or the inputvDataFrame
.- how: str, optional
Join Type.
- left:
Left Join.
- right:
Right Join.
- cross:
Cross Join.
- full:
Full Outer Join.
- natural:
Natural Join.
- inner:
Inner Join.
- expr1: SQLExpression, optional
List of the different columns in pure SQL to select from the current
vDataFrame
, optionally as aliases. Aliases are recommended to avoid ambiguous names. For example:column
orcolumn AS my_new_alias
.- expr2: SQLExpression, optional
List of the different columns in pure SQL to select from the current
vDataFrame
, optionally as aliases. Aliases are recommended to avoid ambiguous names. For example:column
orcolumn AS my_new_alias
.
Returns#
- vDataFrame
object result of the join.
Examples#
Let’s begin by importing VerticaPy.
import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.Let us create two
vDataFrame
which we can JOIN for this example:employees_data = vp.vDataFrame( { "employee_id": [1, 2, 3, 4], "employee_name": ['Alice', 'Bob', 'Charlie', 'David'], "department_id": [101, 102, 101, 103], }, ) departments_data = vp.vDataFrame( { "department_id": [101, 102, 104], "department_name": ['HR', 'Finance', 'Marketing'], } )
123employee_idInteger100%... Abcemployee_nameVarchar(7)100%123department_idInteger100%1 1 ... Alice 101 2 2 ... Bob 102 3 3 ... Charlie 101 4 4 ... David 103 123department_idInteger100%Abcdepartment_nameVarchar(9)100%1 101 HR 2 102 Finance 3 104 Marketing Let us look at the different type of JOINs available below:
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
After that we will also have a look at:
Other operators.
Special operators like Jaro-Winkler.
INNER JOIN#
We can conveniently JOIN the two
vDataFrame
using the key column. Let us perform an INNER JOIN. INNER JOIN is executed to combine rows from both the main table and theinput_relation
based on a specified condition. Only the rows with matching values in the specified column are included in the result. If there is no match, those rows are excluded from the output.result = employees_data.join( input_relation = departments_data, on = [("department_id", "department_id", "=")], how = "inner", expr1 = [ "employee_id AS ID", "employee_name AS Name", ], expr2 = ["department_name AS Dep"], )
123IDInteger100%... AbcNameVarchar(7)100%AbcDepVarchar(9)100%1 1 ... Alice HR 2 2 ... Bob Finance 3 3 ... Charlie HR LEFT JOIN#
Similarly we can perform a LEFT JOIN which ensures that all rows from the main table are included in the result, and matching rows from the
input_relation
are included if they exist. If there is no match, the columns from the input relation will containNULL
values for the corresponding rows in the result.left_join_result = employees_data.join( input_relation = departments_data, on = [("department_id", "department_id", "=")], how = "left", expr1 = ["employee_id AS ID", "employee_name AS Name"], expr2 = ["department_name AS Dep"], )
123IDInteger100%... AbcNameVarchar(7)100%AbcDepVarchar(9)75%1 3 ... Charlie HR 2 1 ... Alice HR 3 4 ... David [null] 4 2 ... Bob Finance RIGHT JOIN#
A RIGHT JOIN is employed to include all rows from the
input_relation
in the result, regardless of whether there are matching values in the main table. Rows from the main table are included if there are matching values, and for non-matching rows, the columns from the main table will contain NULL values in the result.right_join_result = employees_data.join( input_relation = departments_data, on = [("department_id", "department_id", "=")], how = "right", expr1 = ["employee_id AS ID", "employee_name AS Name"], expr2 = ["department_name AS Dep"], )
123IDInteger75%... AbcNameVarchar(7)75%AbcDepVarchar(9)100%1 1 ... Alice HR 2 3 ... Charlie HR 3 [null] ... [null] Marketing 4 2 ... Bob Finance FULL JOIN#
A FULL JOIN is utilized to include all rows from both the main table and the
input_relation
in the result. Matching rows are included based on the specified condition, and for non-matching rows in either table, the columns from the non-matching side will contain NULL values in the result. This ensures that all rows from both tables are represented in the output.full_join_result = employees_data.join( input_relation = departments_data, on = [("department_id", "department_id", "=")], how = "full", expr1 = ["employee_id AS ID", "employee_name AS Name"], expr2 = ["department_name AS Dep"], )
123IDInteger80%... AbcNameVarchar(7)80%AbcDepVarchar(9)80%1 1 ... Alice HR 2 2 ... Bob Finance 3 3 ... Charlie HR 4 4 ... David [null] 5 [null] ... [null] Marketing OTHER OPERATORS#
Let us explore some additional features of joins. For that let us create another table:
additional_departments_data = vp.vDataFrame( { "department_size": [12, 8, 8, 10], "department": ['HR', 'Fin', 'Mar', 'IT'], } )
123department_sizeInteger100%AbcdepartmentVarchar(3)100%1 12 HR 2 8 Fin 3 8 Mar 4 10 IT Notice the names are a bit different than the “department_name” column in the previous
department_data
table. In such cases we can utilize thellike
operator:department_join = departments_data.join( input_relation = additional_departments_data, on = [("department_name", "department", "llike")], how = "inner", expr1 = ["department_id AS ID", "department_name AS Dep"], expr2 = ["department_size AS Size"], )
123IDInteger80%... AbcNameVarchar(7)80%AbcDepVarchar(9)80%1 1 ... Alice HR 2 2 ... Bob Finance 3 3 ... Charlie HR 4 4 ... David [null] 5 [null] ... [null] Marketing Note
VerticaPy provides an array of join options and diverse operators, delivering an exceptional user experience.
JARO-WINKLER#
VerticaPy also allows you to JOIN tables using the Jaro-Winkler method. It is a string similarity metric used to compare the similarity between two strings. This method can be particularly useful in scenarios where slight spelling mistakes are expected between keys of different tables.
Let us create two tables for this example:
users_data = vp.vDataFrame( { "user_id": [1, 2, 3], "email": ['alice@email.com', 'bob@email.com', 'charlie@email.com'], "username": ['Ali', 'Bob', 'Charlie'], "age": [25, 30, 22], }, ) orders_data = vp.vDataFrame( { "order_id": [101, 102, 103], "email": ['Alice@email.com', 'bob@email.com', 'charlee@email.com'], "product_name": ['Laptop', 'Headphones', 'Smartphone'], "quantity": [2, 1, 3], } )
123user_idInteger100%... AbcemailVarchar(17)100%123ageInteger100%1 1 ... alice@email.com 25 2 2 ... bob@email.com 30 3 3 ... charlie@email.com 22 123order_idInteger100%... AbcemailVarchar(17)100%123quantityInteger100%1 101 ... Alice@email.com 2 2 102 ... bob@email.com 1 3 103 ... charlee@email.com 3 Notice that some emails are not correctly spelled, so we can use the
jarow
option to JOIN them:result = users_data.join( input_relation = orders_data, on = [("email", "email", "jarow", ">=", 0.9)], how = "inner", expr1 = [ "user_id AS ID", "username AS Name", "email", ], expr2 = ["product_name AS Item", "quantity AS Qty"], )
123order_idInteger100%... AbcemailVarchar(17)100%123quantityInteger100%1 101 ... Alice@email.com 2 2 102 ... bob@email.com 1 3 103 ... charlee@email.com 3 See also