verticapy.vDataFrame.merge_similar_names#

vDataFrame.merge_similar_names(skip_word: str | list[str]) → vDataFrame#

Merges columns with similar names. The function generates a COALESCE statement that merges the columns into a single column that excludes the input words. Note that the order of the variables in the COALESCE statement is based on the order of the ‘get_columns’ method.

Parameters#

skip_word: str | list, optional: List of words to exclude from the provided column names. For example, if two columns are named ‘age.information.phone’ and ‘age.phone’ AND skip_word is set to ['.information'], then the two columns are merged together with the following COALESCE statement: COALESCE("age.phone", "age.information.phone") AS "age.phone"

Returns#

vDataFrame: An object containing the merged element.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, let’s generate a dataset which has two columns that are duplicates with slight change in spelling and some missing values:

vdf = vp.vDataFrame(
    {
        "user.id": [12, None, 13],
        "id": [12, 11, None],
    }
)

	123 user.id Integer 66%	123 id Integer 66%
1	12	12
2	[null]	11
3	13	[null]

In order to remove the redundant column, we can combine them using merge_similar_names:

vdf.merge_similar_names(skip_word = "user.")

	123 id Integer 100%
1	12
2	11
3	13

Note

This function is particularly useful when flattening highly nested JSON files. Such files may contain redundant features and inconsistencies. The function is designed to merge these features, ensuring consistent information.