Loading...

verticapy.vDataFrame.merge_similar_names#

vDataFrame.merge_similar_names(skip_word: str | list[str]) vDataFrame#

Merges columns with similar names. The function generates a COALESCE statement that merges the columns into a single column that excludes the input words. Note that the order of the variables in the COALESCE statement is based on the order of the ‘get_columns’ method.

Parameters#

skip_word: str | list, optional

List of words to exclude from the provided column names. For example, if two columns are named ‘age.information.phone’ and ‘age.phone’ AND skip_word is set to ['.information'], then the two columns are merged together with the following COALESCE statement: COALESCE("age.phone", "age.information.phone") AS "age.phone"

Returns#

vDataFrame

An object containing the merged element.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, let’s generate a dataset which has two columns that are duplicates with slight change in spelling and some missing values:

vdf = vp.vDataFrame(
    {
        "user.id": [12, None, 13],
        "id": [12, 11, None],
    }
)

123
user.id
Integer
66%
123
id
Integer
66%
11212
2[null]11
313[null]

In order to remove the redundant column, we can combine them using merge_similar_names:

vdf.merge_similar_names(skip_word = "user.")
123
id
Integer
100%
112
211
313

Note

This function is particularly useful when flattening highly nested JSON files. Such files may contain redundant features and inconsistencies. The function is designed to merge these features, ensuring consistent information.

See also

vDataFrame.pivot() : Pivots the vDataFrame.