Loading...

verticapy.vDataFrame.cdt#

vDataFrame.cdt(columns: str | list[str] | None = None, max_cardinality: int = 20, nbins: int = 10, tcdt: bool = True, drop_transf_cols: bool = True) vDataFrame#

Returns the complete disjunctive table of the vDataFrame. Numerical features are transformed to categorical using the vDataFrame.discretize() method. Applying PCA on TCDT leads to MCA (Multiple correspondence analysis).

Warning

This method can become computationally expensive when used with categorical variables with many categories.

Parameters#

columns: SQLColumns, optional

List of the vDataColumns names.

max_cardinality: int, optional

For any categorical variable, keeps the most frequent categories and merges the less frequent categories into a new unique category.

nbins: int, optional

Number of bins used for the discretization (must be > 1).

tcdt: bool, optional

If set to True, returns the transformed complete disjunctive table (TCDT).

drop_transf_cols: bool, optional

If set to True, drops the columns used during the transformation.

Returns#

vDataFrame

the CDT relation.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create a vDataFrame with multiple columns:

vdf = vp.vDataFrame(
    {
        "id": [0, 1, 2, 3, 4, 5],
        "cats": ["A", "B", "C", "A", "B", "C"],
        "vals": [2, 4, 8, 1, 4, 2],
    },
)

123
id
Integer
100%
...
Abc
cats
Varchar(1)
100%
123
vals
Integer
100%
10...A2
21...B4
32...C8
43...A1
54...B4
65...C2

We can create the complete disjunctive table of the vDataFrame:

vdf.cdt(columns=["cats", "vals"], tcdt = False)
123
id
Integer
100%
...
123
cats_A
Bool
100%
123
vals_8
Bool
100%
10...10
21...00
32...01
43...10
54...00
65...00

Same can be done to create the transformed complete disjunctive table of the vDataFrame:

vdf.cdt(columns=["cats", "vals"], tcdt = True)
123
id
Integer
100%
...
123
cats_A
Numeric(36)
100%
123
vals_8
Numeric(36)
100%
10...-0.5-1.0
21...-1.0-1.0
32...-1.00.0
43...-0.5-1.0
54...-1.0-1.0
65...-1.0-1.0

Note

This method can be useful to build an MCA (Multiple Correspondence Analysis) model based on a PCA (Principal Component Analysis) one. The transformed complete disjunctive table refers to a table used in MCA, where the original categorical data is transformed into binary indicators to represent the absence or presence of categories.

See also

PCA : Principal Component Analysis.