verticapy.vDataFrame.cdt#

vDataFrame.cdt(columns: str | list[str] | None = None, max_cardinality: int = 20, nbins: int = 10, tcdt: bool = True, drop_transf_cols: bool = True) → vDataFrame#

Returns the complete disjunctive table of the vDataFrame. Numerical features are transformed to categorical using the vDataFrame.discretize() method. Applying PCA on TCDT leads to MCA (Multiple correspondence analysis).

Warning

This method can become computationally expensive when used with categorical variables with many categories.

Parameters#

columns: SQLColumns, optional: List of the vDataColumns names.
max_cardinality: int, optional: For any categorical variable, keeps the most frequent categories and merges the less frequent categories into a new unique category.
nbins: int, optional: Number of bins used for the discretization (must be > 1).
tcdt: bool, optional: If set to True, returns the transformed complete disjunctive table (TCDT).
drop_transf_cols: bool, optional: If set to True, drops the columns used during the transformation.

Returns#

vDataFrame: the CDT relation.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let us create a vDataFrame with multiple columns:

vdf = vp.vDataFrame(
    {
        "id": [0, 1, 2, 3, 4, 5],
        "cats": ["A", "B", "C", "A", "B", "C"],
        "vals": [2, 4, 8, 1, 4, 2],
    },
)

	123 id Integer 100%	...	Abc cats Varchar(1) 100%	123 vals Integer 100%
1	0	...	A	2
2	1	...	B	4
3	2	...	C	8
4	3	...	A	1
5	4	...	B	4
6	5	...	C	2

We can create the complete disjunctive table of the vDataFrame:

vdf.cdt(columns=["cats", "vals"], tcdt = False)

	123 id Integer 100%	...	123 cats_A Bool 100%	123 vals_8 Bool 100%
1	0	...	1	0
2	1	...	0	0
3	2	...	0	1
4	3	...	1	0
5	4	...	0	0
6	5	...	0	0

Same can be done to create the transformed complete disjunctive table of the vDataFrame:

vdf.cdt(columns=["cats", "vals"], tcdt = True)

	123 id Integer 100%	...	123 cats_A Numeric(36) 100%	123 vals_8 Numeric(36) 100%
1	0	...	-0.5	-1.0
2	1	...	-1.0	-1.0
3	2	...	-1.0	0.0
4	3	...	-0.5	-1.0
5	4	...	-1.0	-1.0
6	5	...	-1.0	-1.0

Note

This method can be useful to build an MCA (Multiple Correspondence Analysis) model based on a PCA (Principal Component Analysis) one. The transformed complete disjunctive table refers to a table used in MCA, where the original categorical data is transformed into binary indicators to represent the absence or presence of categories.