verticapy.machine_learning.memmodel.cluster.KPrototypes#

class verticapy.machine_learning.memmodel.cluster.KPrototypes(clusters: list | ndarray, p: int = 2, gamma: float = 1.0, is_categorical: list | ndarray | None = None)#

InMemoryModel implementation of KPrototypes.

Parameters#

clusters: ArrayLike: list of the model’s cluster centers.
p: int, optional: The p corresponding to one of the p-distances.
gamma: float, optional: Weighting factor for categorical columns. This determines relative importance of numerical and categorical attributes.
is_categorical: list | numpy.array, optional: ArrayLike of booleans to indicate whether X[idx] is a categorical variable, where True indicates categorical and False numerical. If empty, all the variables are considered categorical.

Note

KPrototypes algorithm allows you to use categorical variables directly without the need to encode them.

Attributes#

Attributes are identical to the input parameters, followed by an underscore (‘_’).

Examples#

Initalization

Import the required module.

from verticapy.machine_learning.memmodel.cluster import KPrototypes

A KPrototypes model is defined by its cluster centroids. Optionally you can also provide p value, gamma and provide information about categorical variables. In this example, we will use the following:

clusters = [
    [0.5, 'high'],
    [1, 'low'],
    [100, 'high'],
]


p = 2

gamma = 1.0

is_categorical = [0, 1]

Let’s create a KPrototypes model.

model_kp = KPrototypes(clusters, p, gamma, is_categorical)

Create a dataset.

data = [[2, 'low']]

Making In-Memory Predictions

Use predict() method to do predictions.

model_kp.predict(data)[0]
Out[8]: 1

Note

KPrototypes assigns a cluster id to identify each cluster. In this example, cluster with centroid [0.5, 'high'] will have id = 0, with centroid [1,'low'] will have id = 1 and so on. predict() method returns the id of the predicted cluster.

Use predict_proba() method to compute the predicted probabilities for each cluster.

model_kp.predict_proba(data)
Out[9]: array([[2.35275386e-01, 7.64645005e-01, 7.96090583e-05]])

Use transform() method to compute the distance from each cluster.

model_kp.transform(data)
Out[10]: array([[3.250e+00, 1.000e+00, 9.605e+03]])

Deploy SQL Code

Let’s use the following column names:

cnames = ['col1', 'col2']

Use predict_sql() method to get the SQL code needed to deploy the model using its attributes.

model_kp.predict_sql(cnames)
Out[12]: "CASE WHEN col1 IS NULL OR col2 IS NULL THEN NULL WHEN POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) <= POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) AND POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) <= POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1)) THEN 2 WHEN POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1)) <= POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) THEN 1 ELSE 0 END"

Use predict_proba_sql() method to get the SQL code needed to deploy the model that computes predicted probabilities.

model_kp.predict_proba_sql(cnames)
Out[13]: 
["(CASE WHEN POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) / (1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) + 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) + 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)))) END)",
 "(CASE WHEN POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1)) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) / (1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) + 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) + 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)))) END)",
 "(CASE WHEN POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) / (1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) + 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) + 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)))) END)"]

Use transform_sql() method to get the SQL code needed to deploy the model that computes distance from each cluster.

model_kp.transform_sql(cnames)
Out[14]: 
["POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))",
 "POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))",
 "POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))"]

Hint

This object can be pickled and used in any in-memory environment, just like SKLEARN models.

__init__(clusters: list | ndarray, p: int = 2, gamma: float = 1.0, is_categorical: list | ndarray | None = None) → None#

Methods

`__init__`(clusters[, p, gamma, is_categorical])
`get_attributes`()	Returns the model attributes.
`predict`(X)	Predicts clusters using the input matrix.
`predict_proba`(X)	Predicts the probability of each input to belong to the model clusters.
`predict_proba_sql`(X)	Returns the SQL code needed to deploy the model probabilities.
`predict_sql`(X)	Returns the SQL code needed to deploy the model using its attributes.
`set_attributes`(**kwargs)	Sets the model attributes.
`transform`(X)	Transforms and returns the distance to each cluster.
`transform_sql`(X)	Transforms and returns the SQL distance to each cluster.

Attributes

object_type

Must be overridden in child class