verticapy.machine_learning.memmodel.cluster.KPrototypes#
- class verticapy.machine_learning.memmodel.cluster.KPrototypes(clusters: list | ndarray, p: int = 2, gamma: float = 1.0, is_categorical: list | ndarray | None = None)#
InMemoryModel
implementation ofKPrototypes
.Parameters#
- clusters: ArrayLike
list
of the model’s cluster centers.- p: int, optional
The p corresponding to one of the
p
-distances.- gamma: float, optional
Weighting factor for categorical columns. This determines relative importance of numerical and categorical attributes.
- is_categorical: list | numpy.array, optional
ArrayLike of
booleans
to indicate whetherX[idx]
is a categorical variable, whereTrue
indicates categorical andFalse
numerical. If empty, all the variables are considered categorical.
Note
KPrototypes
algorithm allows you to use categorical variables directly without the need to encode them.Attributes#
Attributes are identical to the input parameters, followed by an underscore (‘_’).
Examples#
Initalization
Import the required module.
from verticapy.machine_learning.memmodel.cluster import KPrototypes
A
KPrototypes
model is defined by its cluster centroids. Optionally you can also providep
value, gamma and provide information about categorical variables. In this example, we will use the following:clusters = [ [0.5, 'high'], [1, 'low'], [100, 'high'], ] p = 2 gamma = 1.0 is_categorical = [0, 1]
Let’s create a
KPrototypes
model.model_kp = KPrototypes(clusters, p, gamma, is_categorical)
Create a dataset.
data = [[2, 'low']]
Making In-Memory Predictions
Use
predict()
method to do predictions.model_kp.predict(data)[0] Out[8]: 1
Note
KPrototypes
assigns a cluster id to identify each cluster. In this example, cluster with centroid[0.5, 'high']
will haveid = 0
, with centroid[1,'low']
will haveid = 1
and so on.predict()
method returns the id of the predicted cluster.Use
predict_proba()
method to compute the predicted probabilities for each cluster.model_kp.predict_proba(data) Out[9]: array([[2.35275386e-01, 7.64645005e-01, 7.96090583e-05]])
Use
transform()
method to compute the distance from each cluster.model_kp.transform(data) Out[10]: array([[3.250e+00, 1.000e+00, 9.605e+03]])
Deploy SQL Code
Let’s use the following column names:
cnames = ['col1', 'col2']
Use
predict_sql()
method to get the SQL code needed to deploy the model using its attributes.model_kp.predict_sql(cnames) Out[12]: "CASE WHEN col1 IS NULL OR col2 IS NULL THEN NULL WHEN POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) <= POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) AND POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) <= POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1)) THEN 2 WHEN POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1)) <= POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) THEN 1 ELSE 0 END"
Use
predict_proba_sql()
method to get the SQL code needed to deploy the model that computes predicted probabilities.model_kp.predict_proba_sql(cnames) Out[13]: ["(CASE WHEN POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) / (1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) + 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) + 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)))) END)", "(CASE WHEN POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1)) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) / (1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) + 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) + 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)))) END)", "(CASE WHEN POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) / (1 / (POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))) + 1 / (POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))) + 1 / (POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1)))) END)"]
Use
transform_sql()
method to get the SQL code needed to deploy the model that computes distance from each cluster.model_kp.transform_sql(cnames) Out[14]: ["POWER(POWER(col1 - 0.5, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))", "POWER(POWER(col1 - 1, 2), 1 / 2) + 1.0 * (ABS((col2 = 'low')::int - 1))", "POWER(POWER(col1 - 100, 2), 1 / 2) + 1.0 * (ABS((col2 = 'high')::int - 1))"]
Hint
This object can be pickled and used in any in-memory environment, just like SKLEARN models.
- __init__(clusters: list | ndarray, p: int = 2, gamma: float = 1.0, is_categorical: list | ndarray | None = None) None #
Methods
__init__
(clusters[, p, gamma, is_categorical])Returns the model attributes.
predict
(X)Predicts clusters using the input matrix.
Predicts the probability of each input to belong to the model clusters.
Returns the SQL code needed to deploy the model probabilities.
predict_sql
(X)Returns the SQL code needed to deploy the model using its attributes.
set_attributes
(**kwargs)Sets the model attributes.
transform
(X)Transforms and returns the distance to each cluster.
Transforms and returns the SQL distance to each cluster.
Attributes
Must be overridden in child class