verticapy.machine_learning.memmodel.cluster.KMeans#

class verticapy.machine_learning.memmodel.cluster.KMeans(clusters: list | ndarray, p: int = 2)#

InMemoryModel implementation of KMeans.

Parameters#

clusters: ArrayLike: list of the model’s cluster centers.
p: int, optional: The p corresponding to one of the p-distances.

Note

memmodel() are defined entirely by their attributes. For example, clusters centroids and p value define a KMeans model.

Attributes#

Attributes are identical to the input parameters, followed by an underscore (‘_’).

Examples#

Initalization

Import the required module.

from verticapy.machine_learning.memmodel.cluster import KMeans

A KMeans model is defined by its cluster centroids and the p value. In this example, we will use the following:

clusters = [[0.5, 0.6], [1, 2], [100, 200]]

p = 2

Let’s create a KMeans model.

model_km = KMeans(clusters, p)

Create a dataset.

data = [[2, 3]]

Making In-Memory Predictions

Use predict() method to do predictions

model_km.predict(data)[0]
Out[6]: 1

Note

KMeans assigns a cluster id to identify each cluster. In this example, cluster with centroid [0.5, 0.6] will have id = 0, with centroid [1,2] will have id = 1 and so on. predict() method returns the id of the predicted cluster.

Use predict_proba() method to compute the predicted probabilities for each cluster.

model_km.predict_proba(data)
Out[7]: array([[0.33177263, 0.66395985, 0.00426752]])

Use transform() method to compute the distance from each cluster.

model_km.transform(data)
Out[8]: array([[  2.83019434,   1.41421356, 220.02954347]])

Deploy SQL Code

Let’s use the following column names:

cnames = ['col1', 'col2']

Use predict_sql() method to get the SQL code needed to deploy the model using its attributes.

model_km.predict_sql(cnames)
Out[10]: 'CASE WHEN col1 IS NULL OR col2 IS NULL THEN NULL WHEN POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2) <= POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2) AND POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2) <= POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2) THEN 2 WHEN POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2) <= POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2) THEN 1 ELSE 0 END'

Use predict_proba_sql() method to get the SQL code needed to deploy the model that computes predicted probabilities.

model_km.predict_proba_sql(cnames)
Out[11]: 
['(CASE WHEN POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2)) / (1 / (POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2)) + 1 / (POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2)) + 1 / (POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2))) END)',
 '(CASE WHEN POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2)) / (1 / (POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2)) + 1 / (POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2)) + 1 / (POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2))) END)',
 '(CASE WHEN POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2) = 0 THEN 1.0 ELSE 1 / (POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2)) / (1 / (POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2)) + 1 / (POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2)) + 1 / (POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2))) END)']

Use transform_sql() method to get the SQL code needed to deploy the model that computes distance from each cluster.

model_km.transform_sql(cnames)
Out[12]: 
['POWER(POWER(col1 - 0.5, 2) + POWER(col2 - 0.6, 2), 1 / 2)',
 'POWER(POWER(col1 - 1.0, 2) + POWER(col2 - 2.0, 2), 1 / 2)',
 'POWER(POWER(col1 - 100.0, 2) + POWER(col2 - 200.0, 2), 1 / 2)']

Hint

This object can be pickled and used in any in-memory environment, just like SKLEARN models.

__init__(clusters: list | ndarray, p: int = 2) → None#

Methods

`__init__`(clusters[, p])
`get_attributes`()	Returns the model attributes.
`predict`(X)	Predicts clusters using the input matrix.
`predict_proba`(X)	Predicts the probability of each input to belong to the model clusters.
`predict_proba_sql`(X)	Returns the SQL code needed to deploy the model probabilities.
`predict_sql`(X)	Returns the SQL code needed to deploy the model using its attributes.
`set_attributes`(**kwargs)	Sets the model attributes.
`transform`(X)	Transforms and returns the distance to each cluster.
`transform_sql`(X)	Transforms and returns the SQL distance to each cluster.

Attributes

object_type

Must be overridden in child class