Loading...

verticapy.machine_learning.model_selection.best_k

verticapy.machine_learning.model_selection.best_k(input_relation: Annotated[str | vDataFrame, ''], X: Annotated[str | list[str], 'STRING representing one column or a list of columns'] | None = None, n_cluster: tuple | list = (1, 100), init: Literal['kmeanspp', 'random', None] = None, max_iter: int = 50, tol: float = 0.0001, use_kprototype: bool = False, gamma: float = 1.0, elbow_score_stop: float = 0.8, **kwargs) int

Finds the KMeans / KPrototypes k based on a score.

Parameters

input_relation: SQLRelation

Relation used to train the model.

X: SQLColumns, optional

list of the predictor columns. If empty, all numerical columns are used.

n_cluster: tuple | list, optional

Tuple representing the number of clusters to start and end with. This can also be a customized list with various k values to test.

init: str | list, optional

The method used to find the initial cluster centers.

  • kmeanspp:

    Only available when use_kprototype = False Use the k-means++ method to initialize the centers.

  • random:

    Randomly subsamples the data to find initial centers.

Default value is kmeanspp if use_kprototype = False; otherwise, random.

max_iter: int, optional

The maximum number of iterations for the algorithm.

tol: float, optional

Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of tol from the previous iteration.

use_kprototype: bool, optional

If set to True, the function uses the KPrototypes algorithm instead of KMeans. KPrototypes can handle categorical features.

gamma: float, optional

Only if use_kprototype = True. Weighting factor for categorical columns. It determines the relative importance of numerical and categorical attributes.

elbow_score_stop: float, optional

Stops searching for parameters when the specified elbow score is reached.

Returns

int

the k-means / k-prototypes k.

Examples

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the: Elbow Curve page.

Load data for machine learning

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the iris dataset.

import verticapy.datasets as vpd

data = vpd.load_iris()
123
SepalLengthCm
Numeric(7)
123
SepalWidthCm
Numeric(7)
123
PetalLengthCm
Numeric(7)
123
PetalWidthCm
Numeric(7)
Abc
Species
Varchar(30)
13.34.55.67.8Iris-setosa
23.34.55.67.8Iris-setosa
33.34.55.67.8Iris-setosa
43.34.55.67.8Iris-setosa
53.34.55.67.8Iris-setosa
63.34.55.67.8Iris-setosa
73.34.55.67.8Iris-setosa
83.34.55.67.8Iris-setosa
93.34.55.67.8Iris-setosa
103.34.55.67.8Iris-setosa
113.34.55.67.8Iris-setosa
123.34.55.67.8Iris-setosa
133.34.55.67.8Iris-setosa
143.34.55.67.8Iris-setosa
153.34.55.67.8Iris-setosa
163.34.55.67.8Iris-setosa
173.34.55.67.8Iris-setosa
183.34.55.67.8Iris-setosa
193.34.55.67.8Iris-setosa
203.34.55.67.8Iris-setosa
213.34.55.67.8Iris-setosa
223.34.55.67.8Iris-setosa
233.34.55.67.8Iris-setosa
243.34.55.67.8Iris-setosa
253.34.55.67.8Iris-setosa
263.34.55.67.8Iris-setosa
274.33.01.10.1Iris-setosa
284.34.79.61.8Iris-virginica
294.34.79.61.8Iris-virginica
304.34.79.61.8Iris-virginica
314.34.79.61.8Iris-virginica
324.34.79.61.8Iris-virginica
334.34.79.61.8Iris-virginica
344.34.79.61.8Iris-virginica
354.34.79.61.8Iris-virginica
364.34.79.61.8Iris-virginica
374.34.79.61.8Iris-virginica
384.34.79.61.8Iris-virginica
394.34.79.61.8Iris-virginica
404.34.79.61.8Iris-virginica
414.34.79.61.8Iris-virginica
424.34.79.61.8Iris-virginica
434.34.79.61.8Iris-virginica
444.34.79.61.8Iris-virginica
454.34.79.61.8Iris-virginica
464.34.79.61.8Iris-virginica
474.34.79.61.8Iris-virginica
484.34.79.61.8Iris-virginica
494.34.79.61.8Iris-virginica
504.34.79.61.8Iris-virginica
514.34.79.61.8Iris-virginica
524.34.79.61.8Iris-virginica
534.34.79.61.8Iris-virginica
544.42.91.40.2Iris-setosa
554.43.01.30.2Iris-setosa
564.43.21.30.2Iris-setosa
574.52.31.30.3Iris-setosa
584.63.11.50.2Iris-setosa
594.63.21.40.2Iris-setosa
604.63.41.40.3Iris-setosa
614.63.61.00.2Iris-setosa
624.73.21.30.2Iris-setosa
634.73.21.60.2Iris-setosa
644.83.01.40.1Iris-setosa
654.83.01.40.3Iris-setosa
664.83.11.60.2Iris-setosa
674.83.41.60.2Iris-setosa
684.83.41.90.2Iris-setosa
694.92.43.31.0Iris-versicolor
704.92.54.51.7Iris-virginica
714.93.01.40.2Iris-setosa
724.93.11.50.1Iris-setosa
734.93.11.50.1Iris-setosa
744.93.11.50.1Iris-setosa
755.02.03.51.0Iris-versicolor
765.02.33.31.0Iris-versicolor
775.03.01.60.2Iris-setosa
785.03.21.20.2Iris-setosa
795.03.31.40.2Iris-setosa
805.03.41.50.2Iris-setosa
815.03.41.60.4Iris-setosa
825.03.51.30.3Iris-setosa
835.03.51.60.6Iris-setosa
845.03.61.40.2Iris-setosa
855.12.53.01.1Iris-versicolor
865.13.31.70.5Iris-setosa
875.13.41.50.2Iris-setosa
885.13.51.40.2Iris-setosa
895.13.51.40.3Iris-setosa
905.13.71.50.4Iris-setosa
915.13.81.50.3Iris-setosa
925.13.81.60.2Iris-setosa
935.13.81.90.4Iris-setosa
945.22.73.91.4Iris-versicolor
955.23.41.40.2Iris-setosa
965.23.51.50.2Iris-setosa
975.24.11.50.1Iris-setosa
985.33.71.50.2Iris-setosa
995.43.04.51.5Iris-versicolor
1005.43.41.50.4Iris-setosa
Rows: 1-100 | Columns: 5

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

Data Exploration

Through a quick scatter plot, we can observe that the data has three main clusters.

data.scatter(
    columns = ["PetalLengthCm", "SepalLengthCm"],
    by = "Species",
)

Elbow Score

Let’s compute the optimal k for our KMeans algorithm and check if it aligns with the three clusters we observed earlier.

from verticapy.machine_learning.model_selection import best_k

best_k(
    input_relation = data,
    X = data.get_columns(exclude_columns= "Species"), # All columns except Species
    n_cluster = (1, 100),
    init = "kmeanspp",
    elbow_score_stop = 0.9,
)

Out[3]: 4

Note

You can experiment with the Elbow score to determine the optimal number of clusters. The score is based on the ratio of Between -Cluster Sum of Squares to Total Sum of Squares, providing a way to assess the clustering accuracy. A score of 1 indicates a perfect clustering.

See also

elbow() : Draws an Elbow curve.