verticapy.machine_learning.model_selection.best_k#

verticapy.machine_learning.model_selection.best_k(input_relation: str | vDataFrame, X: str | list[str] | None = None, n_cluster: tuple | list = (1, 100), init: Literal['kmeanspp', 'random', None] = None, max_iter: int = 50, tol: float = 0.0001, use_kprototype: bool = False, gamma: float = 1.0, elbow_score_stop: float = 0.8, **kwargs) → int#

Finds the KMeans / KPrototypes k based on a score.

Parameters#

input_relation: SQLRelation

Relation used to train the model.

X: SQLColumns, optional

list of the predictor columns. If empty, all numerical columns are used.

n_cluster: tuple | list, optional

Tuple representing the number of clusters to start and end with. This can also be a customized list with various k values to test.

init: str | list, optional

The method used to find the initial cluster centers.

kmeanspp:
Only available when use_kprototype = False Use the k-means++ method to initialize the centers.
random:
Randomly subsamples the data to find initial centers.

Default value is kmeanspp if use_kprototype = False; otherwise, random.

max_iter: int, optional

The maximum number of iterations for the algorithm.

tol: float, optional

Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of tol from the previous iteration.

use_kprototype: bool, optional

If set to True, the function uses the KPrototypes algorithm instead of KMeans. KPrototypes can handle categorical features.

gamma: float, optional

Only if use_kprototype = True. Weighting factor for categorical columns. It determines the relative importance of numerical and categorical attributes.

elbow_score_stop: float, optional

Stops searching for parameters when the specified elbow score is reached.

Returns#

int: the k-means / k-prototypes k.

Examples#

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the: Elbow Curve page.

Load data for machine learning#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the iris dataset.

import verticapy.datasets as vpd

data = vpd.load_iris()

	123 SepalLengthCm Numeric(7)	123 SepalWidthCm Numeric(7)	123 PetalLengthCm Numeric(7)	123 PetalWidthCm Numeric(7)	Abc Species Varchar(30)
1	3.3	4.5	5.6	7.8	Iris-setosa
2	3.3	4.5	5.6	7.8	Iris-setosa
3	3.3	4.5	5.6	7.8	Iris-setosa
4	3.3	4.5	5.6	7.8	Iris-setosa
5	3.3	4.5	5.6	7.8	Iris-setosa
6	3.3	4.5	5.6	7.8	Iris-setosa
7	3.3	4.5	5.6	7.8	Iris-setosa
8	3.3	4.5	5.6	7.8	Iris-setosa
9	3.3	4.5	5.6	7.8	Iris-setosa
10	3.3	4.5	5.6	7.8	Iris-setosa
11	3.3	4.5	5.6	7.8	Iris-setosa
12	3.3	4.5	5.6	7.8	Iris-setosa
13	3.3	4.5	5.6	7.8	Iris-setosa
14	3.3	4.5	5.6	7.8	Iris-setosa
15	3.3	4.5	5.6	7.8	Iris-setosa
16	3.3	4.5	5.6	7.8	Iris-setosa
17	3.3	4.5	5.6	7.8	Iris-setosa
18	3.3	4.5	5.6	7.8	Iris-setosa
19	3.3	4.5	5.6	7.8	Iris-setosa
20	3.3	4.5	5.6	7.8	Iris-setosa
21	3.3	4.5	5.6	7.8	Iris-setosa
22	3.3	4.5	5.6	7.8	Iris-setosa
23	3.3	4.5	5.6	7.8	Iris-setosa
24	3.3	4.5	5.6	7.8	Iris-setosa
25	3.3	4.5	5.6	7.8	Iris-setosa
26	3.3	4.5	5.6	7.8	Iris-setosa
27	3.3	4.5	5.6	7.8	Iris-setosa
28	3.3	4.5	5.6	7.8	Iris-setosa
29	3.3	4.5	5.6	7.8	Iris-setosa
30	3.3	4.5	5.6	7.8	Iris-setosa
31	3.3	4.5	5.6	7.8	Iris-setosa
32	3.3	4.5	5.6	7.8	Iris-setosa
33	3.3	4.5	5.6	7.8	Iris-setosa
34	3.3	4.5	5.6	7.8	Iris-setosa
35	3.3	4.5	5.6	7.8	Iris-setosa
36	3.3	4.5	5.6	7.8	Iris-setosa
37	3.3	4.5	5.6	7.8	Iris-setosa
38	3.3	4.5	5.6	7.8	Iris-setosa
39	3.3	4.5	5.6	7.8	Iris-setosa
40	3.3	4.5	5.6	7.8	Iris-setosa
41	3.3	4.5	5.6	7.8	Iris-setosa
42	3.3	4.5	5.6	7.8	Iris-setosa
43	4.3	3.0	1.1	0.1	Iris-setosa
44	4.3	4.7	9.6	1.8	Iris-virginica
45	4.3	4.7	9.6	1.8	Iris-virginica
46	4.3	4.7	9.6	1.8	Iris-virginica
47	4.3	4.7	9.6	1.8	Iris-virginica
48	4.3	4.7	9.6	1.8	Iris-virginica
49	4.3	4.7	9.6	1.8	Iris-virginica
50	4.3	4.7	9.6	1.8	Iris-virginica
51	4.3	4.7	9.6	1.8	Iris-virginica
52	4.3	4.7	9.6	1.8	Iris-virginica
53	4.3	4.7	9.6	1.8	Iris-virginica
54	4.3	4.7	9.6	1.8	Iris-virginica
55	4.3	4.7	9.6	1.8	Iris-virginica
56	4.3	4.7	9.6	1.8	Iris-virginica
57	4.3	4.7	9.6	1.8	Iris-virginica
58	4.3	4.7	9.6	1.8	Iris-virginica
59	4.3	4.7	9.6	1.8	Iris-virginica
60	4.3	4.7	9.6	1.8	Iris-virginica
61	4.3	4.7	9.6	1.8	Iris-virginica
62	4.3	4.7	9.6	1.8	Iris-virginica
63	4.3	4.7	9.6	1.8	Iris-virginica
64	4.3	4.7	9.6	1.8	Iris-virginica
65	4.3	4.7	9.6	1.8	Iris-virginica
66	4.3	4.7	9.6	1.8	Iris-virginica
67	4.3	4.7	9.6	1.8	Iris-virginica
68	4.3	4.7	9.6	1.8	Iris-virginica
69	4.3	4.7	9.6	1.8	Iris-virginica
70	4.3	4.7	9.6	1.8	Iris-virginica
71	4.3	4.7	9.6	1.8	Iris-virginica
72	4.3	4.7	9.6	1.8	Iris-virginica
73	4.3	4.7	9.6	1.8	Iris-virginica
74	4.3	4.7	9.6	1.8	Iris-virginica
75	4.3	4.7	9.6	1.8	Iris-virginica
76	4.3	4.7	9.6	1.8	Iris-virginica
77	4.3	4.7	9.6	1.8	Iris-virginica
78	4.3	4.7	9.6	1.8	Iris-virginica
79	4.3	4.7	9.6	1.8	Iris-virginica
80	4.3	4.7	9.6	1.8	Iris-virginica
81	4.3	4.7	9.6	1.8	Iris-virginica
82	4.3	4.7	9.6	1.8	Iris-virginica
83	4.3	4.7	9.6	1.8	Iris-virginica
84	4.3	4.7	9.6	1.8	Iris-virginica
85	4.3	4.7	9.6	1.8	Iris-virginica
86	4.4	2.9	1.4	0.2	Iris-setosa
87	4.4	3.0	1.3	0.2	Iris-setosa
88	4.4	3.2	1.3	0.2	Iris-setosa
89	4.5	2.3	1.3	0.3	Iris-setosa
90	4.6	3.1	1.5	0.2	Iris-setosa
91	4.6	3.2	1.4	0.2	Iris-setosa
92	4.6	3.4	1.4	0.3	Iris-setosa
93	4.6	3.6	1.0	0.2	Iris-setosa
94	4.7	3.2	1.3	0.2	Iris-setosa
95	4.7	3.2	1.6	0.2	Iris-setosa
96	4.8	3.0	1.4	0.1	Iris-setosa
97	4.8	3.0	1.4	0.3	Iris-setosa
98	4.8	3.1	1.6	0.2	Iris-setosa
99	4.8	3.4	1.6	0.2	Iris-setosa
100	4.8	3.4	1.9	0.2	Iris-setosa

Rows: 1-100 | Columns: 5

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

Data Exploration#

Through a quick scatter plot, we can observe that the data has three main clusters.

data.scatter(
    columns = ["PetalLengthCm", "SepalLengthCm"],
    by = "Species",
)

Elbow Score#

Let’s compute the optimal k for our KMeans algorithm and check if it aligns with the three clusters we observed earlier.

from verticapy.machine_learning.model_selection import best_k

best_k(
    input_relation = data,
    X = data.get_columns(exclude_columns= "Species"), # All columns except Species
    n_cluster = (1, 100),
    init = "kmeanspp",
    elbow_score_stop = 0.9,
)

Out[3]: 4

Note

You can experiment with the Elbow score to determine the optimal number of clusters. The score is based on the ratio of Between -Cluster Sum of Squares to Total Sum of Squares, providing a way to assess the clustering accuracy. A score of 1 indicates a perfect clustering.