
verticapy.machine_learning.model_selection.best_k¶
- verticapy.machine_learning.model_selection.best_k(input_relation: Annotated[str | vDataFrame, ''], X: Annotated[str | list[str], 'STRING representing one column or a list of columns'] | None = None, n_cluster: tuple | list = (1, 100), init: Literal['kmeanspp', 'random', None] = None, max_iter: int = 50, tol: float = 0.0001, use_kprototype: bool = False, gamma: float = 1.0, elbow_score_stop: float = 0.8, **kwargs) int ¶
Finds the
KMeans
/KPrototypes
k
based on a score.Parameters¶
- input_relation: SQLRelation
Relation used to train the model.
- X: SQLColumns, optional
list
of the predictor columns. If empty, all numerical columns are used.- n_cluster: tuple | list, optional
Tuple representing the number of clusters to start and end with. This can also be a customized list with various
k
values to test.- init: str | list, optional
The method used to find the initial cluster centers.
- kmeanspp:
Only available when
use_kprototype = False
Use thek-means++
method to initialize the centers.
- random:
Randomly subsamples the data to find initial centers.
Default value is
kmeanspp
ifuse_kprototype = False
; otherwise,random
.- max_iter: int, optional
The maximum number of iterations for the algorithm.
- tol: float, optional
Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
tol
from the previous iteration.- use_kprototype: bool, optional
If set to
True
, the function uses theKPrototypes
algorithm instead ofKMeans
.KPrototypes
can handle categorical features.- gamma: float, optional
Only if
use_kprototype = True
. Weighting factor for categorical columns. It determines the relative importance of numerical and categorical attributes.- elbow_score_stop: float, optional
Stops searching for parameters when the specified elbow score is reached.
Returns¶
- int
the k-means / k-prototypes
k
.
Examples¶
The following examples provide a basic understanding of usage. For more detailed examples, please refer to the: Elbow Curve page.
Load data for machine learning¶
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.For this example, we will use the iris dataset.
import verticapy.datasets as vpd data = vpd.load_iris()
123SepalLengthCm123SepalWidthCm123PetalLengthCm123PetalWidthCmAbcSpecies1 3.3 4.5 5.6 7.8 Iris-setosa 2 3.3 4.5 5.6 7.8 Iris-setosa 3 3.3 4.5 5.6 7.8 Iris-setosa 4 3.3 4.5 5.6 7.8 Iris-setosa 5 3.3 4.5 5.6 7.8 Iris-setosa 6 3.3 4.5 5.6 7.8 Iris-setosa 7 3.3 4.5 5.6 7.8 Iris-setosa 8 3.3 4.5 5.6 7.8 Iris-setosa 9 3.3 4.5 5.6 7.8 Iris-setosa 10 3.3 4.5 5.6 7.8 Iris-setosa 11 3.3 4.5 5.6 7.8 Iris-setosa 12 3.3 4.5 5.6 7.8 Iris-setosa 13 3.3 4.5 5.6 7.8 Iris-setosa 14 3.3 4.5 5.6 7.8 Iris-setosa 15 3.3 4.5 5.6 7.8 Iris-setosa 16 3.3 4.5 5.6 7.8 Iris-setosa 17 3.3 4.5 5.6 7.8 Iris-setosa 18 3.3 4.5 5.6 7.8 Iris-setosa 19 3.3 4.5 5.6 7.8 Iris-setosa 20 3.3 4.5 5.6 7.8 Iris-setosa 21 3.3 4.5 5.6 7.8 Iris-setosa 22 3.3 4.5 5.6 7.8 Iris-setosa 23 3.3 4.5 5.6 7.8 Iris-setosa 24 3.3 4.5 5.6 7.8 Iris-setosa 25 3.3 4.5 5.6 7.8 Iris-setosa 26 3.3 4.5 5.6 7.8 Iris-setosa 27 4.3 3.0 1.1 0.1 Iris-setosa 28 4.3 4.7 9.6 1.8 Iris-virginica 29 4.3 4.7 9.6 1.8 Iris-virginica 30 4.3 4.7 9.6 1.8 Iris-virginica 31 4.3 4.7 9.6 1.8 Iris-virginica 32 4.3 4.7 9.6 1.8 Iris-virginica 33 4.3 4.7 9.6 1.8 Iris-virginica 34 4.3 4.7 9.6 1.8 Iris-virginica 35 4.3 4.7 9.6 1.8 Iris-virginica 36 4.3 4.7 9.6 1.8 Iris-virginica 37 4.3 4.7 9.6 1.8 Iris-virginica 38 4.3 4.7 9.6 1.8 Iris-virginica 39 4.3 4.7 9.6 1.8 Iris-virginica 40 4.3 4.7 9.6 1.8 Iris-virginica 41 4.3 4.7 9.6 1.8 Iris-virginica 42 4.3 4.7 9.6 1.8 Iris-virginica 43 4.3 4.7 9.6 1.8 Iris-virginica 44 4.3 4.7 9.6 1.8 Iris-virginica 45 4.3 4.7 9.6 1.8 Iris-virginica 46 4.3 4.7 9.6 1.8 Iris-virginica 47 4.3 4.7 9.6 1.8 Iris-virginica 48 4.3 4.7 9.6 1.8 Iris-virginica 49 4.3 4.7 9.6 1.8 Iris-virginica 50 4.3 4.7 9.6 1.8 Iris-virginica 51 4.3 4.7 9.6 1.8 Iris-virginica 52 4.3 4.7 9.6 1.8 Iris-virginica 53 4.3 4.7 9.6 1.8 Iris-virginica 54 4.4 2.9 1.4 0.2 Iris-setosa 55 4.4 3.0 1.3 0.2 Iris-setosa 56 4.4 3.2 1.3 0.2 Iris-setosa 57 4.5 2.3 1.3 0.3 Iris-setosa 58 4.6 3.1 1.5 0.2 Iris-setosa 59 4.6 3.2 1.4 0.2 Iris-setosa 60 4.6 3.4 1.4 0.3 Iris-setosa 61 4.6 3.6 1.0 0.2 Iris-setosa 62 4.7 3.2 1.3 0.2 Iris-setosa 63 4.7 3.2 1.6 0.2 Iris-setosa 64 4.8 3.0 1.4 0.1 Iris-setosa 65 4.8 3.0 1.4 0.3 Iris-setosa 66 4.8 3.1 1.6 0.2 Iris-setosa 67 4.8 3.4 1.6 0.2 Iris-setosa 68 4.8 3.4 1.9 0.2 Iris-setosa 69 4.9 2.4 3.3 1.0 Iris-versicolor 70 4.9 2.5 4.5 1.7 Iris-virginica 71 4.9 3.0 1.4 0.2 Iris-setosa 72 4.9 3.1 1.5 0.1 Iris-setosa 73 4.9 3.1 1.5 0.1 Iris-setosa 74 4.9 3.1 1.5 0.1 Iris-setosa 75 5.0 2.0 3.5 1.0 Iris-versicolor 76 5.0 2.3 3.3 1.0 Iris-versicolor 77 5.0 3.0 1.6 0.2 Iris-setosa 78 5.0 3.2 1.2 0.2 Iris-setosa 79 5.0 3.3 1.4 0.2 Iris-setosa 80 5.0 3.4 1.5 0.2 Iris-setosa 81 5.0 3.4 1.6 0.4 Iris-setosa 82 5.0 3.5 1.3 0.3 Iris-setosa 83 5.0 3.5 1.6 0.6 Iris-setosa 84 5.0 3.6 1.4 0.2 Iris-setosa 85 5.1 2.5 3.0 1.1 Iris-versicolor 86 5.1 3.3 1.7 0.5 Iris-setosa 87 5.1 3.4 1.5 0.2 Iris-setosa 88 5.1 3.5 1.4 0.2 Iris-setosa 89 5.1 3.5 1.4 0.3 Iris-setosa 90 5.1 3.7 1.5 0.4 Iris-setosa 91 5.1 3.8 1.5 0.3 Iris-setosa 92 5.1 3.8 1.6 0.2 Iris-setosa 93 5.1 3.8 1.9 0.4 Iris-setosa 94 5.2 2.7 3.9 1.4 Iris-versicolor 95 5.2 3.4 1.4 0.2 Iris-setosa 96 5.2 3.5 1.5 0.2 Iris-setosa 97 5.2 4.1 1.5 0.1 Iris-setosa 98 5.3 3.7 1.5 0.2 Iris-setosa 99 5.4 3.0 4.5 1.5 Iris-versicolor 100 5.4 3.4 1.5 0.4 Iris-setosa Rows: 1-100 | Columns: 5Note
VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.
Data Exploration¶
Through a quick scatter plot, we can observe that the data has three main clusters.
data.scatter( columns = ["PetalLengthCm", "SepalLengthCm"], by = "Species", )
Elbow Score¶
Let’s compute the optimal
k
for ourKMeans
algorithm and check if it aligns with the three clusters we observed earlier.from verticapy.machine_learning.model_selection import best_k best_k( input_relation = data, X = data.get_columns(exclude_columns= "Species"), # All columns except Species n_cluster = (1, 100), init = "kmeanspp", elbow_score_stop = 0.9, ) Out[3]: 4
Note
You can experiment with the Elbow score to determine the optimal number of clusters. The score is based on the ratio of Between -Cluster Sum of Squares to Total Sum of Squares, providing a way to assess the clustering accuracy. A score of 1 indicates a perfect clustering.
See also
elbow()
: Draws an Elbow curve.