Loading...

verticapy.machine_learning.model_selection.elbow#

verticapy.machine_learning.model_selection.elbow(input_relation: str | vDataFrame, X: str | list[str] | None = None, n_cluster: tuple | list = (1, 15), init: Literal['kmeanspp', 'random', None] = None, max_iter: int = 50, tol: float = 0.0001, use_kprototype: bool = False, gamma: float = 1.0, show: bool = True, chart: PlottingBase | TableSample | Axes | mFigure | Highchart | Highstock | Figure | None = None, **style_kwargs) TableSample#

Draws an Elbow curve.

Parameters#

input_relation: SQLRelation

Relation used to train the model.

X: SQLColumns, optional

list of the predictor columns. If empty, all numerical columns are used.

n_cluster: tuple | list, optional

Tuple representing the number of clusters to start and end with. This can also be a customized list with various k values to test.

init: str | list, optional

The method used to find the initial cluster centers.

  • kmeanspp:

    Only available when use_kprototype = False Use the k-means++ method to initialize the centers.

  • random:

    Randomly subsamples the data to find initial centers.

Default value is kmeanspp if use_kprototype = False; otherwise, random.

max_iter: int, optional

The maximum number of iterations for the algorithm.

tol: float, optional

Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of tol from the previous iteration.

use_kprototype: bool, optional

If set to True, the function uses the KPrototypes algorithm instead of KMeans. KPrototypes can handle categorical features.

gamma: float, optional

Only if use_kprototype = True. Weighting factor for categorical columns. It determines the relative importance of numerical and categorical attributes.

show: bool, optional

If set to True, the Plotting object is returned.

chart: PlottingObject, optional

The chart object to plot on.

**style_kwargs

Any optional parameter to pass to the Plotting functions.

Returns#

TableSample

nb_clusters,total_within_cluster_ss,between_cluster_ss,total_ss, elbow_score

Examples#

The following examples provide a basic understanding of usage. For more detailed examples, please refer to the: Elbow Curve page.

Load data for machine learning#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the iris dataset.

import verticapy.datasets as vpd

data = vpd.load_iris()
123
SepalLengthCm
Numeric(7)
123
SepalWidthCm
Numeric(7)
123
PetalLengthCm
Numeric(7)
123
PetalWidthCm
Numeric(7)
Abc
Species
Varchar(30)
13.34.55.67.8Iris-setosa
23.34.55.67.8Iris-setosa
33.34.55.67.8Iris-setosa
43.34.55.67.8Iris-setosa
53.34.55.67.8Iris-setosa
63.34.55.67.8Iris-setosa
73.34.55.67.8Iris-setosa
83.34.55.67.8Iris-setosa
93.34.55.67.8Iris-setosa
103.34.55.67.8Iris-setosa
113.34.55.67.8Iris-setosa
123.34.55.67.8Iris-setosa
133.34.55.67.8Iris-setosa
143.34.55.67.8Iris-setosa
153.34.55.67.8Iris-setosa
163.34.55.67.8Iris-setosa
173.34.55.67.8Iris-setosa
183.34.55.67.8Iris-setosa
193.34.55.67.8Iris-setosa
203.34.55.67.8Iris-setosa
213.34.55.67.8Iris-setosa
223.34.55.67.8Iris-setosa
233.34.55.67.8Iris-setosa
243.34.55.67.8Iris-setosa
253.34.55.67.8Iris-setosa
263.34.55.67.8Iris-setosa
273.34.55.67.8Iris-setosa
283.34.55.67.8Iris-setosa
293.34.55.67.8Iris-setosa
303.34.55.67.8Iris-setosa
313.34.55.67.8Iris-setosa
323.34.55.67.8Iris-setosa
333.34.55.67.8Iris-setosa
343.34.55.67.8Iris-setosa
353.34.55.67.8Iris-setosa
363.34.55.67.8Iris-setosa
373.34.55.67.8Iris-setosa
383.34.55.67.8Iris-setosa
393.34.55.67.8Iris-setosa
403.34.55.67.8Iris-setosa
413.34.55.67.8Iris-setosa
423.34.55.67.8Iris-setosa
434.33.01.10.1Iris-setosa
444.34.79.61.8Iris-virginica
454.34.79.61.8Iris-virginica
464.34.79.61.8Iris-virginica
474.34.79.61.8Iris-virginica
484.34.79.61.8Iris-virginica
494.34.79.61.8Iris-virginica
504.34.79.61.8Iris-virginica
514.34.79.61.8Iris-virginica
524.34.79.61.8Iris-virginica
534.34.79.61.8Iris-virginica
544.34.79.61.8Iris-virginica
554.34.79.61.8Iris-virginica
564.34.79.61.8Iris-virginica
574.34.79.61.8Iris-virginica
584.34.79.61.8Iris-virginica
594.34.79.61.8Iris-virginica
604.34.79.61.8Iris-virginica
614.34.79.61.8Iris-virginica
624.34.79.61.8Iris-virginica
634.34.79.61.8Iris-virginica
644.34.79.61.8Iris-virginica
654.34.79.61.8Iris-virginica
664.34.79.61.8Iris-virginica
674.34.79.61.8Iris-virginica
684.34.79.61.8Iris-virginica
694.34.79.61.8Iris-virginica
704.34.79.61.8Iris-virginica
714.34.79.61.8Iris-virginica
724.34.79.61.8Iris-virginica
734.34.79.61.8Iris-virginica
744.34.79.61.8Iris-virginica
754.34.79.61.8Iris-virginica
764.34.79.61.8Iris-virginica
774.34.79.61.8Iris-virginica
784.34.79.61.8Iris-virginica
794.34.79.61.8Iris-virginica
804.34.79.61.8Iris-virginica
814.34.79.61.8Iris-virginica
824.34.79.61.8Iris-virginica
834.34.79.61.8Iris-virginica
844.34.79.61.8Iris-virginica
854.34.79.61.8Iris-virginica
864.42.91.40.2Iris-setosa
874.43.01.30.2Iris-setosa
884.43.21.30.2Iris-setosa
894.52.31.30.3Iris-setosa
904.63.11.50.2Iris-setosa
914.63.21.40.2Iris-setosa
924.63.41.40.3Iris-setosa
934.63.61.00.2Iris-setosa
944.73.21.30.2Iris-setosa
954.73.21.60.2Iris-setosa
964.83.01.40.1Iris-setosa
974.83.01.40.3Iris-setosa
984.83.11.60.2Iris-setosa
994.83.41.60.2Iris-setosa
1004.83.41.90.2Iris-setosa
Rows: 1-100 | Columns: 5

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

Data Exploration#

Through a quick scatter plot, we can observe that the data has three main clusters.

data.scatter(
    columns = ["PetalLengthCm", "SepalLengthCm"],
    by = "Species",
)

Elbow Curve#

Let’s compute the optimal k for our KMeans algorithm and check if it aligns with the three clusters we observed earlier.

To achieve this, let’s create the Elbow curve.

from verticapy.machine_learning.model_selection import elbow

elbow(
    input_relation = data,
    X = data.get_columns(exclude_columns= "Species"), # All columns except Species
    n_cluster = (1, 100),
    init = "kmeanspp",
)

Note

You can experiment with the Elbow score to determine the optimal number of clusters. The score is based on the ratio of Between -Cluster Sum of Squares to Total Sum of Squares, providing a way to assess the clustering accuracy. A score of 1 indicates a perfect clustering.

Note

It’s evident from the Elbow curve that k=3 is a suitable choice, indicating the optimal number of clusters for the KMeans algorithm.

See also

best_k() : Finds the KMeans / KPrototypes k based on a score.