K-Means Clustering

 

Published: March 2, 2015

Updated: November 9, 2016

Configures and initializes a K-means clustering model

Category: Machine Learning / Initialize Model / Clustering

You can use the K-Means Clustering module to create an untrained K-means clustering model. K-means is one of the simplest and the best known unsupervised learning algorithms, and can be used for a variety of machine learning tasks, such as detecting abnormal data, clustering of text documents, and analysis of a dataset prior to using other classification or regression methods.

After you have configured the module parameters, you must pass the untrained model to the Train Clustering Model or the Sweep Clustering modules to train the model on a set of input data that you provide.

Because the K-means algorithm is an unsupervised learning method, the data you use to train the model does not need a label column. In other words, you don’t need to know any of the cluster categories in advance; the algorithm will find possible categories based solely on the data.

If your training data already has labels, you can use one of the supervised classification methods provided in Azure Machine Learning. Or, you can use the label values to guide selection of the clusters.

In general, clustering uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and eventually for making predictions. Clustering models also can help you identify relationships in a dataset that you might not logically derive by browsing or simple observation.

For these reasons, clustering is often used in the early phases of machine learning tasks, to explore the data and discover unexpected correlations.

When you configure a clustering model using the k-means method, you must specify a target number k indicating the number of centroids you want in the model. The centroid is a point that is representative of each cluster. The K-means algorithm assigns each incoming data point to one of the clusters by minimizing the within-cluster sum of squares.

The K-means algorithm begins with an initial set of centroids, which are like central starting points for each cluster, and then uses Lloyd's algorithm to iteratively refine the locations of the centroids. The K-means algorithm stops building and refining clusters when it meets one or more of these conditions:

  • The centroids stabilize, meaning that cluster assignments for individual points no longer change and the algorithm has converged on a solution.

  • The algorithm completed running the specified number of iterations.

After completing the training phase, you use the Assign Data to Clusters module to assign new cases to one of the clusters that was found by the k-means algorithm. Cluster assignment is performed by computing the distance between the new case and the centroid of each cluster. Each new casse is assigned to the cluster with the nearest centroid.

Generating the Best Clustering Model

In general, the seeding process used during clustering can significantly affect the model. Seeding means the initial placement of points into potental centroids.

For example, if the dataset contains many outliers, and an outlier is chosen to seed the clusters, no other data points would fit well with that cluster and the cluster could be a singleton -- that is, a cluster with only one point.

There are various ways to avoid this problem:

  • Use a parameter sweep to change the number of centroids and try multiple seed values.

  • Create multiple models, varying the metric or iterating more.

  • Use a method such as PCA to find variables that have a detrimental effect on clustering. See the Find similar companies sample for a demonstration of this technique.

System_CAPS_ICON_important.jpg Important

In general, with clustering models, it is possible that any given configuration will result in a locally optimized set of clusters: in other words, the set of clusters you get really suits only the current data points, and is not generalizable. If you used a different initial configuration, the K-means method might find a different, perhaps superior, configuration.

Therefore, we recommend that you always experiment with the parameters, create multiple models, and compare the resulting models.

  1. Add the K-Means Clustering module to your experiment.

  2. Specify how you want the model to be trained, by setting the Create trainer mode option.

    • Single Parameter. If you know the exact parameters you want to use in the clustering model, you can provide a specific set of values as arguments.

    • Parameter Range. If you are not sure of the best parameters, you can find the optimal parameters by specifying multiple values and using the Sweep Clustering module to find the optimal configuration.

      The trainer iterates over multiple combinations of the settings you provided and determine the combination of values that produces the optimal clustering results.

  3. For Number of Centroids, type the number of clusters you want the algorithm to begin with.

    The model is not guaranteed to produce exactly this number of clusters, but it starts with this number of data points and iterates to find the optimal configuration, as described in the Technical Notes section.

    If you are performing a parameter sweep, the name of the property changes to Range for Number of Centroids. You can use the Range Builder to specify a range, or you can type a series of numbers representing different numbers of clusters to create when initializing each model.

  4. The properties Initialization or Initialization for sweep are used to specify the algorithm that is used to define the initial cluster configuration.

    • First N. Some initial number of data points are chosen from the data set and used as the initial means.

      Also called the Forgy method.

    • Random. The algorithm randomly places a data point in a cluster and then computes the initial mean to be the centroid of the cluster's randomly assigned points.

      Also called the random partition method.

    • K-Means++. This is the default method for initializing clusters.

      The K-means ++ algorithm was proposed in 2007 by David Arthur and Sergei Vassilvitskii to avoid poor clustering by the standard k-means algorithm. K-means ++ improves upon standard K-means by using a different method for choosing the initial cluster centers.

    • K-Means++Fast. A variant of the K-means ++ algorithm that was optimized for faster clustering.

    • Evenly. Centroids are located equidistant from each other in the d-Dimensional space of n data points.

    • Use label column. The values in the label column are used to guide the selection of centroids.

  5. For Random number seed, optionally type a value to use as the seed for the cluster initialization. This value can have a significant effect on cluster selection.

    If you use a parameter sweep, you can specify that multiple initial seeds be created, to look for the best initial seed value. For Number of seeds to sweep, type the total number of random seed values to use as starting points.

  6. For Metric, choose the function to use for measuring the distance between cluster vectors, or between new data points and the randomly chosen centroid. Azure Machine Learning supports the following cluster distance metrics:

    • Euclidean. The Euclidean distance is commonly used as a measure of cluster scatter for K-means clustering. This metric is preferred because it minimizes the mean distance between points and the centroids.

    • Cosine. The cosine function is used to measure cluster similarity. Cosine similarity is useful in cases where you do not care about the length of a vector, only its angle.

  7. For Iterations, type the number of times the algorithm should iterate over the training data before finalizing the selection of centroids.

    You can adjust this parameter to balance accuracy vs. training time.

  8. For Assign label mode, choose an option that specifies how a label column, if present in the dataset, should be handled.

    Because K-means clustering is an unsupervised machine learning method, labels are optional. However, if your dataset already has a label column, you can use those values to guide selection of the clusters, or you can specify that the values be ignored.

    • Ignore label column. The values in the label column are ignored and are not used in building the model.

    • Fill missing values. The label column values are used as features to help build the clusters. If any rows are missing a label, the value is imputed by using other features.

    • Overwrite from closest to center. The label column values are replaced with predicted label values, using the label of the point that is closest to the current centroid.

  9. Train the model.

    • If you set Create trainer mode to Single Parameter, add a tagged dataset and train the model by using the Train Clustering Model module.

    • If you set Create trainer mode to Parameter Range, add a tagged dataset and train the model using Sweep Clustering. You can use the model trained using those parameters, or you can make a note of the parameter settings to use when configuring a learner.

For examples of how K-means clustering is used in Azure Machine Learning, see these experiments in the Model Gallery:

Given a specific number of clusters (K) to find for a set of D-dimensional data points with N data points, the K-means algorithm builds the clusters as follows:

  1. The module initializes a K-by-D array with the final centroids that define the K clusters found.

  2. By default, the module assigns the first K data points in order to the K clusters.

  3. Starting with an initial set of K centroids, the method uses Lloyd's algorithm to iteratively refine the locations of the centroids.

  4. The algorithm terminates when the centroids stabilize or when a specified number of iterations are completed.

  5. A similarity metric (by default, Euclidean distance) is used to assign each data point to the cluster that has the closest centroid.

System_CAPS_ICON_warning.jpg Warning

  • If you pass a parameter range to Train Clustering Model, it will use only the first value in the parameter range list.
  • If you pass a single set of parameter values to the Sweep Clustering module, when it expects a range of settings for each parameter, it ignores the values and using the default values for the learner.
  • If you select the Parameter Range option and enter a single value for any parameter, that single value you specified will be used throughout the sweep, even if other parameters change across a range of values.
NameRangeTypeDefaultDescription
Number of Centroids>=2Integer2Number of Centroids
MetricList (subset)MetricEuclideanSelected metric
InitializationListCentroid initialization methodK-Means++Initialization algorithm
Iterations>=1Integer100Number of iterations
NameTypeDescription
Untrained modelICluster interfaceUntrained K-Means clustering model

For a list of all exceptions, see Machine Learning Module Error Codes.

ExceptionDescription
Error 0003Exception occurs if one or more of inputs are null or empty.

Clustering
Assign Data to Clusters
Train Clustering Model
Sweep Clustering

Show: