Microsoft Clustering Algorithm Technical Reference
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
This section explains the implementation of the Microsoft Clustering algorithm, including the parameters that you can use to control the behavior of clustering models. It also provides guidance about how to improve performance when you create and process clustering models.
For additional information about how to use clustering models, see the following topics:
The Microsoft Clustering algorithm provides two methods for creating clusters and assigning data points to the clusters. The first, the Kmeans algorithm, is a hard clustering method. This means that a data point can belong to only one cluster, and that a single probability is calculated for the membership of each data point in that cluster. The second method, the Expectation Maximization (EM) method, is a soft clustering method. This means that a data point always belongs to multiple clusters, and that a probability is calculated for each combination of data point and cluster.
You can choose which algorithm to use by setting the CLUSTERING_METHOD parameter. The default method for clustering is scalable EM.
EM Clustering
In EM clustering, the algorithm iteratively refines an initial cluster model to fit the data and determines the probability that a data point exists in a cluster. The algorithm ends the process when the probabilistic model fits the data. The function used to determine the fit is the loglikelihood of the data given the model.
If empty clusters are generated during the process, or if the membership of one or more of the clusters falls below a given threshold, the clusters with low populations are reseeded at new points and the EM algorithm is rerun.
The results of the EM clustering method are probabilistic. This means that every data point belongs to all clusters, but each assignment of a data point to a cluster has a different probability. Because the method allows for clusters to overlap, the sum of items in all the clusters may exceed the total items in the training set. In the mining model results, scores that indicate support are adjusted to account for this.
The EM algorithm is the default algorithm used in Microsoft clustering models. This algorithm is used as the default because it offers multiple advantages in comparison to kmeans clustering:

Requires one database scan, at most.

Will work despite limited memory (RAM).

Has the ability to use a forwardonly cursor.

Outperforms sampling approaches.
The Microsoft implementation provides two options: scalable and nonscalable EM. By default, in scalable EM, the first 50,000 records are used to seed the initial scan. If this is successful, the model uses this data only. If the model cannot be fit using 50,000 records, an additional 50,000 records are read. In nonscalable EM, the entire dataset is read regardless of its size. This method might create more accurate clusters, but the memory requirements can be significant. Because scalable EM operates on a local buffer, iterating through the data is much faster, and the algorithm makes much better use of the CPU memory cache than nonscalable EM. Moreover, scalable EM is three times faster than nonscalable EM, even if all the data can fit in main memory. In the majority of cases, the performance improvement does not lead to lower quality of the complete model.
For a technical report that describes the implementation of EM in the Microsoft Clustering algorithm, see Scaling EM (Expectation Maximization) Clustering to Large Databases.
KMeans Clustering
Kmeans clustering is a wellknown method of assigning cluster membership by minimizing the differences among items in a cluster while maximizing the distance between clusters. The "means" in kmeans refers to the centroid of the cluster, which is a data point that is chosen arbitrarily and then refined iteratively until it represents the true mean of all data points in the cluster. The "k" refers to an arbitrary number of points that are used to seed the clustering process. The kmeans algorithm calculates the squared Euclidean distances between data records in a cluster and the vector that represents the cluster mean, and converges on a final set of k clusters when that sum reaches its minimum value.
The kmeans algorithm assigns each data point to exactly one cluster, and does not allow for uncertainty in membership. Membership in a cluster is expressed as a distance from the centroid.
Typically, the kmeans algorithm is used for creating clusters of continuous attributes, where calculating distance to a mean is straightforward. However, the Microsoft implementation adapts the kmeans method to cluster discrete attributes, by using probabilities. For discrete attributes, the distance of a data point from a particular cluster is calculated as follows:
1  P(data point, cluster)
Note 

The Microsoft Clustering algorithm does not expose the distance function used in computing kmeans, and measures of distance are not available in the completed model. However, you can use a prediction function to return a value that corresponds to distance, where distance is computed as the probability of a data point belonging to the cluster. For more information, see ClusterProbability (DMX). 
The kmeans algorithm provides two methods of sampling the data set: nonscalable Kmeans, which loads the entire data set and makes one clustering pass, or scalable kmeans, where the algorithm uses the first 50,000 cases and reads more cases only if it needs more data to achieve a good fit of model to data.
Updates to the Microsoft Clustering Algorithm in SQL Server 2008
In SQL Server 2008, the default configuration of the Microsoft clustering algorithm was changed to use the internal parameter, NORMALIZATION = 1. Normalization is performed using zscore statistics, and assumes normal distribution. The intent of this change in the default behavior is to minimize the effect of attributes that might have large magnitudes and many outliers. However, zscore normalization may alter the clustering results on distributions that are not normal (such as uniform distributions). To prevent normalization and obtain the same behavior as the Kmeans clustering algorithm in SQL Server 2005, you can use the Parameter Settings dialog box to add the custom parameter, NORMALIZATION, and set its value to 0.
Note 

The NORMALIZATION parameter is an internal property of the Microsoft Clustering algorithm and is not supported. In general, the use of normalization is recommended in clustering models to improve model results. 
The Microsoft Clustering algorithm supports several parameters that affect the behavior, performance, and accuracy of the resulting mining model.
Setting Algorithm Parameters
The following table describes the parameters that can be used with the Microsoft Clustering algorithm. These parameters affect both the performance and accuracy of the resulting mining model.
Modeling Flags
The algorithm supports the following modeling flags. You define modeling flags when you create the mining structure or mining model. The modeling flags specify how values in each column are handled during analysis.
Modeling Flag 
Description 

MODEL_EXISTENCE_ONLY 
The column will be treated as having two possible states: Missing and Existing. A null is a missing value. Applies to mining model column. 
NOT NULL 
The column cannot contain a null. An error will result if Analysis Services encounters a null during model training. Applies to mining structure column. 
A clustering model must contain a key column and input columns. You can also define input columns as being predictable. Columns set to Predict Only are not used to build clusters. The distribution of these values in the clusters are calculated after the clusters are built.
Input and Predictable Columns
The Microsoft Clustering algorithm supports the specific input columns and predictable columns that are listed in the following table. For more information about what the content types mean when used in a mining model, see Content Types (Data Mining).
Column 
Content types 

Input attribute 
Continuous, Cyclical, Discrete, Discretized, Key, Table, Ordered 
Predictable attribute 
Continuous, Cyclical, Discrete, Discretized, Table, Ordered 
Note 

Cyclical and Ordered content types are supported, but the algorithm treats them as discrete values and does not perform special processing. 