Mining Model Content for Sequence Clustering Models (Analysis Services  Data Mining)
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
This topic describes mining model content that is specific to models that use the Microsoft Sequence Clustering algorithm. For an explanation of general and statistical terminology related to mining model content that applies to all model types, see Mining Model Content (Analysis Services  Data Mining).
A sequence clustering model has a single parent node (NODE_TYPE = 1) that represents the model and its metadata. The parent node, which is labeled (All), has a related sequence node (NODE_TYPE = 13) that lists all the transitions that were detected in the training data.
The algorithm also creates a number of clusters, based on the transitions that were found in the data and any other input attributes included when creating the model, such as customer demographics and so forth. Each cluster (NODE_TYPE = 5) contains its own sequence node (NODE_TYPE = 13) that lists only the transitions that were used in generating that specific cluster. From the sequence node, you can drill down to view the details of individual state transitions (NODE_TYPE = 14).
For an explanation of sequence and state transitions, with examples, see Microsoft Sequence Clustering Algorithm.
A sequence clustering model has a unique structure that combines two kinds of objects with very different types of information: the first are clusters, and the second are state transitions.
The clusters created by sequence clustering are like the clusters created by the Microsoft Clustering algorithm. Each cluster has a profile and characteristics. However, in sequence clustering, each cluster additionally contains a single child node that lists the sequences in that cluster. Each sequence node contains multiple child nodes that describe the state transitions in detail, with probabilities.
There are almost always more sequences in the model than you can find in any single case, because the sequences can be chained together. Microsoft Analysis Services stores pointers from one state to the other so that you can count the number of times each transition happens. You can also find information about how many times the sequence occurred, and measure its probability of occurring as compared to the entire set of observed states.
The following table summarizes how information is stored in the model, and how the nodes are related.
Node 
Has child node 
NODE_DISTRIBUTION table 

Model root 
Multiple cluster nodes Node with sequences for entire model 
Lists all products in the model, with support and probability. Because the clustering method permits partial membership in multiple clusters, support and probability can have fractional values. That is, instead of counting a single case once, each case can potentially belong to multiple clusters. Therefore, when the final cluster membership is determined, the value is adjusted by the probability of that cluster. 
Sequence node for model 
Multiple transition nodes 
Lists all products in the model, with support and probability. Because the number of sequences is known for the model, at this level, calculations for support and probability are straightforward:

Individual cluster nodes 
Node with sequences for that cluster only 
Lists all products in a cluster, but provides support and probability values only for products that are characteristic of the cluster. Support represents the adjusted support value for each case in this cluster. Probability values are adjusted probability. 
Sequence nodes for individual clusters 
Multiple nodes with transitions for sequences in that cluster only 
Exactly the same information as in individual cluster nodes. 
Transitions 
No children 
Lists transitions for the related first state. Support is an adjusted support value, indicating the cases that take part in each transition. Probability is the adjusted probability, represented as a percentage. 
NODE_DISTRIBUTION Table
The NODE_DISTRIBUTION table provides detailed probability and support information for the transitions and sequences for a specific cluster.
A row is always added to the transition table to represent possible Missing values. For information about what the Missing value means, and how it affects calculations, see Missing Values (Analysis Services  Data Mining).
The calculations for support and probability differ depending on whether the calculation applies to the training cases or to the finished model. This is because the default clustering method, Expectation Maximization (EM), assumes that any case can belong to more than one cluster. When calculating support for the cases in the model, it is possible to use raw counts and raw probabilities. However, the probabilities for any particular sequence in a cluster must be weighted by the sum of all possible sequence and cluster combinations.
Cardinality
In a clustering model, the cardinality of the parent node generally tells you how many clusters are in the model. However, a sequence clustering model has two kinds of nodes at the cluster level: one kind of node contains clusters, and the other kind of node contains a list of sequences for the model as a whole.
Therefore, to learn the number of clusters in the model, you can take the value of NODE_CARDINALITY for the (All) node and subtract one. For example, if the model created 9 clusters, the cardinality of the model root is 10. This is because the model contains 9 cluster nodes, each with its own sequence node, plus one additional sequence node labeled cluster 10, which represents the sequences for the model.
An example might help clarify how the information is stored, and how you can interpret it. For example, you can find the largest order, meaning the longest observed chain in the underlying AdventureWorksDW2012 data, by using the following query:
USE AdventureWorksDW2012 SELECT DISTINCT OrderNumber, Count(*) FROM vAssocSeqLineItems GROUP BY OrderNumber ORDER BY Count(*) DESC
From these results, you find that the order numbers 'SO72656', 'SO58845', and 'SO70714' contain the largest sequences, with eight items each. By using the order IDs, you can view the details of a particular order to see which items were purchased, and in what order.
OrderNumber  LineNumber  Model 

SO58845  1  Mountain500 
SO58845  2  LL Mountain Tire 
SO58845  3  Mountain Tire Tube 
SO58845  4  Fender Set  Mountain 
SO58845  5  Mountain Bottle Cage 
SO58845  6  Water Bottle 
SO58845  7  Sport100 
SO58845  8  LongSleeve Logo Jersey 
However, some customers who purchase the Mountain500 might purchase different products. You can view all the products that follow the Mountain500 by viewing the list of sequences in the model. The following procedures walk you through viewing these sequences by using the two viewers provided in Analysis Services:
To view related sequences by using the Sequence Clustering viewer
In Object Explorer, rightclick the [Sequence Clustering] model, and select Browse.
In the Sequence Clustering viewer, click the State Transitions tab.
In the Cluster dropdown list, ensure that Population (All) is selected.
Move the slider bar at the left of the pane all the way to the top, to show all links.
In the diagram, locate Mountain500, and click the node in the diagram.
The highlighted lines point to the next states (the products that were purchased after the Mountain500) and the numbers indicate the probability. Compare these to the results in the generic model content viewer.
To view related sequences by using the generic model content viewer
In Object Explorer, rightclick the [Sequence Clustering] model, and select Browse.
In the viewer dropdown list, select the Microsoft Generic Content Tree Viewer.
In the Node caption pane, click the node named Sequence level for cluster 16.
In the Node details pane, find the NODE_DISTRIBUTION row, and click anywhere in the nested table.
The top row is always for the Missing value. This row is sequence state 0.
Press the down arrow key, or use the scroll bars, to move down through the nested table until you see the row, Mountain500.
This row is sequence state 20.
Note You can obtain the row number for a particular sequence state programmatically, but if you are just browsing, it might be easier to simply copy the nested table into an Excel workbook.
Return to the Node caption pane, and expand the node, Sequence level for cluster 16, if it is not already expanded.
Look among its child nodes for Transition row for sequence state 20. Click the transition node.
The nested NODE_DISTRIBUTION table contains the following products and probabilities. Compare these to the results in the State Transition tab of the Sequence Clustering viewer.
The following table shows the results from the NODE_DISTRIBUTION table, together with the rounded probability values that are displayed in the graphical viewer.
Product  Support (NODE_DISTRIBUTION table)  Probability (NODE_DISTRIBUTION) table)  Probability (from graph) 

Missing  48.447887  0.138028169  (not shown) 
Cycling Cap  10.876056  0.030985915  0.03 
Fender Set  Mountain  80.087324  0.228169014  0.23 
HalfFinger Gloves  0.9887324  0.002816901  0.00 
Hydration Pack  0.9887324  0.002816901  0.00 
LL Mountain Tire  51.414085  0.146478873  0.15 
LongSleeve Logo Jersey  2.9661972  0.008450704  0.01 
Mountain Bottle Cage  87.997183  0.250704225  0.25 
Mountain Tire Tube  16.808451  0.047887324  0.05 
ShortSleeve Classic Jersey  10.876056  0.030985915  0.03 
Sport100  20.76338  0.05915493  0.06 
Water Bottle  18.785915  0.053521127  0.25 
Although the case that we initially selected from the training data contained the product 'Mountain500' followed by 'LL Mountain Tire', you can see that there are many other possible sequences. To find detailed information for any particular cluster, you must repeat the process of drilling down from the list of sequences in the cluster to the actual transitions for each state, or product.
You can jump from the sequence listed in one particular cluster, to the transition row. From that transition row, you can determine which product is next, and jump back to that product in the list of sequences. By repeating this process for each first and second state you can work through long chains of states.
A common scenario for sequence clustering is to track user clicks on a Web site. For example, if the data were from records of customer purchases on the Adventure Works ecommerce Web site, the resulting sequence clustering model could be used to infer user behavior, to redesign the ecommerce site to solve navigation problems, or to promote sales.
For example, analysis might show that users always follow a particular chain of products, regardless of demographics. Also, you might find that users frequently exit the site after clicking on a particular product. Given that finding, you might ask what additional paths you could provide to users that would induce users to stay on the Web site.
If you do not have additional information to use in classifying your users, then you can simply use the sequence information to collect data about navigation to better understand overall behavior. However, if you can collect information about customers and match that information with your customer database, you can combine the power of clustering with prediction on sequences to provide recommendations that are tailored to the user, or perhaps based on the path of navigation to the current page.
Another use of the extensive state and transition information compiled by a sequence clustering model is to determine which possible paths are never used. For example, if you have many visitors going to pages 14, but visitors never continue on to page 5, you might investigate whether there are problems that prevent navigation to page 5. You can do this by querying the model content, and comparing it against a list of possible paths. Graphs that tell you all the navigation paths in a Web site can be created programmatically, or by using a variety of site analysis tools.
To find out how to obtain the list of observed paths by querying the model content, and to see other examples of queries on a sequence clustering model, see Sequence Clustering Model Query Examples.