Microsoft Association Algorithm Technical Reference
The Microsoft Association Rules algorithm is a straightforward implementation of the well-known Apriori algorithm.
Both the Microsoft Decision Trees algorithm and the Microsoft Association Rules algorithm can be used to analyze associations, but the rules that are found by each algorithm can differ. In a decision trees model, the splits that lead to specific rules are based on information gain, whereas in an association model, rules are based completely on confidence. Therefore, in an association model, a strong rule, or one that has high confidence, might not necessarily be interesting because it does not provide new information.
The Apriori algorithm does not analyze patterns, but rather generates and then counts candidate itemsets. An item can represent an event, a product, or the value of an attribute, depending on the type of data that is being analyzed.
In the most common type of association model Boolean variables, representing a Yes/No or Missing/Existing value, are assigned to each attribute, such as a product or event name. A market basket analysis is an example of an association rules model that uses Boolean variables to represent the presence or absence of particular products in a customer's shopping basket.
For each itemset, the algorithm then creates scores that represent support and confidence. These scores can be used to rank and derive interesting rules from the itemsets.
Association models can also be created for numerical attributes. If the attributes are continuous, the numbers can be discretized, or grouped in buckets. The discretized values can then be handled either as Booleans or as attribute-value pairs.
Support, Probability, and Importance
Support, which issometimes referred to as frequency, means the number of cases that contain the targeted item or combination of items. Only items that have at least the specified amount of support can be included in the model.
A frequent itemset refers to a collection of items where the combination of items also has support above the threshold defined by the MINIMUM_SUPPORT parameter. For example, if the itemset is {A,B,C} and the MINIMUM_SUPPORT value is 10, each individual item A, B, and C must be found in at least 10 cases to be included in the model, and the combination of items {A,B,C} must also be found in at least 10 cases.
Note You can also control the number of itemsets in a mining model by specifying the maximum length of an itemset, where length means the number of items.
By default, the support for any particular item or itemset represents a count of the cases that contain that item or items. However, you can also express MINIMUM_SUPPORT as a percentage of the total cases in the data set, by typing the number as a decimal value less than 1. For example, if you specify a MINIMUM_SUPPORT value of 0.03, it means that at least 3% of the total cases in the data set must contain this item or itemset for inclusion in the model. You should experiment with your model to determine whether using a count or percentage makes more sense.
In contrast, the threshold for rules is expressed not as a count or percentage, but as a probability, sometimes referred to as confidence. For example, if the itemset {A,B,C} occurs in 50 cases, but the itemset {A,B,D} also occurs in 50 cases, and the itemset {A,B} in another 50 cases, it is obvious that {A,B} is not a strong predictor of {C}. Therefore, to weight a particular outcomes against all known outcomes, Analysis Services calculates the probability of the individual rule (such as If {A,B} Then {C}) by dividing the support for the itemset {A,B,C} by the support for all related itemsets.
You can restrict the number of rules that a model produces by setting a value for MINIMUM_PROBABILITY.
For each rule that is created, Analysis Services outputs a score that indicates its importance, which is also referred to as lift. Lift Importance is calculated differently for itemsets and rules.
The importance of an itemset is calculated as the probability of the itemset divided by the compound probability of the individual items in the itemset. For example, if an itemset contains {A,B}, Analysis Services first counts all the cases that contain this combination A and B, and divides that by the total number of cases, and then normalizes the probability.
The importance of a rule is calculated by the log likelihood of the right-hand side of the rule, given the left-hand side of the rule. For example, in the rule If {A} Then {B}, Analysis Services calculates the ratio of cases with A and B over cases with B but without A, and then normalizes that ratio by using a logarithmic scale.
Feature Selection
The Microsoft Association Rules algorithm does not perform any kind of automatic feature selection. Instead, the algorithm provides parameters that control the data that is used by the algorithm. This might include limits on the size of each itemset, or setting the maximum and minimum support required to add an itemset to the model.
To filter out items and events that are too common and therefore uninteresting, decrease the value of MAXIMUM_SUPPORT to remove very frequent itemsets from the model.
To filter out items and itemsets that are rare, increase the value of MINIMUM_SUPPORT.
To filter out rules, increase the value of MINIMUM_PROBABILITY.
The Microsoft Association Rules algorithm supports several parameters that affect the behavior, performance, and accuracy of the resulting mining model.
Setting Algorithm Parameters
You can change the parameters for a mining model at any time by using the Data Mining Designer in Business Intelligence Development Studio. You can also change parameters programmatically by using the AlgorithmParameters collection in AMO, or by using the MiningModels Element (ASSL) in XMLA. The following table describes each parameter.
Note |
|---|
You cannot change the parameters in an existing model by using a DMX statement; you must specify the parameters in the DMX CREATE MODEL or ALTER STRUCTURE… ADD MODEL when you create the model. |
Modeling Flags
The following modeling flags are supported for use with the Microsoft Association Rules algorithm.
An association model must contain a key column, input columns, and a single predictable column.
Input and Predictable Columns
The Microsoft Association Rules algorithm supports the specific input columns and predictable columns that are listed in the following table. For more information about the meaning of content types in a mining model, see Content Types (Data Mining).
Column | Content types |
|---|---|
Input attribute | Cyclical, Discrete, Discretized, Key, Table, Ordered |
Predictable attribute | Cyclical, Discrete, Discretized, Table, Ordered |
Note |
|---|
Cyclical and Ordered content types are supported, but the algorithm treats them as discrete values and does not perform special processing. |
Note