Cross-Validation Report (Analysis Services - Data Mining)
Cross-validation lets you partition a mining structure into cross-sections and iteratively train and test models against each cross-section of the data. You specify several partitions to divide the data into, and each partition is used in turn as the test data, while the remaining data is used to train a new model. Analysis Services then generates a set of standard accuracy metrics for each model. By comparing the metrics for the models generated for each cross-section, you can get a good idea of how reliable the mining model is for the whole data set.
Cross-validation cannot be used with any model that contains a KEY TIME or KEY SEQUENCE column.
This section explains the information that is presented in the Cross-Validation report, on the Mining Accuracy Chart tab of Data Mining Designer. For information about how to create a report, see Cross-Validation Tab (Mining Accuracy Chart View).
To cross-validate mining models associated with a mining structure, you use either the Cross-Validation tab of the Mining Accuracy View in Data Mining Designer, or the cross-validation stored procedures, to configure the following options:
Specify the number of folds.
Specify the maximum number of cases to use for cross-validation. This number is divided by the number of folds.
Specify the predictable column. Optionally, specify a predictable state.
If your mining structure contains clustering models, specify #Cluster instead of selecting a predictable column. The report returns results only for clustering models.
Optionally, set parameters that control how the accuracy of prediction is assessed.
Choosing a Cross-Validation Data Set
When you use the Cross Validation tab of the Mining Accuracy Chart view, there are two ways to control the amount and type of data that is used during cross-validation: you can specify the number of folds, and you can limit the number of cases. By default, cross-validation in Business Intelligence Development Studio uses the training cases for each model. If a filter is associated with the model, the filter is applied.
The Fold Count value specifies the number of cross-sections of the data set to create. Each fold is used as the testing data set, and the data in the remaining folds is used to train a new model. Therefore, if you used the minimum value of 2, half the data set would be used for testing and the other half for training.
If the mining structure is not stored in an instance of Analysis Services, but is created as a temporary, or session structure, the maximum number of folds that you can use is 10. If the mining structure is stored in an instance of Analysis Services, you cannot create more folds than there are cases. If the number of cases is less than the number you set for Fold Count, the lesser number is used.
As you increase the number of folds, the time that is required to perform cross-validation increases accordingly, because a model must be generated and tested for each fold. You might experience performance problems if the number of folds is too high.
The Max Cases value specifies the total number of cases, across all folds, that can be used for cross-validation. Therefore, the number of cases in any particular fold is equal to the Max Cases value divided by the Fold Count value. The default value is 0, which means that all cases in the mining structure are used.
If you perform cross-validation by using one of the stored procedures, you provide values for the FoldCount and MaxCases properties as parameters to the stored procedure.
If you use the cross-validation stored procedures, you can also set the DataSet parameter to define the data set used for testing. The options for the data set include training set only, testing and training set, and combinations of training and testing set with mining model filters. For more information, see SystemGetCrossValidationResults (Analysis Services - Data Mining).
Choosing Models and Columns to Validate
When you use the Cross Validation tab in Data Mining Designer, you must first select the predictable column from a list. Typically, a mining structure can support many mining models, not all of which use the same predictable column. When you run cross-validation, only those models that use the same predictable column can be included in the report.
To choose a predictable attribute, click Target Attribute and select the column from the list. If the target attribute is a nested column, or a column in a nested table, you must type the name of the nested column using the format <Nested Table Name>(key).<Nested Column>. If the only column used from the nested table is the key column, you can use <Nested Table Name>(key).
Note If you use the stored procedures, you can exercise more control over the models that are tested. For more information, see SystemGetCrossValidationResults (Analysis Services - Data Mining).
After you select the predictable attribute, Analysis Services automatically tests all models that use the same predictable attribute.
If the target attribute contains discrete values, after you have selected the predictable column, you can optionally type a target state, if there is a specific value that you want to predict.
The selection of the target state affects the measures that are returned. If you specify a target attribute—that is, a column name—and do not pick a specific value that you want the model to predict, by default the model will be evaluated on its prediction of the most probable state.
If you are cross-validating a clustering model, there is no predictable column; instead, you select #Cluster from the list of predictable attributes in the Target Attribute list box. After you have selected Cluster, other options that are not relevant to clustering models, such as Target State, are disabled. Analysis Services will test all clustering models that are associated with the mining structure.
Setting the Accuracy Threshold
You can control the standard for measuring prediction accuracy by setting a value for Target Threshold. A threshold represents a kind of accuracy bar. Each prediction is assigned a probability that the predicted value is correct. Therefore, if you set the Target Threshold value closer to 1, you are requiring that the probability for any particular prediction to be fairly high in order to be counted as a good prediction. Conversely, if you set Target Threshold closer to 0, even predictions with lower probability values are counted as "good" predictions.
There is no recommended threshold value because the probability of any prediction depends on your data and the type of prediction you are making. You should review some predictions at different probability levels to determine an appropriate accuracy bar for your data. This step is important because the value that you set for Target Threshold has a strong effect on the measured accuracy of the model.
For example, suppose your structure contains three models that predict the target state with probabilities of 0.05, 0.15, and 0.8. If you set the threshold to 0.5, only one prediction is counted as being correct. If you set Target Threshold to 0.10, two predictions are counted as being correct.
When Target Threshold is set to null, which is the default value, the most probable state is used as the target. In the example just cited, all three models would have correct predictions. Therefore, when you compare models, you should consider the threshold used for each cross-validation instance. You can also assess the mean probabilities for all cases in a particular model by using the measures Mean Likelihood and Root Mean Square Error that are provided in the cross-validation report.
Limitations when Using the Cross-Validation Tab
If you perform cross-validation by using the cross-validation report in Business Intelligence Development Studio, there are some limitations on the models that you can test and the parameters you can set.
By default, all models associated with the selected mining structure are cross-validated. You cannot specify the model or a list of models.
Cross-validation is not supported for models that are based on the Microsoft Time Series algorithm or the Microsoft Sequence Clustering algorithm.
The report cannot be created if your mining structure does not contain any models that can be tested by cross-validation.
If the mining structure contains both clustering and non-clustering models and you do not choose the #Cluster option, results for both types of models are displayed in the same report, even though the attribute, state, and threshold settings might not be appropriate for the clustering models.
Some parameter values are restricted. For example, a warning is displayed if the number of folds is more than 10, because generating so many models might cause the report to display slowly.
If you want to specify advanced settings, you must use the cross-validation stored procedures. For more information, see Data Mining Stored Procedures (Analysis Services - Data Mining).
After you have specified the parameters and clicked Refresh, the results of cross-validation are displayed in the results grid. This section explains the contents of each column in the results grid.
In addition to some basic information about the number of folds in the data, and the amount of data in each fold, Analysis Services displays a set of metrics about each model, categorized by type. The following table lists the tests and metrics, with an explanation of what the metric means.
Measures and Descriptions
Case likelihoodIndication of how likely it is that a case belongs to a particular cluster.
True PositiveCount of cases that meet these conditions:
False PositiveCount of cases that meet these conditions:
True NegativeCount of cases that meet these conditions:
False NegativeCount of cases that meet these conditions:
Pass/FailCount of cases that meet these conditions:
LiftThe ratio of the actual prediction probability to the marginal probability in the test cases, excluding rows with missing values. This metric shows how much the probability improves when the model is used.
Root Mean Square ErrorSquare root of the mean error for all partition cases, divided by the number of cases in the partition, excluding cases with missing values.
Log scoreThe logarithm of the actual probability for each case, summed, and then divided by the number of rows in the input dataset, excluding cases with missing values. Because probability is represented as a fraction, log scores are always negative numbers. A number closer to 0 indicates a better prediction.
Root Mean Square ErrorAverage error of the predicted value to the actual value, expressed as the square root of the mean sum of the square errors.
Mean Absolute ErrorSum of the absolute error for all cases in the partition, divided by the number of cases, excluding rows with missing values.
Log ScoreThe logarithm of the actual probability for each case, summed, and then divided by the number of rows in the input dataset, excluding cases with missing values. Because probability is represented as a fraction, log scores are always negative numbers. A number closer to 0 indicates a better prediction.
Aggregate measures provide an indication of the variance in the results for each partition.
MeanAverage of the partition values for a particular measure.
Standard DeviationAverage of the deviation from the mean for a specific measure, across all the partitions in a model.
These measures of accuracy are computed for each target attribute, and for each attribute you can specify or omit a target value. Some records might not have any value for the target attribute, which is a special case, called the missing value. Rows that have missing values are not counted when computing the accuracy measure for a particular target attribute. Also, because the scores are computed for each attribute individually, if values are present for the target attribute but missing for other attributes, it does not affect the score for the target attribute.