Mining Model Content for Linear Regression Models (Analysis Services  Data Mining)
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
This topic describes mining model content that is specific to models that use the Microsoft Linear Regression algorithm. For a general explanation of mining model content for all model types, see Mining Model Content (Analysis Services  Data Mining).
A linear regression model has an extremely simple structure. Each model has a single parent node that represents the model and its metadata, and a regression tree node (NODE_TYPE = 25) that that contains the regression formula for each predictable attribute.
Linear regression models use the same algorithm as Microsoft Decision Trees, but different parameters are used to constrain the tree, and only continuous attributes are accepted as inputs. However, because linear regression models are based on the Microsoft Decision Trees algorithm, linear regression models are displayed by using the Microsoft Decision Tree Viewer. For information, see Browse a Model Using the Microsoft Tree Viewer.
The next section explains how to interpret information in the regression formula node. This information applies not only to linear regression models, but also to decision trees models that contain regressions in a portion of the tree.
This section provides detail and examples only for those columns in the mining model content that have particular relevance for linear regression.
For information about generalpurpose columns in the schema rowset, see Mining Model Content (Analysis Services  Data Mining).
When you create a model by using the Microsoft Linear Regression algorithm, the data mining engine creates a special instance of a decision trees model and supplies parameters that constrain the tree to contain all the training data in a single node. All continuous inputs are flagged and evaluated as potential regressors, but only those regressors that fit the data are retained as regressors in the final model. The analysis produces either a single regression formula for each regressor or no regression formula at all.
You can view the complete regression formula in the Mining Legend, by clicking the (All) node in the Microsoft Tree Viewer.
Also, when you create a decision trees model that includes a continuous predictable attribute, sometimes the tree has regression nodes that share the properties of regression tree nodes.
Most of the important information in a regression node is contained in the NODE_DISTRIBUTION table. The following example illustrates the layout of the NODE_DISTRIBUTION table. In this example, the Targeted Mailing mining structure has been used to create a linear regression model that predicts customer income based on age. The model is for the purpose of illustration only, because it can be built easily using the existing AdventureWorks2012 sample data and mining structure.
ATTRIBUTE_NAME 
ATTRIBUTE_VALUE 
SUPPORT 
PROBABILITY 
VARIANCE 
VALUETYPE 

Yearly Income 
Missing 
0 
0.000457142857142857 
0 
1 
Yearly Income 
57220.8876687257 
17484 
0.999542857142857 
1041275619.52776 
3 
Age 
471.687717702463 
0 
0 
126.969442359327 
7 
Age 
234.680904692439 
0 
0 
0 
8 
Age 
45.4269617936399 
0 
0 
126.969442359327 
9 

35793.5477381267 
0 
0 
1012968919.28372 
11 
The NODE_DISTRIBUTION table contains multiple rows, each grouped by a variable. The first two rows are always value types 1 and 3, and describe the target attribute. The succeeding rows provide details about the formula for a particular regressor. A regressor is an input variable that has a linear relationship with the output variable. You can have multiple regressors, and each regressor will have a separate row for the coefficient (VALUETYPE = 7), score gain (VALUETYPE = 8), and statistics (VALUETYPE = 9). Finally, the table has a row that contains the intercept of the equation (VALUETYPE = 11).
Elements of the Regression Formula
The nested NODE_DISTRIBUTION table contains each element of the regression formula in a separate row. The first two rows of data in the example results contain information about the predictable attribute, Yearly Income, which models the dependent variable. The SUPPORT column shows the count of cases in support of the two states of this attribute: either a Yearly Income value was available, or the Yearly Income value was missing.
The VARIANCE column tells you the computed variance of the predictable attribute. Variance is a measure of how scattered the values are in a sample, given an expected distribution. Variance here is calculated by taking the average of the squared deviation from the mean. The square root of the variance is also known as standard deviation. Analysis Services does not provide the standard deviation but you can easily calculate it.
For each regressor, three rows are output. They contain the coefficient, score gain, and regressor statistics.
Finally, the table contains a row that provides the intercept for the equation.
Coefficient
For each regressor, a coefficient (VALUETYPE = 7) is calculated. The coefficient itself appears in the ATTRIBUTE_VALUE column, whereas the VARIANCE column tells you the variance for the coefficient. The coefficients are calculated so as to maximize linearity.
Score Gain
The score gain (VALUETYPE = 8) for each regressor represents the interestingness score of the attribute. You can use this value to estimate the usefulness of multiple regressors.
Statistics
The regressor statistic (VALUETYPE = 9) is the mean for the attribute for cases that have a value. The ATTRIBUTE_VALUE column contains the mean itself, whereas the VARIANCE column contains the sum of deviations from the mean.
Intercept
Normally, the intercept (VALUETYPE = 11) or residual in a regression equation tells you the value of the predictable attribute, at the point where the input attribute, is 0. In many cases, this might not happen, and could lead to counterintuitive results.
For example, in a model that predicts income based on age, it is useless to learn the income at age 0. In reallife, it is typically more useful to know about the behavior of the line with respect to average values. Therefore, SQL Server Analysis Services modifies the intercept to express each regressor in a relationship with the mean.
This adjustment is difficult to see in the mining model content, but is apparent if you view the completed equation in the Mining Legend of the Microsoft Tree Viewer. The regression formula is shifted away from the 0 point to the point that represents the mean. This presents a view that is more intuitive given the current data.
Therefore, assuming that the mean age is around 45, the intercept (VALUETYPE = 11) for the regression formula tells you the mean income.