SQL Server 2005
Unearth the New Data Mining Features of Analysis Services 2005
This article is based upon a prerelease version of SQL Server 2005. All information contained herein is subject to change.
This article discusses:
- SQL Server 2005 data mining functionality
- New and enhanced algorithms
- Object model programmability
- Integration and extensibility
This article uses the following technologies:
SQL and Visual Studio
As the Beta 2 of SQL Server™ 2005 draws nearer, it's time to talk about some of the new features it has in store for database developers. To whet your appetite, there's no better place to start than with the new data mining features of SQL Server 2005 Analysis Services, where you'll find enhancements to two existing algorithms, plus five new algorithms, and more than a dozen added visualizations to help you get a handle on your data relationships. [For more on the new algorithms, see the sidebar "New Algorithms in SQL Server 2005."] If you take only a cursory look, however, you will miss all of the rich functionality you can get from enhancements to the Data Mining Extensions to SQL (DMX), programmability models, and integration with online analytical processing (OLAP), Data Transformation Services (DTS), and Reporting Services. These improvements make it possible to create a new breed of intelligent apps with embedded data mining technology.
Data mining in SQL Server 2005 reveals greatly improved tools. The Business Intelligence (BI) Development Studio integrated into Visual Studio® lets you add data mining to your development toolbox. A data mining project can exist alongside any other development project in your Visual Studio solutions. For example, you can do all of the following without leaving the designer: design models to predict product cross-sales in an Analysis Services project, periodically retrain your models in a DTS project, and build a Web site that takes advantage of the predictive power of data mining in a Web project—then package them up in one portable solution.
Data mining objects in an Analysis Services project are files like any other project file and can be shared, examined, or checked into source control programs for safekeeping. Of course, you can also use the BI Development Studio to connect directly to any Analysis Services database to create, or even edit objects on a live server.
Wizards and Designers
SQL Server 2005 Analysis Services provides wizards and designers for all major objects. Wizards are used to initially create any object, and in Analysis Services 2005 they are flexible enough that in many cases you do not need to further refine your objects afterwards. When you do, customized designers allow you to edit your objects down to the finest detail. For the purposes of data mining, there are two wizards and designers you will use most frequently: the Data Source View Wizard and its designer, along with the Data Mining Wizard and its designer.
In Analysis Services 2000, to access relational data you simply specified a data source and then selected the tables you wanted to see. If the view of the data in the relational source wasn't exactly what you needed for modeling, you had to modify the source by either creating new tables or views that represented the data in the form you required.
Analysis Services 2005 provides a new Data Source View (DSV) object, which provides an abstraction layer on top of your data source. In this layer you specify which tables from the source are available, as well as friendly names and annotations for all tables, columns, table keys, and relationships (if not specified in the source). Additionally, you can create both virtual columns, called Named Calculations, on existing tables and virtual views that are called Named Queries. When you create and process mining models in Analysis Services, the DSV representation, along with the virtual columns and views, is used as the data source.
So, why is this important? Suppose you were mining survey responses with a range of 1 to 4 where 1 equals "Very Dissatisfied" and 4 equals "Very Satisfied." For mining purposes, all you care about is whether the respondent was satisfied or not. The Create Named Calculation feature allows you to create a new variable. In this case, you would simply right-click the table in the DSV designer and select "Create Named Calculation." In the named calculation dialog you enter the name of the column, "Satisfied", and an expression representing the value, which in this case is:
CASE SatisfactionLevel WHEN 3 THEN 'Yes'
WHEN 4 THEN 'Yes'
Once the column is added to the table, it appears as any other column. You can even use the expanded data exploration tools which are found in the DSV in order to see column distributions, as shown in Figure 1. Another important facet is that the DSV allows you to select tables from heterogeneous data sources, which permits modeling across distributed data.
Figure 1 DSV Designer in the BI Development Studio
Creating and Editing Models
The new Data Mining Wizard allows you to create complex models in a few easy steps. The wizard samples your data to detect how to appropriately model it based on the capabilities of the algorithm you select. Also new in the wizard is the ability to easily create an arbitrary number of nested tables. Nested tables allow you to combine sparse data, such as transactional data, along with dense data, such as demographic data. For example, you can now easily model the impact demographics have on product sales or the opposite: which product sales data is a reflection of which demographics. The ability to provide multiple nested tables in a single model allows for even more advanced modeling. For example, you could determine which concession items are associated with which movies, or even how sales from September can predict sales in October.
Once you're finished with the wizard, you land in the free-form Data Mining designer. The designer allows you to tweak each and every column setting to your heart's delight (and more importantly, to your modeling requirements). The designer also provides the first exposure to another new data mining feature—the Mining Structure.
The Mining Structure is a set of columns that represents the domain of the business problem and a set of Mining Models that applies those columns to an algorithm to solve that problem. Each model within a structure specifies which algorithm and parameters to use and how each column is used in that model. For example, you could create a structure to see the impact of age on customer churn. You can add the age column to the structure multiple times; perhaps you'd add it once as a continuous column and once as a discrete column that automatically buckets the age into discrete ranges depending on the data distribution. Your structure could then have two models, one using the continuous age and another using the bucketed age. Processing of the models in the structure occurs in parallel with a single data read. Once the models are complete, you can use them to compare how the different configurations affect your problem.
Browsing and Querying Models
SQL Server 2005 Data Mining ships with 12 custom visualizations for understanding the patterns unearthed from your data. Each of these views is a Windows® Forms control that can be embedded into custom data mining applications. To compare your models and determine a model's overall quality, four different types of accuracy charts are included: a lift chart, a profit chart, a confusion matrix, and an actual-versus-predicted plot.
Also included is a Microsoft® Access-style query builder that provides a graphical interface for generating the DMX queries to perform predictions against mining models. The query builder allows drag and drop selection of columns from your model or source data, plus it provides a comprehensive list of all the functions that apply to the model's algorithm with full function prototypes (see the sidebar "DMX.")
The query builder uses the SHAPE operator to automatically handle the query generation for models requiring nested table input. In addition, you can set the query builder to generate "singleton" queries where the input data is specified directly in the query. This allows you to specify the exact inputs for a prediction query so you can experiment by directly seeing the impact of changing inputs on the predicted output.
Figure 5 Time Series Model
Neural Nets Neural Networks is an artificial intelligence technique that excels in finding complex relationships in data that would be missed by other algorithms. While the patterns it finds can be difficult to explain, Neural Nets is currently the best algorithm for finding non-linearities that exist in your data. Due to its thoroughness, it can be slower than other algorithms like Naive Bayes.
Data flow transforms for Text Mining included with SQL Server 2005 DTS allows you to manipulate unstructured textual data. This lets you take verbatim data, such as survey responses or e-mail, and convert them into structured data to use as input for analysis using the standard data mining algorithms.
Below the surface of SQL Server Data Mining lies a powerful and flexible server. The SQL Server team, of which I am a member, approached data mining not from the typical analytical point of view, but rather from the developer's point of view. We asked ourselves what we would have to do in order to make data mining accessible to the developer and how developers could take advantage of the power of data mining and embed that power into their applications. The answer was to create a fully extensible server-based model that focuses on strong algorithms, a powerful API, and deep integration.
The core of all data mining APIs in SQL Server is the DMX language, which is described in the OLE DB for Data Mining specification, plus language enhancements for SQL Server 2005 Data Mining. The simple DMX sample in the "DMX" sidebar demonstrates how to get the churn probability for all customers. If you are only interested in the most likely customers to churn, DMX supports TOP and ORDER BY so you can issue queries, as shown in the following lines of code:
SELECT TOP 25 t.CustomerID
NATURAL PREDICTION JOIN
OPENQUERY('CustomerDataSource', 'SELECT * FROM Customers')
ORDER BY PredictProbability([Churned],True) DESC
Previously complex DMX statements can now be made simpler by extending and overloading functions; an example of this is a function that retrieves the probability of a particular state, as just shown. The ability to expose content information in normal queries allows you to drill through to the source data to show the supporting facts for each bit of learned information in the model. For example, "What cases are in this cluster?" or "What cases support this rule?"
One of the biggest changes in DMX is the expanded set of source data you can use for a PREDICTION JOIN statement. In SQL Server 2000, your only options were OPENROWSET or a singleton SELECT query, but in SQL Server 2005 you will be able to use OPENQUERY to predict against named data sources. DMX statements also allow you to predict against the output of another prediction, which permits the chaining of models or allows you to predict against multiple models in the same query. MDX statements allow you to perform predictions directly against OLAP cubes and an XML for Analysis (XMLA) rowset parameter allows you to perform predictions on the server against data on the client (regardless of the server's ability to directly access that data source).
Data Mining APIs
Because Analysis Services is now a native XMLA server, a data mining model can instantly become a Web service that's accessible from any device running on any platform by using standard SOAP protocols. In addition to the root-level XMLA API, data mining in SQL Server 2005 has two new APIs: ADOMD.NET for client applications and Analysis Management Objects (AMO) for management and design applications. Of course, the standard OLE DB and OLE DB-derived interfaces (ADO, ADO.NET) are still available for your use as well.
ADOMD.NET is the managed client interface for data mining. It exposes standard ADO.NET objects (such as connections, commands, DataReaders, and so on), plus a complete object model for allowing access to data mining models. Collections of models and columns allow you to browse model metadata, while the MiningContentNode object lets you programmatically access the learned content of a model.
AMO provides a full suite of objects for creating and maintaining data mining objects. Among other things, it covers model security, processing, and backup and restore.
Managed Stored Procedures and UDFs
SQL Server 2005 Analysis Services lets you load managed assemblies onto the server. An object model, tentatively called "Server ADOMD," is almost identical to ADOMD.NET. This allows you to create stored procedures and user-defined functions (UDFs) that directly access the content of a model on the server. This is invaluable when the model contains potentially thousands or even millions of rules and your client only needs a small subset of them. In fact, the data mining viewers call stored procedures themselves to reduce the amount of information required on the client. Using the content object model in a UDF allows you to do interesting things, like return the reason for the prediction along with the result. For instance, you can have a function like the following:
Public Function GetNodeDescription(ByVal strNodeId As String) As String
Return Context.CurrentMiningModel.GetNodeFromUniqueName( _
You would then call it from DMX like this:
FROM CustomerChurnModel PREDICTION JOIN
In addition to your own UDFs, standard Visual Basic for Applications (VBA) and Excel functions can also be called as UDFs. For example, you can use the Excel VBA Log function to get the log probability as follows:
FROM CustomerChurnModel PREDICTION JOIN
Data mining applications can also take advantage of the rest of the features of the Business Intelligence platform provided in SQL Server 2005. OLAP, DTS, and Reporting Services all integrate with data mining in meaningful ways.
As in SQL Server 2000, data mining models can be built directly on OLAP cubes. However, whereas previously a model always used all of the data in the cube, those models can now be built on arbitrary slices of a cube. OLAP Mining Models are built by first choosing a dimension and level whose members you want to analyze, such as customers, and then choosing the properties and measures related to those members as columns of your mining model. Related dimensions, such as a product dimension, can be selected as nested tables and independent slices can be applied to each one. This ability allows you to build models that are able to answer questions like "What is the relationship between brand X and Y products sold in April to males under 30 years old who live in Washington State?"
Once an OLAP model is trained, the content of the model can be applied back to the cube through a data mining dimension. This type of dimension allows you to slice your cube into the patterns discovered by the data mining algorithm. For example, you can create a clustering model, segmenting your cube into natural groups, and then track measures across those groups over time using standard cube-browsing tools.
DTS in SQL Server 2005 has the traditional task flow environment and a new data pipeline environment. Data mining training and prediction tasks can interact with DTS environment variables as input to or output from the data mining engine. Data mining transforms work directly on the data pipeline, allowing you to take any of an entire range of actions from simply enhancing your data with predictive results to allowing complex data transformations to occur prior to training or prediction. By working with your data directly on the pipeline itself, you won't need to stage the data by putting it in a temporary resultset.
Finally, the results of data mining can be deployed directly through integration with Reporting Services. Any data mining query can be served up as a report, and a data mining query can be used as a source for report distribution. Intelligent reports could display the results of the customer churn query I've described.DMX
Microsoft introduced data mining as part of its standard database offering in SQL Server 2000 Analysis Services. While it contained two powerful data mining algorithms, its major contribution to data mining was the OLE DB for Data Mining interfaces and, in particular, the Data Mining Extensions to SQL language (DMX). DMX provides a familiar query interface for data mining purposes. Model creation and training become familiar CREATE and INSERT INTO statements. Prediction and content discovery become SELECT statements. For example, to create and train a hypothetical customer churn model, you could use statements such as:
CREATE MINING MODEL CustomerChurn
(CustID LONG KEY,
Income LONG CONTINUOUS,
LastVisit DATETIME CONTINUOUS,
Profession TEXT DISCRETE,
Churned BOOLEAN DISCRETE PREDICT)
INSERT INTO CustomerChurn (CustId, Income, LastVisit, Profession, Risk)
OPENQUERY('My Datasource ', 'SELECT CustomerID, Income, LastVisit
Profession, Risk FROM Customers')
To subsequently perform predictions and content queries, you use standard SELECT statements. For predictions, a new join semantic, called the PREDICTION JOIN, joins your source data to the model as inputs to the algorithm. To access the content, you select from a special content object available from the model. Hypothetical statements to predict a customer's likelihood to churn and to access model content would look like this:
SELECT NewCustomers.CustID, PredictProbability(Churned, True)
FROM CustomerChurn NATURAL PREDICTION JOIN
OPENQUERY ('My Datasource', 'SELECT * FROM NewCustomers') AS
SELECT * FROM CustomerChurn.CONTENT
One of the lessons I've learned while doing data mining work is that in many cases, if your data mining solution doesn't have the algorithm the customer needs, the customer can't and won't use your solution. This need for a dynamic product is why SQL Server 2005 allows additional data mining algorithms to be added to the engine through a set of extensibility APIs. The COM APIs used by third-party algorithms to integrate into the data mining engine are virtually identical to the APIs used by the Microsoft algorithms. This means that there are no built-in limitations to third-party algorithms and they can take advantage of features such as the DMX syntax, stored procedures, user-defined functions, and BI system integration without jumping through hoops. Custom viewers for both third-party and Microsoft algorithms can be integrated into the data mining tools as well.
SQL Server 2005 takes what was once an arcane technology and makes it appealing to a much wider audience. By providing easy-to-use tools, familiar APIs, and integration across the Microsoft BI suite, data mining is finally available to all users who can most benefit from it through the applied creativity of the developers who build applications on SQL Server. SQL Server 2005 Beta 2 contains an excellent data mining tutorial to get you started.
is the development lead for the Data Mining engine in the SQL Server division of Microsoft. He has spoken at many technical conferences and is the co-author of an upcoming book on SQL Server 2005 Data Mining. You can reach Jamie at email@example.com