.gif)
SQL Server Community
Article
Writer: Mark Tabladillo, Ph.D
Technical Reviewer: Jeannine Takaki, SQL Server UE
Published: August 2009
Applies to: SQL Server 2005, SQL Server 2008
Summary:This community article
provides an introduction to data mining, and defines related terminology. The
article provides examples from the daily news of questions that potentially can
be answered by data mining, and broadly outlines the approach to creating and
deploying data mining solutions in a business. Finally, the author explains the
advantages of using Microsoft data mining technologies, and provides a list of
resources for more information.
Introduction
This community article discusses why people should use
Microsoft data mining. The Microsoft
data mining tools leverage the strengths of Microsoft SQL Server data
management software and the Microsoft Office system. Data
mining options in a Microsoft solution range from ad hoc analysis within Microsoft
Office Excel, to collaborative solutions using Office SharePoint Server,
customized Microsoft .NET connection software applications, and
production-level stored procedures in SQL Server. However, many
enterprise environments include software applications and data warehouse
solutions from multiple vendors, and the Microsoft technology respects these
designs by allowing for open data connectivity and access with other software
products, including other data mining solutions.
Why Should I Be Interested in Data
Mining?
Data mining is an analytical activity that requires people
to be successful. Microsoft has defined three specific job roles for people who
would be interested in data mining (Microsoft, 2008b). These roles are organized by function, and a
specific person or team may span duties or interests in more than one group:
• Analysts – Subject matter experts who understand the data and are
able to translate the results of data mining into actionable business
information.
• Developers – Intend to integrate data mining within custom
application solutions.
• Architects – Plan to meet the needs of analysts and developers, and
may have a strong interest in data integration and reporting technologies.
You may take on one or
more of these roles, even if you have another title like Chief Information
Officer or Software Architect. Anyone
who does more than basic data analysis can benefit from data mining.
To get you started with on data mining, this community article provides the following
basic information:
• What is data mining?
• Is the correct term “Data Mining”
or “Machine Learning”?
• What do we get from data mining?
• What are some problems that data
mining solves?
• What are the costs for data mining?
• Are there advantages to using Microsoft
data mining technologies?
What Is Data Mining
A new book on this
topic was just released, Data Mining with Microsoft SQL Server 2008 (MacLennan,
Tang, & Crivat, 2009). These authors
have worked on development of the Microsoft data mining technology. Their definition of data mining is “the
process of analyzing data to find hidden patterns using automatic
methodologies” (MacLennan et al., 2009, p. 1).
First, data mining is
an activity. When you use Microsoft
technology to do data mining, you use SQL Server Analysis Services, which
requires a SQL Server license. By moving
the activity to a server, Microsoft makes data-intensive processes run in an
environment designed for processing efficiency and connectivity to enterprise
systems. Of course, SQL Server can run
on a desktop, and I run it on the Windows Vista operating system for product
demonstrations. However, data mining
works best in a server environment. In
other implementations you may know of, people have created data mining applications
for either desktop or server application.
Second, data mining
means analysis. The job role of “analyst”
describes the person who typically is most interested in doing data
mining. Microsoft has made a free
plug-in available for Office Excel 2007 which allows an analyst to use
Microsoft Office to connect to a SQL Server implementation of data mining. The Office Excel plug-in lets the analyst
perform data mining on Office Excel data. The patterns in the data are analyzed
on the server and the results are sent to Office Excel. Even without this plug-in, a trained analyst
can use the free Business Intelligence Development Studio (BIDS) which ships
with SQL Server to create data analysis solutions. The BIDS product is provided in both SQL
Server 2005 or SQL Server 2008, and using the Microsoft Visual Studio
development system framework, an analyst can create a project or solution
(collection of projects), either of which can include Microsoft data
mining. The main point is that data
mining is another important tool which analysts can and should use.
Third, data mining
reveals hidden patterns. You might think
the adjective “hidden” to be counterintuitive, since you already have and see
all the data. In fact, finding patterns
is a normal human activity, and analysts can visually perform pattern matching
and recognition on smaller datasets. However,
an analyst might not discover all
the patterns even in a small dataset. Data
mining assumes that the data is empirical, and its results come solely from the
information presented, not from any outside information or known patterns. Data mining in general (and Microsoft data
mining in particular) creates conclusions based on accepted mathematical
techniques for pattern matching, and outputs these as models, or patterns that could
be considered as an explanation of an empirical dataset. Typically, the advanced mathematical techniques
used in data mining come from academia or professional mathematicians, and data
mining is an active area of university research. For our everyday use, data mining software makes
it easy to apply sophisticated mathematics to discover hidden patterns.
Finally, data mining is an activity that can be automated or
programmed. Software developers are an
important audience for Microsoft data mining technology. Microsoft has not only provided user tools
like BIDS and the Office Excel 2007 plug-in, but also provided the DMX (Data
Mining Extensions) language and the .NET Framework for custom applications or
solutions. Programmability extends the
activity beyond just ad-hoc analysis, and allows analysts to leverage their
findings into full enterprise-level solutions.
Microsoft data mining is designed to be automated as needed, which
greatly extends the potential for custom development and the possibilities for
enterprise automation.
“Data Mining” or “Machine Learning”?
Many terms have been used in consulting and academia to
describe data mining in general (and what Microsoft data mining in particular
does). I too use alternative terms with
regular frequency. I use the term predictive analytics to describe any
statistical technique which helps provide a predictive look into the
future. The Microsoft data mining tools
can provide predictive analysis, but also could be used for descriptive (or
historical) analysis too. Because the
tools have dual use, the term predictive analysis does not sufficiently
describe Microsoft’s use or implementation.
Another term I use is machine
learning. Increasingly, technical
universities offer courses in data mining or machine learning. Many times, the curriculum and topics are
similar or identical. However, in use,
most of the books I have on this topic prefer the term data mining (Berry &
Linoff, 1997, 2000; Bigus, 1996; Cerrito, 2006; Han & Kamber, 2006; Langit,
Goff, Mauri, Malik, & Welch, 2009; MacLennan et al., 2009; Tan, Steinbach,
& Kuman, 2006; Tang & MacLennan, 2005; Witten & Frank, 2005). To illustrate the convergence of terminology,
one book that I have is titled “The Elements of Statistical Learning” (Hastie,
Tibshirani, & Friedman, 2001) but the subtitle for the book is “Data Mining,
Inference, and Prediction”.
To me, the term mining has a connotation of being on a
hunt and looking for something valuable.
Even though diamonds and silver and coal are mined and are all valuable,
the usual business metaphor is mining for gold.
Also, the mining imagery involves a data
miner, a person doing the activity.
By contrast, the term machine learning implies a machine doing
automatic processing. The goal of
machine learning is to achieve artificial intelligence, and it is important to
advance academic theory and computer science to best leverage information
technology. However, the term data
mining suggests perhaps a more romantic notion of a data miner on a quest for
discovery, needing the aid of the computer sidekick in pursuit of hidden
patterns.
Another term used by
the Association for Computing Machinery (Association for Computing Machinery,
2008) is knowledge discovery in databases
(KDD). This association has been a
respected force in the scientific computing area for decades, and attracts
membership from leading academics and industry professionals. They use this term in the name of an annual
conference, and the papers submitted to the conference cover data mining,
predictive analytics, and machine learning.
In summary, both
industry and academia have settled on the term data mining, but like me, you may find yourself using other terms
depending on the situation.
What Do We Get from Data Mining?
I would love to say
that we could solve all the world’s problems if we had enough data. If you have been talking with some data
mining vendors or consultants, they might even imply that we could!
Realistically, however, we all have to read beyond the marketing spin.
Back to the core
definition: data mining reveals hidden
patterns, and we get to see those patterns as models. It would be appropriate to say statistical
models, because these models provide either descriptive or predictive
statistical information. Statistics are
important to analysts, but they are already important to anyone using business
intelligence systems, even without data mining.
Therefore, this article does not talk in detail about statistics. A good
data mining text should include what statistical information you would need to
know for data mining applications.
So what is a
model? I have had fun asking this
question to live audiences. The term has
various connotations that include fashion models, model cars, or model homes. Generally a model is a description of
something else, and in data mining, a model describes a hidden pattern in the
data.
Microsoft data mining
produces models. Some of the Microsoft models are expressed as equations, but
more normally, the technology presents results showing relationships based on
data. Data mining results are entirely dependent on the data that was used to
create the model.
The following diagram
comes from the SQL Server Books Online and illustrates the content of a model.
What this diagram tells you is that a Microsoft mining model includes metadata
(information about the data), patterns
(which could be rules or formulas), and bindings
(meaning data bindings to the mining structure, the place where the data is
defined for data mining).
.jpg)
Figure 1. Mining Model
Architecture (Microsoft, 2008a)
It is important to
note that the result of data mining is not just an output equation. That difference was important to me coming
from an applied statistical background, since in much of that work, the output
is an equation with variables and weights.
Data mining can and in some cases should present equations, but as a
general rule, there might not be a specific equation or group of equations that
completely describes an entire model.
In a many enterprise
business environments, the goal of analysis is to solve some business
objective. Even in places where
statistics and statistical analysts have provided solutions, the output should
provide actionable information. Decision
makers want insight into how to make decisions.
Whether these results come from statistics or data mining, what counts
is translating mathematical results into business language. That translation requires that the person who
does the analysis should know both the industry-specific objectives and the
data mining technology. Data mining does
not have life independent of a professional analyst who can interpret and apply
the results toward a business question.
Thus, you cannot say
that data mining cannot solve the world’s problems. Data mining doesn’t
directly solve any problems—instead, it is the data miner, or the trained
professional, who bridges the gap between a data mining model and a business
problem.
Organizations that are
serious about using data mining technology therefore should include ongoing
training and development of their analysts, and make sure that their data
miners learn about the technologies and tools for solutions, and not focus
solely on the essential industry-specific objectives
If you use Microsoft
data mining, as a professional data miner it is important that you invest in
understanding what the possible models are and what they produce. Microsoft has provided free information online,
and I have provided additional resources at the end of this paper. In addition to free information, larger
companies should leverage data mining consultants and data mining training to
accelerate either their entry into data mining or to or current use.
What Are Some Problems that Data Mining Solves?
Some typical problems
are categorized in the book Data Mining with SQL Server 2008 (MacLennan et al.,
2009, pp. 4-5):
• Recommendation generation – After a customer chooses one or more
products, data mining suggests another product.
• Anomaly detection – Commonly, fraud detection in the financial
industry means looking for that one transaction or one customer among thousands
who might be committing fraud. Data
mining can find a single observation among even the millions which might be
different.
• Churn analysis – The term churn refers to losing a repeat customer
or client, and knowing what early indicators might indicate someone is ready to
switch can be important.
• Risk management – Credit ratings are often based on multivariate
formulas which help predict levels of risk.
• Customer segmentation – Grouping customers or clients together,
even by their own self-determined characteristics, can allow large
organizations to manage marketing campaigns or even just organize their service
professionals around similar groupings.
• Targeted ads – Marketers use data mining to deliver customized ads
online, but organizations always want to know how to tailor any communications
to be based on what they already know about their customers or clients.
• Forecasting – Time-series analysis takes data from the past, and
provides a look into the future, even when there are seasonal increases or
declines.
In past presentations,
I have drawn many examples from the current news. I like the newspaper USA Today, and I have
read from that paper to discover examples of questions that data mining can answer. With that experience in mind, I’m going to
read today’s news from MSN.com (Microsoft, 2009b) and see where data mining may
provide more insight:
U.S. news – One story reports that the U.S. economy shrank at the fastest clip since
1982. Data mining provides time-series
tools which help provide comparable reporting even when seasonal trends might
influence the data. In the United States,
many retailers experience increased sales during the holiday season between
November and January. Experienced
retailers know to adjust their analysis to account for what is a different time
compared to other times.
World news – One story tells us that record heat has been scorching southern
Australia. More climate data can provide the associative variables to help
understand why temperatures rise. The
classic statistical wisdom is that correlation does not prove causality,
meaning that just because two metrics might rise and fall together does not
mean that one causes the other.
Causality is a tricky subject (and beyond the scope of this community
article). However, data mining might
provide insight to what variables have strong associations with the temperature
outcome.
Business news – One story says that consumers are more upbeat in January. Data mining can provide an important
contribution to surveillance (meaning collecting and analyzing surveys). Perhaps you have been surveyed for some
study, and we know of pollsters surveying voters just before and after
elections. Most surveys are performed
for a sample of the full population, and therefore analysis requires applying
weights to the results to make the sample look more like the population. Data mining can provide insight into how to
provide weights.
Sports news– One story mentions a specific American football team, and asks whether
this team (the Pittsburgh Steelers) has the best defense line of all time. Sports are of topical interest to people
around the world, and there are more soccer fans in the world than American
football fans. People want to know which
team or which players are the best, either in the current year, or compared
historically. All we have from history
are specific team performance metrics, and we can create a computer model
(meaning applied data mining) to help us discover pattern insights into the
sports teams and players considered the best in the world or best in history.
Health news– One story talks about salmonella. Recently in the United States, there
was a salmonella outbreak in the peanut industry (believed to be a single
processing plant). In any investigation
(for health or criminal reasons) an investigator needs to sort through multiple
events and factors and decide which ones have a likely causal relationship on a
specific outcome. I cautioned earlier
about drawing conclusions about causality when there is only associative evidence
(guilt by association). In some situations, there might be multiple possible
causes. Salmonella is a common bacterium,
but there are many such kinds of bacteria, and every peanut processing facility
needs to take multiple precautions against the various known causal
agents. Sorting among the known
candidates to determine the actual culprit requires data collection, and when
the collected data involves complex multivariate relationships, data mining may
provide insight for discovery.
Microsoft has some
case studies online describing specific business implementations of SQL Server
(Microsoft, 2009a), but the more important point is to learn about the
algorithms and what they do. Once you
have learned more about the technology, when faced with a question, you might
find it easier to consider data mining as one piece of the technology that you
would use in a solution. Therefore, the
focus of this community article is not to talk in detail about any data mining
algorithms, but instead point to general questions that data mining can help
solve.
Not all applications
have to do with making money or increasing profit (though these uses are common
and valuable). Nonprofit organizations, including
many hospitals, can and do use data mining to help discover hidden patterns as
they promote their own charitable objectives.
All organizations are increasingly creating or accessing larger and
larger databases, and analysts looking at this data would be wise to learn more
about what data mining can do for them.
More details can be found in the data mining texts referenced at the end
of this article.
What Are the Costs for Data Mining?
Several quick cost
issues to consider:
• Licensing or purchase fees –
Software costs vary, and though some software packages are free, others have
licensing fees. SQL Server provides ways
to monitor usage.
• Hardware and maintenance – Whether
desktop or server, all hardware has associated maintenance costs, and backups
for data are included in this cost.
• Data collection – Sometimes data
are not available to solve a particular question, and other times, only part of
the data is available. Purchasing data
may be part of a solution, and other times, might require a new data collection
strategy.
• Data preparation – Preparing data
for mining means not only having it, but also cleaning and perhaps transforming
the data for appropriate use. Someone
needs to do that data preparation for analysis (and such tasks are common in
most business intelligence systems).
• Personal development – Data mining
does not work all by itself once you have loaded the software and data onto a
computer. A trained data mining analyst
commits to learning the technology and also what might be currently possible
through custom development. Analysis
teams should commit to learning more as new techniques and applications become
available.
Larger organizations
should intentionally include data mining costs into their normal budgeting
process, where these costs can be properly adjusted and evaluated. The qualitative costs include the cultural
shifts associated with new technology implementation.
Advantages of Microsoft Data Mining
There are many data
mining software products available, free and otherwise. Some important advantages of the Microsoft
data mining tools include:
• Tight implementation with a world-class database (SQL Server). Leveraging the performance, security, and
optimization features of this database platform is important because Microsoft
has created one of the world-leading databases.
Many organizations already have data mining licensed through SQL Server
2005 or SQL Server 2008. This technology
can access data in other formats through ADO.NET providers, including Oracle,
Teradata, DB2, and SAS. Many desktop
data mining products (alone) do not provide the scalability or security and
access features in SQL Server.
• Applications for production-level business intelligence. A team of developers can integrate data
mining into a current business intelligence solution. The integration works best with Windows, but
because the interface can be Web-based, you need not be in a Windows-based BI
solution to integrate these tools.
Specifically, Office SharePoint Server provides a way to distribute and
collaborate on data mining solutions through a Web browser.
• Production-quality use and output.
Data modeling has the image of strong individuals leading the way. However, as we move forward, Microsoft
technology makes it easier to support team ownership of data mining projects,
and we can expect that future iterations will help enable teams to data mine
together. Those newer cultural norms
come from the SQL Server and business intelligence cultures, and I believe this
change is a welcome addition to how statistical analysts have traditionally
worked. We need our heroes, but we also
need teams too.
• Extensibility. It is
possible to extend Microsoft data mining to implement algorithms not in the
product. For example, someone has
written a Support Vector Machine (SVM) algorithm, and that code is on Codeplex
(CodePlex Open Source Community, 2009).
Also, using the .NET Framework, a software developer could integrate other
standalone data mining solutions into the Microsoft solution (the integration
depends on how open the other data mining solution is). Microsoft has provided some insider tips and
techniques for extensibility in the Web site supported by the Microsoft data
mining team (SQL Server Data Mining Team, 2009).
On their Web site,
Microsoft provides much free information and tutorials on their software, as
part of their commitment to customer, client, and partner service excellence.
Many resources that I find useful for Microsoft data mining are included in the
Resources section at the end of this paper.
Conclusion
This community article
provides an introduction to data mining, describes what the term means, and takes
a look at the models that data mining produces.
The article also discussed some common business problems solved with
data mining. Finally, the article
focuses on Microsoft data mining, and outlines advantages for that implementation.
Data mining is an
active research field, and you could spend years reading peer-reviewed articles
and textbooks on different aspects of the topic. The field has been historically dominated by
academic people, and there's much careful thought behind the not only the
algorithms but the statistical philosophies of analysis and synthesis. Though I
have provided data mining training, and teach at the university level, I
consider myself a lifelong student of this topic. You might be or become an important part of
that story. I encourage you to share
what you know and learn.
Resources
Association for
Computing Machinery. (2008). Welcome Page.
Retrieved June 15, 2009, from http://www.acm.org/
Berry, M. J. A., &
Linoff, G. (1997). Data Mining Techniques.
New York, NY: John Wiley & Sons Inc.
Berry, M. J. A., &
Linoff, G. (2000). Mastering Data Mining.
New York, NY: John Wiley & Sons Inc.
Bigus, J. P. (1996). Data Mining with Neural Networks. New
York, NY: McGraw-Hill Companies, Inc.
Cerrito, P. B. (2006).
Introduction to Data Mining Using SAS Enterprise
Miner. Cary, NC: SAS Institute Inc.
CodePlex Open Source
Community. (2009). Support Vector Machine plug-in in Analysis Services. Retrieved June 15, 2009, from http://www.codeplex.com/svmplugin
Han, J., & Kamber,
M. (2006). Data Mining: Concepts and
Techniques (Second ed.). San Francisco, CA: Elsevier, Inc.
Hastie, T.,
Tibshirani, R., & Friedman, J. (2001). The
Elements of Statistical Learning: Data Mining, Inference, and Prediction.
New York, NY: Springer Science & Business Media, LLC.
Langit, L., Goff, K.
S., Mauri, D., Malik, S., & Welch, J. (2009). Smart Business Intelligence Solutions with Microsoft® SQL Server® 2008.
Redmond, WA: Microsoft Press.
MacLennan, J., Tang,
Z., & Crivat, B. (2009). Data Mining
with SQL Server 2008. Indianapolis, IN: Wiley Publishing Inc.
MarkTab Consulting.
(2009). MarkTab.net Data Mining Portal.
Retrieved June 15, 2009, from http://www.marktab.net
Microsoft. (2008a).
Mining Models (Analysis Services - Data Mining). Retrieved June 15, 2009, from
ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.en/s10dm_0evalplan/html/cd4df273-0c6a-4b3e-9572-8a7e313111e8.htm
Microsoft. (2008b).
SQL Server 2008 Books Online -- How Do I Browse by Job Role (Analysis Services
- Data Mining). Retrieved August 15,
2008, from http://msdn.microsoft.com/en-us/library/bb510519.aspx
Microsoft. (2009a).
Microsoft SQL Server 2008 Product Information - Case Studies. Retrieved June 15, 2009, from http://www.microsoft.com/sqlserver/2008/en/us/case-studies.aspx
Microsoft. (2009b).
MSN Web Portal. Retrieved January 15,
2009, from http://www.msn.com
SQL Server Data Mining
Team. (2009). SQLServerDataMining.com Home Page. Retrieved June 15, 2009, from http://www.sqlserverdatamining.com/ssdm/
Tan, P.-N., Steinbach,
M., & Kuman, V. (2006). Introduction
to Data Mining. Boston, MA: Pearson Education, Inc.
Tang, Z., &
MacLennan, J. (2005). Data Mining with
SQL Server 2005. Indianapolis, IN: Wiley Publishing, Inc.
Witten, I. H., &
Frank, E. (2005). Data Mining: Practical
Machine Learning Tools and Techniques (Second ed.). San Francisco, CA:
Elsevier, Inc.
About the author: Mark
Tabladillo is a regular contributor to the MSDN data mining forums. He is owner
of MarkTab Inc., has worked as
a consultant for Solid Quality Mentors,
and is an associate faculty member of the University of Phoenix.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this
paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this
rating? For example:
- Are
you rating it high due to having good examples, excellent screen shots, clear
writing, or another reason?
- Are
you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback
will help us improve the quality of the papers that we release.
Send feedback.