By Andrew Wong, Senior Consultant, and Dean Sutcliffe, Product Manager, for Data
Quality Manager.
Data Quality Manager
(DQM) is a commercial data integration application that provides Data
Profiling, Data Cleansing and Matching, and Management Dashboard
functionalities. It lets the user conduct data investigations and profiling,
and measure gaps against specified business rules.
Since this article was
written before the general availability of SQL Server 2008, it does not discuss
the Data Profiling task which is new in Integration Services in SQL Server 2008.
Introduction
Steps in a data quality
project
Most data quality projects follow a four-step process:
1. Profile the data.
2. Prepare a data quality score card.
3. Parse & deduplicate the data.
4. Test the data.
This is typically an iterative process.
.gif)
Figure 1. The
typical iterative steps in a data quality project.
On the market today, there are a number of specialized data
quality tools, with varying levels of capabilities and cost. Within the
Microsoft product line, SQL Server Integration Services has many of the basic
data quality functionalities. Integration Services can also be extended to
handle complex data profiling and deduplication tasks. The robustness and high
performance of SQL Server make it an ideal platform for complex, high-volume data
integration tasks.
This article discusses the building of a data quality
application that uses SQL Server, Integration Services, and Microsoft Excel.
The authors gained this experience by implementing a number of data quality
projects, and acquired their engineering knowledge through the development of a
commercial data quality software product, Data Quality Manager.
Architecture of a data quality application
The key to success and high performance in a data quality
system is to process the data (that is, profile, parse and match the data) in situ within the SQL Server
environment. Most data quality tools make a copy of the data, and process this copy
in memory to obtain a result. While this is satisfactory for small data sets, the
ERP systems and business intelligence platforms of most companies produce sizable
quantities of raw data that seriously challenge this approach.
In our experience, we have found that the most efficient
approach is to process the data closest to its home ¡V that is, in situ in SQL Server tables. More
importantly, by using SQL Server as the processing engine, we are allocating
the complex, heavy-duty logic and processing to SQL Server¡¦s data engine, which
is built specifically for intensive data manipulation. In addition, SQL Server
has a number of facilities, such as special indices and the SQL optimizer
engine, that it can use to improve its overall performance.
Data Profiling
Data profiling analyzes the content of critical data items,
and tests whether the data is ¡§fit for purpose¡¨. For example, in the context of
a customer database, the data profiling question could be: ¡§Are all telephone
numbers complete, do they contain valid area codes for the United States, and
is each area code consistent with the Zip code of the address?¡¨ Data profiling
measures the degree of fit of the data to specified business rules, and
presents the result as a percentage score that identifies the percentage of the
data that passes the business rule.
Business context
The most commonly encountered profiling issues are related
to consumer and commercial customer details. For instance, a business may want
to profile the consistency of the following data elements:
Consumer customer
name
- How
many customers have a Full Name?
- What
percentage of customers have a Full Name that is spelled correctly?
- Is
the Title of the customer spelled correctly?
- Is
the Gender of the Customer consistent with the customer's Title?
- What
is the most likely Gender based on the Name?
Phone contact details
- Do all
phone numbers follow the same or a valid pattern?
- How
many land line fields contain a mobile number?
- Are
all phone numbers in the United States?
- How
many phone numbers have no area code?
- How
many phone numbers have an invalid area code?
- Can
the phone number be validated from external reference sources?
Commercial customer
name
- Does
the Name contain an organization name or a personal name?
- If
the name is the name of a legal entity, what is its company type (Ltd,
Pty, Inc, and so forth)?
- Does
the company's registration number follow a consistent pattern for that
State?
- Determine
the Standard Industrial Classification (SIC) based on the Company name.
Customer address
- Is
the structure of the address valid?
- Is
the address complete?
- What
is the Zip code, City, County, and State of each address?
- Is
the Zip code consistent with the City, County, and State?
- Is
the Zip code consistent with telephone area code?
- What
is the building type for each address?
- Can
the address be validated from external Postal Service sources?
E-mail address and
Web site URL
- Is
the structure of the e-mail address correct?
- List
the company and country code for each e-mail address.
- List
the organization type (.com or .gov) of each e-mail address.
- Is
the e-mail address a company e-mail address that is consistent with the company
name?
- Is
the e-mail address consistent with the name of the individual?
- Is
the Web site URL structurally correct?
- Can
the Web site URL be validated as active and accessible from the Internet?
The concerns of most manufacturing organizations center on product
and supply chain details. The profiling issues for a manufacturing organization
may include:
Product master record
- How
many products have similar descriptions?
- List
all variation of how a product is described.
- Find
the same of similarly products that is classified in different Category.
- List
the products that can be ordered from different branch office of the same supplier
E-Commerce supply
chain
- Which
products are incorrectly classified under the United Nations Standard
Products and Services Code (UNSPSC)?
- In how
many categories does a product appear?
- Which
supplier's products do not use standard nomenclature?
- Which
product does not follow the industry standard for unit of measure, package
size, and so forth?
Building profiling capabilities in SQL Server
The general approach in building profiling capabilities in SQL
Server is:
1. Abstract business rules into logic, which is translated
into Transact-SQL as a profile action.
2. Abstract data into patterns and token lists.
Externalizing parts of the business rules into patterns and reference
tokens makes the system extendable. These patterns and tokens can be adapted to
other projects, or updated to handle data changes in the future.
Profile actions are built as Transact-SQL user-defined
functions and stored procedures that use associated token lists. These items are
embedded as database objects in SQL Server, and can be called by any utility
that can execute a query on the server.
In building the Data Quality Manage product, we built a data
profiling user interface that provides an interactive platform for users and
analysts. The profile result sets are essentially result sets in tables, which
can be displayed by using Excel Services or a similar reporting service. Salient
information is summarized into a score card and a dashboard. This facilitates
communication with project sponsors and senior management.
.gif)
Figure 2. The
user interface for data profiling in Data Quality Manager.
Profiling functions
In the example below, we have a client who wishes to use the
phone number field for their sales force, regardless of which state or country
they reside in. The common profiling questions are:
- Is a
phone number of the right length?
- Does
it have the correct area code?
- How
many different patterns do we have?
- Do we
have a textual label (for example, area code 202 ¡V Washington, D.C)?
Below is a simple profiling function that can be run on any
SQL-92-compliant server. The purpose of the function is to query a phone number
field, then store the result. From the result, the analyst can understand the
current state of the data, and assess how much effort is required to get the
data into a state fit for use.
Looking at this phone number field on its own will only give
us a high level view of the data. However, if we include reference data (such
as the area code list for the United States) or alternative fields from within
the data set, we can further validate the data or repair values when they are
missing or deemed incorrect.
The function below walks the length of the data, testing one
character at a time to determine whether it is a letter, number, space or
something else, then returns a code that reflects what it found.
Create FUNCTION [dbo].[ufn_PatternDisplay] (@txtstring NVarchar(255))
RETURNS varchar(255)
--|| with encryption -->< --|| <--remove for build
as
BEGIN
-- =============================================
-- Author: Dean Sutcliffe
-- Create date: 4 July 2006
-- Description: replace char in a string with AN # based on content
-- =============================================
declare @i int ,@Output varchar(255),@temp Nchar(1)
set @i=1
set @Output=''
WHILE (@i) <(len(@txtstring) +1)
BEGIN
set @temp=substring(@txtstring,@i,1)
Select @Output=@Output + case
When @temp like '[a-Z]' then 'A' -- if between a-Z call it ¡¥A¡¦
When @temp like '[0-9]' then 'N' --if it is a number call it ¡¥N¡¦
When @temp=' ' then ' ' -- if a space call it a Space
Else '#' end -- for everything else it's #
set @i=@i+1
END
RETURN @Output
END
Presenting profiling results
The results of profile actions are contained in result
tables. In Data Quality Manager, we have chosen Microsoft Excel as the
presentation platform. Excel offers ease of use and seamless integration with
the rest of the Microsoft Office suite. This streamlines the process of producing
data quality reports.
As an example, here are the data profiling results of the pattern
function example shown above:
RecordCount | Customer_Phone_pattern | data_sample | % |
123,374 | #NN# NNNN NNNN | (02) 4272 9777 | 36.20% |
56,754 | NN NNNNNNNN | 02 42716136 | 16.70% |
45,678 | NNNNNNNNNNNNN | 12012282087 | 13.40% |
34,677 | NNNNNNNNNNN | 293281900 | 10.20% |
34,667 | NN NNNN NNNN | 02 2816 1447 | 10.20% |
10,004 | Null | Null | 2.90% |
9,701 | NNNNNNNNN | 26348337 | 2.80% |
6,976 | NNNNNNNNNNNN | 2026236969 | 2.00% |
6,754 | #NNNNNNNNNNNN | 3.1018E+11 | 2.00% |
4,567 | NN#NNNN#NNNN | 02.9417.0799 | 1.30% |
3,456 | NNN NNN NNNN | 027 572 9265 | 1.00% |
2,345 | NNN#NNN#NNNN | 072-638-5294 | 0.70% |
1,004 | NNNNNNNNNN | 49360830 | 0.30% |
345 | #NNNNNNNNNNN | 5603934334 | 0.10% |
90 | NNNNNNNN | 17400407 | 0.00% |
90 | AAAAAAA | UNKNOWN | 0.00% |
79 | AAAAAA | FOLLOW UP | 0.00% |
47 | #NN#N#NNNNNNNNN | +31(0)102175866 | 0.00% |
From the table above, we can see that data has been keyed to
differing standards. We have a small number of clearly invalid records like ¡§UNKNOWN¡¨,
a small number of NULL records, and so forth. We can now classify each pattern
against business rules (as Valid or Invalid), and derive an overall data quality
score for this phone number data.
We can also plan the next steps in how we are going to
handle the phone number data. If the database also contains address data, then we
can determine whether a number like 002026236969 is in Maryland in the United
States, or in Sydney, Australia. After this decision is made, we can reformat
the number appropriately.
Profiling with SQL Server Integration Services
The profiling function demonstrated above can be wrapped in
an Execute SQL task and published in an Integration Services package. By doing
so, the core functionality of Integration Services is extended with data
quality tools that are tailored specifically for your data environment.
.gif)
Figure 3. Data
profiling in an Integration Services package
Profiling with SQL Server
These profiling functions can also be called inline, built
into an API or the SOAP interface of a Web service, or used as part of a Transact-SQL
GROUP BY query.
Here is an example of calling the profiling function
described above as a SQL statement:
Select Count(*) as RecordCount,
[dbo].[ufn_PatternDisplay]([Customer_Phone]) as Customer_Phone_pattern,
Min([Customer_Phone]) as data_sample
From [tbl_Customer_Details] CtmrDtls
Group by [dbo].[ufn_PatternDisplay]([Customer_Phone])
Building a Data Quality Score Card
Using a score card to manage and communicate
Data profiling results should then be summarized in a score
card that shows how each of the business rules is measured. This provides a
business summary of data quality issues in your organization. The score card is
an essential business communication tool. It lets the business understand the
scope and implication of data quality issues, and to decide on appropriate
actions.
When presenting the score card in Data Quality Manager, we
present a hierarchy of three concepts:
- Business
rules.
- The critical
data items associated with each rule.
- The
data profiling result.
As an illustration, consider the following example.
ACME Manufacturing has conducted a data quality profiling
project, and has presented the outcomes in the following structure:
Business Rules | Critical Data Items | Data Quality profile |
Supplier details must be current | Company name | Completed with a legal entity name |
Non-blank |
Company registration number | Must be completed |
Registration number has a valid structure |
Number found in field |
Parent company | Non blank when parent id has been set |
Domicile country | Valid (on reference list) |
Web site URL | Completed, with valid structure |
Valid if URL is provided |
This structure identifies a number of key business rules. This
includes "Supplier details must be current" (as shown in the sample
above), "Vendor terms must be valid", "Customer delivery details
must be up-to-date", and so forth.
Each of these business rules is supported by a number of critical
data items. For example, the data items of "Company Name",
"Registration number", "Parent Company", "Domicile
Country", and "Web site UR " are critical data items that
support the "Supplier details must be current" business rule.
Under the "Company Name" data items, the data quality
profile measures the percentage that are "Non-blank", and "Completed
with a legal entity name". For the critical data item of "Web site
URL", the profile measures the percentage that are "Completed, with
valid structure", and "Valid if URL is provided".
A simple score card can now be assembled by using the profile
percentage scores. The profile can be weighted, and target values can be set.
Then a more sophisticated score card that reflects business priorities starts
to emerge.
Building the score card
In Data Quality Manager, profiling results are aggregated
into a summary result table, and presented in Microsoft Excel.
A summary chart in the form of a polar chart gives a business-wide
overview of the data quality summary. Traffic light icons provide an instant
view of the data quality score against target. A detailed grid shows the data quality score
and target for each of the critical data items.
.gif)
Figure 4. The
summary view in a data quality score card in Data Quality Manager.
.gif)
Figure 5. The
detail view in a data quality score card in Data Quality Manager.
Using the score card
The score card is a management tool that explains data
quality issues in terms of what is important for the business.
For example, the business may discuss:
- Which
data items are at the core of business operations?
- What
is the type of, or the source of, data quality issues?
- What
are the implications for internal processes in data quality improvement?
- Which
organizational unit is responsible for data quality improvement?
- What
is the hidden cost of poor data quality to the business?
- What
level of improvement in data quality is practical?
- How
much data quality improvement has there been since the beginning of a data
quality program?
We have found that over two-thirds of the effort in a
typical data quality project is spent in data profiling and in communicating
with the score card. This highlights the fact that a data quality project is
essentially an exercise in business education and communication. Presenting the
profile findings in a score card often transforms the data quality discussion
from a technical exercise to a business imperative.
Parsing, Deduplication, and Data Enhancement
Patterns and parsing
Parsing is the decomposition of a piece of compound data
into its business components. It is a fundamental step in matching, deduping,
and enhancing data. By breaking a string into its components, pattern and token
lists can be produced, and a solid foundation can be built for sophisticated or
fuzzy matching.
Parsing is always specific to the underlying business
context. For instance, a 10-digit phone number and a 10-digit product number
may have a similar pattern of aaabbbcccc. The phone number parser will
recognize that "aaa" is the area code. The parser may use a zip
code/area code cross-reference list to confirm that the results are correct.
Some of the commonly used parsing functions include:
- Street
addresses (Western style: US, UK, AU)
- Street
addresses (European style: GN, FA, DA)
- Street
addresses (double-byte: HK, China)
- Postal
data
- Company
name
- Person
name
- Product
description
- Email
address
- Phone
number
- Account
and cost center name
Matching without parsing
SQL Server Integration Services provides a fuzzy matching
capability, but its effectiveness depends on matching data items that come from
the same business context. In other words, Integration Services fuzzy matching
works better when the data is properly parsed.
Let us consider this example to illustrate this point. The
business needs to dedupe a Customers list that contains the name, two columns
of address data, and the phone number. Below is an example of 4 customer
records that include 2 individuals at the same address.
Name | Address1 | Address2 | Phone no |
John Dawkins | 1530 Olmo Way, Walnut Creek, CA 94595 | 961 555 9874 |
J Dawson | 15/30 Walnut Creek Road, Walnut Creek, CA94597 | 555 9847 |
John Dawking | Unit 15, 30 Olmo Way | Monks Creek, CA 94598 | 961 555 9847 |
Jenny Dawkins | 1530 Olmo Way, Walnut Creek, CA 94595 | | 961 555 9847 |
.gif)
Figure 6. Fuzzy
matching without prior parsing in an Integration Services package.
When we apply Fuzzy Grouping, SQL Server Integration
Services will match the set of 4 records into a single set. This is due to the
fact that Jenny Dawkins and John Dawkins qualify as a fuzzy match. In the
context of our effort to identify distinct individuals, this is a false result
(as known as a false-positive).
Match Id | Name | Address1 | Address2 | Phone no |
Set 1 | John Dawkins | 1530 Olmo Way, Walnut Creek, CA 94595 | 961 555 9874 |
| J Dawson | 15/30 Walnut Creek Road, Walnut Creek, CA94597 | 555 9847 |
| John Dawking | Unit 15, 30 Olmo Way | Monks Creek, CA 94598 | 961 555 9847 |
| Jenny Dawkins | 1530 Olmo Way, Walnut Creek, CA 94595 | | 961 555 9847 |
Parsing before matching
In this next example, we have added two preliminary parsing
steps to parse the name and address into their components.
.gif)
Figure 7. Fuzzy
matching with prior parsing in an Integration Services package.
Parsing the input string into its components allows the fuzzy
matching algorithms to operate on specific components of the data, and to match
data coming from the same domain. More importantly, it allows the matching
process to be ¡¥discriminating¡¦ when it comes to the relative importance of
components. For instance, we may find that Street name and Zip code are more
important than Street number or Unit type. We may also find that Phone number is
of less importance, if the data is organized geographically.
The following table shows the same 4 records after
preliminary parsing of the name and address data:
Name | Initial | Surname | Unit | Unit Type | Street No | Road Name | Road Type | City | State | Zip Code | Area Code | Phone No |
John | J | Dawkins | | | 1530 | Olmo | Way | Walnut Creek | CA | 94595 | 961 | 5559847 |
| J | Dawson | 15 | | 30 | Walnut Creek | Road | Walnut Creek | CA | 94597 | | 5559847 |
John | J | Dawking | 15 | U | 30 | Olmo | Way | Monks Creek | CA | 94598 | 961 | 5559847 |
Jenny | J | Dawkins | 15 | U | 30 | Olmo | Way | Monks Creek | CA | 94598 | 961 | 5559847 |
By adding a parsing function in the job stream, we have
improved the results of matching:
Match Id | Name | Address1 | Address2 | Phone no |
Set 1 | John Dawkins | 1530 Olmo Way, Walnut Creek, CA 94595 | 961 555 9874 |
J Dawson | 15/30 Walnut Creek Road, Walnut Creek, CA94597 | 555 9847 |
John Dawking | Unit 15, 30 Olmo Way | Monks Creek, CA 94598 | 961 555 9847 |
Set 2 | Jenny Dawkins | 1530 Olmo Way, Walnut Creek, CA 94595 | |
The parsing function is an essential component of a data quality
system. We have found that parsing data before applying fuzzy matching greatly
increases the effectiveness of a deduplication process. It makes the matching
process more discriminating and more targeted, and provides you with the degree
of control that you need over the matching process.
Conclusion
SQL Server Integration Services has much of the basic
functionality required to ensure data quality. This article has illustrated how
SQL Server can be extended, especially in the area of profiling and parsing. The
architecture, performance, and robustness of SQL Server make it an ideal
platform to handle complex and demanding Data Quality requirements.
About the authors and their product
Andrew Wong, Senior Consultant,
DQM. Andrew is a lead consultant in data quality projects for DQM. He has substantial
experience in large-scale data integration, systems migration and business
intelligence projects. Andrew has developed a set of data integration
methodologies that have data quality and data governance principles as their
centerpiece. He consults in the industry on the planning and management of data
quality and data migration projects, writes regularly in data quality journals,
and is the key architect behind the DQM product.
Dean Sutcliffe, Product
Manager, DQM. Dean is Lead Architect and product manager of the DQM product. Dean
has over 10 years of practical experience in data cleansing and data
deduplication, accumulated through operating a data cleansing bureau in a direct
marketing agency. During this period, he has observed many varieties of data
quality issues, in almost every industry. He has developed a practical approach
to data quality management, as he has demonstrated with great skill as the chief
designer of DQM. Dean is a Microsoft Certified Professional with substantial
knowledge of SQL Server 2005, and has been working with SQL Server since
version 4.x.
The DQM (Data Quality
Manager) product. The learnings
from this article are based on the practical experience that the authors gained
from the development and implementation of DQM on a number of clients. DQM is a
data quality application that provides data profiling; data cleansing and
matching, and data quality score card and dashboard. Users can conduct data
investigation, data profiling and measure gaps against business rules. DQM can
parse unstructured data into its components; cleanse and standardize the
content; and match items to identify duplications. The application is
preinstalled with templates which correspond to the most commonly encountered
data quality issues in the industry. This feature reduces learning time and speeds
up a data quality project. DQM is built primarily on SQL Server technologies ¡V
its profiling, parsing and matching logic is implemented as stored procedures
and user defined functions. It can be deployed as a batch process or as a Web-based
real-time system. It has an easy-to-use user interface, and uses Integration
Services to provide an alternative launch platform. DQM has demonstrated high
performance and scalability to very large data volumes.