SQL Server

Data Quality Testing Using SQL Server 2012 Data Quality Services

Raj Kamal
Gunjan Jain

Download the Code Sample

Microsoft SQL Server 2012 Data Quality Services (DQS) is a powerful tool that can help your IT teams (especially QA and testers) detect and prevent data quality issues that affect your customers. In this article, we demonstrate how test teams can use DQS as a way of proactively uncovering—and solving—data quality issues.

If not identified and corrected early on, defective data can contaminate all downstream systems and information assets. The growing awareness of such repercussions has led to major public initiatives, such as the Data Quality Act in the United States and Directive 2003/98/EC of the European Parliament. Enterprises must present data that meet stringent quality standards, especially in light of recent compliance regulations.

SQL Server 2012 DQS is a knowledge-driven solution, with the Data Quality Knowledge Base (DQKB) at its heart. A DQKB stores all the knowledge related to a specific type of data source. The organization’s data expert (often referred to as a data steward) maintains the DQKB. For each data domain, the DQKB stores all identified terms, spelling errors, validation and business rules, and reference data that can be used to perform data quality actions on the data source. SQL Server 2012 DQS addresses the following data quality problems, which  are typical candidates for test scenarios in a data-centric project.

  • Completeness
  • Conformity
  • Consistency
  • Accuracy
  • Validity
  • Duplication

In this article, we cover some basic, common data quality issues. We provide four test scenarios to illustrate how DQS deals with each issue:  standard data, syntax and format, domain and validity.

To follow along with this article, you need to have SQL Server 2012 installed. You can findinstallation instructions at DQS Forum. You also need a sample database. Download and run the scripts from the link at the beginning of this article to populate your sample database with the following two tables, which we’ll be using for demonstration.

NameGenderAge
JohnMale18
MikeUn23
RahulMale27
AmarMNULL
SenoritaFemale35
JohnFemale18

Table: Employee A

NameEmailDegree
Johnjohn@test.comBE
MikeMike test.comBachelor
Rahul LLB
Amaramar@test.comNULL
Senoritasenorita@testMA
NULLtest@test.comBE

Table: Employee B

Throughout the rest of this article, you’ll see how a test team can use SQL Server 2012 DQS to test data quality requirements for tables like this. Ideally, a QA project team defines various test scenarios and the test team focuses on defining and then running the data quality rules. If your organization doesn’t have separate QA and test teams, you can still define and test the knowledge bases you create, providing value for your customers.

DQS Test Planning

Creating a knowledge base isn’t a core activity of a test team, but your test team can still take this step even if the project team doesn’t adopt SQL Server DQS formally as a project requirement. To create a knowledge base, you launch the Data Quality Client and click the new knowledge base, which you’ll see in the left pane of the user interface. You then name the knowledge base and provide the mandatory field and description. Select Knowledge Discovery from the activity, and then click Next.

You can create a new knowledge base in three possible modes. You see these options on the right side of the Knowledge Base Management window. The selection depends on the activity you want to do.

  • Domain management: Create from scratch with no guidance. You define all business rules under it.
  • Knowledge discovery: Use a sample data set to guide you in building your knowledge base.
  • Matching policy: Use a sample data set. 

Figure 1 shows how the window looks when you click the New Knowledge Base button. You are expected to provide the sample database details.

Creating a knowledge base and its domains
Figure 1 Creating a knowledge base and its domains

To choose a database sample, select either SQL Server or Microsoft Excel as your Data Source. Once you select a database, all its tables are listed in Table/View. For this example, select a table that has corrupt data (one of the two we provide).

When you create a knowledge base, you also need to create a domain. Any domain you create has many built-in options, such as Speller and null check. To start, click Create a Domain. Then fill in the required fields.

Once the mapping is done, click Next and then click Start to upload the file for analysis. Figure 2 shows the initial analysis of the data. Because we haven’t yet defined any custom rules, only built-in DQS algorithms are executed on the sample database.

First execution of the knowledge base
Figure 2 First execution of the knowledge base

The records are categorized according to the following criteria:

  • New records: All of them are new because it’s the first pass.
  • Unique: Number of unique records.
  • Valid: Number of valid records—again, because it’s the first pass, everything is valid.
  • Completeness: Availability of null value check.

Now it’s time to analyze the data. In the first analysis, all the default DQS algorithms, including spell checking and checking for null values, are executed.

Figure 3 includes a list of the values that Data Quality Client has found in the knowledge discovery analysis of the data. You can flag these values as invalid or errors as appropriate or leave them at their default value if they are correct. As you can see in Figure 3, we reviewed the values found for the Degree domain, flagging them as Correct, Error or Invalid, and assigned a correct value to any incorrect ones. Once you have flagged the incorrect values, you can enter the correct values in the Correct To column. Handily, the client then groups your corrected values under the correct value in the list.

First level of correction
Figure 3 First level of correction

The final step is to click Next and publish the knowledge base (that is, store the results on Data Quality Server).

DQS Test Design

In the test design phase, you modify the domains you created when you created the knowledge base. You modify the existing domain rules to include the business-related rules. You can also create new domains during domain management. To begin, launch Data Quality Client. Under Knowledge Management, click the existing knowledge base name and select Domain Management. The window in Figure 4 should appear.

Domain management window
Figure 4 Domain management window

Within this window, you create the domain rules for the data you want to cleanse. First, go to the Domain Rules tab to define the business rule. If you open the knowledge base created in the previous section in domain management mode, a list of domains appears. The tabs in Figure 4 show the options for setting the data quality rules that apply to a domain:

  • Domain Properties: The properties set when a domain is created (when a new knowledge base is defined). You can’t edit these properties through domain management.
  • Reference Data: Data quality services that allow you to validate data against external sources.
  • Domain Rules: Syntax-based validation that DQS can apply.
  • Term-Based Relations: Functions as a find-and-replace engine. In it, you can repair common errors, such as changing “si” to “is” or expand common abbreviations, such as revising “Corp.” to “Corporation.”

We’ll now run through four basic data quality test scenarios so you can see how easy it is to control data quality using Data Quality Services.

Data Quality Test Scenario 1

Data quality issue: Standard data—that is, data constrained according to certain rules (Dom_Age)

Business rule: The degree should be three or fewer characters—that is, the length of the data should be three or fewer characters.

Now we can show you how DQS helps define different types of rules. You can combine rule types with AND/OR operators, making them quite flexible. You can also have multiple rules per domain if you want to simplify managing multiple nondependent conditions. You build new rules within the Domain Rules tab, as shown in Figure 5.

Defining a business rule for the Degree domain
Figure 5 Defining a business rule for the Degree domain

To build a new rule, click the New Rule button at the top right of the window. Select the condition you want from the drop-down list. Then click Apply All Rules.

A working window appears. The rule “Length is less than or equal to 3” means that invalid value counts will increase, decreasing the valid value count. Keep in mind that the rules you create don’t apply to any cases in which a correction is already defined in the domain values. Click Yes to confirm the rule. Now when the modified rule is executed along with other rules, the results will reflect your changes.

Data Quality Test Scenario 2

Data quality issue: Email syntax and format (Dom_Email)

Business rule: Email should have the correct syntax and format.

Regular expression: ^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$

You build a rule for the Email domain by using a regular expression. To create the rule, just follow the steps described in Test Scenario 1. Once you click Apply All Rules, the window in Figure 6 appears.

Defining a business rule for the Email domain
Figure 6 Defining a business rule for the Email domain

As in Test Scenario 1, you finally click Yes to confirm the rule, which will reflect the changes you made once the modified rule is executed along with other rules.

To save this business rule, click Publish, as shown in Figure 7. Remember that you can’t modify a rule once it’s published.

Saving the modified business rule for the Email domain
Figure 7 Saving the modified business rule for the Email domain

Data Quality Test Scenario 3

Data quality issue: Age domain  (Dom_Age)

Business rule: The age should be greater than 20 and less than 30.

To build the business rule for the Age domain, just set the correct values in the drop-down boxes, as shown in Figure 8.

Defining the business rule for the Age domain
Figure 8 Defining the business rule for the Age domain

Data Quality Test Scenario 4

Data quality issue: Validity

Business rule: The gender field should contain either M or F. Discard all other values.

To build the rule for the Gender domain, set the values in the drop-down lists and click Apply All Rules, as shown in Figure 9.

Defining the business rule for the Gender domain
Figure 9 Defining the business rule for the Gender domain

Figure 10 shows the message box that appears when you click Apply All Rules. It warns about the increase of invalid values and the consequent decrease of valid values. Click Yes if you’re sure this rule is as you want it, and then publish it. Again, you can’t modify rules after they have been published.

Executing the business rule for the Gender domain
Figure 10 Executing the business rule for the Gender domain

DQS Test Execution

By now, you should see how easy it is to create knowledge bases and rules in SQL Server 2012 DQS. The next step is to execute the tests scenarios and DQS rules we created. For creating knowledge bases and defining domain rules, we used a sample database. For the test execution phase, we can execute the rules we created on the production database.

To create a new project, from the middle pane of the home screen, just click New Data Quality Project. In the window shown in Figure 11, you provide a name and a description for your project and select the knowledge base  defined in the previous steps to use for cleansing. Then click Next.

Creating a new test project
Figure 11 Creating a new test project

In the window shown in Figure 12, you select the database, which loads all the mappings for that database. Click Next to see the results.

Selecting domains to execute
Figure 12 Selecting domains to execute

To execute the DQS algorithms and the domain rules we created, click Start. The results are shown in Figure 13. Click Next when execution is finished.

Executing the test project
Figure 13 Executing the test project

We have now executed all rules on the data and can move to cleansing the data.

Bug Fixing and Verification

Projects are applications of knowledge bases across sets of data. There are two modes of operation: cleaning and matching. Cleaning is the application of the knowledge base against a set of data to test and refine the rules that are in place. Matching is the application of the knowledge base against a set of data and applying what appear to be fuzzy matching algorithms over the data to match it to existing records.

Let’s now look at how you can fix some common errors through DQS. With DQS, you can fix bugs or find and export them for further testing instead of manually going back to SQL Server to fix them. You can log any issues in your bug-tracking tool.

After executing all the rules created in the knowledge base, the results window shows the results for each domain. Figure 14 shows the five tabs in which you can study each domain and the data associated with it:

  • Suggested: Suggested changes by DQS algorithm
  • New: New values
  • Invalid: Incorrect values
  • Corrected: Values  a user has corrected
  • Correct: Correct values 

Data cleansing
Figure 14 Data cleansing

You can easily correct the values that appear on the Invalid tab. Just add the appropriate value in the Correct To field and select Approve. (See Figure 15.) The value will be changed to the value you provided. Click Next when you have reviewed the invalid values for all domains.

Correcting invalid values
Figure 15 Correcting invalid values

The columns at the left of Figure 16 provide information about what has been corrected, including the reason for the correction. Your data is now cleansed and ready to export to either SQL Server or Excel.

Exporting the cleansed data to a SQL Server table
Figure 16 Exporting the cleansed data to a SQL Server table

Benefits and Takeaways

SQL Server 2012 DQS is an effective tool that can be instrumental in verifying and improving the data quality of data warehousing, business intelligence or other data-centric projects. DQS isn’t only for business users and data stewards, however. IT teams, especially the QA and testing members, have long been craving such a tool.

SQL Server DQS helps with both data verification and data validation practices and can make it easier to prevent data quality issues proactively rather than having to correct errors in shipped products—when changes can be expensive to fix. In addition, automating your data quality processes enables live unattended monitoring that will notify stakeholders about potential critical issues.

You can find additional resources for SQL Server 2012 DQS on the Microsoft TechNet Wiki: Data Quality Services, DQS operations, SQL Server “Denali,” SQL Server 2012 and Testing.


Raj Kamal is a senior test consultant with the Global Delivery team at Microsoft. He specializes in different types of testing techniques, test automation and testability in domains such as manufacturing, health care and higher education. He has helped teams develop test automation strategies and architectures at Cognizant Technology Solutions, Oracle Corporation and Microsoft. He also provides training in automated testing architectures and design. He is QAI (CSTE) and ISTQB certified. His passion:  geektester.You can contact him at rajkamal@microsoft.com.

Gunjan Jain is an associate test consultant with the Global Delivery team at Microsoft. She specializes in testing scenarios such as web testing and automation testing in new terrains. Before joining Microsoft, she worked at Wipro and SunGard. She still tries to find time for developing web applications, something she did in her engineering days.You can contact her at gunjain@microsoft.com.

 

Rate: