The following table lists the tasks, containers, data sources and destinations, and transformations that are used within the sample.
|
Element
|
Purpose
|
|---|
|
Execute SQL task
|
The Execute SQL task is named Create Customer Address Reference Table View, Populate NewCustomer Input Table and Create Output Tables.
This Execute SQL task runs a stored procedure that creates the input table, CustomerLeads, and also creates the three output tables, ExistingCustomerLeads, NewCustomerLeads, and DuplicateCustomerLeads. For its return code, the stored procedure returns either a value of 0 or 1. A value of 0 indicates that the stored procedure ran successfully; a value of 1 indicates that the stored procedure did not run successfully. The task stores the return code in a package variable, ReturnCode.
Between the Execute SQL task and the Data Flow task is a precedence constraint. This precedence constraint allows the Data Flow task to run only if the following conditions are true:
-
The Execute SQL task finishes successfully.
-
The ReturnCode variable contains a value of 0.
|
|
Data Flow task
|
The Data Flow task, Fuzzy Lookup Data Flow Task, executes the data flow in the package.
|
|
OLE DB source
|
The OLE DB source, OLE DB Source - Customer Leads, reads records from the CustomerLeads table.
|
|
Lookup transformation
|
The Lookup transformation, Lookup against Existing Customers, performs an exact lookup to identify existing customers. The Lookup transformation compares each row of customer records in the input table, CustomerLeads, against the entries in the Lookup reference dataset, and then does one of the following actions:
-
If the row has a matching entry in the reference dataset, the transformation directs the row into the into the ExistingCustomerLeads table.
-
If the row has no matching entry in the reference dataset, the transformation directs the row to the Fuzzy Lookup transformation, Fuzzy Lookup against Existing Customers, for further comparison.
The Lookup transformation runs in Partial cache mode, and caches rows that have no matching entries in the reference dataset.
|
|
Derived Column transformation
|
The Derived Column transformation, Derived Column, adds the _Similarity columns to each row and sets the column value to 1.
|
|
Fuzzy Lookup transformation
|
The Fuzzy Lookup transformation, Fuzzy Lookup against Existing Customers, performs a fuzzy lookup to identify customer records that are fuzzy matches of existing customer records.
The transformation adds a _Similarity column that contains a similarity score to each row. A score of 0.0 means no match was found, whereas a score of 1.0 means an exact match was found. A score between 0.0 and 1.0 is a measure of similarity in which a value closer to 1.0 indicates greater similarity.
|
|
Conditional Split transformation
|
The first Conditional Split transformation, Conditional Split on _Similarity, directs input rows to one of two outputs depending on the value of the similarity score determined by the fuzzy lookup. Rows with a similarity score >= .70 are written to the ExistingCustomerLeads table. Rows with similarity scores < 70 are probably valid new customer leads and additional cleaning is done on these rows.
The second Conditional Split transformation, Conditional Split on Canonical Record for Group, directs input rows to one of two outputs depending on whether the data row is a duplicate. If the values of the _key_in and _key_out columns are equal, the row is used as the canonical row in the group, and the canonical row is inserted into the NewCustomerLeads table. If the _key_in and _key_out columns are not equal, the row is treated as a fuzzy duplicate and the row is inserted into the DuplicateCustomerLeads table.
|
|
Union All transformation
|
The Union All transformation, Union All, merges rows of existing customers—both exact and fuzzy matches—into one dataset.
|
|
Fuzzy Grouping transformation
|
The Fuzzy Grouping transformation, Fuzzy Grouping, groups customers who are likely duplicates. The transformation adds three columns _key_in, _key_out and _score to each row. _key_in is a unique identifier assigned to each input row and _key_out contains the particular _key_in assigned to the row that best represents all the rows in a fuzzy group. All rows in a fuzzy group will have the same _key_out value. The _score column is a value between 0.0 and 1.0 that describes the textual similarity between a given input row and the row selected to be the canonical value.
|
|
OLE DB destinations
|
The OLE DB destination, OLE DB Destination - Existing Customers, inserts rows into the ExistingCustomerLeads table.
The OLE DB destination, OLE DB Destination - Unique Customer Leads, inserts rows into the NewCustomerLeads table.
The OLE DB destination, OLE DB Destination - Duplicate Customer Leads, inserts rows into the DuplicateCustomerLeads table.
|
|
Execute SQL task
|
The Execute SQL task, Return Row Count for ExistingCustomersLeads table, runs a parameterized SQL statement. The statement returns a count for the number of rows in the ExistingCustomerLeads table where the _Similarity column has a value of 1. The SQL statement stores this row count as a Single row result set in a package variable, RowCount. The _Similarity column value is stored in a package variable that is mapped to the parameter in the SQL statement.
A _Similarity value of 1 means that there is an exact match between the customer record in the input table, CustomerLeads, and the customer record in the Lookup reference dataset.
|
|
Script task
|
The Script task, Display Row Count, displays the row count value for the ExistingCustomerLeads table, based on the value stored in the RowCount package variable.
|
|
File connection manager
|
The File connection manager, CreateTables.sql, connects to the file that contains the SQL the package uses.
|
|
OLE DB connection manager
|
The OLE DB connection manager, (local).AdventureWorks, connects to the AdventureWorks database on the local server.
|
The following table describes the data in the output tables.