
Controlling Fuzzy Matching Behavior
The Fuzzy Lookup transformation includes three features for customizing the lookup it performs: maximum number of matches to return per input row, token delimiters, and similarity thresholds.
The transformation returns zero or more matches up to the number of matches specified. Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.
The similarity thresholds can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates. You specify the similarity threshold by setting the MinSimilarity property at the component and join levels. To satisfy the similarity that is specified at the component level, all rows must have a similarity across all matches that is greater than or equal to the similarity threshold that is specified at the component level. That is, you cannot specify a very close match at the component level unless the matches at the row or join level are equally close.
Each match includes a similarity score and a confidence score. The similarity score is a mathematical measure of the textural similarity between the input record and the record that Fuzzy Lookup transformation returns from the reference table. The confidence score is a measure of how likely it is that a particular value is the best match among the matches found in the reference table. The confidence score assigned to a record depends on the other matching records that are returned. For example, matching St. and Saint returns a low similarity score regardless of other matches. If Saint is the only match returned, the confidence score is high. If both Saint and St. appear in the reference table, the confidence in St. is high and the confidence in Saint is low. However, high similarity may not mean high confidence. For example, if you are looking up the value Chapter 4, the returned results Chapter 1, Chapter 2, and Chapter 3 have a high similarity score but a low confidence score because it is unclear which of the results is the best match.
The similarity score is represented by a decimal value between 0 and 1, where a similarity score of 1 means an exact match between the value in the input column and the value in the reference table. The confidence score, also a decimal value between 0 and 1, indicates the confidence in the match. If no usable match is found, similarity and confidence scores of 0 are assigned to the row, and the output columns copied from the reference table will contain null values.
Sometimes, Fuzzy Lookup may not locate appropriate matches in the reference table. This can occur if the input value that is used in a lookup is a single, short word. For example, helo is not matched with the value hello in a reference table when no other tokens are present in that column or any other column in the row.
The transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and the following additional columns:
-
_Similarity, a column that describes the similarity between values in the input and reference columns.
-
_Confidence, a column that describes the quality of the match.
The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses.