Friday 24 October 2014

Quality Stage Overview


The data that drives today’s business systems often comes from a variety of sources and disparate data structures. As organizations grow, they retain old data systems and augment them with new and improved systems. Thus, data becomes difficult to manage and use and a clear picture of a customer, product, or buying trend can be practically impossible to ascertain.

Data quality can be broadly defined from a qualitative viewpoint as information that you can trust or ensuring that data at a particular point in time is suitable for its purpose. A more specific quantitative definition can include the level of compliance that is attained by an enterprise’s data environment to independently define rules that describe that data environment. The emphasis is on the business user’s perception of data quality—what is delivered to the user and the semantics of the data that is presented. In this case, data quality might well depend upon the data itself (such as correct data types, consistent formatting, retrievability and usability) and the processes and applications that deliver this data to the business user.

The source of data quality issues is a lack of common standards on how to store data and an inconsistency in how the data is input. Different business operations are often very creative with the data values that they introduce into your application environments. Inconsistency across sources makes understanding relationships between critical business entities such as customers and products very difficult. In many cases, there is no reliable and persistent key that you can use across the enterprise to get all the information that is associated with a single customer or product.

Without high-quality data, strategic systems cannot match and integrate all related data to provide a complete view of the organization and the interrelationships within it. CIOs can no longer count on a return on the investments made in critical business applications. The solution calls for a product that can automatically re-engineer and match all types of customer, product, and enterprise data, in batch or at the transaction level in real time.

IBM Information Server addresses these requirements with an integrated software platform that provides the full spectrum of tools and technologies that are required to address data quality issues. These technologies include data profiling (IBM WebSphere Information Analyzer and IBM WebSphere AuditStage), data cleansing (IBM WebSphere QualityStage), and data movement and transformation (IBM WebSphere DataStage) as follows:

·         IBM WebSphere Information Analyzer (the focus of IBM WebSphere Information Analyzer and Data Quality Assessment) is a new module of IBM Information Server that represents the next generation data profiling and analysis tool. It is designed to help business and data analysts understand the content, quality, and structure of their data sources by automating the data discovery process. The IBM WebSphere Information Analyzer product module helps increase the productivity of data personnel and improve return on investment (ROI) and time to benefit of data-intensive projects.

·         IBM WebSphere DataStage integrates data across multiple and high volumes of data sources and target applications. It integrates data on demand with a high performance parallel framework, extended metadata management, and enterprise connectivity.

·         IBM WebSphere QualityStage enables customers to use the following processes to make a significant impact in the data that drives an organization’s success.
Ø  Investigation of source data to understand the nature, scope, and detail of data quality challenges.
Ø  Standardization to ensure that data is formatted and conforms to organization-wide specifications, including name and firm standards as well as address cleansing and verification.
Ø  Matching of data to identify duplicate records within and across data sets.
Ø  Survivorship to eliminate duplicate records and create the “best record view” of data.


As briefly introduced, the five types of issues that generally occur within enterprise data stores are as follows:

  • Lack of information standards: Names, addresses, part numbers, and other data are entered in inconsistent ways, particularly across different systems. These differences make records look different even when they are actually the same
  • Data surprises in individual fields: Data in a database is often misplaced, while in other cases some fields are used for multiple purposes.
  • Information is buried in free-form fields: In this case, valuable information is hidden away in text fields. Because these fields are difficult to query using SQL, this information is often not taken advantage of, although it most likely has value to the business. This type of problem is common in product information and help desk case records.
  • Data myopia (our term for the lack of consistent identifiers across different systems): Without adequate foreign-key relationships, it is impossible to get a complete view of information across systems.
  • Redundancy within individual tables: This issue is extremely common, where data is re-entered into systems because the data entry mechanism is not aware that the original record is already there. This issue is a common side effect of the lack of standards, but it is one of the worst data quality problems, because it links directly to costs and customer dissatisfaction.

IBM WebSphere QualityStage helps to identify and resolve all these issues for any type of data. It helps to ensure that systems deliver accurate and complete information to business users across the enterprise. IBM WebSphere QualityStage is a data re-engineering environment that is designed to help programmers, programmer analysts, business analysts, and others cleanse and
enrich data to meet business objectives and data quality management standards.

A process for re-engineering data can help accomplish the following goals:

  • Resolve conflicting and ambiguous meanings for data values
  • Identify new or hidden attributes from free-form and loosely controlled source fields
  • Standardize data to make it easier to find
  • Identify duplication and relationships among such business entities as customers, prospects, vendors, suppliers, parts, locations, and events
  • Create one unique view of the business entity
  • Facilitate enrichment of re-engineered data, such as adding information from vendor sources or applying standard postal certification routines

IBM WebSphere QualityStage data quality functions include:

·         Free-form text investigation that allows you to recognize and parse out individual fields of data from free-form text
Investigation provides the ability to determine the number and frequency of the unique values found in either single-domain or free-form columns:

Ø  For single-domain columns, the unique values can represent complete data values, partial data values (that is prefixes, suffixes, or substrings), simple data formats (alpha, numeric, blanks, or special characters), or combinations of values and simple data formats.
Ø  Free-form columns often contain multiple (usually related) data elements such as person names, business names, or postal addresses. For free-form columns, the unique values can represent individual tokens (words) that are determined by parsing the free-form data or lexical patterns providing context to help understand the free-form data.

·         Standardization that allows individual fields to be parsed and made uniform according to your own standards
Standardization provides the ability to normalize your information to defined standards. This incorporates the ability to parse free-form columns into single-domain data elements to create a consistent representation of the input data and to ensure data values conform to standard representation.

·         Address verification and correction that uses postal information to standardize, validate, and enrich address data
For addresses, standardization incorporates postal standards with delivered rule sets. It also prepares data elements for more effective matching.

·         Record linkage and matching that allows duplicates to be identified from individual sources, and common records across sources to be identified and linked
The central strength of IBM WebSphere QualityStage is its ability to match data from different records, even when it appears very different. IBM WebSphere QualityStage utilizes a statistical matching technique called Probabilistic Record Linkage2 that provides the highest optimization of each contributing data element to a match and an unlimited number of elements for the highest confidence in your information. This matching allows each individual field score to be summed to produce a final score that precisely measures the information content of the matching fields. That final score, or match weight, is an accurate gauge of the probability of a match.
Because of this ability to match records, IBM WebSphere QualityStage is a key enabler of creating a single view of customers or products.

·         Survivorship that allows the best data from across different systems to be merged into a consolidated record Survivorship is the process of aggregating or consolidating a group of records into a single unique representation of data, a single consolidated record for retaining the best of breed information from the given individual records. Survivorship incorporates a flexible process of assessing data at the record or individual field level based on specific survivorship rules. These rules can include completeness of data, frequency of data, or logical conditions.

These capabilities of IBM WebSphere QualityStage support the re-engineering of data to meet complex business goals and challenges.


IBM WebSphere QualityStage is built around a services-oriented vision for structuring data quality tasks that are used by many new enterprise system architectures. As part of the integrated IBM Information Server platform, it is supported by a broad range of shared services and benefits from the reuse of several suite components.
IBM WebSphere QualityStage (and IBM WebSphere DataStage) share the same infrastructure for importing and exporting data, for designing, deploying, and running jobs, and for reporting. The developer uses the same design canvas to specify the flow of data from preparation to transformation and delivery.
Multiple discrete services give IBM WebSphere QualityStage the flexibility to match increasingly varied customer environments and tiered architectures.


With reference to above figure, the following suite components are shared between IBM WebSphere QualityStage and IBM Information Server:
  • Unified user interface
The IBM WebSphere DataStage and QualityStage Designer provides a development environment. The IBM WebSphere DataStage and QualityStage Administrator provides access to deployment and administrative functions. IBM WebSphere QualityStage is integrated tightly with IBM WebSphere DataStage and shares the same design canvas, which enables users to design jobs with data transformation stages and data quality stages in the same session.
  • Common services
IBM WebSphere QualityStage uses the common services in IBM Information Server for logging and security. Because metadata is shared “live” across tools, you can access services such as impact analysis without leaving the design environment. You can also access domain-specific services for enterprise data cleansing such as investigate, standardize, match, and survive from this layer.
  • Common repository
The repository holds data to be shared by multiple projects. Clients can access metadata and results of data analysis from the respective service layers.
  • Common parallel processing engine
The parallel processing engine addresses high throughput requirements for analyzing large quantities of source data and handling increasing volumes of work in decreasing time frames.
  • Common connectors
Any data source that is supported by IBM Information Server can be used as input to a IBM WebSphere QualityStage job by using connectors. The connectors also enable access to the common repository from the processing engine.

IBM WebSphere QualityStage uses one or more of the stages shown in following figure to improve an organization’s data quality.


The main functions are investigate, standardize, match, and survive as follows:
  • Investigate source data to understand the nature, scope, and detail of data quality challenges.
  • Standardize data to ensure that it is formatted and conforms to organization-wide specifications including name and firm standards as well as address cleansing and verification. Investigate can be used to assess the effectiveness of Standardize. The standard rules can then be augmented to improve the Standardize data.
  • Match data to identify duplicate records within and across data sets.
  • Survive appropriate data by eliminating duplicate records and creating the best record view of data.

IBM WebSphere QualityStage comprises a set of stages, a Match Designer, and related capabilities that provide a development environment for building data-cleansing tasks called jobs. The IBM WebSphere QualityStage components include the Match Designer for designing and testing match passes and the IBM WebSphere QualityStage stage types, including Investigate, Standardize, Match Frequency, Reference Match, Unduplicate Match, and Survive. (This is not a
comprehensive list.)
When a IBM WebSphere QualityStage job is compiled, the Designer (client) transfers the logic (developed on the client) to the Server. On execution, the IBM WebSphere QualityStage load modules perform the actual data engineering tasks such as investigate, standardize, matching, and survive. These load modules operate through control parameter statements that are passed to the modules during processing. Each IBM WebSphere QualityStage operation is referred to as a stage. Complex data engineering tasks can be performed by linking individual stages together in a job comprising multiple job steps or stages.

IBM WebSphere QualityStage comes with customizable rules to prepare complex information about your business entities for a variety of transactional, operational, and analytical purposes.

IBM WebSphere QualityStage automates the conversion of data into verified standard formats including the use of probabilistic matching, in which variables that are common to records (such as the given name, date of birth, or gender) are matched when unique identifiers are not available.

IBM WebSphere QualityStage components include the Match Designer, for designing and testing match passes, and a set of data-cleansing operations called stages. Information is extracted from the source system and then measured, cleansed, enriched, consolidated, and loaded into the target system. At run time, data cleansing jobs typically consist of the following sequence of
stages:
·         Standardize stage parses free-form or fixed-format columns into single-domain data elements to create a consistent representation of the input data. This stage ensures that each data element has the same content and format and also standardizes spelling formats and abbreviations.
·         Match stages ensure data integrity by linking records from one or more data sources that correspond to the same customer, supplier, or other entity. Matching can be used to identify duplicate entities that are caused by data entry variations or account-oriented business practices. For example:
Ø  Unduplicate Match jobs group records into sets (within a single data set) that have similar attributes.
Ø  Reference Match stage matches reference data to source data between two data sets.
The probabilistic matching capability and dynamic weighting strategies of IBM WebSphere QualityStage help you to create high-quality, accurate data and identify core business information consistently, such as customer, location, and product throughout the enterprise. IBM WebSphere QualityStage standardizes and matches any type of information.
·         Survive stage ensures that the best available data survives and is consequently prepared for the target correctly.

Business intelligence packages that are available with IBM WebSphere QualityStage provide data enrichment that is based on business rules. These rules can resolve issues with common data quality problems such as invalid address fields throughout multiple geographies. The following packages are available:
  • Worldwide Address Verification and Enhancement System (WAVES) matches address data against standard postal reference data that helps you verify address information for 233 countries and regions.
  • Multinational geocoding is used for spatial information management and location-based services by adding longitude, latitude, and census information to location data.
  • Postal certification rules provide certified address verification and enhancement to address fields to enable mailers to meet the local requirements to qualify for postal discounts.

By ensuring data quality, IBM WebSphere QualityStage reduces the time and cost to implement CRM, business intelligence, ERP, and other strategic customer-related IT initiatives.

Some of the scenarios for data cleansing include:

  • Obtaining one view of households
Knowing that your customers (stored in different source systems) belong to the same household (share the same mailing address) can facilitate marketing and mail campaigns.
  • Obtaining a consolidated view of an entity
Knowing the total quarterly sales from the prescriptions of one doctor can help pharmaceutical companies effectively market to them. This information needs to be extracted and consolidated from existing systems with different standards and formats, including information buried in free-form fields, incorrect data values, discrepancies between field metadata and actual data in the field, and duplicates.
  • Obtaining a single real-time view of a customer
Customer Service Representatives (CSR) and customers using self-service portals require a single view of all their health, dental, and benefit plans with the insurance company. Typically, these plans are implemented in different systems, and the data resides in different sources in different formats.

IBM WebSphere QualityStage performs the preparation stage of enterprise data integration, often referred to as data cleansing. IBM WebSphere QualityStage takes advantage of the source systems analysis that is performed by IBM WebSphere Information Analyzer and supports the transformation functions of IBM WebSphere DataStage. Working together, these products automate what was previously a manual or neglected activity within a data integration effort—data quality assurance.


Understanding your data is a necessary precursor to cleansing and consolidation. You can use a data profiling product such as IBM WebSphere Information Analyzer to perform a comprehensive analysis across many columns and tables to create a direct input into the cleansing and matching process by using shared metadata. The Investigate stage supports the initial level of column or domain analysis and extends it through the analysis of free-form text fields to create this input.

The Investigate stage shows the actual condition of data in existing sources and identifies and corrects data problems before they corrupt new systems.

Investigation parses and analyzes free-form fields creating patterns to reveal field formats, counts unique values, and classifies, or assigns a business meaning to each occurrence of a value within a field. The Investigate stage also allows users to validate field contents in a specific column or domain. The actual full-volume data files must be investigated to assure that as many data anomalies as possible are identified. The Investigate stage involves performing one or more of the following functions:
  • Parsing
  • Classifying
  • Creation of Word or Pattern Distributions

Specific objectives of the investigation process include, but are not limited to the following:
  • Investigate data values and anomalies to be considered for special handling during design or development. Any recommended special handling is documented in a Business Requirements Document. Examples include:
      Out of range, default (such as all nines in the telephone number) or unique values.
      Free-form text that requires parsing to improve “addressability” to key components for optimal matching.
      Values that do not match metadata labels and require new field definitions for matching and load format (such as DBA, C/O, Attention text, drivers license numbers, or other comment information in a name or address field).
      Values that overlap adjacent fields and thus require a re-alignment of field content. For example, name information or city, state, and postal code that extend into street and mailing address fields.
      Invalid formats that can cause conversion problems such as alphanumeric in character or numeric only fields.
      Blank or missing data.
      Special character and punctuation handling.
  • Discover additional tokens (key words) to add to the classification table, such as name prefixes, name suffixes, street types, unit types, and business words.
  • Verify the usefulness of data fields for the purpose of entity matching (record linkage). For example, does a Tax ID field contain primarily blank or invalid values, making it an unreliable field to use in the identification of a unique customer?
  • Validate business requirements or assumptions about the data. Results can lead to changes in the business requirements and can assist in driving the technical solution.

The Investigate stage takes a single input, which can be a link from any database connector that is supported by IBM WebSphere DataStage, from a flat file or data set or from any processing stage. Inputs to the Investigate stage can be fixed length or variable.

Use of investigation (or data profiling) is critical to assess how data can be used for further steps of standardization, matching, and survivorship or consolidation. Data that is used in these additional steps is stored in two types of fields—single-domain and multiple-domain.

During the investigation process, it is important to document fields that require parsing and standardization, standardization only, or no standardization at all. You must also consider whether data content will be used for matching, survivorship, pass-through (static data) or whether it will not be carried forward into the new system or target database.

·         Static fields
Ø  Can be used in the matching process.
Ø  Might require some standardization of format and default or out of range values.
Ø  Typically re-joined with re-engineered data at prior to matching or load formatting.
Character Investigate is usually sufficient to evaluate.

·         Single-domain fields
Ø  Categorized as either Entity Identifiers or Entity Clarifiers:
§  Entity Identifiers
Examples include ZIP Code (U.S. postal code), Social Security Number (SSN), TIN, and telephone numbers. These can serve as critical match fields if their quality is high. These fields typically have a specific format (such as 9-digit numeric for SSN, or Alpha-Numeric-Alpha-Numeric-Alpha-Numeric for Canadian postal code). Character Discrete Investigate is commonly used, with the mask set to all Cs.
§  Entity Clarifiers
Examples include name prefix, gender, and marital status. This type of field usually has a limited number of known values. Character Discrete Investigate is commonly used, with the mask set to all Ts.
Ø  Typically used for a single purpose.
Ø  Can exist in several common formats.
Ø  No parsing is usually necessary.
Ø  Can serve as strong blocking and match fields if quality is high.
Ø  Standardize Stage usually only requires removal of special characters, creation of a common default value and format and concatenating data.
·         Multiple-domain fields
Ø  Typically these are large free-form fields such as multiple Address fields.
Ø  Comprised of multiple single-domain values, such as house number, street name, unit, and first name.
Ø  Intelligent parsing is required on these fields for optimal matching.
Ø  Can serve as good match fields and multiple-component blocking candidates.
Ø  Standardization requires removal of most special characters and strong data typing.
Ø  Occasionally contain business word interference, such as reference numbers, comments, and so forth.
Ø  Can contain additional relationships such as Attn, C/O, DBA, and Trustee.
Word Investigate is commonly used for multi-domain fields.
The Character Investigate option parses a single-domain field (one that contains one data element or token, such as SSN, telephone number, date, or ZIP code) to analyze and classify data. Character investigation provides you with the option of investigating multiple columns individually (Character Discrete) or integrated as one unit of data (Character Concatenate).

The investigation process generates a column frequency report that presents information about frequency values for a specific column.
  • Character Discrete Investigate

The Character Discrete Investigate analyzes multiple single-domain columns. This option allows you to investigate a large number of columns with little effort. A Character Discrete Investigate produces a column frequency report that treats each column as a separate token for frequency count and analysis.

  • Character Concatenate Investigate

The Character Concatenate Investigate option performs cross-column correlations between multiple columns to determine relationships. With this option, you select two or more columns from anywhere in the record (the columns do not have to be contiguous) to be investigated as a single data column. To create the pattern analysis, the tokens are concatenated with no spaces between the tokens.

The Character Investigate process generates a column frequency report that presents information about frequency values for a specific column. A pattern report is prepared for all types of investigations and displays the count, percentage of data that matches this pattern, the generated pattern, and sample data. This output can be presented in a wide range of formats to conform to
standard reporting tools.

For character investigations, you use column masks to select the characters that  are included in the frequency count or pattern analysis and the characters that are displayed as part of samples in the pattern report. You apply a mask symbol to each character in the selected columns. You can use the following mask characters:
  • Mask C
Displays the character and includes it in the frequency count and pattern analysis. Use the C column mask to inspect the values in your columns and to certify that false data does not appear in a column such as 99999 for a postal code or 111111111 for a national ID number.
  • Mask T
Displays the type of character and includes the character in the frequency count and pattern analysis. Use the T column mask when you want to inspect the type of data in a character position, for example 9-digit telephone numbers such as nnn-nnn-nnnn or (nnn)-nnn-nnnn.
  • Mask X
Excludes the character in the frequency count or the pattern analysis; however, it includes it in the sample data. Use the X column mask to include data from the column in the sample but not as a token or part of the token for investigation.

For example, you set up an Investigate job to analyze the first two characters of a ZIP code to determine the frequency distribution based on a state (each state is defined by the first two characters of the ZIP code). You set the column mask for the ZIP code to CCXXX. The qsInvPattern column of the pattern report displays only the first two characters. The frequency count is based on the number of records in the file that start with the first two characters of the ZIP code. In the qsInvSample column of the pattern report, you will see all five characters of the ZIP code.

You can also use the X column mask with the Character Concatenate option to specify one or more columns to appear as part of the sample only.

Character Investigate generate reports with following fields:
  • qsInvColumnName identifies the names of the column that is investigated.
  • qsInvPattern displays the character and includes the character in the frequency count and pattern analysis.
  • qsInvSample shows one or more samples of the content of this column. The number to be displayed is configurable.
  • qsInvCount shows the actual number of occurrences of the value in the qsInvPattern column.
  • qsInvPercent shows the percentage occurrences of the value in the qsInvPattern column to the total number of records on this file
The Word Investigation option parses free-form data fields into individual tokens and analyzes them to create patterns. For example, to create the patterns in address data, the Word Investigation option uses a set of rules for classifying personal names, business names, and addresses. The Word Investigation option also provides frequency counts on the tokens.

The Investigate stage provides pre-built rule sets for investigating patterns on names and postal addresses for a number of different countries. For example, for the U.S., the Investigate stage parses the following components:
  • USPREP
Name, address, and area data.
  • USNAME
Individual and organization names.
  • USADDR
Street and mailing addresses.
  • USAREA
City, state, ZIP code, and so on.

The test field “123 St. Virginia St.” is analyzed in the following way:
1.     Field parsing would break the address into the individual tokens of 123, St.,Virginia, and St.
2.     Lexical analysis determines the business significance of each piece:
a.     123 = number
b.    St. = street type
c.     Virginia = alpha
d.    St. = Street type
3.     Context analysis identifies the various data structures and content as 123 St.Virginia, St.
a.     123 = House number
b.    St. Virginia = Street address
c.     St. = Street type
When you specify Word Investigate, you select the rule by which you want the columns investigated and then select one or more columns to examine.
Word Investigate parses the free-form data column into individual elements or tokens, which is a word, a number, or a mixture separated by one or more spaces or special characters. The process compares each token with classified tokens in the Classification table for that rule set. If the token matches the word in the Classification table, Investigate assigns the class for that token to represent it in the pattern. For tokens that do not match any classified token, Investigate examines the pattern and assigns classes as shown in the following Table.
Word Investigate generates two reports (Token report and Pattern report) as follows

Token report
  • qsInvCount indicates the number of times this token was encountered across the entire input data.
  • qsInvWord identifies the individual token or word value that is found inside the selected input columns.
  • qsInvClassCode identifies the classification of the token that is based on the selected rule set classification table. Unclassified tokens, if selected, get a question mark “?” for alpha or a carat “^” for numeric.

Pattern report
  • qsInvColumnName identifies the names of the column that is investigated.
  • qsInvPattern displays the character and includes the character in the frequency count and pattern analysis.
  • qsInvSample shows one or more samples of the content of this column. The number to be displayed is configurable.
  • qsInvCount shows the actual number of occurrences of the value in the qsInvPattern column.
  • qsInvPercent shows the percentage occurrences of the value in the qsInvPattern column to the total number of records on this file.


Standardizing data involves moving free-form data (columns that contain more than one data entry) into fixed columns and manipulating data to conform to standard conventions. The process identifies and corrects invalid values, standardizes spelling formats and abbreviations, and validates the format and content of the data.

The Standardize stage builds on the interpretation of the data during the Investigate stage. The Standardize stage uses the same prebuilt tables and rule sets that the Investigate stage used to investigate the data to standardize the data. Standardize reformats data from multiple systems and creates a consistent data presentation with fixed and discrete columns, according to your company requirements.

The Standardize stage processes the data with the following outcome:
  • Creates fixed-column, addressable data
  • Facilitates effective matching
  • Enables output formatting

The Standardize stage uses the data content and placement within the record context to determine the meaning of each data element. To correctly parse and identify each element or token and place them in the appropriate column in the output file, the Standardize stage uses rule sets to support specific conditioning and standardization requirements. These can be standard rule sets that are designed to comply with the name (individual and business) and address conventions of a specific country. Alternatively, they can be custom rule sets to focus on challenges such as product or part descriptions, consistent metadata definitions, insurance claim data, or other industry specific challenges. The Standardize rule sets can assimilate the data and append additional information from the input data, such as gender.

These rule sets are the same as those used in the Investigate stage. You can run the rules as they are shipped, or you can customize them to process obscure data not covered by the standard rule sets. You can also create your own custom rule sets from scratch.

Standardize ensures that each data type has the same content and format. Standardized data is important for the following reasons:
  • Effectively matches data
  • Facilitates a consistent format for the output data

The Standardize stage parses free-form and fixed-format columns into single-domain columns to create a consistent representation of the input data.
  • Free-form columns contain alphanumeric information of any length as long as it is less than or equal to the maximum column length defined for that column.
  • Fixed-format columns contain only one specific type of information, such as only numeric, character, or alphanumeric, and have a specific format.

The following rules are examples of the types of rules that the Standardize stage might support depending on the rule set in use:
  • Assign data to its appropriate metadata fields. Standardize ensures that the data within a specific field is being used for the business purpose defined in the metadata. For example, credit records might have a driver’s license number in the address line 1 field and a customer’s address in the address line 2 field. To synchronize data with its appropriate metadata field, the driver’s license number can be moved to a separate field for the driver’s licenses.
  • Decompose free-form fields into single component fields. For example, the customer’s address can be decomposed into House number, Street name, PO Box, Rural Route, and other smaller component fields.
  • Identify new data fields based on the underlying data. New fields, that do not exist on input, such as Gender Flag, Individual/Business Record Indicator, or Nickname, can be populated by the application, based on table or file look-ups.
  • Break up records storing multiple entities. It might be necessary to create a separate record for each person or entity that is represented on a single input record (such as joint accounts). A separate record allows for a more complete linkage of all entities in the input files.
  • Exclude records that do not meet minimum criteria. Based on defined business rules, the application can be required to exclude or reject records that do not meet basic requirements (for example, records that do not contain a name or address).

The Standardize stage takes a single input, which can be a link from any database connector supported by IBM WebSphere DataStage, a flat file or data set, or any processing stage. It is not necessary to restrict the data to fixed-length columns. Standardize will accept all basic data types (non-vector or non-aggregate) other than binary.

The Standardize stage has only one output link. This link can send the raw input and the standardized output to any other stage. The Standardize stage creates multiple columns that you can send along with the input columns to the output link. Any columns from the original input can be written to the output along with additional data created by the Standardize stage based on the input data (such as a SOUNDEX phonetic or NYSIIS codes). The Match stage and other stages can use the output from the Standardize stage—you can use any of the additional data for blocking and matching columns in Match stages.
The Standardize stage has three categories of rule sets:

  • Domain Pre-processor

For a specific country, the Standardize stage identifies and assigns a data domain to the name, address, and area columns in each record. You can use the output from this file as the input to the country-appropriate Domain-Specific rule sets.

There is one rule set for each country.

These rule sets evaluate the mixed-domain input from a file for a specific country. Domain-preprocessor rule sets follow a naming convention that starts with a country abbreviation and ends with prep (an abbreviation for preprocessor) such as USPREP for the U.S. and GBPREP for Great Britain.

Note: These rule sets do not perform standardization but parse the columns in each record and each token into one of the appropriate domain-specific column sets, which are Name, Area, or Address

The domain-preprocessor rule sets do not assume a data domain with a column position. You need to delimit every column or group of columns with literals as shown in the Standardize Rule Process window in following figure.
·         Domain Specific

These rule sets evaluate the domain-specific input from a file for a specific country. There are three domain-specific rules sets for each country as follows:

Ø  Name including individual names, organization names, attention instructions, and secondary names.
Ø  Address including unit number, street name, type, and directionals.
Ø  Area including cities, states, region, and other locale information.

This category creates consistent and industry-standard data storage structures, and matching structures such as blocking keys and primary match keys.

·         Validation

The Validation rule sets are used to standardize common business data including Date, Email Address, Phone Number, and Taxpayer ID/Social Security Number. These rules are configured for U.S. formats.

The rule sets output two types of columns as follows:
Ø  Business Intelligence columns which help focus on critical information contained within output data. For example, valid data (VD) and valid flag (VF) columns.
Ø  Reporting/Error columns which provide details on the data value that fails validation and the reason code detailing the validation error.

There are four rule sets

Rule Set Name
Comments

VDATE

Dates that include day, month, and year
¾      The following information pertains to the VDATE validation rule set:
¾      Punctuation, such as hyphens or slashes, are removed during the parsing step.
¾      The rule set outputs two types of columns: Business Intelligence columns and Reporting/Error columns.
¾      There are no significant default Classification table entries used with this rule set.
¾      The standard output format is CCYYMMDD.

There are formats for dates that are required for the input data. Expected input date
formats include any of the following formats:
¾      mmddccyy (09211991)
¾      mmmddccyy (OCT021983)
¾      mmmdccyy (OCT21983)
¾      mmddccyy (04101986)
¾      mm/dd/ccyy (10/23/1960)
¾      m/d/ccyy (1/3/1960)
¾      mm/d/ccyy (10/3/1960)
¾      m/dd/ccyy (1/13/1960)
¾      mm-dd-ccyy (04-01-1960)
¾      m-d-ccyy (1-3-1960)
¾      m-dd-ccyy (1-13-1960)
¾      ccyy-mm-dd (1990-10-22O)

Examples of the output string corresponding to a particular input format are as
follows:
¾      Input format 1990-10-22; output result 19901022
¾      Input formation 1/13/1960; output result 19600113
¾      Input format OCT021983; output result 19831002

If a data value passes validation, this rule set populates the following two Business
Intelligence column values:
¾      The valid date {VD} data column which is populated with the eight numeric bytes
¾      The valid flag {VF} field which is populated with the value T

If a data value fails validation, this rule set populates the Reporting Error fields
“Invalid Data” (which has the invalid data value) and “Invalid Reason” which has a
code such as IF (invalid input format), IM (invalid month), IT (invalid table such as a
date of 11111111), MM (invalid numeric month), FB (invalid day of February—leap
year), M0 (invalid day for months with 30 days), and M1 (invalid day for months with
31 days).


VEMAIL

The VEMAIL rule set identifies the format, components and completeness of e-mail
addresses as follows:
¾      All e-mail addresses should have a user, domain, and top-level qualifier.
¾      Punctuation such as hyphens (-), at signs (@), and periods (.) are used as key delimiters during the parsing step.
¾      The default classification table for this rule set contains common domain (for instance, ORG, COM, EDU, GOV, and so forth.) and sub-domain qualifiers (for example, country and state codes).

The parsing parameters parse the address into multiple tokens (for example, the
e-mail address John_Smith@abccorp.comas has three tokens: John_Smith,
abccorp, and comas).
If a data value is validated, this rule set populates the following Business Intelligence
fields:
_ User {US}
_ Domain {DM}
_ Top-level Qualifier {TL}
_ URL {RL}
If a data value fails validation, this rule set outputs the Reporting Error fields
“Unhandled Data” (contains the unhandled data value) and “Unhandled Patterns”
(contains the unhandled pattern).


VPHONE

The VPHONE rule set validates the value and standardizes the format of a U.S.
telephone number. Punctuation such as hyphens (-) and parentheses ( ), are
removed during the parsing step.

If the data value passes validation, this rule set outputs the following Business
Intelligence field values:
¾      Valid Phone Number {VD} field which is populated with the numeric telephone number
¾      Phone Number Extension {VX} field which is populated with the extension number
¾      Valid flag {VF} field which is populated with T
If the data value fails any one of the validation requirements the “Invalid Data” and
the “Invalid Reason” fields are populated. Invalid Reason has a code such as IL
(invalid length—main telephone without extension must be 7 or 10 bytes.), IT (invalid
value), and IP (invalid pattern or format).


VTAXID

The VTAXID rule set validates the value and standardizes the format of a tax ID or
national ID number as follows:
Punctuation such as hyphens (-) are removed during the parsing step.

There are no significant default Classification table entries used with this rule set.
If a data value passes the listed criteria then it is considered a valid value and outputs
two Business Intelligence field values:
¾      TAX_ID/SSN valid data {VD} field which is populated with the nine numeric bytes
¾      Valid flag {VF} field which is populated with the value T.
If the data value fails any one of the validation requirements the “Invalid Data” and the “Invalid Reason” fields are populated. Invalid Reason codes include IP (data value did not contain nine, and only nine, numeric characters), IT (data value was found on the Invalid Data table), and Z3 (first three numeric characters are all zeros).



Data matching finds records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) even if there is no predetermined key.

To increase its usability and completeness, data can be consolidated or linked (matched) along any relationship, such as a common person, business, place, product, part, or event. You can also use matching to find duplicate entities that are caused by data entry violations or account-oriented business practices.
During the data matching stage, IBM WebSphere QualityStage takes the following actions:

  • Identifies duplicate records (such as customers, suppliers, products, or parts) within one or more data sources
  • Provides householding for individuals (such as a family or group of individuals at a location) and householding for commercial entities (multiple businesses in the same location or different locations)
  • Enables the creation of match groups across data sources that might or might not have a predetermined key

There are two types of match stage:
  • Unduplicate match locates and groups all similar records within a single input data source. This process identifies potential duplicate records, which might then be removed.
  • Reference Match identifies relationships among records in two data sources.

Matching is a two-step process: first you block records and then you match them.
Blocking provides a method of limiting the number of pairs to examine. When you partition data sources into mutually-exclusive and exhaustive subsets and only search for matches within a subset, the process of matching becomes manageable.

Basic blocking concepts include:

  • Blocking partitions the sources into subsets that make computation feasible.
  • Block size is the single most important factor in match performance.
  • Blocks should be as small as possible without causing block overflows.Smaller blocks are more efficient than larger blocks during matching.

To understand the concept of blocking, consider a column that contains age data. If there are 100 possible ages, blocking partitions the source data into 100 subsets. The first subset is all people with an age of zero, the next is people with an age of 1, and so on. These subsets are called blocks.

If the age values are uniformly distributed, 10 records out of the 1000-record source contain data for people of age 0 on each source, 10 records for people of age 1, and so on.

The pairs of records to be compared are taken from records in the same block:
  • The first block consists of all people of age 0 on each data source. This value is 10 times 10 or 100 record pairs.
  • The second block consists of all people on each data source with an age of 1.

When the process is complete, you compared 100 (blocks) x 100 (pairs in a block) = 10 000 pairs, rather than the 1 000 000 record pairs that are required without blocking.

You can also combine multiple blocking variables into a single block for a single pass. For example, blocking on age and gender divides the sources into sets of 0-year-old males, 0-year-old females, 1-year-old males, 1-year-old females, and so on.

Blocking creates a set of records that all have the same values in the blocking variables. Records in different blocks (which means they do not have the same values in the blocking variables) are classified as unmatched automatically. If you believe that the unmatched records might still include matches, then a second pass can be defined with the new blocking variables to look for matches in the newly constituted blocks.
Blocking strategies are used to limit the number of records to compare. Smaller blocks are many times more efficient than larger blocks. Use restrictive blocking schemes in the first pass with blocking variables (such as SSN) that are expected to match records definitively. Unmatched records can then be subject to less restrictive blocking variables such as first and last name. Thus
progressively, fewer and fewer unmatched records can be subject to matching in subsequent passes that have less and less restrictive blocking variables. The reason for having multiple passes is to provide for cases where the data in the blocking variables is error-prone.

The matcher computes composite weights for all pairs in the set. IBM WebSphere QualityStage then finds the mathematically optimal sets of matches and duplicates on each source.

To do this, the matcher maintains a matrix of weights that can be optimized. This matrix is size limited so the number of records in a set is limited. If the matrix size is exceeded, all records in the set are skipped and must be handled by a subsequent matching pass.

  • Avoiding block overflow
To avoid block overflow, the number of records to be compared in one block must not exceed system memory on the server and the total number of records on either source in one block must not exceed the block overflow setting.
You can determine whether block overflow has occurred in the Match Designer tool’s Total Statistics panel as shown in following figure.This information is provided for each match pass.


·         Applying blocking variables

The best blocking variables are those with the largest number of values possible and the highest reliability. Base a blocking strategy on the following principles:
Ø  Make blocking sets as small as possible. A good size is 10 to 20 records per source. Efficiency becomes quite poor when blocks exceed 100 records per source.
Ø  Use multiple passes to define different sets.
Ø  Find easy matches quickly and then widen the net in your subsequent passes.
The match process only looks for matches in records that have the same values in the blocking columns.

After you create a block of records, Match stage compares columns that you specified as matching columns (in the Match specification) to determine the best match for a record.

To determine whether a record is a match, the Match stage performs the following steps:

·         Calculates a weight for each comparison, according to the probability associated with each column. The Match stage uses two probabilities for each column:

Ø  The m probability reflects the error rate for the column.

You set the probability that a column agrees provided that the record pair is a match.

The m probability is one minus the error rate of the column. If a column in a sample of matched records disagrees 10% of the time, the m probability for this column is 1 - 0.1, or 0.9. The closer the m probability is to one, the more critical is a disagreement on the column. You can use a very high m probability to force columns that are very important to have a high penalty for disagreeing.

Ø  The u probability is the likelihood that a column agrees at random

If a u probability is high, the weight is low. The u probability is the probability that the column agrees, provided that the record pair does not match. The u probability is the probability that the column agrees at random.

For example, the probability that gender agrees at random is 50% of the time. With a uniform distribution, you have four combinations in which gender agrees in two of the four, or a 0.5 u probability. If you assign a high u probability, a low weight is calculated when a column agrees.

During the match run, the Match stage calculates a u probability for each comparison using the frequency distribution statistics for the column, described in “Match Frequency Stage”. Usually your concern for u probability is to provide a base point between a rare event (1 out of the total # of records) and a common event (1 out of 2). A value of 0.1 or 0.01 is a typical starting point. If you need to control the u probability (such as columns for an individual identification number, a national ID number, or a Patient Identification Number), specify the vartype to be NOFREQ which indicates that frequency distribution information should not be used by the Match stage.

·         Weights are used in the Match stage to determine by what percentage a column matches and to verify that the matches have a high degree of agreement. For each matching column, the Match stage computes a weight using the following equations:

Ø  An agreement weight is computed when the comparison between a pair of columns agrees.
log2(m probability / u probability)
Ø  A disagreement weight is computed when the comparison between the pair of columns disagrees.
log2((1 - m probability)/(1 - u probability))

The higher the m-probability, the higher the disagreement weight will be for the field not matching because errors are relatively rare events; the lower the u probability the greater the potential weight for the comparison. Additionally, the calculated weight or penalty represents a maximum or minimum range of scoring. Specific match comparisons can generate a score within the range depending on the comparison used.

The Match stage adds the weights assigned to each column comparison and obtains a composite weight for the record.

The Match stage then computes the composite weight as the sum of the agreement weight and the disagreement weight of each column. The higher the composite weight, the greater the agreement.

·         In the Match specification, you can specify cutoff threshold as follows:
Ø  Match Cutoff threshold

When a record pair receives a composite weight greater than or equal to this weight, the pair is declared a match.

Ø  Clerical Cutoff threshold

When a record pair receives a composite weight greater than or equal to this weight and less than the match cutoff, the pair is marked for clerical review. This weight must be equal to or less than the match cutoff weight. If you do not want a clerical review, set the clerical cutoff equal to the match cutoff.

Ø  Duplicate threshold

The lowest weight that a record pair can have to be considered a duplicate. This cutoff weight is optional and must be higher than the match cutoff weight.

This cutoff is only used with Reference Match and not with Unduplicate Match.

These thresholds are used to send the results of matching to the appropriate outputs.

You use the Match Designer tool to set which columns are to be compared for the matching process. When the match specification is completed, you can use it in the Unduplicate Match stage, Reference Match stage, and Match Frequency stages. You can then apply the results from these match stages to the next stage of your project, such as Survive or for loading into a database.
The Match Frequency stage gives you direct control over the disposition of generated frequency data. This stage provides results that can be used by the Match Designer tool and match stages, but enables you to generate the frequency data independent of running the matches. You can generate frequency information by using any data that provides the fields that are needed by a match. Then you can let the generated frequency data flow into a match stage, store it for later use, or both.

The Match Frequency stage takes one input data file. This data file can be from one of the following places:
·         A link from a sample data set file for use in the Match Designer tool
·         A link from any parallel database, file or processing stage that carries information you need for the following stages:
Ø  An unduplicate match
Ø  The data source input of a reference match
Ø  The reference source input of a reference match

The Match Frequency stage takes a single output link that carries four columns:
  • qsFreqVal
  • qsFreqCounts
  • qsFreqColumnID
  • qsFreqHeaderFlag
  •  
The data in these columns provides the necessary information to give input to the Match Designer tool and to the Unduplication Match and Reference Match stages.

When you are configuring the Match Frequency stage, you can choose from the following options:
  • Designate a match specification from the repository, so that frequency information is generated only for the match columns in the specification.
  • Do not designate a match specification, thereby generating frequency information for all columns of the input.
When multiple match specifications are defined for the same table, it is more efficient to generate a single match frequency output for all the columns in the table, and then use it in the different match stages.
  • Increase the number of frequencies reported in the output. (The default is 100.) You should increase this number if you are processing a large number of records.
The Match Designer is a tool for creating a match specification.

When you create the match specification, you can select from more than 25 types of comparisons such as CHAR (character comparisons), and UNCERT (character uncertainty comparisons). The comparisons are algorithms that are based on the type of data in the columns. Following table shows the most common comparisons and where they are used.
Common comparisons
Where mostly used

CHAR (character comparison) character by character, left to right.


This is most useful to ensure a direct comparison, particular on code related fields such as Gender.

CNT_DIFF (count differences) tolerates a set number of key stroke errors (typically 1-2) on a prorated basis.


This is most commonly used with data such as telephone number or SSN where there is higher probability of key stroke issues.

DATE8 (basic tolerance of date errors) tolerates a set number of variations in days on a prorated basis.


This is most commonly used with data such as date of birth, date of encounter, and so forth where there is possibility of slight error in the day entered.

MULT_UNCERT - (word uncertainty in single field) tolerates phonetic errors,
transpositions, random insertion, deletion, and replacement of characters and words using a comparison threshold.


This is the most common comparison for business names, arrays of identifiers, or other descriptive data where multiple words exist that can vary in position.

NAME_UNCERT - (shortest value character uncertainty) tolerates phonetic errors, transpositions, random insertion, deletion, and replacement of characters using a comparison threshold but only to a maximum of the shortest value length.


This is the standard comparison for personal names.

PREFIX - (shortest value character comparison) character by character, left to right to a maximum of the shortest value length.


This is most useful to ensure matching of truncated data such as middle initial.
UNCERT - (character uncertainty) tolerates phonetic errors, transpositions, random insertion, deletion, and replacement of characters using a comparison threshold.

This is the standard comparison for street names or other basic text data.

You can use the Match Designer to create multiple match specifications that include one or more passes. Each pass is separately defined and is stored in the repository to be reused.Following figure shows a panel that specifies the match column (GENDER), comparison type (CHAR – Character comparisons), and m (.9) and u (.01) probabilities chosen.
You can run each pass on the complete source data or test data that is created from a representative subset of your production data and view the results in a variety of graphic displays. You can also use frequency information that is generated by the Match Frequency stage to help create your match specifications.

You create separate match specification for each type of match as follows:

·         For Unduplicate Match, you can create a match specification that is one of the following types:
Ø  Unduplicate (dependent) removes duplicates from match consideration in subsequent passes
Ø  Unduplicate Independent includes all duplicates in subsequent passes

·         For Reference Match, you can create a match specification that is one of the following types:
Ø  One-to-One to create a two-file match specification based on a one-to-one relationship. In this match type, a record on the reference source can match only one data source record.
Ø  Reference to create a two-file match specification based on a many-to-one relationship.
There are multiple flavors of this type as follows:
§  Many-to-One in which a record on the reference source can match many data source records. However, any one data source record can only match one reference source record.

For example, 101 Main St. on the data source matches to two records on the reference source: 101-199 Main St SW and 101-199 Main St SE. The first reference source record is the matched record and the other is disregarded.

Many-to-one Multiple in which each reference source record having the same weight as the matched pair when it is scored against the data record is flagged as a duplicate record. Any one data source record might match more than one reference source record. For example, 101 Main St. on the data source matches to two records on the reference source: 101-199 Main St SW and 101-199 Main St SE. One reference source record is the matched record and the other is the duplicate.

§  Many-to-one Duplicates is similar to the Many-to-one Multiple option, except that additional reference source records that match to a level above the duplicate cutoff value are flagged as duplicates. Thus, records with lower weights than the match weight can be flagged as duplicates.

For example, 101 Main St on the data source matches to three records on the reference source: 101-199 Main St SW, 101-199 Main St SE, and 101 Main Rd. One record gets 101-199 Main St SW as the match and both of the other addresses could be duplicates.

The inputs to the Match Designer tool are:
  • Input data source which is usually a representative subset of the data or reference source
  • Frequency information from the Match Frequency stage as described in “Match Frequency Stage”

The output of the Match Designer tool is a test results database—you must provide the ODBC Data Source Name (DSN) of this test results database (on the client computer), and the user name and password for accessing it.

The main area of the Match Designer is made up of two tabs:

·         Compose tab

On the Compose tab, you design and fine-tune the match passes. You design the Match passes and add them to the Match job. You can add, delete, and modify Match passes in this section. For each Match pass, you can specify blocking fields, matching fields, cutoff weights, and view the weight histogram, data results, and Match pass statistics.

Ø  The Match Type area is a kind of sandbox for designing jobs that displays the current Match job. In this area, you can rearrange the order in which the Match passes run in the Match job, add or remove passes, and create new passes.
Ø  The Match Pass Holding area is used to keep iterations of a particular pass definition or alternate approaches to a pass. The passes in the holding area do not run as part of the match job, but can be tested in isolation. You can add any of the Match passes in the holding area to the Match job by moving the Match pass to the Match Type area. Also, you can remove any of the Match passes from the Match job by moving it from the type area into the Match Pass Holding Area. Any pass, whether in the type or holding areas, can be test run. This approach lets you perform trial runs of different pass definitions without needing to lose alternate definitions.
Ø  The Blocking Columns area designates the fields that must match exactly for records to be in the same processing group for the match. The right pane shows the histogram and data sections when the run is complete. You can sort and search the data columns from the match results.
Ø  Cutoff values area identifies the match cutoff and clerical cutoff.
Match cutoff verifies that any record pairs above the cutoff have a high probability of matching and those below the cutoff have a high probability of not being a match. The composite weights assigned to each record pair create a distribution of scores that range from very high positive to very high negative.

§  Within the distribution of positive values, define a value or cutoff at which any record pair receiving a weight equal to or greater than this cutoff is considered a match.
This is referred to as the match cutoff.
§  Conversely, define a cutoff at which any record pair receiving a weight equal to or less than this cutoff is considered a non-match. Record pairs with weights that fall between these two cutoff values are considered clerical review cases.
This is referred to as the clerical cutoff.

If more than one record pair receives a composite weight higher than the match cutoff weight, those records are declared duplicates.

The duplicate cutoff weight is optional and must be higher than the match cutoff weight. This cutoff is not used with the Unduplicate Match stage. The way in which duplicate records are handled is based on what type ofmatching you selected.

Any record pair that falls below the clerical cutoff becomes a residual and is eligible for the next matching pass.

·         Total statistics tab

This tab provides statistical data in a graphical format for all the passes that run.

The cumulative statistics are of value only if you test multiple passes consecutively, in the order that they appear in the match specification. The Total Statistics page displays the following information:

Ø  Cumulative statistics for the current runs of all passes in the match specification.
Ø  Individual statistics for the current run of each pass.
Ø  Charts that compare the statistics for the current run of all passes.
This stage locates and groups all similar records within a single input data source. This process identifies potential duplicate records, which might then be removed.

An example is the need to eliminate duplicates from a consolidation of mailing lists purchased from multiple sources.

The Unduplicate Match stage accomplishes the following actions:

  • Categorizes all records with weights above the match cutoff as a set of duplicates.
  • Identifies a master record by selecting the record within the set that matches to itself with the highest weight. The master record is associated with its set of duplicates.
  • Determines that records not part of a set of duplicates are residuals. The residuals and the master records are generally made available for the next pass.
  • Excludes duplicates in subsequent passes. However, you can choose the Independent match type (more later) if you want duplicates to be included in subsequent passes.

The output of the Unduplicate Match stage can include master records, duplicates above the match cutoff, clerical duplicates, and residuals. You can use this output as input to the Survive stage.

Unduplicate Match stage uses two match types as follows:

  • Dependent: This match type is the default choice. With this match type, after the first pass, duplicates are removed from match consideration in subsequent passes.
  • Independent: With this match type, duplicates are included in subsequent passes.
Across all passes, all records that match to any given record are also considered matched to each other. For example, if records A and B match in pass 1, records B and C match in pass 2, and records C and D match in pass 3, then records A, B, C and D are matched to each other.

Following table shows sample data to describe how to use the Independent match type option with the Unduplicate Match stage. The table shows four records that describe the same person. You require that all records concerning the same person match without regard to address.

Record
Name
Address
Tax Identifier
1
William Nickson
123 Rodeo Drive
123456789
2
Bill Nixon
123 Rodeo Drive

3
B Nickson
978 Sunset Blvd
123456789
4
Nickson
456 Western Ave.
123456789

The matching process using this data can yield different results depending on whether you choose the Dependent or Independent match type:

  • Dependent Match

The first pass blocks and matches on Name and Address. Records 1 and 2 are considered a matched pair, while records 3 and 4 are considered residuals.

If Record 2 (without the TaxID) is selected as the master, and Record 1 is considered a duplicate, then Record 1 is not available for the second pass.

If the second pass blocks and matches on Name and TaxID, then only Records 3 and 4 match. The result is two groups of matched records: Records 1 and 2, and Records 3 and 4.

  • Independent Match

The first pass results are the same as the Dependent Match. Records 1 and 2 are considered a matched pair, and records 3 and 4 are considered residuals.

If Record 2 (without the TaxID) is selected as the master record in the second pass, the duplicate record, Record 1, is also compared to the rest of the records. When you block on Name and TaxID, records 1, 3, and 4 match. Because Record 1 matched Record 2 in the first pass, the output is one group with all four records linked.
The Reference Match stage identifies relationships among records. This match can group records that are being compared in different ways as follows:

  • One-to-many matching

Identifies all records in one data source that correspond to a record for the same individual, event, household, or street address in a second data source. Only one record in the reference source can match one record in the data source because the matching applies to individual events.

An example of one-to-many matching is finding the same individual (based on comparing social security numbers) in two different sources such as a department of motor vehicles (DMV) list and a voter registration list. There should be at most only one record matching on social security number in each list.

  • Many-to-one matching

Multiple records in the data file can match a single record in the reference file. For example, matching a transaction data source to a master data source allows many transactions for one person in the master data source.

An example of many-to-one matching is matching the ZIP codes in a customer file with the list of valid ZIP codes. More than one record in the customer file can have the same ZIP code in it.

When you configure the Reference Match stage, you select an existing Reference-type Match specification stored in the repository, select the match type, and select the match outputs as shown in following figure.
The Reference Match stage requires standardized data and reference data as source data, a reference match specification, and frequency information for both sources. Briefly, these inputs are:

  • Source data from any parallel database, file or processing stage. Typically, the source is standardized data from the Standardize stage.
  • Reference source against which the source data is matched.
  • Frequency information for that data as generated by the Match Frequency stage
  • Match specification to group and match the data as described for the “Match Designer tool”

The output of the Reference Match stage includes master records, clerical review records, duplicates, and residuals. You can use this output as input to the Survive stage. Briefly, the Reference Match stage delivers up to six outputs as follows:

  • Match contains matched records for both inputs
  • Clerical has records that fall in the clerical range for both inputs
  • Data Duplicate contains duplicates in the data source
  • Reference Duplicate contains duplicates in the reference source
  • Data Residual contains records that are non-matches from the data input
  • Reference Residual contains records that are non-matches from the reference input


The Survive stage consolidates duplicate records, which creates a best-of-breed representation of the matched data. Survive consolidates duplicate records, creating the best representation of the match data so companies can use it to load a master data record, cross-populate all data sources, or both.

During the Survive stage, IBM WebSphere QualityStage takes the following actions:
  • Replaces existing data with “better” data from other records based on user specified rules
  • Supplies missing values in one record with values from other records on the same entity
  • Populates missing values in one record with values from corresponding records which have been identified as a group in the matching stage
  • Enriches existing data with external data

The Survive stage constructs column values from groups of related or duplicate records and stores the column values in the survived record (the best result) from each group.

The Survive job is the last job in the IBM WebSphere QualityStage workflow and is usually run after the Unduplicate Match stage job. The output from the Unduplicate Match stage, and in some cases the Reference Match stage, becomes the source data that you use for the Survive stage.

The Survive stage requires one input source, which can be sourced from a database connector, flat file, data set, or another processing stage. If your input is the result of a match stage, you need to set up another stage (for example, a Funnel stage) to combine the master, duplicate records, and clerical reviewed duplicates into one input source.

While it is not necessary to process the data through the match stages before you use the Survive stage, the source data must include related or duplicate groups of rows. Also, the data must be able to be sorted on one or more columns that identify each group. These columns are referred to as group keys.

  • To order the records, you sort on the group key or keys so that all records in a group are contiguous. The Survive stage sorts records automatically if the “Don’t Pre-sort Input” option is not selected in the Survive Stage window. However, the automatic sort provides no control over the order of the records within each group. To control the order within groups, you can presort the input by using the Sort stage.
  • The Survive stage accepts all basic data types (non-vector, non-aggregate) other than binary.

The Survive stage can have only one output link. This link can send output to any other stage. You specify which columns and column values from each group create the output record for the group. The output record can include an entire input record, or selected columns from the record, or selected columns from different records in the group.

  • You select column values based on rules for testing the columns.

A rule contains a set of conditions and a list of one or more target columns. The rows are tested one by one in the order they appear within the match set. For each row tested, the values of the “test” row is updated.
  • To select a best candidate match, you can specify multiple columns such as the record creation date, data source from which the record originated, length of data in the column, and frequency of data in a group

You set up and configure the Survive stage to create a survived record. You need to configure the Survive stage before you run a Survive job. Before you can add source data to the Survive stage, all input records must be combined into one input source.

When you configure the Survive stage, you choose simple rules that are provided in the New Rules window or you select the Complex Survive Expression® to create your own custom rules. You use some or all of the columns from the source file, add a rule to each column, and apply the data.

After the Survive stage processes the records to select the best record, this information is sent to the target file.

To consider a target as the best candidate for the output record requires a rule that comprises one or more targets and a TRUE conditional expression. A condition is made up of:
  • Column names
  • Constant or literal values
  • Operators that specify comparison or arithmetic operations

The Survive stage evaluates the columns to the rule (simple or complex) and selects those that meet the conditions of the rule as best columns. The Survive stage reads the first record and evaluates the record according to any rule that you select. The evaluation process uses the following method:

  • If the first record has no best columns then the selected rule for the target record is evaluated against all the columns in the record. If a target record passes the test, its columns become best columns and a “b” appears in frontof the column names—this is only visible when building the rule and not part of the evaluation process.

  • Each subsequent record in the group is evaluated in relation to the current record. If a target record passes the test then its columns become the best columns and replace any existing best columns. If none of the current columns meets the conditions, the best columns remain unchanged.

  • After all records in the group are evaluated, the values that are designated as the best values are combined in the output record. Survive continues the process with the next records in the next group.

To apply survival rules to the input columns, the Survive stage includes:

·         A set of predefined Techniques (packaged survive expressions) shown in the following table from which you can select the desired technique. These techniques are sometimes referred to as simple survive rules.

Technique
Pattern
Shortest Field
SIZEOF(TRIM(c,”column”)) <= SIZEOF(TRIM(b.”column”))
Longest Field
SIZEOF(TRIM(c,”column”)) >= SIZEOF(TRIM(b.”column”))
Most Frequent
FREQUENCY
Most Frequent
(non blank)
FREQUENCY (Skips missing values when counting most frequent)
Equals
c.”column” = “DATA”
Not Equals
c.”column” <> “DATA”
Greater Than
c.”column” >= “DATA”
Less Than
c.”column” <= “DATA”
At Least One
1 (At least one record survives, regardless of other rules)

You define a simple rule by specifying each of the following elements:
Ø  Target column or columns
Ø  Column to analyze
Ø  Technique to apply to the column that is being analyzed
The rule is defined for you depending on the Technique that you select.

Following figure shows an example of a simple rule involving the Longest field technique.

·         The Rule Expression Builder for creating your own complex expressions. The rules created by the Rule Expression Builder are also referred to as complex survive rules.

In the Rule Expression Builder, you define a rule by specifying each of the following elements:
Ø  A current record from the Columns list
Ø  One or both functions (SIZEOF and TRIM)
Ø  An operator such as =, <>, <, >, +, -, *, /, %, AND, OR and NOT
Ø  A best record from the Columns list
Ø  One or both functions
In a complex rule, you can use more than one column (other than the target itself) to define the rule for a target.

Following figure shows a sample complex expression rule using the Rule Expression Builder.

2 comments:

  1. The Figure links are broken. Would it be possible to fix the Figures? Thank you.

    ReplyDelete
  2. nicely written, very helpful for beginners

    ReplyDelete