The
data that drives today’s business systems often comes from a variety of sources
and disparate data structures. As organizations grow, they retain old data
systems and augment them with new and improved systems. Thus, data becomes
difficult to manage and use and a clear picture of a customer, product, or
buying trend can be practically impossible to ascertain.
Data
quality can be broadly defined from a qualitative viewpoint as information that
you can trust or ensuring that data at a particular point in time is suitable
for its purpose. A more specific quantitative definition can include the level
of compliance that is attained by an enterprise’s data environment to
independently define rules that describe that data environment. The emphasis is
on the business user’s perception of data quality—what is delivered to the user
and the semantics of the data that is presented. In this case, data quality
might well depend upon the data itself (such as correct data types, consistent
formatting, retrievability and usability) and the processes and applications
that deliver this data to the business user.
The
source of data quality issues is a lack of common standards on how to store
data and an inconsistency in how the data is input. Different business
operations are often very creative with the data values that they introduce into
your application environments. Inconsistency across sources makes understanding
relationships between critical business entities such as customers and products
very difficult. In many cases, there is no reliable and persistent key that you
can use across the enterprise to get all the information that is associated
with a single customer or product.
Without
high-quality data, strategic systems cannot match and integrate all related
data to provide a complete view of the organization and the interrelationships
within it. CIOs can no longer count on a return on the investments made in
critical business applications. The solution calls for a product that can
automatically re-engineer and match all types of customer, product, and
enterprise data, in batch or at the transaction level in real time.
IBM
Information Server addresses these requirements with an integrated software
platform that provides the full spectrum of tools and technologies that are
required to address data quality issues. These technologies include data
profiling (IBM WebSphere Information Analyzer and IBM WebSphere AuditStage),
data cleansing (IBM WebSphere QualityStage), and data movement and
transformation (IBM WebSphere DataStage) as follows:
·
IBM WebSphere Information Analyzer (the focus of IBM WebSphere
Information Analyzer and Data Quality Assessment) is a new module of IBM
Information Server that represents the next generation data profiling and
analysis tool. It is designed to help business and data analysts understand the
content, quality, and structure of their data sources by automating the data
discovery process. The IBM WebSphere Information Analyzer product module helps
increase the productivity of data personnel and improve return on investment
(ROI) and time to benefit of data-intensive projects.
·
IBM WebSphere DataStage integrates data across multiple
and high volumes of data sources and target applications. It integrates data on
demand with a high performance parallel framework, extended metadata
management, and enterprise connectivity.
·
IBM WebSphere QualityStage enables customers to use the
following processes to make a significant impact in the data that drives an
organization’s success.
Ø Investigation of source data to
understand the nature, scope, and detail of data quality challenges.
Ø Standardization to ensure that
data is formatted and conforms to organization-wide specifications, including
name and firm standards as well as address cleansing and verification.
Ø Matching of data to identify
duplicate records within and across data sets.
Ø Survivorship to eliminate
duplicate records and create the “best record view” of data.
As
briefly introduced, the five types of issues that generally occur within
enterprise data stores are as follows:
- Lack of information
standards:
Names, addresses, part numbers, and other data are entered in inconsistent
ways, particularly across different systems. These differences make
records look different even when they are actually the same
- Data surprises in individual
fields: Data
in a database is often misplaced, while in other cases some fields are
used for multiple purposes.
- Information is buried in
free-form fields: In
this case, valuable information is hidden away in text fields. Because
these fields are difficult to query using SQL, this information is often
not taken advantage of, although it most likely has value to the business.
This type of problem is common in product information and help desk case
records.
- Data myopia (our term for the lack of
consistent identifiers across different systems): Without adequate
foreign-key relationships, it is impossible to get a complete view of
information across systems.
- Redundancy within individual
tables: This
issue is extremely common, where data is re-entered into systems because
the data entry mechanism is not aware that the original record is already
there. This issue is a common side effect of the lack of standards, but it
is one of the worst data quality problems, because it links directly to
costs and customer dissatisfaction.
IBM
WebSphere QualityStage helps to identify and resolve all these issues for any
type of data. It helps to ensure that systems deliver accurate and complete
information to business users across the enterprise. IBM WebSphere QualityStage
is a data re-engineering environment that is designed to help programmers,
programmer analysts, business analysts, and others cleanse and
enrich
data to meet business objectives and data quality management standards.
A process for re-engineering data
can help accomplish the following goals:
- Resolve conflicting and ambiguous meanings for
data values
- Identify new or hidden attributes from free-form
and loosely controlled source fields
- Standardize data to make it easier to find
- Identify duplication and relationships among
such business entities as customers, prospects, vendors, suppliers, parts,
locations, and events
- Create one unique view of the business entity
- Facilitate enrichment of re-engineered data,
such as adding information from vendor sources or applying standard postal
certification routines
IBM
WebSphere QualityStage data quality functions include:
·
Free-form
text investigation that allows you to recognize and parse out individual fields
of data from free-form text
Investigation
provides the ability to determine the number and frequency of the unique values
found in either single-domain or free-form columns:
Ø For single-domain columns, the unique values can
represent complete data values, partial data values (that is prefixes,
suffixes, or substrings), simple data formats (alpha, numeric, blanks, or
special characters), or combinations of values and simple data formats.
Ø Free-form columns often contain multiple
(usually related) data elements such as person names, business names, or postal
addresses. For free-form columns, the unique values can represent individual
tokens (words) that are determined by parsing the free-form data or lexical
patterns providing context to help understand the free-form data.
·
Standardization
that allows individual fields to be parsed and made uniform according to your
own standards
Standardization
provides the ability to normalize your information to defined standards. This
incorporates the ability to parse free-form columns into single-domain data
elements to create a consistent representation of the input data and to ensure
data values conform to standard representation.
·
Address
verification and correction that uses postal information to standardize,
validate, and enrich address data
For
addresses, standardization incorporates postal standards with delivered rule
sets. It also prepares data elements for more effective matching.
·
Record
linkage and matching that allows duplicates to be identified from individual
sources, and common records across sources to be identified and linked
The
central strength of IBM WebSphere QualityStage is its ability to match data
from different records, even when it appears very different. IBM WebSphere
QualityStage utilizes a statistical matching technique called Probabilistic Record Linkage2 that provides the highest
optimization of each contributing data element to a match and an unlimited
number of elements for the highest confidence in your information. This
matching allows each individual field score to be summed to produce a final
score that precisely measures the information content of the matching fields.
That final score, or match weight, is an accurate gauge of the
probability of a match.
Because
of this ability to match records, IBM WebSphere QualityStage is a key enabler
of creating a single view of customers or products.
·
Survivorship
that allows the best data from across different systems to be merged into a
consolidated record Survivorship is the process of aggregating or
consolidating a group of records into a single unique representation of data, a
single consolidated record for retaining the best of breed information from the
given individual records. Survivorship incorporates a flexible process of
assessing data at the record or individual field level based on specific survivorship
rules. These rules can include completeness of data, frequency of data, or
logical conditions.
These
capabilities of IBM WebSphere QualityStage support the re-engineering of data
to meet complex business goals and challenges.
IBM
WebSphere QualityStage is built around a services-oriented vision for
structuring data quality tasks that are used by many new enterprise system
architectures. As part of the integrated IBM Information Server platform, it is
supported by a broad range of shared services and benefits from the reuse of
several suite components.
IBM
WebSphere QualityStage (and IBM WebSphere DataStage) share the same
infrastructure for importing and exporting data, for designing, deploying, and
running jobs, and for reporting. The developer uses the same design canvas to
specify the flow of data from preparation to transformation and delivery.
Multiple
discrete services give IBM WebSphere QualityStage the flexibility to match
increasingly varied customer environments and tiered architectures.
With
reference to above figure, the following suite components are shared between
IBM WebSphere QualityStage and IBM Information Server:
- Unified user interface
The
IBM WebSphere DataStage and QualityStage Designer provides a development
environment. The IBM WebSphere DataStage and QualityStage Administrator
provides access to deployment and administrative functions. IBM WebSphere
QualityStage is integrated tightly with IBM WebSphere DataStage and shares the
same design canvas, which enables users to design jobs with data transformation
stages and data quality stages in the same session.
- Common services
IBM
WebSphere QualityStage uses the common services in IBM Information Server for
logging and security. Because metadata is shared “live” across tools, you can
access services such as impact analysis without leaving the design environment.
You can also access domain-specific services for enterprise data cleansing such
as investigate, standardize, match, and survive from this layer.
- Common repository
The
repository holds data to be shared by multiple projects. Clients can access
metadata and results of data analysis from the respective service layers.
- Common parallel processing engine
The
parallel processing engine addresses high throughput requirements for analyzing
large quantities of source data and handling increasing volumes of work in
decreasing time frames.
- Common connectors
Any
data source that is supported by IBM Information Server can be used as input to
a IBM WebSphere QualityStage job by using connectors. The connectors also
enable access to the common repository from the processing engine.
IBM
WebSphere QualityStage uses one or more of the stages shown in following figure
to improve an organization’s data quality.
The
main functions are investigate, standardize, match, and survive as follows:
- Investigate source data to understand the
nature, scope, and detail of data quality challenges.
- Standardize data to ensure that it is
formatted and conforms to organization-wide specifications including name
and firm standards as well as address cleansing and verification.
Investigate can be used to assess the effectiveness of Standardize. The
standard rules can then be augmented to improve the Standardize data.
- Match data to identify duplicate
records within and across data sets.
- Survive appropriate data by
eliminating duplicate records and creating the best record view of data.
IBM
WebSphere QualityStage comprises a set of stages, a Match Designer, and related
capabilities that provide a development environment for building data-cleansing
tasks called jobs. The IBM WebSphere QualityStage
components include the Match Designer for designing and testing match passes
and the IBM WebSphere QualityStage stage types, including Investigate,
Standardize, Match Frequency, Reference Match, Unduplicate Match, and Survive.
(This is not a
comprehensive
list.)
When a
IBM WebSphere QualityStage job is compiled, the Designer (client) transfers the
logic (developed on the client) to the Server. On execution, the IBM WebSphere
QualityStage load modules perform the actual data engineering tasks such as
investigate, standardize, matching, and survive. These load modules operate
through control parameter statements that are passed to the modules during
processing. Each IBM WebSphere QualityStage operation is referred to as a stage. Complex data engineering tasks
can be performed by linking individual stages together in a job comprising
multiple job steps or stages.
IBM
WebSphere QualityStage comes with customizable rules to prepare complex
information about your business entities for a variety of transactional,
operational, and analytical purposes.
IBM
WebSphere QualityStage automates the conversion of data into verified standard
formats including the use of probabilistic matching, in which variables that
are common to records (such as the given name, date of birth, or gender) are
matched when unique identifiers are not available.
IBM
WebSphere QualityStage components include the Match Designer, for designing and
testing match passes, and a set of data-cleansing operations called stages. Information is extracted from
the source system and then measured, cleansed, enriched, consolidated, and
loaded into the target system. At run time, data cleansing jobs typically
consist of the following sequence of
stages:
·
Standardize
stage parses
free-form or fixed-format columns into single-domain data elements to create a
consistent representation of the input data. This stage ensures that each data
element has the same content and format and also standardizes spelling formats
and abbreviations.
·
Match
stages ensure
data integrity by linking records from one or more data sources that correspond
to the same customer, supplier, or other entity. Matching can be used to
identify duplicate entities that are caused by data entry variations or
account-oriented business practices. For example:
Ø Unduplicate Match jobs group
records into sets (within a single data set) that have similar attributes.
Ø Reference Match stage matches
reference data to source data between two data sets.
The
probabilistic matching capability and dynamic weighting strategies of IBM
WebSphere QualityStage help you to create high-quality, accurate data and
identify core business information consistently, such as customer, location,
and product throughout the enterprise. IBM WebSphere QualityStage standardizes
and matches any type of information.
·
Survive
stage ensures
that the best available data survives and is consequently prepared for the
target correctly.
Business
intelligence packages that are available with IBM WebSphere QualityStage
provide data enrichment that is based on business rules. These rules can
resolve issues with common data quality problems such as invalid address fields
throughout multiple geographies. The following packages are available:
- Worldwide Address
Verification and Enhancement System (WAVES) matches address data against
standard postal reference data that helps you verify address information
for 233 countries and regions.
- Multinational geocoding is
used for spatial information management and location-based services by
adding longitude, latitude, and census information to location data.
- Postal certification rules
provide certified address verification and enhancement to address fields
to enable mailers to meet the local requirements to qualify for postal
discounts.
By
ensuring data quality, IBM WebSphere QualityStage reduces the time and cost to
implement CRM, business intelligence, ERP, and other strategic customer-related
IT initiatives.
Some of
the scenarios for data cleansing include:
- Obtaining one view of
households
Knowing
that your customers (stored in different source systems) belong to the same
household (share the same mailing address) can facilitate marketing and mail
campaigns.
- Obtaining a consolidated view
of an entity
Knowing
the total quarterly sales from the prescriptions of one doctor can help
pharmaceutical companies effectively market to them. This information needs to
be extracted and consolidated from existing systems with different standards
and formats, including information buried in free-form fields, incorrect data
values, discrepancies between field metadata and actual data in the field, and
duplicates.
- Obtaining a single real-time
view of a customer
Customer
Service Representatives (CSR) and customers using self-service portals require
a single view of all their health, dental, and benefit plans with the insurance
company. Typically, these plans are implemented in different systems, and the
data resides in different sources in different formats.
IBM
WebSphere QualityStage performs the preparation stage of enterprise data
integration, often referred to as data
cleansing. IBM
WebSphere QualityStage takes advantage of the source systems analysis that is
performed by IBM WebSphere Information Analyzer and supports the transformation
functions of IBM WebSphere DataStage. Working together, these products automate
what was previously a manual or neglected activity within a data integration
effort—data quality assurance.
Understanding
your data is a necessary precursor to cleansing and consolidation. You can use
a data profiling product such as IBM WebSphere Information Analyzer to perform
a comprehensive analysis across many columns and tables to create a direct
input into the cleansing and matching process by using shared metadata. The
Investigate stage supports the initial level of column or domain analysis and
extends it through the analysis of free-form text fields to create this input.
The
Investigate stage shows the actual condition of data in existing sources and
identifies and corrects data problems before they corrupt new systems.
Investigation
parses and analyzes free-form fields creating patterns to reveal field formats,
counts unique values, and classifies, or assigns a business meaning to each
occurrence of a value within a field. The Investigate stage also allows users
to validate field contents in a specific column or domain. The actual
full-volume data files must be investigated to assure that as many data
anomalies as possible are identified. The Investigate stage involves performing
one or more of the following functions:
- Parsing
- Classifying
- Creation of Word or Pattern
Distributions
Specific
objectives of the investigation process include, but are not limited to the
following:
- Investigate data values and
anomalies to be considered for special handling during design or
development. Any recommended special handling is documented in a Business
Requirements Document. Examples include:
– Out of range, default (such as all
nines in the telephone number) or unique values.
– Free-form text that requires
parsing to improve “addressability” to key components for optimal matching.
– Values that do not match metadata
labels and require new field definitions for matching and load format (such as
DBA, C/O, Attention text, drivers license numbers, or other comment information
in a name or address field).
– Values that overlap adjacent
fields and thus require a re-alignment of field content. For example, name
information or city, state, and postal code that extend into street and mailing
address fields.
– Invalid formats that can cause
conversion problems such as alphanumeric in character or numeric only fields.
– Blank or missing data.
– Special character and punctuation
handling.
- Discover additional tokens
(key words) to add to the classification table, such as name prefixes,
name suffixes, street types, unit types, and business words.
- Verify the usefulness of data
fields for the purpose of entity matching (record linkage). For example,
does a Tax ID field contain primarily blank or invalid values, making it
an unreliable field to use in the identification of a unique customer?
- Validate business requirements
or assumptions about the data. Results can lead to changes in the business
requirements and can assist in driving the technical solution.
The
Investigate stage takes a single input, which can be a link from any database
connector that is supported by IBM WebSphere DataStage, from a flat file or
data set or from any processing stage. Inputs to the Investigate stage can be
fixed length or variable.
Use of
investigation (or data profiling) is critical to assess how data can be used
for further steps of standardization, matching, and survivorship or
consolidation. Data that is used in these additional steps is stored in two
types of fields—single-domain and multiple-domain.
During
the investigation process, it is important to document fields that require parsing
and standardization, standardization only, or no standardization at all. You
must also consider whether data content will be used for matching,
survivorship, pass-through (static data) or whether it will not be carried
forward into the new system or target database.
·
Static
fields
Ø Can be used in the matching
process.
Ø Might require some standardization
of format and default or out of range values.
Ø Typically re-joined with
re-engineered data at prior to matching or load formatting.
Character
Investigate is usually sufficient to evaluate.
·
Single-domain
fields
Ø Categorized as either Entity Identifiers or Entity Clarifiers:
§ Entity Identifiers
Examples
include ZIP Code (U.S.
postal code), Social Security Number (SSN), TIN, and telephone numbers. These
can serve as critical match fields if their quality is high. These fields
typically have a specific format (such as 9-digit numeric for SSN, or
Alpha-Numeric-Alpha-Numeric-Alpha-Numeric for Canadian postal code). Character
Discrete Investigate is commonly used, with the mask set to all Cs.
§ Entity Clarifiers
Examples
include name prefix, gender, and marital status. This type of field usually has
a limited number of known values. Character Discrete Investigate is commonly
used, with the mask set to all Ts.
Ø Typically used for a single
purpose.
Ø Can exist in several common
formats.
Ø No parsing is usually necessary.
Ø Can serve as strong blocking and
match fields if quality is high.
Ø Standardize Stage usually only
requires removal of special characters, creation of a common default value and
format and concatenating data.
·
Multiple-domain
fields
Ø Typically these are large
free-form fields such as multiple Address fields.
Ø Comprised of multiple
single-domain values, such as house number, street name, unit, and first name.
Ø Intelligent parsing is required on
these fields for optimal matching.
Ø Can serve as good match fields and
multiple-component blocking candidates.
Ø Standardization requires removal
of most special characters and strong data typing.
Ø Occasionally contain business word
interference, such as reference numbers, comments, and so forth.
Ø Can contain additional
relationships such as Attn, C/O, DBA, and Trustee.
Word
Investigate is commonly used for multi-domain fields.
The
Character Investigate option parses a single-domain field (one that contains
one data element or token, such as SSN, telephone number, date, or ZIP code) to
analyze and classify data. Character investigation provides you with the option
of investigating multiple columns individually (Character Discrete) or
integrated as one unit of data (Character Concatenate).
The
investigation process generates a column frequency report that presents
information about frequency values for a specific column.
- Character Discrete
Investigate
The
Character Discrete Investigate analyzes multiple single-domain columns. This
option allows you to investigate a large number of columns with little effort.
A Character Discrete Investigate produces a column frequency report that treats
each column as a separate token for frequency count and analysis.
- Character Concatenate
Investigate
The
Character Concatenate Investigate option performs cross-column correlations
between multiple columns to determine relationships. With this option, you
select two or more columns from anywhere in the record (the columns do not have
to be contiguous) to be investigated as a single data column. To create the
pattern analysis, the tokens are concatenated with no spaces between the
tokens.
The
Character Investigate process generates a column frequency report that presents
information about frequency values for a specific column. A pattern report is
prepared for all types of investigations and displays the count, percentage of
data that matches this pattern, the generated pattern, and sample data. This
output can be presented in a wide range of formats to conform to
standard
reporting tools.
For
character investigations, you use column masks to select the characters
that are included in the frequency count
or pattern analysis and the characters that are displayed as part of samples in
the pattern report. You apply a mask symbol to each character in the selected columns. You can
use the following mask characters:
- Mask C
Displays
the character and includes it in the frequency count and pattern analysis. Use
the C column mask to inspect the values
in your columns and to certify that false data does not appear in a column such
as 99999 for a postal code or 111111111 for a national ID number.
- Mask T
Displays
the type of character
and includes the
character in the frequency count and pattern analysis. Use the T column mask when you want to
inspect the type of data in a character position, for example 9-digit telephone
numbers such as nnn-nnn-nnnn or (nnn)-nnn-nnnn.
- Mask X
Excludes
the character in the frequency count or the pattern analysis; however, it
includes it in the sample data. Use the X
column mask to
include data from the column in the sample but not as a token or part of the
token for investigation.
For
example, you set up an Investigate job to analyze the first two characters of a
ZIP code
to determine the
frequency distribution based on a state (each state is defined by the first two
characters of the ZIP code). You set the column mask for the ZIP code to CCXXX. The qsInvPattern column of the
pattern report displays only the first two characters. The frequency count is
based on the number of records in the file that start with the first two
characters of the ZIP code. In the qsInvSample column of the pattern report,
you will see all five characters of the ZIP code.
You
can also use the X column mask with the Character Concatenate option to specify
one or more columns to appear as part of the sample only.
Character
Investigate generate reports with following fields:
- qsInvColumnName identifies the names of the
column that is investigated.
- qsInvPattern displays the character and includes the
character in the frequency count and pattern analysis.
- qsInvSample shows one or more samples of the content of this
column. The number to be displayed is configurable.
- qsInvCount shows the actual number of occurrences of the
value in the qsInvPattern column.
- qsInvPercent shows the percentage occurrences of the value in
the qsInvPattern column to the total number of records on this file
The
Word Investigation option parses free-form data fields into individual tokens
and analyzes them to create patterns. For example, to create the patterns in
address data, the Word Investigation option uses a set of rules for classifying
personal names, business names, and addresses. The Word Investigation option
also provides frequency counts on the tokens.
The
Investigate stage provides pre-built rule sets for investigating patterns on
names and postal addresses for a number of different countries. For example, for the U.S. ,
the Investigate stage parses the following components:
- USPREP
Name,
address, and area data.
- USNAME
Individual and organization names.
- USADDR
Street and mailing addresses.
- USAREA
City, state, ZIP code, and so on.
The
test field “123 St. Virginia St. ”
is analyzed in the following way:
1. Field parsing would break the
address into the individual tokens of 123, St. ,Virginia ,
and St.
2. Lexical analysis determines the
business significance of each piece:
a. 123 = number
b. St. = street type
c. Virginia = alpha
d. St. = Street type
3. Context analysis identifies the
various data structures and content as 123 St.Virginia, St.
a. 123 = House number
b. St. Virginia = Street address
c. St. = Street type
When
you specify Word Investigate, you select the rule by which you want the columns
investigated and then select one or more columns to examine.
Word Investigate parses the free-form
data column into individual elements or tokens, which is a word, a number, or a
mixture separated by one or more spaces or special characters. The process
compares each token with classified tokens in the Classification table for that
rule set. If the token matches the word in the Classification table,
Investigate assigns the class for that token to represent it in the pattern.
For tokens that do not match any classified token, Investigate examines the
pattern and assigns classes as shown in the following Table.
Word
Investigate generates two reports (Token report and Pattern report) as follows
Token
report
- qsInvCount indicates the number of times this token was
encountered across the entire input data.
- qsInvWord identifies the individual token or word value
that is found inside the selected input columns.
- qsInvClassCode identifies the classification
of the token that is based on the selected rule set classification table.
Unclassified tokens, if selected, get a question mark “?” for alpha or a
carat “^” for numeric.
Pattern report
- qsInvColumnName identifies the names of the
column that is investigated.
- qsInvPattern displays the character and includes the
character in the frequency count and pattern analysis.
- qsInvSample shows one or more samples of the content of this
column. The number to be displayed is configurable.
- qsInvCount shows the actual number of occurrences of the
value in the qsInvPattern column.
- qsInvPercent shows the percentage occurrences of the value in
the qsInvPattern column to the total number of records on this file.
Standardizing
data involves moving free-form data (columns that contain more than one data
entry) into fixed columns and manipulating data to conform to standard
conventions. The process identifies and corrects invalid values, standardizes
spelling formats and abbreviations, and validates the format and content of the
data.
The
Standardize stage builds on the interpretation of the data during the
Investigate stage. The Standardize stage uses the same prebuilt tables and rule
sets that the Investigate stage used to investigate the data to standardize the
data. Standardize reformats data from multiple systems and creates a consistent
data presentation with fixed and discrete columns, according to your company
requirements.
The Standardize stage processes
the data with the following outcome:
- Creates fixed-column, addressable data
- Facilitates effective matching
- Enables output formatting
The
Standardize stage uses the data content and placement within the record context
to determine the meaning of each data element. To correctly parse and identify
each element or token and place them in the appropriate column in the output
file, the Standardize stage uses rule sets to support specific conditioning and
standardization requirements. These can be standard rule sets that are designed
to comply with the name (individual and business) and address conventions of a
specific country. Alternatively, they can be custom rule sets to focus on
challenges such as product or part descriptions, consistent metadata
definitions, insurance claim data, or other industry specific challenges. The
Standardize rule sets can assimilate the data and append additional information
from the input data, such as gender.
These
rule sets are the same as those used in the Investigate stage. You can run the
rules as they are shipped, or you can customize them to process obscure data
not covered by the standard rule sets. You can also create your own custom rule
sets from scratch.
Standardize
ensures that each data type has the same content and format. Standardized data
is important for the following reasons:
- Effectively matches data
- Facilitates a consistent
format for the output data
The
Standardize stage parses free-form and fixed-format columns into single-domain
columns to create a consistent representation of the input data.
- Free-form columns contain
alphanumeric information of any length as long as it is less than or equal
to the maximum column length defined for that column.
- Fixed-format columns contain
only one specific type of information, such as only numeric, character, or
alphanumeric, and have a specific format.
The
following rules are examples of the types of rules that the Standardize stage
might support depending on the rule set in use:
- Assign data to its
appropriate metadata fields. Standardize ensures that the data within a
specific field is being used for the business purpose defined in the
metadata. For example, credit records might have a driver’s license number
in the address line 1 field and a customer’s address in the address line 2
field. To synchronize data with its appropriate metadata field, the
driver’s license number can be moved to a separate field for the driver’s
licenses.
- Decompose free-form fields
into single component fields. For example, the customer’s address can be
decomposed into House number, Street name, PO Box, Rural Route, and other
smaller component fields.
- Identify new data fields
based on the underlying data. New fields, that do not exist on input, such
as Gender Flag, Individual/Business Record Indicator, or Nickname, can be
populated by the application, based on table or file look-ups.
- Break up records storing
multiple entities. It might be necessary to create a separate record for
each person or entity that is represented on a single input record (such
as joint accounts). A separate record allows for a more complete linkage
of all entities in the input files.
- Exclude records that do not
meet minimum criteria. Based on defined business rules, the application
can be required to exclude or reject records that do not meet basic
requirements (for example, records that do not contain a name or address).
The
Standardize stage takes a single input, which can be a link from any database
connector supported by IBM WebSphere DataStage, a flat file or data set, or any
processing stage. It is not necessary to restrict the data to fixed-length
columns. Standardize will accept all basic data types (non-vector or
non-aggregate) other than binary.
The
Standardize stage has only one output link. This link can send the raw input
and the standardized output to any other stage. The Standardize stage creates
multiple columns that you can send along with the input columns to the output
link. Any columns from the original input can be written to the output along
with additional data created by the Standardize stage based on the input data
(such as a SOUNDEX phonetic or NYSIIS codes). The Match stage and other stages
can use the output from the Standardize stage—you can use any of the additional
data for blocking and matching columns in Match stages.
The
Standardize stage has three categories of rule sets:
- Domain Pre-processor
For
a specific country, the Standardize stage identifies and assigns a data domain
to the name, address, and area columns in each record. You can use the output
from this file as the input to the country-appropriate Domain-Specific rule
sets.
There
is one rule set for each country.
These
rule sets evaluate the mixed-domain input from a file for a specific country.
Domain-preprocessor rule sets follow a naming convention that starts with a
country abbreviation and ends with prep (an abbreviation for preprocessor) such
as USPREP for the U.S. and
GBPREP for Great Britain .
Note: These rule sets do not perform
standardization but parse the columns in each record and each token into one of
the appropriate domain-specific column sets, which are Name, Area, or Address
The
domain-preprocessor rule sets do not assume a data domain with a column
position. You need to delimit every column or group of columns with literals as
shown in the Standardize Rule Process window in following figure.
·
Domain
Specific
These
rule sets evaluate the domain-specific input from a file for a specific
country. There are three domain-specific rules sets for each country as
follows:
Ø Name including individual names,
organization names, attention instructions, and secondary names.
Ø Address including unit number,
street name, type, and directionals.
Ø Area including cities, states,
region, and other locale information.
This
category creates consistent and industry-standard data storage structures, and
matching structures such as blocking keys and primary match keys.
·
Validation
The
Validation rule sets are used to standardize common business data including
Date, Email Address, Phone Number, and Taxpayer ID/Social Security Number.
These rules are configured for U.S.
formats.
The
rule sets output two types of columns as follows:
Ø Business Intelligence columns
which help focus on critical information contained within output data. For
example, valid data (VD) and valid flag (VF) columns.
Ø Reporting/Error columns which
provide details on the data value that fails validation and the reason code
detailing the validation error.
There
are four rule sets
Rule Set Name
|
Comments
|
VDATE
|
Dates
that include day, month, and year
¾
The
following information pertains to the VDATE validation rule set:
¾
Punctuation,
such as hyphens or slashes, are removed during the parsing step.
¾
The
rule set outputs two types of columns: Business Intelligence columns and
Reporting/Error columns.
¾
There
are no significant default Classification table entries used with this rule
set.
¾
The
standard output format is CCYYMMDD.
There
are formats for dates that are required for the input data. Expected input
date
formats
include any of the following formats:
¾
mmddccyy
(09211991)
¾
mmmddccyy
(OCT021983)
¾
mmmdccyy
(OCT21983)
¾
mmddccyy
(04101986)
¾
mm/dd/ccyy
(10/23/1960)
¾
m/d/ccyy
(1/3/1960)
¾
mm/d/ccyy
(10/3/1960)
¾
m/dd/ccyy
(1/13/1960)
¾
mm-dd-ccyy
(04-01-1960)
¾
m-d-ccyy
(1-3-1960)
¾
m-dd-ccyy
(1-13-1960)
¾
ccyy-mm-dd
(1990-10-22O)
Examples
of the output string corresponding to a particular input format are as
follows:
¾
Input
format 1990-10-22; output result 19901022
¾
Input
formation 1/13/1960; output result 19600113
¾
Input
format OCT021983; output result 19831002
If
a data value passes validation, this rule set populates the following two
Business
Intelligence
column values:
¾
The
valid date {VD} data column which is populated with the eight numeric bytes
¾ The valid flag
{VF} field which is populated with the value T
If
a data value fails validation, this rule set populates the Reporting Error
fields
“Invalid
Data” (which has the invalid data value) and “Invalid Reason” which has a
code
such as IF (invalid input format), IM (invalid month), IT (invalid table such
as a
date
of 11111111), MM (invalid numeric month), FB (invalid day of February—leap
year),
M0 (invalid day for months with 30 days), and M1 (invalid day for months with
31
days).
|
VEMAIL
|
The
VEMAIL rule set identifies the format, components and completeness of e-mail
addresses
as follows:
¾
All
e-mail addresses should have a user, domain, and top-level qualifier.
¾
Punctuation
such as hyphens (-), at signs (@), and periods (.) are used as key delimiters
during the parsing step.
¾
The
default classification table for this rule set contains common domain (for
instance, ORG, COM, EDU, GOV, and so forth.) and sub-domain qualifiers (for
example, country and state codes).
The
parsing parameters parse the address into multiple tokens (for example, the
e-mail
address John_Smith@abccorp.comas has three tokens: John_Smith,
abccorp,
and comas).
If
a data value is validated, this rule set populates the following Business
Intelligence
fields:
_ User {US}
_ Domain {DM}
_ Top-level Qualifier {TL}
_ URL {RL}
If
a data value fails validation, this rule set outputs the Reporting Error
fields
“Unhandled
Data” (contains the unhandled data value) and “Unhandled Patterns”
(contains
the unhandled pattern).
|
VPHONE
|
The
VPHONE rule set validates the value and standardizes the format of a
telephone
number. Punctuation such as hyphens (-) and parentheses ( ), are
removed
during the parsing step.
If
the data value passes validation, this rule set outputs the following
Business
Intelligence
field values:
¾
Valid
Phone Number {VD} field which is populated with the numeric telephone number
¾
Phone
Number Extension {VX} field which is populated with the extension number
¾ Valid flag {VF}
field which is populated with T
If
the data value fails any one of the validation requirements the “Invalid
Data” and
the
“Invalid Reason” fields are populated. Invalid Reason has a code such as IL
(invalid
length—main telephone without extension must be 7 or 10 bytes.), IT (invalid
value),
and IP (invalid pattern or format).
|
VTAXID
|
The
VTAXID rule set validates the value and standardizes the format of a tax ID
or
national
ID number as follows:
Punctuation
such as hyphens (-) are removed during the parsing step.
There
are no significant default Classification table entries used with this rule
set.
If
a data value passes the listed criteria then it is considered a valid value
and outputs
two
Business Intelligence field values:
¾
TAX_ID/SSN
valid data {VD} field which is populated with the nine numeric bytes
¾
Valid
flag {VF} field which is populated with the value T.
If
the data value fails any one of the validation requirements the “Invalid
Data” and the “Invalid Reason” fields are populated. Invalid Reason codes
include IP (data value did not contain nine, and only nine, numeric
characters), IT (data value was found on the Invalid Data table), and Z3 (first
three numeric characters are all zeros).
|
Data
matching finds records in a single data source or independent data sources that
refer to the same entity (such as a person, organization, location, product, or
material) even if there is no predetermined key.
To
increase its usability and completeness, data can be consolidated or linked
(matched) along any relationship, such as a common person, business, place,
product, part, or event. You can also use matching to find duplicate entities
that are caused by data entry violations or account-oriented business
practices.
During
the data matching stage, IBM WebSphere QualityStage takes the following
actions:
- Identifies duplicate records
(such as customers, suppliers, products, or parts) within one or more data
sources
- Provides householding for
individuals (such as a family or group of individuals at a location) and
householding for commercial entities (multiple businesses in the same
location or different locations)
- Enables the creation of match
groups across data sources that might or might not have a predetermined
key
There
are two types of match stage:
- Unduplicate match locates and groups all
similar records within a single input data source. This process identifies
potential duplicate records, which might then be removed.
- Reference Match identifies relationships
among records in two data sources.
Matching
is a two-step process: first you block records and then you match them.
Blocking
provides a method of limiting the number of pairs to examine. When you
partition data sources into mutually-exclusive and exhaustive subsets and only
search for matches within a subset, the process of matching becomes manageable.
Basic
blocking concepts include:
- Blocking partitions the
sources into subsets that make computation feasible.
- Block size is the single most
important factor in match performance.
- Blocks should be as small as
possible without causing block overflows.Smaller blocks are more efficient
than larger blocks during matching.
To understand
the concept of blocking, consider a column that contains age data. If there are
100 possible ages, blocking partitions the source data into 100 subsets. The
first subset is all people with an age of zero, the next is people with an age
of 1, and so on. These subsets are called blocks.
If the
age values are uniformly distributed, 10 records out of the 1000-record source
contain data for people of age 0 on each source, 10 records for people of age
1, and so on.
The
pairs of records to be compared are taken from records in the same block:
- The first block consists of
all people of age 0 on each data source. This value is 10 times 10 or 100
record pairs.
- The second block consists of
all people on each data source with an age of 1.
When
the process is complete, you compared 100 (blocks) x 100 (pairs in a block) =
10 000 pairs, rather than the 1 000 000 record pairs that are required without
blocking.
You can
also combine multiple blocking variables into a single block for a single pass.
For example, blocking on age and gender divides the sources into sets of
0-year-old males, 0-year-old females, 1-year-old males, 1-year-old females, and
so on.
Blocking
creates a set of records that all have the same values in the blocking
variables. Records in different blocks (which means they do not have the same
values in the blocking variables) are classified as unmatched automatically. If
you believe that the unmatched records might still include matches, then a
second pass can be defined with the new blocking variables to look for matches
in the newly constituted blocks.
Blocking
strategies are used to limit the number of records to compare. Smaller blocks
are many times more efficient than larger blocks. Use restrictive blocking
schemes in the first pass with blocking variables (such as SSN) that are
expected to match records definitively. Unmatched records can then be subject
to less restrictive blocking variables such as first and last name. Thus
progressively,
fewer and fewer unmatched records can be subject to matching in subsequent
passes that have less and less restrictive blocking variables. The reason for
having multiple passes is to provide for cases where the data in the blocking
variables is error-prone.
The
matcher computes composite weights for all pairs in the set. IBM WebSphere
QualityStage then finds the mathematically optimal sets of matches and
duplicates on each source.
To do
this, the matcher maintains a matrix of weights that can be optimized. This
matrix is size limited so the number of records in a set is limited. If the
matrix size is exceeded, all records in the set are skipped and must be handled
by a subsequent matching pass.
- Avoiding block overflow
To
avoid block overflow, the number of records to be compared in one block must
not exceed system memory on the server and the total number of records on
either source in one block must not exceed the block overflow setting.
You
can determine whether block overflow has occurred in the Match Designer tool’s
Total Statistics panel as shown in following figure.This information is
provided for each match pass.
·
Applying
blocking variables
The
best blocking variables are those with the largest number of values possible
and the highest reliability. Base
a blocking strategy on the following principles:
Ø Make blocking sets as small as
possible. A good size is 10 to 20 records per source. Efficiency becomes quite
poor when blocks exceed 100 records per source.
Ø Use multiple passes to define
different sets.
Ø Find easy matches quickly and then
widen the net in your subsequent passes.
The
match process only looks for matches in records that have the same values in
the blocking columns.
After
you create a block of records, Match stage compares columns that you specified
as matching columns (in the Match specification) to determine the best match
for a record.
To
determine whether a record is a match, the Match stage performs the following
steps:
·
Calculates
a weight for each comparison, according to the probability associated with each
column. The Match stage uses two probabilities for each column:
Ø The m probability reflects the error
rate for the column.
You
set the probability that a column agrees provided that the record pair is a
match.
The
m probability is one minus the error
rate of the column. If a column in a sample of matched records disagrees 10% of
the time, the m probability for this column is 1 -
0.1, or 0.9. The closer the m
probability is to
one, the more critical is a disagreement on the column. You can use a very high
m probability to force columns that
are very important to have a high penalty for disagreeing.
Ø The u probability is the likelihood that
a column agrees at random
If
a u probability is high, the weight is
low. The u probability is the probability
that the column agrees, provided that the record pair does not match. The u probability is the probability
that the column agrees at random.
For
example, the probability that gender agrees at random is 50% of the time. With
a uniform distribution, you have four combinations in which gender agrees in
two of the four, or a 0.5 u
probability. If
you assign a high u probability, a low weight is
calculated when a column agrees.
During
the match run, the Match stage calculates a u probability
for each comparison using the frequency distribution statistics for the column,
described in “Match Frequency Stage”. Usually your concern for u probability is to provide a base
point between a rare event (1 out of the total # of records) and a common event
(1 out of 2). A value of 0.1 or 0.01 is a typical starting point. If you need
to control the u probability (such as columns for
an individual identification number, a national ID number, or a Patient
Identification Number), specify the vartype to be NOFREQ which indicates that
frequency distribution information should not be used by the Match stage.
·
Weights
are used in the Match stage to determine by what percentage a column matches
and to verify that the matches have a high degree of agreement. For each
matching column, the Match stage computes a weight using the following
equations:
Ø An agreement weight is computed
when the comparison between a pair of columns agrees.
log2(m probability / u
probability)
Ø A disagreement weight is computed
when the comparison between the pair of columns disagrees.
log2((1 - m probability)/(1 - u
probability))
The
higher the m-probability, the higher the disagreement weight will be for the
field not matching because errors are relatively rare events; the lower the u probability the greater the
potential weight for the comparison. Additionally, the calculated weight or
penalty represents a maximum or minimum range of scoring. Specific match
comparisons can generate a score within the range depending on the comparison
used.
The
Match stage adds the weights assigned to each column comparison and obtains a
composite weight for the record.
The
Match stage then computes the composite weight as the sum of the agreement
weight and the disagreement weight of each column. The higher the composite weight,
the greater the agreement.
·
In
the Match specification, you can specify cutoff threshold as follows:
Ø Match Cutoff threshold
When
a record pair receives a composite weight greater than or equal to this weight,
the pair is declared a match.
Ø Clerical Cutoff threshold
When
a record pair receives a composite weight greater than or equal to this weight
and less than the match cutoff, the pair is marked for clerical review. This
weight must be equal to or less than the match cutoff weight. If you do not
want a clerical review, set the clerical cutoff equal to the match cutoff.
Ø Duplicate threshold
The
lowest weight that a record pair can have to be considered a duplicate. This
cutoff weight is optional and must be higher than the match cutoff weight.
This
cutoff is only used with Reference Match and not with Unduplicate Match.
These
thresholds are used to send the results of matching to the appropriate outputs.
You use
the Match Designer tool to set which columns are to be compared for the
matching process. When the match specification is completed, you can use it in
the Unduplicate Match stage, Reference Match stage, and Match Frequency stages.
You can then apply the results from these match stages to the next stage of
your project, such as Survive or for loading into a database.
The
Match Frequency stage gives you direct control over the disposition of
generated frequency data. This stage provides results that can be used by the
Match Designer tool and match stages, but enables you to generate the frequency
data independent of running the matches. You can generate frequency information
by using any data that provides the fields that are needed by a match. Then you
can let the generated frequency data flow into a match stage, store it for
later use, or both.
The
Match Frequency stage takes one input data file. This data file can be from one
of the following places:
·
A
link from a sample data set file for use in the Match Designer tool
·
A
link from any parallel database, file or processing stage that carries
information you need for the following stages:
Ø An unduplicate match
Ø The data source input of a
reference match
Ø The reference source input of a
reference match
The
Match Frequency stage takes a single output link that carries four columns:
- qsFreqVal
- qsFreqCounts
- qsFreqColumnID
- qsFreqHeaderFlag
The
data in these columns provides the necessary information to give input to the
Match Designer tool and to the Unduplication Match and Reference Match stages.
When
you are configuring the Match Frequency stage, you can choose from the
following options:
- Designate a match
specification from the repository, so that frequency information is
generated only for the match columns in the specification.
- Do not designate a match
specification, thereby generating frequency information for all columns of the input.
When
multiple match specifications are defined for the same table, it is more
efficient to generate a single match frequency output for all the columns in
the table, and then use it in the different match stages.
- Increase the number of
frequencies reported in the output. (The default is 100.) You should
increase this number if you are processing a large number of records.
The
Match Designer is a tool for creating a match specification.
When
you create the match specification, you can select from more than 25 types of
comparisons such as CHAR (character comparisons), and UNCERT (character
uncertainty comparisons). The comparisons are algorithms that are based on the
type of data in the columns. Following table shows the most common comparisons
and where they are used.
Common comparisons
|
Where mostly used
|
CHAR
(character comparison) character by character, left to right.
|
This
is most useful to ensure a direct comparison, particular on code related
fields such as Gender.
|
CNT_DIFF
(count differences) tolerates a set number of key stroke errors (typically
1-2) on a prorated basis.
|
This
is most commonly used with data such as telephone number or SSN where there
is higher probability of key stroke issues.
|
DATE8
(basic tolerance of date errors) tolerates a set number of variations in days
on a prorated basis.
|
This
is most commonly used with data such as date of birth, date of encounter, and
so forth where there is possibility of slight error in the day entered.
|
MULT_UNCERT
- (word uncertainty in single field) tolerates phonetic errors,
transpositions,
random insertion, deletion, and replacement of characters and words using a
comparison threshold.
|
This
is the most common comparison for business names, arrays of identifiers, or
other descriptive data where multiple words exist that can vary in position.
|
NAME_UNCERT
- (shortest value character uncertainty) tolerates phonetic errors,
transpositions, random insertion, deletion, and replacement of characters
using a comparison threshold but only to a maximum of the shortest value
length.
|
This
is the standard comparison for personal names.
|
PREFIX
- (shortest value character comparison) character by character, left to right
to a maximum of the shortest value length.
|
This
is most useful to ensure matching of truncated data such as middle initial.
|
UNCERT
- (character uncertainty) tolerates phonetic errors, transpositions, random
insertion, deletion, and replacement of characters using a comparison
threshold.
|
This
is the standard comparison for street names or other basic text data.
|
You can
use the Match Designer to create multiple match specifications that include one
or more passes. Each pass is separately defined and is stored in the repository
to be reused.Following figure shows a panel that specifies the match column
(GENDER), comparison type (CHAR – Character comparisons), and m (.9) and u (.01) probabilities chosen.
You can
run each pass on the complete source data or test data that is created from a
representative subset of your production data and view the results in a variety
of graphic displays. You can also use frequency information that is generated
by the Match Frequency stage to help create your match specifications.
You
create separate match specification for each type of match as follows:
·
For
Unduplicate Match, you can create a match specification that is one of the
following types:
Ø Unduplicate (dependent) removes
duplicates from match consideration in subsequent passes
Ø Unduplicate Independent includes
all duplicates in subsequent passes
·
For
Reference Match, you can create a match specification that is one of the
following types:
Ø One-to-One to create a two-file
match specification based on a one-to-one relationship. In this match type, a
record on the reference source can match only one data source record.
Ø Reference to create a two-file
match specification based on a many-to-one relationship.
There
are multiple flavors of this type as follows:
§ Many-to-One in which a record on
the reference source can match many data source records. However, any one data
source record can only match one reference source record.
For
example, 101 Main St.
on the data source matches to two records on the reference source: 101-199 Main St SW
and 101-199 Main St SE .
The first reference source record is the matched record and the other is
disregarded.
Many-to-one
Multiple in which each reference source record having the same weight as the
matched pair when it is scored against the data record is flagged as a
duplicate record. Any one data source record might match more than one
reference source record. For example, 101
Main St. on the data source matches to two records
on the reference source: 101-199
Main St SW and 101-199 Main St SE . One reference source
record is the matched record and the other is the duplicate.
§ Many-to-one Duplicates is similar
to the Many-to-one Multiple option, except that additional reference source
records that match to a level above the duplicate cutoff value are flagged as
duplicates. Thus, records with lower weights than the match weight can be
flagged as duplicates.
For
example, 101 Main St
on the data source matches to three records on the reference source: 101-199 Main St SW ,
101-199 Main St SE ,
and 101 Main Rd.
One record gets 101-199 Main St SW
as the match and both of the other addresses could be duplicates.
The
inputs to the Match Designer tool are:
- Input data source which is
usually a representative subset of the data or reference source
- Frequency information from
the Match Frequency stage as described in “Match Frequency Stage”
The
output of the Match Designer tool is a test results database—you must provide
the ODBC Data Source Name (DSN) of this test results database (on the client
computer), and the user name and password for accessing it.
The
main area of the Match Designer is made up of two tabs:
·
Compose
tab
On
the Compose tab, you design and fine-tune the match passes. You design the
Match passes and add them to the Match job. You can add, delete, and modify
Match passes in this section. For each Match pass, you can specify blocking
fields, matching fields, cutoff weights, and view the weight histogram, data
results, and Match pass statistics.
Ø The Match Type area is a kind of
sandbox for designing jobs that displays the current Match job. In this area,
you can rearrange the order in which the Match passes run in the Match job, add
or remove passes, and create new passes.
Ø The Match Pass Holding area is
used to keep iterations of a particular pass definition or alternate approaches
to a pass. The passes in the holding area do not run as part of the match job,
but can be tested in isolation. You can add any of the Match passes in the
holding area to the Match job by moving the Match pass to the Match Type area.
Also, you can remove any of the Match passes from the Match job by moving it
from the type area into the Match Pass Holding Area. Any pass, whether in the
type or holding areas, can be test run. This approach lets you perform trial
runs of different pass definitions without needing to lose alternate
definitions.
Ø The Blocking Columns area
designates the fields that must match exactly for records to be in the same
processing group for the match. The right pane shows the histogram and data
sections when the run is complete. You can sort and search the data columns
from the match results.
Ø Cutoff values area identifies the
match cutoff and clerical cutoff.
Match
cutoff verifies that any record pairs above the cutoff have a high probability
of matching and those below the cutoff have a high probability of not being a
match. The composite weights assigned to each record pair create a distribution
of scores that range from very high positive to very high negative.
§ Within the distribution of
positive values, define a value or cutoff at which any record pair receiving a
weight equal to or greater than this cutoff is considered a match.
This
is referred to as the match cutoff.
§ Conversely, define a cutoff at
which any record pair receiving a weight equal to or less than this cutoff is
considered a non-match. Record pairs with weights that fall between these two
cutoff values are considered clerical review cases.
This
is referred to as the clerical cutoff.
If
more than one record pair receives a composite weight higher than the match
cutoff weight, those records are declared duplicates.
The
duplicate cutoff weight is optional and must be
higher than the match cutoff weight. This cutoff is not used with the
Unduplicate Match stage. The way in which duplicate records are handled is
based on what type ofmatching you selected.
Any
record pair that falls below the clerical cutoff becomes a residual and is
eligible for the next matching pass.
·
Total
statistics tab
This
tab provides statistical data in a graphical format for all the passes that
run.
The
cumulative statistics are of value only if you test multiple passes
consecutively, in the order that they appear in the match specification. The
Total Statistics page displays the following information:
Ø Cumulative statistics for the
current runs of all passes in the match specification.
Ø Individual statistics for the
current run of each pass.
Ø Charts that compare the statistics
for the current run of all passes.
This
stage locates and groups all similar records within a single input data source.
This process identifies potential duplicate records, which might then be
removed.
An
example is the need to eliminate duplicates from a consolidation of mailing
lists purchased from multiple sources.
The
Unduplicate Match stage accomplishes the following actions:
- Categorizes all records with
weights above the match cutoff as a set of duplicates.
- Identifies a master record by
selecting the record within the set that matches to itself with the
highest weight. The master record is associated with its set of
duplicates.
- Determines that records not
part of a set of duplicates are residuals. The residuals and the master
records are generally made available for the next pass.
- Excludes duplicates in
subsequent passes. However, you can choose the Independent match type
(more later) if you want duplicates to be included in subsequent passes.
The
output of the Unduplicate Match stage can include master records, duplicates
above the match cutoff, clerical duplicates, and residuals. You can use this
output as input to the Survive stage.
Unduplicate
Match stage uses two match types as follows:
- Dependent: This match type is
the default choice. With this match type, after the first pass, duplicates
are removed from match consideration in subsequent passes.
- Independent: With this match
type, duplicates are included in subsequent passes.
Across
all passes, all records that match to any given record are also considered
matched to each other. For example, if records A and B match in pass 1, records
B and C match in pass 2, and records C and D match in pass 3, then records A,
B, C and D are matched to each other.
Following
table shows sample data to describe how to use the Independent match type
option with the Unduplicate Match stage. The table shows four records that
describe the same person. You require that all records concerning the same
person match without regard to address.
Record
|
Name
|
Address
|
Tax Identifier
|
1
|
William Nickson
|
|
123456789
|
2
|
Bill Nixon
|
|
|
3
|
B Nickson
|
978 Sunset Blvd
|
123456789
|
4
|
Nickson
|
|
123456789
|
The
matching process using this data can yield different results depending on
whether you choose the Dependent or Independent match type:
- Dependent Match
The
first pass blocks and matches on Name and Address. Records 1 and 2 are
considered a matched pair, while records 3 and 4 are considered residuals.
If
Record 2 (without the TaxID) is selected as the master, and Record 1 is
considered a duplicate, then Record 1 is not available for the second pass.
If
the second pass blocks and matches on Name and TaxID, then only Records 3 and 4
match. The result is two groups of matched records: Records 1 and 2, and
Records 3 and 4.
- Independent Match
The
first pass results are the same as the Dependent Match. Records 1 and 2 are
considered a matched pair, and records 3 and 4 are considered residuals.
If
Record 2 (without the TaxID) is selected as the master record in the second
pass, the duplicate record, Record 1, is also compared to the rest of the
records. When you block on Name and TaxID, records 1, 3, and 4 match. Because
Record 1 matched Record 2 in the first pass, the output is one group with all
four records linked.
The
Reference Match stage identifies relationships among records. This match can
group records that are being compared in different ways as follows:
- One-to-many matching
Identifies
all records in one data source that correspond to a record for the same
individual, event, household, or street address in a second data source. Only
one record in the reference source can match one record in the data source
because the matching applies to individual events.
An
example of one-to-many matching is finding the same individual (based on
comparing social security numbers) in two different sources such as a
department of motor vehicles (DMV) list and a voter registration list. There
should be at most only one record matching on social security number in each
list.
- Many-to-one matching
Multiple
records in the data file can match a single record in the reference file. For
example, matching a transaction data source to a master data source allows many
transactions for one person in the master data source.
An
example of many-to-one matching is matching the ZIP codes in a customer file
with the list of valid ZIP codes. More than one record in the customer file can
have the same ZIP code in it.
When
you configure the Reference Match stage, you select an existing Reference-type
Match specification stored in the repository, select the match type, and select
the match outputs as shown in following figure.
The
Reference Match stage requires standardized data and reference data as source
data, a reference match specification, and frequency information for both
sources. Briefly, these inputs are:
- Source data from any parallel
database, file or processing stage. Typically, the source is standardized
data from the Standardize stage.
- Reference source against
which the source data is matched.
- Frequency information for
that data as generated by the Match Frequency stage
- Match specification to group
and match the data as described for the “Match Designer tool”
The
output of the Reference Match stage includes master records, clerical review
records, duplicates, and residuals. You can use this output as input to the
Survive stage. Briefly, the Reference Match stage delivers up to six outputs as
follows:
- Match contains matched
records for both inputs
- Clerical has records that
fall in the clerical range for both inputs
- Data Duplicate contains
duplicates in the data source
- Reference Duplicate contains
duplicates in the reference source
- Data Residual contains
records that are non-matches from the data input
- Reference Residual contains
records that are non-matches from the reference input
The
Survive stage consolidates duplicate records, which creates a best-of-breed
representation of the matched data. Survive consolidates duplicate records,
creating the best representation of the match data so companies can use it to
load a master data record, cross-populate all data sources, or both.
During
the Survive stage, IBM WebSphere QualityStage takes the following actions:
- Replaces existing data with
“better” data from other records based on user specified rules
- Supplies missing values in
one record with values from other records on the same entity
- Populates missing values in
one record with values from corresponding records which have been
identified as a group in the matching stage
- Enriches existing data with
external data
The
Survive stage constructs column values from groups of related or duplicate
records and stores the column values in the survived record (the best result)
from each group.
The
Survive job is the last job in the IBM WebSphere QualityStage workflow and is
usually run after the Unduplicate Match stage job. The output from the
Unduplicate Match stage, and in some cases the Reference Match stage, becomes
the source data that you use for the Survive stage.
The
Survive stage requires one input source, which can be sourced from a database
connector, flat file, data set, or another processing stage. If your input is
the result of a match stage, you need to set up another stage (for example, a
Funnel stage) to combine the master, duplicate records, and clerical reviewed
duplicates into one input source.
While
it is not necessary to process the data through the match stages before you use
the Survive stage, the source data must include related or duplicate groups of
rows. Also, the data must be able to be sorted on one or more columns that
identify each group. These columns are referred to as group keys.
- To order the records, you
sort on the group key or keys so that all records in a group are
contiguous. The Survive stage sorts records automatically if the “Don’t
Pre-sort Input” option is not selected in the Survive Stage window.
However, the automatic sort provides no control over the order of the
records within each group. To control the order within groups, you can
presort the input by using the Sort stage.
- The Survive stage accepts all
basic data types (non-vector, non-aggregate) other than binary.
The
Survive stage can have only one output link. This link can send output to any
other stage. You specify which columns and column values from each group create
the output record for the group. The output record can include an entire input
record, or selected columns from the record, or selected columns from different
records in the group.
- You select column values
based on rules for testing the columns.
A
rule contains a set of conditions and a list of one or more target columns. The
rows are tested one by one in the order they appear within the match set. For
each row tested, the values of the “test” row is updated.
- To select a best candidate
match, you can specify multiple columns such as the record creation date,
data source from which the record originated, length of data in the
column, and frequency of data in a group
You set
up and configure the Survive stage to create a survived record. You need to
configure the Survive stage before you run a Survive job. Before you can add
source data to the Survive stage, all input records must be combined into one
input source.
When
you configure the Survive stage, you choose simple rules that are provided in
the New Rules window or you select the Complex Survive Expression® to create
your own custom rules. You use some or all of the columns from the source file,
add a rule to each column, and apply the data.
After
the Survive stage processes the records to select the best record, this
information is sent to the target file.
To
consider a target as the best candidate for the output record requires a rule
that comprises one or more targets and a TRUE conditional expression. A
condition is made up of:
- Column names
- Constant or literal values
- Operators that specify
comparison or arithmetic operations
The
Survive stage evaluates the columns to the rule (simple or complex) and selects
those that meet the conditions of the rule as best columns. The Survive stage
reads the first record and evaluates the record according to any rule that you
select. The evaluation process uses the following method:
- If the first record has no
best columns then the selected rule for the target record is evaluated
against all the columns in the record. If a target record passes the test,
its columns become best columns and a “b” appears in frontof the column
names—this is only visible when building the rule and not part of the
evaluation process.
- Each subsequent record in the
group is evaluated in relation to the current record. If a target record
passes the test then its columns become the best columns and replace any
existing best columns. If none of the current columns meets the
conditions, the best columns remain unchanged.
- After all records in the
group are evaluated, the values that are designated as the best values are
combined in the output record. Survive continues the process with the next
records in the next group.
To
apply survival rules to the input columns, the Survive stage includes:
·
A
set of predefined Techniques (packaged survive expressions) shown in the
following table from which you can select the desired technique. These
techniques are sometimes referred to as simple
survive rules.
Technique
|
Pattern
|
Shortest Field
|
SIZEOF(TRIM(c,”column”))
<= SIZEOF(TRIM(b.”column”))
|
Longest Field
|
SIZEOF(TRIM(c,”column”))
>= SIZEOF(TRIM(b.”column”))
|
Most Frequent
|
FREQUENCY
|
Most Frequent
(non blank)
|
FREQUENCY (Skips
missing values when counting most frequent)
|
Equals
|
c.”column” =
“DATA”
|
Not Equals
|
c.”column”
<> “DATA”
|
Greater Than
|
c.”column” >=
“DATA”
|
Less Than
|
c.”column” <=
“DATA”
|
At Least One
|
1 (At least one
record survives, regardless of other rules)
|
You
define a simple rule by specifying each of the following elements:
Ø Target column or columns
Ø Column to analyze
Ø Technique to apply to the column
that is being analyzed
The
rule is defined for you depending on the Technique that you select.
Following
figure shows an example of a simple rule involving the Longest field technique.
·
The
Rule Expression Builder for creating your own complex expressions. The rules
created by the Rule Expression Builder are also referred to as complex survive
rules.
In
the Rule Expression Builder, you define a rule by specifying each of the
following elements:
Ø A current record from the Columns
list
Ø One or both functions (SIZEOF and
TRIM)
Ø An operator such as =, <>,
<, >, +, -, *, /, %, AND, OR and NOT
Ø A best record from the Columns
list
Ø One or both functions
In
a complex rule, you can use more than one column (other than the target itself)
to define the rule for a target.
Following
figure shows a sample complex expression rule using the Rule Expression
Builder.
The Figure links are broken. Would it be possible to fix the Figures? Thank you.
ReplyDeletenicely written, very helpful for beginners
ReplyDelete