INVESTIGATE STAGE
Understanding your data is a necessary precursor to cleansing and consolidation. You can use a data profiling product such as IBM WebSphere Information Analyzer to perform a comprehensive analysis across many columns and tables to create a direct input into the cleansing and matching process by using shared metadata. The Investigate stage supports the initial level of column or domain analysis and extends it through the analysis of free-form text fields to create this input.
The Investigate stage shows the actual condition of data in
existing sources and identifies and corrects data problems before they corrupt
new systems.
Investigation parses and analyzes free-form fields creating
patterns to reveal field formats, counts unique values, and classifies, or
assigns a business meaning to each occurrence of a value within a field. The
Investigate stage also allows users to validate field contents in a specific
column or domain. The actual full-volume data files must be investigated to
assure that as many data anomalies as possible are identified. The Investigate
stage involves performing one or more of the following functions:
Parsing
Classifying
Creation of Word or Pattern Distributions
Specific objectives of the investigation process include, but are
not limited to the following:
Investigate data values and anomalies to be considered for special
handling during design or development. Any recommended special handling is
documented in a Business Requirements Document. Examples include:
Out of range, default (such as all nines in the telephone number)
or unique values.
Free-form text that requires parsing to improve “addressability”
to key components for optimal matching.
Values that do not match metadata labels and require new field
definitions for matching and load format (such as DBA, C/O, Attention text,
drivers license numbers, or other comment information in a name or address
field).
Values that overlap adjacent fields and thus require a
re-alignment of field content. For example, name information or city, state,
and postal code that extend into street and mailing address fields.
Invalid formats that can cause conversion problems such as
alphanumeric in character or numeric only fields.
Blank or missing data.
Special character and punctuation handling.
Discover additional tokens (key words) to add to the
classification table, such as name prefixes, name suffixes, street types, unit
types, and business words.
Verify the usefulness of data fields for the purpose of entity
matching (record linkage). For example, does a Tax ID field contain primarily
blank or invalid values, making it an unreliable field to use in the
identification of a unique customer?
Validate business requirements or assumptions about the data.
Results can lead to changes in the business requirements and can assist in
driving the technical solution.
The Investigate stage takes a single input, which can be a link
from any database connector that is supported by IBM WebSphere DataStage, from
a flat file or data set or from any processing stage. Inputs to the Investigate
stage can be fixed length or variable.
Use of investigation (or data profiling) is critical to assess how
data can be used for further steps of standardization, matching, and
survivorship or consolidation. Data that is used in these additional steps is
stored in two types of fields—single-domain and multiple-domain.
During the investigation process, it is important to document
fields that require parsing and standardization, standardization only, or no
standardization at all. You must also consider whether data content will be
used for matching, survivorship, pass-through (static data) or whether it will
not be carried forward into the new system or target database.
Static fields
Can be used in the matching process.
Might require some standardization of format and default or out of
range values.
Typically re-joined with re-engineered data at prior to matching
or load formatting.
Character Investigate is usually sufficient to evaluate.
Single-domain fields
Categorized as either Entity Identifiers or Entity Clarifiers:
Entity Identifiers
Examples include ZIP Code (U.S. postal code), Social Security
Number (SSN), TIN, and telephone numbers. These can serve as critical match
fields if their quality is high. These fields typically have a specific format
(such as 9-digit numeric for SSN, or Alpha-Numeric-Alpha-Numeric-Alpha-Numeric
for Canadian postal code). Character Discrete Investigate is commonly used,
with the mask set to all Cs.
Entity Clarifiers
Examples include name prefix, gender, and marital status. This
type of field usually has a limited number of known values. Character Discrete
Investigate is commonly used, with the mask set to all Ts.
Typically used for a single purpose.
Can exist in several common formats.
No parsing is usually necessary.
Can serve as strong blocking and match fields if quality is high.
Standardize Stage usually only requires removal of special
characters, creation of a common default value and format and concatenating
data.
Multiple-domain fields
Typically these are large free-form fields such as multiple
Address fields.
Comprised of multiple single-domain values, such as house number,
street name, unit, and first name.
Intelligent parsing is required on these fields for optimal
matching.
Can serve as good match fields and multiple-component blocking
candidates.
Standardization requires removal of most special characters and
strong data typing.
Occasionally contain business word interference, such as reference
numbers, comments, and so forth.
Can contain additional relationships such as Attn, C/O, DBA, and
Trustee.
Word Investigate is commonly used for multi-domain fields.
Summary:
– Analyze free-form and single-domain fields
– Generate frequency counts of unique values
– Discover patterns in single-domain and free-form fields
– Understand and validate business rules within data fields
●Methods
– Character Discrete
– Character Concatenate
Word
Investigate Field Masks:
Field Masks:
Options that control the representation of the data. Options:
Character (C), Type (T),
Skipped (X)
Mask C: For viewing the actual character values of the data
Mask X: For ignoring characters
Mask T: For viewing the data patterns
Tokens:
Individual units of data (actual field values)
Investigate Stage Job Design:
Investigate Methods:
a. Character Discrete: analyzes multiple
single-domain columns.
Investigate stage properties
Output :
b ).
character concatenate: performs cross-column correlations between multiple
columns to determine relationships
Stage
Properties:
Output Results :
C ) Word Investigation: Identifying free-form fields that may require
Parsing
and standardization, discovery of key
Words for
classification
Stage
Properties:
Pattern
Output:
Pattern
Report output Columns:
qsInvCoqsIn Column Namefies the names of the column that is investigated.
qsInvPattern
displays the character and includes the character in the frequency count and
pattern analysis.
qsInvSample
shows one or more samples of the content of this column. The number to be
displayed is configurable.
qsInvCount
shows the actual number of occurrences of the value in the qsInvPattern column.
qsInvPercent
shows the percentage occurrences of the value in the qsInvPattern column to the
total number of records on this file.
Token
Output:
Token
report output columns:
qsInvCount
indicates the number of times this token was encountered across the entire
input data.
qsInvWord
identifies the individual token or word value that is found inside the selected
input columns.
qsInvClassCode
identifies the classification of the token that is based on the selected rule
set classification table. Unclassified tokens, if selected, get a question mark
“?” for alpha or a carat “^” for numeric.
Summary
& Important points:
About
Masks:
Example:
Input text
= IBM Quality Stage 8
If you
apply Mask C: IBM Quality Stage 8
If you
apply Mask T: aaabaaaaaaabaaaaabn
If you
apply mask C & Mask X like cccXcccccccXcccccXc then the output will be
IBMQualityStage8
If you
apply Mask T & Mask X like TTTXTTTTTTTXTTTTTXT then the output will be
Aaaaaaaaaaaaaaan
The Mask X
will simply skip the letter presented in the position and will gives you the
result for the remaining characters coming from the input.
About
Advanced options Presented in the Investigate Stage:
No of
Samples: If desired, increase the number of samples that appear for each
pattern in the pattern report. Default is 1.
Frequency
cutoff Patterns with a frequency of less than this number will not appear in
the pattern or token reports. Default is 1. If desired, enter a higher number.
(For example, if you enter 4, any pattern that occurs three times or less does
not appear in the report.)
Token
Report Options
For Word
investigations, the following Token Report Options are available:
Treat
Successive Unclassified Words as One Word. Select to strip out the spaces
between unclassified words (concatenating them into one word). For example,
THOMAS ALVA EDISON becomes THOMASALVAEDISON. This option reduces the number of
patterns in both the pattern and token reports.
Include
Unclassified Numeric Tokens. Select to list all number tokens in the token
report. For example, when investigating an address column, you probably do not
want to see house and apartment numbers, but you might want to see numbers if
you are investigating part numbers.
Include
Unclassified Alphas. Select to include all word tokens that are not in the
classification table for the given rule set in the token reports. If you do not
select this option, the report includes only tokens from the rule set's
classification table.
Include
Mixed Types and Punctuated Words. Select to include tokens with leading or
trailing numerics, such as 109TH and 42ND, in the token report.
Classified
Tokens
You can
include classified tokens by choosing any of the following options:
Standard
Abbreviation. Select to display the standardized representation of the token
from the rule set's classification table.
Original
Spelling. Select to display the form the token takes in the data file.
Corrected
Spelling. Corrects any misspellings as long as the Classification table has a weight
assigned to the token.
Additional
Options
You can
select any of the following options:
Separator
List. Includes all special characters that separate tokens. Edit the list to
add or remove characters.
Strip
List. Includes all special characters from the Separator List that are not to
be a token. (For example, the pound sign (#) by default is not part of this
list, thus APT#3A is three tokens: APT, #, and 3A.) Edit the list to add or
remove characters.
if you need more information on qualitystage follow our
blog . We will keep update this blog with quality stage information
Thanks for your support, i am very interested in learning Hadoop.. If you want more details on HADOOP BIGDATA
ReplyDeletejust go through this link.....http://www.tekclasses.com/courses/hadoop/
Very nicely explained. Good job !
ReplyDelete