Saturday 25 October 2014

QUALITY STAGE-INVESTIGATE STAGE OVERVIEW


INVESTIGATE STAGE



Understanding your data is a necessary precursor to cleansing and consolidation. You can use a data profiling product such as IBM WebSphere Information Analyzer to perform a comprehensive analysis across many columns and tables to create a direct input into the cleansing and matching process by using shared metadata. The Investigate stage supports the initial level of column or domain analysis and extends it through the analysis of free-form text fields to create this input.

The Investigate stage shows the actual condition of data in existing sources and identifies and corrects data problems before they corrupt new systems.

Investigation parses and analyzes free-form fields creating patterns to reveal field formats, counts unique values, and classifies, or assigns a business meaning to each occurrence of a value within a field. The Investigate stage also allows users to validate field contents in a specific column or domain. The actual full-volume data files must be investigated to assure that as many data anomalies as possible are identified. The Investigate stage involves performing one or more of the following functions:
Parsing
Classifying
Creation of Word or Pattern Distributions

Specific objectives of the investigation process include, but are not limited to the following:
Investigate data values and anomalies to be considered for special handling during design or development. Any recommended special handling is documented in a Business Requirements Document. Examples include:
Out of range, default (such as all nines in the telephone number) or unique values.
Free-form text that requires parsing to improve “addressability” to key components for optimal matching.
Values that do not match metadata labels and require new field definitions for matching and load format (such as DBA, C/O, Attention text, drivers license numbers, or other comment information in a name or address field).
Values that overlap adjacent fields and thus require a re-alignment of field content. For example, name information or city, state, and postal code that extend into street and mailing address fields.
Invalid formats that can cause conversion problems such as alphanumeric in character or numeric only fields.
Blank or missing data.
Special character and punctuation handling.
Discover additional tokens (key words) to add to the classification table, such as name prefixes, name suffixes, street types, unit types, and business words.
Verify the usefulness of data fields for the purpose of entity matching (record linkage). For example, does a Tax ID field contain primarily blank or invalid values, making it an unreliable field to use in the identification of a unique customer?
Validate business requirements or assumptions about the data. Results can lead to changes in the business requirements and can assist in driving the technical solution.

The Investigate stage takes a single input, which can be a link from any database connector that is supported by IBM WebSphere DataStage, from a flat file or data set or from any processing stage. Inputs to the Investigate stage can be fixed length or variable.

Use of investigation (or data profiling) is critical to assess how data can be used for further steps of standardization, matching, and survivorship or consolidation. Data that is used in these additional steps is stored in two types of fields—single-domain and multiple-domain.

During the investigation process, it is important to document fields that require parsing and standardization, standardization only, or no standardization at all. You must also consider whether data content will be used for matching, survivorship, pass-through (static data) or whether it will not be carried forward into the new system or target database.

Static fields
Can be used in the matching process.
Might require some standardization of format and default or out of range values.
Typically re-joined with re-engineered data at prior to matching or load formatting.
Character Investigate is usually sufficient to evaluate.

Single-domain fields
Categorized as either Entity Identifiers or Entity Clarifiers:
Entity Identifiers
Examples include ZIP Code (U.S. postal code), Social Security Number (SSN), TIN, and telephone numbers. These can serve as critical match fields if their quality is high. These fields typically have a specific format (such as 9-digit numeric for SSN, or Alpha-Numeric-Alpha-Numeric-Alpha-Numeric for Canadian postal code). Character Discrete Investigate is commonly used, with the mask set to all Cs.
Entity Clarifiers
Examples include name prefix, gender, and marital status. This type of field usually has a limited number of known values. Character Discrete Investigate is commonly used, with the mask set to all Ts.
Typically used for a single purpose.
Can exist in several common formats.
No parsing is usually necessary.
Can serve as strong blocking and match fields if quality is high.
Standardize Stage usually only requires removal of special characters, creation of a common default value and format and concatenating data.
Multiple-domain fields
Typically these are large free-form fields such as multiple Address fields.
Comprised of multiple single-domain values, such as house number, street name, unit, and first name.
Intelligent parsing is required on these fields for optimal matching.
Can serve as good match fields and multiple-component blocking candidates.
Standardization requires removal of most special characters and strong data typing.
Occasionally contain business word interference, such as reference numbers, comments, and so forth.
Can contain additional relationships such as Attn, C/O, DBA, and Trustee.
Word Investigate is commonly used for multi-domain fields.

Summary:

– Analyze free-form and single-domain fields
– Generate frequency counts of unique values
– Discover patterns in single-domain and free-form fields
– Understand and validate business rules within data fields
Methods
– Character Discrete
– Character Concatenate
Word


Investigate Field Masks:

Field Masks:

Options that control the representation of the data. Options: Character (C), Type (T),
Skipped (X)

Mask C: For viewing the actual character values of the data

Mask X: For ignoring characters

Mask T: For viewing the data patterns


Tokens:

Individual units of data (actual field values)


Investigate Stage Job Design:




 Investigate Methods:

a. Character Discrete:   analyzes multiple single-domain columns.

Investigate stage properties


Output :




b ). character concatenate: performs cross-column correlations between multiple columns to determine relationships


Stage Properties:



Output Results :




 C ) Word Investigation: Identifying free-form fields that may require
Parsing and standardization, discovery of key
Words for classification

Stage Properties:



Pattern Output:




Pattern Report output Columns:

qsInvCoqsIn Column Namefies the names of the column that is investigated.
qsInvPattern displays the character and includes the character in the frequency count and pattern analysis.
qsInvSample shows one or more samples of the content of this column. The number to be displayed is configurable.
qsInvCount shows the actual number of occurrences of the value in the qsInvPattern column.
qsInvPercent shows the percentage occurrences of the value in the qsInvPattern column to the total number of records on this file.


Token Output:


Token report output columns:
qsInvCount indicates the number of times this token was encountered across the entire input data.
qsInvWord identifies the individual token or word value that is found inside the selected input columns.
qsInvClassCode identifies the classification of the token that is based on the selected rule set classification table. Unclassified tokens, if selected, get a question mark “?” for alpha or a carat “^” for numeric.


Summary & Important points:

About Masks:

Example:

Input text = IBM Quality Stage 8

If you apply Mask C: IBM Quality Stage 8
If you apply Mask T: aaabaaaaaaabaaaaabn
If you apply mask C & Mask X like cccXcccccccXcccccXc then the output will be
IBMQualityStage8
If you apply Mask T & Mask X like TTTXTTTTTTTXTTTTTXT then the output will be
Aaaaaaaaaaaaaaan

The Mask X will simply skip the letter presented in the position and will gives you the result for the remaining characters coming from the input.



About Advanced options Presented in the Investigate Stage:


No of Samples: If desired, increase the number of samples that appear for each pattern in the pattern report. Default is 1.

Frequency cutoff Patterns with a frequency of less than this number will not appear in the pattern or token reports. Default is 1. If desired, enter a higher number. (For example, if you enter 4, any pattern that occurs three times or less does not appear in the report.)
Token Report Options
For Word investigations, the following Token Report Options are available:
Treat Successive Unclassified Words as One Word. Select to strip out the spaces between unclassified words (concatenating them into one word). For example, THOMAS ALVA EDISON becomes THOMASALVAEDISON. This option reduces the number of patterns in both the pattern and token reports.
Include Unclassified Numeric Tokens. Select to list all number tokens in the token report. For example, when investigating an address column, you probably do not want to see house and apartment numbers, but you might want to see numbers if you are investigating part numbers.
Include Unclassified Alphas. Select to include all word tokens that are not in the classification table for the given rule set in the token reports. If you do not select this option, the report includes only tokens from the rule set's classification table.
Include Mixed Types and Punctuated Words. Select to include tokens with leading or trailing numerics, such as 109TH and 42ND, in the token report.
Classified Tokens
You can include classified tokens by choosing any of the following options:
Standard Abbreviation. Select to display the standardized representation of the token from the rule set's classification table.
Original Spelling. Select to display the form the token takes in the data file.
Corrected Spelling. Corrects any misspellings as long as the Classification table has a weight assigned to the token.
Additional Options
You can select any of the following options:
Separator List. Includes all special characters that separate tokens. Edit the list to add or remove characters.
Strip List. Includes all special characters from the Separator List that are not to be a token. (For example, the pound sign (#) by default is not part of this list, thus APT#3A is three tokens: APT, #, and 3A.) Edit the list to add or remove characters. 



if you need more information on qualitystage follow our blog . We will keep update this blog with quality stage information


2 comments:

  1. Thanks for your support, i am very interested in learning Hadoop.. If you want more details on HADOOP BIGDATA
    just go through this link.....http://www.tekclasses.com/courses/hadoop/

    ReplyDelete
  2. Very nicely explained. Good job !

    ReplyDelete