Friday, 31 October 2014

Monday, 27 October 2014

Saturday, 25 October 2014

Unix Script to Get DSLog after Datastage job finish

#!/usr/bin/ksh
# --------------------------------------------------------------------------------------------------
# name: GetDSLog.ksh
#
# purpose:
# extract datastage log entries for a given job and save them to
# a designated file on the server.
#
# usage: GetDSLog <dsjob> <logfile> <Project>
#
# parameters:
#
# $1 - <dsjob> is the name of the DataStage job, required
# $2 - <logfile> is the name including path to write the log entries to, required
# $3 - <Project> is the name of the datastage project, required. If not supplied,
# the project is looked up in the app_parameter table.
# examples:
#
# GetDSLog.ksh Dsjob sequencer
"/datastage/common/log/member.log"
#
#
# -----------------------------------------------------------------------------
# set environment. Lookup the SetEnv.ksh in the user's home directory first
# and then in the /ohd/common/bin directory.
# -----------------------------------------------------------------------------
# set -vx

# -----------------------------------------------------------------------------
# write log entries
# $1 is the actual message to log
# -----------------------------------------------------------------------------
upLog()
{
\echo "$1"
}

# -----------------------------------------------------------------------------
# abend the job and return -1 exit code
# $1 is the actual message to display
# -----------------------------------------------------------------------------
glAbend()
{
upLog "$1"
exit 1
}

# -----------------------------------------------------------------------------
# assign a job parameter
# $1 is the name of the parameter
# $2 is the value
# -----------------------------------------------------------------------------
AssignParam()
{
param=$1
value=$2
if [[ -n $(\echo "$JobParam" | \grep -w $param) ]]; then
ParamList=$ParamList" -param $param=$value"
\echo "$param=$value" >> $CurrParams
fi
}

# -----------------------------------------------------------------------------
# Execute/run a job
# return: 0 if successful
# 1 otherwise
# -----------------------------------------------------------------------------
RunAndCheck()
{
#
# reset the extract job if it is not in a runable state
#
dsJStat=$($DSBin/dsjob -jobinfo $dsProject $eljob | \awk -F: '{print $2}')

if [[ -z $dsJStat ]]; then
upLog "unable to get job status"
return 1
fi

RunFailed=$(\echo $dsJStat |\grep "RUN FAILED")
NotCompiled=$(\echo $dsJStat |\grep "NOT COMPILED")
Stopped=$(\echo $dsJStat |\grep "STOPPED")
Unknown=$(\echo $dsJStat |\grep "UNKNOWN")

if [[ -n $RunFailed || -n $NotCompiled || -n $Stopped || -n $Unknown ]]; then
upLog "restting......: $eljob"
$DSBin/dsjob -run -mode RESET $dsProject $eljob
if [[ $? != 0 ]]; then
upLog "unable to reset job $eljob!"
return 1
fi
fi

ParamList=" "
JobParam=$($DSBin/dsjob -lparams $dsProject $eljob)
AssignParam "jobName" "$dsjob"
AssignParam "logFileName" "$logto"

$DSBin/dsjob -run $ParamList -jobstatus -warn $MaxWarns $dsProject $eljob > /dev/null

rc=$?
upLog "ds exit code..: $rc"

dsJStat=$($DSBin/dsjob -jobinfo $dsProject $eljob | \awk -F: '{print $2}')

if [[ -n $(\echo "$dsJStat" | \grep 'RUN OK') ]]; then Stat=OK; fi
if [[ -n $(\echo "$dsJStat" | \grep 'RUN with WARNINGS') ]]; then Stat=WARNING; fi
if [[ -n $(\echo "$dsJStat" | \grep 'RUN FAILED') ]]; then Stat=FATAL; fi

if [[ $Stat = FATAL ]]; then

dsID=$($DSBin/dsjob -lognewest $dsProject $eljob $Stat| \awk -F= '{print $NF}')
dsErr=$($DSBin/dsjob -logdetail $dsProject $eljob $dsID)
upLog "Status........: fail!"
upLog "$dsErr"
return 1

else

upLog "Status........: success!"
return 0
fi
}

# -----------------------------------------------------------------------------
# main
# -----------------------------------------------------------------------------
if [ $# -lt 2 ] ; then
printf "Error: $0 - Required parameter(s) was not supplied!\n\n"
printf "Usage: $0 <dslog> <logfile> [<Project>]\n\n"
printf "where: <dsjob> : is the name of the DataStage job and is required\n"
printf " <logfile> : is the name of the log file including path, required.\n"
printf " <Project> : is the name of the project and is optional. If not supplied\n"
printf " the default project in the parameter file is used.\n\n"
exit 1
fi

Me=$(\echo $0 | \awk -F/ '{print $NF}')
mypath=$(\echo $0 | \awk -F"$Me" '{print $1}')

dsjob=$1
logto=$2
dsProject=$3

export DSHOME=`cat /.dshome`
DSBin=$DSHOME/bin
Start=$(\date +%Y%m%d%H%M%S)
MaxWarns=50

filename=$(\echo $logto |\awk -F/ '{print $NF}')
if [[ -n $(\echo $logto |\grep "\/") ]]; then
logpath=$(\echo $logto |\awk -F"/$filename" '{print $1}')
else
logpath="./"
fi

if [[ $logpath = "./" || $logpath = "." ]]; then logpath=$PWD; fi

if [[ ! -d $logpath ]]; then
glAbend "the directory given for the log file ($logto) does not exist!"
fi

logto="$logpath/$filename"

eljob="jcExtractLog.$(\echo $dsjob |\sed 's/\./_/g')"

upLog "Extract Job Started"
upLog "dsJob.........: $dsjob"
upLog "log file......: $logto"
upLog "Project.......: $dsProject"

RunAndCheck

\echo "------------------------------" >> $logFile

exit 0

www.bigpaathshala.com

QUALITY STAGE-INVESTIGATE STAGE OVERVIEW

INVESTIGATE STAGE

Understanding your data is a necessary precursor to cleansing and consolidation. You can use a data profiling product such as IBM WebSphere Information Analyzer to perform a comprehensive analysis across many columns and tables to create a direct input into the cleansing and matching process by using shared metadata. The Investigate stage supports the initial level of column or domain analysis and extends it through the analysis of free-form text fields to create this input.

The Investigate stage shows the actual condition of data in existing sources and identifies and corrects data problems before they corrupt new systems.

Investigation parses and analyzes free-form fields creating patterns to reveal field formats, counts unique values, and classifies, or assigns a business meaning to each occurrence of a value within a field. The Investigate stage also allows users to validate field contents in a specific column or domain. The actual full-volume data files must be investigated to assure that as many data anomalies as possible are identified. The Investigate stage involves performing one or more of the following functions:

Parsing

Classifying

Creation of Word or Pattern Distributions

Specific objectives of the investigation process include, but are not limited to the following:

Investigate data values and anomalies to be considered for special handling during design or development. Any recommended special handling is documented in a Business Requirements Document. Examples include:

Out of range, default (such as all nines in the telephone number) or unique values.

Free-form text that requires parsing to improve “addressability” to key components for optimal matching.

Values that do not match metadata labels and require new field definitions for matching and load format (such as DBA, C/O, Attention text, drivers license numbers, or other comment information in a name or address field).

Values that overlap adjacent fields and thus require a re-alignment of field content. For example, name information or city, state, and postal code that extend into street and mailing address fields.

Invalid formats that can cause conversion problems such as alphanumeric in character or numeric only fields.

Blank or missing data.

Special character and punctuation handling.

Discover additional tokens (key words) to add to the classification table, such as name prefixes, name suffixes, street types, unit types, and business words.

Verify the usefulness of data fields for the purpose of entity matching (record linkage). For example, does a Tax ID field contain primarily blank or invalid values, making it an unreliable field to use in the identification of a unique customer?

Validate business requirements or assumptions about the data. Results can lead to changes in the business requirements and can assist in driving the technical solution.

The Investigate stage takes a single input, which can be a link from any database connector that is supported by IBM WebSphere DataStage, from a flat file or data set or from any processing stage. Inputs to the Investigate stage can be fixed length or variable.

Use of investigation (or data profiling) is critical to assess how data can be used for further steps of standardization, matching, and survivorship or consolidation. Data that is used in these additional steps is stored in two types of fields—single-domain and multiple-domain.

During the investigation process, it is important to document fields that require parsing and standardization, standardization only, or no standardization at all. You must also consider whether data content will be used for matching, survivorship, pass-through (static data) or whether it will not be carried forward into the new system or target database.

Static fields

Can be used in the matching process.

Might require some standardization of format and default or out of range values.

Typically re-joined with re-engineered data at prior to matching or load formatting.

Character Investigate is usually sufficient to evaluate.

Single-domain fields

Categorized as either Entity Identifiers or Entity Clarifiers:

Entity Identifiers

Examples include ZIP Code (U.S. postal code), Social Security Number (SSN), TIN, and telephone numbers. These can serve as critical match fields if their quality is high. These fields typically have a specific format (such as 9-digit numeric for SSN, or Alpha-Numeric-Alpha-Numeric-Alpha-Numeric for Canadian postal code). Character Discrete Investigate is commonly used, with the mask set to all Cs.

Entity Clarifiers

Examples include name prefix, gender, and marital status. This type of field usually has a limited number of known values. Character Discrete Investigate is commonly used, with the mask set to all Ts.

Typically used for a single purpose.

Can exist in several common formats.

No parsing is usually necessary.

Can serve as strong blocking and match fields if quality is high.

Standardize Stage usually only requires removal of special characters, creation of a common default value and format and concatenating data.

Multiple-domain fields

Typically these are large free-form fields such as multiple Address fields.

Comprised of multiple single-domain values, such as house number, street name, unit, and first name.

Intelligent parsing is required on these fields for optimal matching.

Can serve as good match fields and multiple-component blocking candidates.

Standardization requires removal of most special characters and strong data typing.

Occasionally contain business word interference, such as reference numbers, comments, and so forth.

Can contain additional relationships such as Attn, C/O, DBA, and Trustee.

Word Investigate is commonly used for multi-domain fields.

Summary:

– Analyze free-form and single-domain fields

– Generate frequency counts of unique values

– Discover patterns in single-domain and free-form fields

– Understand and validate business rules within data fields

●Methods

– Character Discrete

– Character Concatenate

Word

Investigate Field Masks:

Field Masks:

Options that control the representation of the data. Options: Character (C), Type (T),

Skipped (X)

Mask C: For viewing the actual character values of the data

Mask X: For ignoring characters

Mask T: For viewing the data patterns

Tokens:

Individual units of data (actual field values)

Investigate Stage Job Design:

Investigate Methods:

a. Character Discrete: analyzes multiple single-domain columns.

Investigate stage properties

Output :

b ). character concatenate: performs cross-column correlations between multiple columns to determine relationships

Stage Properties:

Output Results :

C ) Word Investigation: Identifying free-form fields that may require

Parsing and standardization, discovery of key

Words for classification

Stage Properties:

Pattern Output:

Pattern Report output Columns:

qsInvCoqsIn Column Namefies the names of the column that is investigated.

qsInvPattern displays the character and includes the character in the frequency count and pattern analysis.

qsInvSample shows one or more samples of the content of this column. The number to be displayed is configurable.

qsInvCount shows the actual number of occurrences of the value in the qsInvPattern column.

qsInvPercent shows the percentage occurrences of the value in the qsInvPattern column to the total number of records on this file.

Token Output:

Token report output columns:

qsInvCount indicates the number of times this token was encountered across the entire input data.

qsInvWord identifies the individual token or word value that is found inside the selected input columns.

qsInvClassCode identifies the classification of the token that is based on the selected rule set classification table. Unclassified tokens, if selected, get a question mark “?” for alpha or a carat “^” for numeric.

Summary & Important points:

About Masks:

Example:

Input text = IBM Quality Stage 8

If you apply Mask C: IBM Quality Stage 8

If you apply Mask T: aaabaaaaaaabaaaaabn

If you apply mask C & Mask X like cccXcccccccXcccccXc then the output will be

IBMQualityStage8

If you apply Mask T & Mask X like TTTXTTTTTTTXTTTTTXT then the output will be

Aaaaaaaaaaaaaaan

The Mask X will simply skip the letter presented in the position and will gives you the result for the remaining characters coming from the input.

About Advanced options Presented in the Investigate Stage:

No of Samples: If desired, increase the number of samples that appear for each pattern in the pattern report. Default is 1.

Frequency cutoff Patterns with a frequency of less than this number will not appear in the pattern or token reports. Default is 1. If desired, enter a higher number. (For example, if you enter 4, any pattern that occurs three times or less does not appear in the report.)

Token Report Options

For Word investigations, the following Token Report Options are available:

Treat Successive Unclassified Words as One Word. Select to strip out the spaces between unclassified words (concatenating them into one word). For example, THOMAS ALVA EDISON becomes THOMASALVAEDISON. This option reduces the number of patterns in both the pattern and token reports.

Include Unclassified Numeric Tokens. Select to list all number tokens in the token report. For example, when investigating an address column, you probably do not want to see house and apartment numbers, but you might want to see numbers if you are investigating part numbers.

Include Unclassified Alphas. Select to include all word tokens that are not in the classification table for the given rule set in the token reports. If you do not select this option, the report includes only tokens from the rule set's classification table.

Include Mixed Types and Punctuated Words. Select to include tokens with leading or trailing numerics, such as 109TH and 42ND, in the token report.

Classified Tokens

You can include classified tokens by choosing any of the following options:

Standard Abbreviation. Select to display the standardized representation of the token from the rule set's classification table.

Original Spelling. Select to display the form the token takes in the data file.

Corrected Spelling. Corrects any misspellings as long as the Classification table has a weight assigned to the token.

Additional Options

You can select any of the following options:

Separator List. Includes all special characters that separate tokens. Edit the list to add or remove characters.

Strip List. Includes all special characters from the Separator List that are not to be a token. (For example, the pound sign (#) by default is not part of this list, thus APT#3A is three tokens: APT, #, and 3A.) Edit the list to add or remove characters.

if you need more information on qualitystage follow our blog . We will keep update this blog with quality stage information

Some Roles and Responsibilities of Quality Stage Resume

Roles and Responsibilities of QS:

Acting as a Global DQ Focal as part of the DQ Team in helping out the local DQ Teams as well as train the teams on Quality Stage.

Working with the SME’s of the source systems in understanding the Business process and gathering the requirements.

Preparing the DQ Requirement sheets and taking the sign off from the SME’s before getting start with the design of actual jobs.

Generating the Pattern report and Token report as requested by the Business in identifying the Data patterns and inconsistency in source data.

Involved in the discussions with SME’s to validate the reports generated by Investigate stage and preparing of ACL (Action Check List) document suggestions made by the Business.

Updating/Modifying existed rule sets according to the ACL Document to get the best output in further cycles.

Effectively used standardized stage in standardizing the source data by using the existed rule sets like name, address and area etc for multiple countries and generated the valid and invalid data reports.

Involved in the Discussions with Business again to validate the Standardized reports and prepared ACL doc again with the action points.

Updated/Modifying the existed rule sets specifically the PAT file to handle the unhandled patterns according to the ACL doc where ever it is needed.

Involved in Building of Customized Rulesets for the countries for which there are no default Rulesets.

Involved in the design of Match Templates suggestions made by the Business and validated the results to identify the duplicates coming from the source depending on different Match types.

Designed the jobs by calling the Match Templates prepared and generated the results.

Designed the jobs by using different survive techniques available in Survive stage by passing the Match stage output as input to get the best breed of record to the target.

Involved in discussions with Business to validate the data report generated by Survive and taking the signoff from them on the data if there are no issues.
Effectively worked on AVI stage to validate the address information coming from multiple countries.

Generated the Suggestion report requested by the source team by using AVI to get the suggestions made by the stage to validate the data.

Generated the Correction Report like Valid data report and invalid data report requested by the Business to correct the invalid data like incorrect postal codes, incorrect cities, incorrect PO Boxes, incorrect state and incorrect organization names coming from source.

Implemented the Business Rules defined by the Business to validate the source data by using the Data Rules Stage and generated the reports to identify the valid and invalid data.

Roles and Responsibilities of IA:

1. Involved in IA Administration to create the IA Projects, adding of Data sources to the projects, imported the metadata of the tables to the IA Projects and giving the IA permissions to the Users to profile the data.

2. Involved in Profiling of source data by using the IA Tool.

3. Ran the column analysis on the data and analyzed the data to identify valid and invalid data each column wise and filled the DQ Result sheet with the results of Column Analysis.

4. Used various types of reference tables like Valid Reference Table, Invalid Reference table and Range Reference table etc to validate the data in Column Analysis.

5. Created the Virtual columns to validate the data on more than column at a time or on the combination of columns.

6. Created the virtual tables with some sample data of other table by placing the filter condition to validate the data.

7. Generated different kind of IA reports for Column Analysis like Invalid Format Report and Invalid Data Report etc.

8. Ran the Key analysis like Single column Key analysis and Multi column analysis to identify the primary key of the data and generated the IA reports for Key Analysis.

9. Ran the Cross Domain analysis to identify how much percentage of one table data matching with the pair table data and created the IA Reports for the Cross domain Analysis.

10. Ran the Baseline Analysis to compare the column Analysis results of one cycle with other cycle and generated the IA Reports for Baseline Analysis.

11. Created the IA Data rule definitions and Data rule set definitions to validate the source data and derives the data rules and Data rule sets by binding the actual table columns to the rules to validate the data.

12. Ran the Data rules and rule sets to identify the valid and invalid data and exported the results to csv files to share the results with the Business.

13. Expertise in creating the Rule definitions, rules and Execution of rules as a bulk from command line prompt to make the process fast.

XML file Read in DATASTAGE

1. Introduction:

Data stage is and ETL (Extend, Transformer, Load) tool. It is a powerful data integration tool.Many of the Data Stage users knows how to load Sequential files, CSV files and tables using Data Stage. Currently In most of the organizations data is storing in the format of XML files. XML is used to create structure, store, and transport information.

2. Purpose

The purpose of this document is to give simple steps to load a XML files using Data Stage. The document covers each step with the screenshots for better understanding.

3. Introduction to XML file:

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML is used in many aspects of web development, often to simplify data storage and sharing. A markup construct that begins with < and ends with >. Tags come in the below flavors:

· start-tags; for example: <section>

· end-tags; for example: </section>

4. Steps to Load XML files

4.1 Sample XML file-Creation:

· Place the text in to the notepad

· Save it as .XML format

· Open the XML file with Internet explore. If it not showing any parser error then the file is correctly defined.

· Ex: Employee Sample XML file. It gives the information about employees.

4.2 Data Stage Designer-Login:

· Go to Start -> Programs -> IBM Information Server -> IBM WebSphere DataStage and QualityStage Designer

· Or else click the

short cut key on the Desktop

· Enter the Domain ,Username ,Password ,Project details

4.3 Parallel Job –Creation:

· Login in to DataStage Designer window

· File -> New-> Parallel Job

4.4 XML File Metadata –Import

· In the DataStage Designer window go to

· Import -> Table Definition-> XML Table Definition

· New window XML Meta data Importer will open.

· In the XML Meta data go to

· File -> open -> file

· Give the XML file name in the file name tab.

· The XML file Metadata will appears on the screen

· Check all the Text boxes on the screen. (If any fields not required then don’t check the check box)

· Check one of the key columns in the Table Definition.

· Then to go File -> Save

· Give the DataStage table definition name, Folder Path and then enter Ok

· File-> Exit (come out from the XML Meta Data Importer window)

4.5 XML Job Design

· Drag the 2 Sequential file stages from the Pallet->File

· The first file is to give the XML file name and second file is to load the XML data

· Drag the XML Input stage from the Pallet -> Real Time

· XML Input stage will Import the Metadata of XML file and process the transactions.

· Then save the job in the desinger window go to File -> save as

· Give the Item name and Folder path

· Double click on the first Sequential File stage

· Stage tab->Give the XML file name in the File path

· In the Output tab give any value in column name ,SQL type and Length

· In the Format tab select the record type value as ‘implicit’ and Delimiter as ‘none’ and click ok

· Next open XML Input stage ,in the stage tab and in General tab check the below values

· Stage tab -> Transformation settings, check the below fields and load the XML parsing code for the file.

· Click on the Load button and select the XML file then click OK

· After loading the parsing code the screen should be

· In the Input tab and in XML source tab select the below options.

· In the Output tab and in transformation settings select the below options

· In Output tab and Columns load the Metadata of XML file

· After all steps done click ok in XML Input stage

· Open target Sequential file stage , give the target file name then click ok

· Finally compile the job ,then run the job.

· After completion of successful run the data will load in the target table.

· Target file output.

BIGPAATHSHALA.COM