Friday 24 October 2014

How does the History process works?


History Process is the process of keeping and maintaining old and current information based on dates.
REC_ACDT & EFF_STARTDT --> (Arrival date) This date colunm represents in 3 points of the record. When the record is appearing for the first time, in the Warehouse, this column is subject to the run date of the job.
REC_EDT & EFF_ENDDT --> (Expiring date) This date represents the end of the record, when the record is appearing for the first time, the column EFF_ENDT_DT would be set to 9999-12-31.
We do history process on Warehouse tables and tables that contains data fields and non-key fields.
There are 3 kinds of history process:
1) Standard History Process(SHP)
2) Refresh History Process(RHP)
3) Expire History Process (EHP)
--> Standard History Process: On SHP the data is maintained, but when it has new records, the old data turn into non-active record. When there is a upsert (update and insert) in a record, the end date of the old data is changed to the date of the insert of new data, and the end date of the new data becomes 9999-12-31, because we don’t know when it is going to finish.
--> Refresh History Process: On RHP the old data is not maintained, the new data just replaces the old data, and the start date of the record becomes the data of the new upsert. On RHP the end date is always 9999-12-31, because never exists old data.
--> Expire history Process: On EHP, no old data stay in records, all of them get expired, unless it repeats on the new records. Just the new data is active, until the new data gets inserted.
* The tables with the examples of each History Process type and the differences between each type of HP are in the attached file.
======= ===============
What parameters do I need to consider when calling Generic History processing?

The generic history processing sequence sqGenericHistoryProcess requires passing through parameters correctly. There are a few common issues to consider.
Check the parameter help text but please:
1.      When creating the PX dataset for use as input to the history processing sequence, please always ensure that RCP is turned OFF on the output link.
2.      Parameter pSELECT_COLS should be all columns in the target table EXCEPT REC_TS
3.      DO NOT include REC_ACDT as a "key column"

======== ==============
Why does my job loop in the Generic run script waiting for PX nodes?

All Datasequences that are called using the generic run scripts and have the $APT_CONFIG_FILE parameter defined, will go through the dynamic node allocation process. The set up requirements are in detailed in the Datastage Dynamic Node Allocation doc
the process to trigger the allocation of nodes was amended to be centralised. The detail of this is in the PX Centralised Node Allocation Doc
In essence you will need to run:
$UKDWENV/bin/DS_PX_node_control.ksh
====== ================
How do the Generic Components of DataStage work?

Generic Components:
-Key Lookup container
-Key Generator Container
-Rejection handler container
Multiple Instance Jobs / Generic Sequences
1) Generic HP Sequence
2) Generic Re-Startable
3) Generic Extraction Sequence
-First step of outbound job ­­à Extraction
-Key Lookup container
It is used for retrieving surrogate key from XREF table provided that we have natural key as the input.
When this natural key is passed to the container is refers to the latest timestamp DataSet which is stored in a shared data path (pSHAREDDATAPATH). This DS contains 2 columns, surrogate key and natural key. The look up is performed on the natural key from this DataSet with incoming natural key from the job, if the Surrogate key of them don't match, it will be rejected.
The container reads the natural key as one of the input parameters.
Key Generator container
It is also used for retrieving surrogate key from XREF but it is also used to generating surrogate keys for those records which do not have a surrogate key yet.
When the natural key is passed to the container it will refer to the latest timestamp DataSet which contains SK and NK. The lookup will check whether that NK already has a SK or not. If the SK is not found a TD Macro (store procedure) will get the next value for the SK and to that NK. This way all NK will have its respective SK passed to the link.
Rejection handler container
This container will take care of the rejected records as an input.
It produces the rejection DS file in the temp file path.
It will check the input data with the constraint in the transformer, if it’s rejected, it will generate the 4th column(REC_NO) then it will become a rejected file, that would be created at Temp File Path.
The DS file name will be as follows:
#SUBJECTNAME.rej
-Fields of the reject file:
REJ_NO: The sequence number.
REJ_DESC: Description of why the field was rejected (“Invalid account” for example).
REC_TYPE: It’s the column itself.
REJ_STAGE: Cleanse job name.
Multiple Instance Jobs / Generic Sequences
Shared jobs can be used across several projects and several instances of a job can be run at the same time. This is called multiple instance jobs.
In DS Director we are able to find the log of a multiple instance job because the log name will carry the invocation id.
Invocation id:

<job_name> <subject_name> <schema_name><table_name>
Invocator ID which is composed by some names, and one of them is the subject name that identifies different instances of a same sequence running under different projects. Some sequences may have more than one instance running at the same time in different projects. Because of this, a sequence must have a different identification for each project.
Generic Extraction Sequence:
The first step of the Outbound job is extraction, but if you have, for example, more than 20 tables, it will take too much time to extract one by one, so you can call a generic Extraction sequence and do it faster. It is used to extract data from warehouse for outbound jobs.
It can be found under /jobs/shared/PxExtractDB2_generic
Parameters:
pTABLE_NAME: The table name.
pCLEASEDDATASET
pDATASET_NAME: It creates the Data Set name by the first 5 column names, if the table has more than 5 columns, it will take the first five columns and put _TRNC.ds, just to show that there is more than 5 columns in that table.
pMode: It’s the mode of extraction.
“S”: It means the user defined DataSet name and SQL query.
“T”: It means that the type of extraction would be default.
pSCHEMAPATH
pSELECT_COLS: The columns you want to select.
pWHERE_CLAUSE: Put your where clause.
pCMD_RD_FLG
$TDPATH
pSELECT_STMT
Generic Re-startable load sequence:
This is used to load the data. You can use as many generic restartable sequence as you need however each one of them has a limitation of 6 keys, so in case a table has 3 keys each key will use one of the 6 sequence within each generic restartable sequence, and the keys that aren’t used will be turned to “ ”.
In case something goes wrong during the loading process you can restart your sequence because this generic container looks up if the records already exist in the target before loading the records.
Parameters:
pKEYFIELD: The name of the key_field. (“IP_ID”)
pKEYFIELD_TYPE: The type of the key_field. (“int64”)
pNUMKEYS: Number of the keys that will be used.
pFORCE_LOOKUP: It checks if the table was already loaded.
pOPTIONAL_FILTER: The extraction needs to have where_clause, so it checks the where_clause.
pLOOKUP_DATASET_ALREADY_EXISTS: It asks if there’s a lookup dataset or not.
History processing container
It takes care of History processing and loads the data
You can define the type of history process you want by setting the relevant parameter to SHP, RHP or EHP.
pFUNCFIELD: The function. (“UPDATE”)
pHPTYPE: Type of history processing. (“SHP”)
pFUNC_TYPE: The type of the function. “string[1]”
pCMT_RD_FLG: “N”
pSELECT_COLS
- DO NOT include REC_ACDT as a "key column"

====== ================
Partitioning Techniques:
When the data is partitioned, the job runs faster and performance is improved.
*Node: Node is a processing space, manually defined by the administrator, depending on the resource of the HD.
-Hash Partitioning:

It is based on a function of hash partitioning keys. Records with the same values for all hash key fields are grouped in the same processing node.
Hash partition ensures that records with the same keys are grouped in the same partition, so we can sort the records using hash key fields and remove duplicates.
After a hash partition, partitions will not necessarily have the same number of records.
-->Records are partitioned based in hash key.
-->A single primary partitioning key must be defined, and as many secondary keys as required.
(Primary key + secondary key <= total number of columns)
--> Primary partitioning key can be that with the biggest cardinality.
-->Fields that assume only two values are poor choices of hash key, because resources are wasted (just two nodes would be used), which will decrease performance.
-Modulus Partitioning:

Partitioning based on mathematic modulus operation:
Partition_number = field_value mod number_of_partitions
-->Field value is a numeric field of the input data set
-->Number of partitions is the number of processing nodes.
If field value = 125 and number of partitions = 4, this record will be placed in partition number 1, because 125 mod 4 = 1.
-Round Robin Partitioning:

Data is distributed circularly starting from the first partition. When the last partition is reached, the next partition will be the first one.
-Random Partitioning:

Data is randomly distributed amongst partitions. Random partitioning can rebalance the partitions to insure that each partition will receive similar amount of data.
-Same Partitioning:

This is the fastest partitioning method, because data are not repartitioned. The input of a stage is exactly the output of the previous stage.
-Entire Partitioning:

All partition nodes have all input data set.
-Auto Partitioning:

It is a default method on DataStage. DataStage determines the best partitioning method depending on the stage. Generally the chosen method is Round Robin as initial partitioning and Same method for intermediate stages.

======== ===========
Concepts of Parallelism:
It defines how data is distributed.
Sequential file processing: Process needs to be complete before passing the data to the next stage. It will take much longer compared to a parallel file processing. A server job processing is done sequentially.
Pipeline Parallelism: As soon as a row is processed it moves to the next node.
Partition Parallelism: Data is partitioned among the nodes so that processing is done in parallel.
Pipeline + Partition Parallelism: It is the union of both concepts above. A Parallel job uses both types of processing at the same time.(This is the one we use in BI)


SHP Example



SOURCE TABLES

TARGET TABLES











December 17th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
100
ABC
BANK-Mega
1234

1111
ABC
BANK-Mega
1234
2010-12-17
9999-12-31
200
XYZ
BANK-Mega
1111

2222
XYZ
BANK-Mega
1111
2010-12-17
9999-12-31











December 18th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
100
ABC
ABCD-BANK
1234

1111
ABC
BANK-Mega
1234
2010-12-17
2010-12-18
200
XYZ
BANK-Mega
9999

2222
XYZ
BANK-Mega
1111
2010-12-17
2010-12-18
300
PQR
NO ADDR
1113

1111
ABC
ABCD-BANK
1234
2010-12-18
9999-12-31





2222
XYZ
BANK-Mega
9999
2010-12-18
9999-12-31





3333
PQR
NO ADDR
1113
2010-12-18
9999-12-31











December 19th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
100
ABC
ABCD-BANK
1234

1111
ABC
BANK-Mega
1234
2010-12-17
2010-12-18
200
XYZ
BANK-COP
9999

2222
XYZ
BANK-Mega
1111
2010-12-17
2010-12-18
300
PQR
NO ADDR
1113

1111
ABC
ABCD-BANK
1234
2010-12-18
9999-12-31





2222
XYZ
BANK-Mega
9999
2010-12-18
2010-12-19





3333
PQR
NO ADDR
1113
2010-12-18
9999-12-31





2222
XYZ
BANK-COP
9999
2010-12-19
9999-12-31










RHP Example



SOURCE TABLES

TARGET TABLE











December 17th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
100
ABC
BANK-Mega
1234

1111
ABC
BANK-Mega
1234
2010-12-17
9999-12-31
200
XYZ
BANK-Mega
1111

2222
XYZ
BANK-Mega
1111
2010-12-17
9999-12-31











December 18th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
200
XYZ
BANK-Mega
9999

1111
ABC
BANK-Mega
1234
2010-12-17
9999-12-31
300
PQR
NO ADDR
1113

2222
XYZ
BANK-Mega
9999
2010-12-18
9999-12-31





3333
PQR
NO ADDR
1113
2010-12-18
9999-12-31






















December 19th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
200
XYZ
BANK-COP
9999

1111
ABC
BANK-Mega
1234
2010-12-17
9999-12-31
300
PQR
NO ADDR
1113

2222
XYZ
BANK-COP
9999
2010-12-19
9999-12-31





3333
PQR
NO ADDR
1113
2010-12-18
9999-12-31



















EHP Example



SOURCE TABLES

TARGET TABLE











December 17th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
100
ABC
BANK-Mega
1234

1111
ABC
BANK-Mega
1234
2010-12-17
9999-12-31
200
XYZ
BANK-Mega
1111

2222
XYZ
BANK-Mega
1111
2010-12-17
9999-12-31











December 18th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
200
XYZ
BANK-Mega
9999

1111
ABC
BANK-Mega
1234
2010-12-17
2010-12-18
300
PQR
NO ADDR
1113

2222
XYZ
BANK-Mega
9999
2010-12-18
9999-12-31





3333
PQR
NO ADDR
1113
2010-12-18
9999-12-31






















December 19th







C.NO
C.NAME
C.ADDR
C.TNO

C.NO
C.NAME
ADDR
TNO
RA
RE
400
MNP
NO ADDR
4321

1111
ABC
BANK-Mega
1234
2010-12-17
2010-12-18





2222
XYZ
BANK-Mega
9999
2010-12-18
2010-12-19





3333
PQR
NO ADDR
1113
2010-12-18
2010-12-19





4444
MNP
NO ADDR
4321
2010-12-19
9999-12-31

















Differences between SHP, RHP and EHP

SHP
RHP
EHP
- All the changes are stored
- Just the newest or refreshed information is maintained
- Expired records(without their whole changes history), and refreshed  records are stored
- When some record is changed an UPSERT operation is realized.
- When some record is changed it is deleted and reinserted with the changes
- When some record is
 changed it is refreshed
 as in RHP
- To find the latest change in a record, we must to search what record contains 9999-31-12 value in its REC_EDT field
- All values in REC_EDT
 column are equal to
 9999-31-12
- Expired or inactive records have the REC_EDT value different of 9999-31-12, and active records have REC_EDT value equal to 9999-31-12
- When a record is in target
table but not in source table, nothing is done
- When a record is in target
table but not in source    table, nothing is done
- When a record is in target table but not in source table it is expired






















1 comment: