How does the History process works?
History
Process is the process of keeping and maintaining old and current information
based on dates.
REC_ACDT & EFF_STARTDT --> (Arrival date) This date colunm
represents in 3 points of the record. When the record is appearing for the
first time, in the Warehouse, this column is subject to the run date of the
job.
REC_EDT & EFF_ENDDT --> (Expiring date) This date
represents the end of the record, when the record is appearing for the first
time, the column EFF_ENDT_DT would be set to 9999-12-31.
We
do history process on Warehouse tables and tables that contains data fields and
non-key fields.
There
are 3 kinds of history process:
1) Standard History Process(SHP)
2) Refresh History Process(RHP)
3) Expire History Process (EHP)
-->
Standard History Process: On SHP the data is maintained, but when it has new
records, the old data turn into non-active record. When there is a upsert
(update and insert) in a record, the end date of the old data is changed to the
date of the insert of new data, and the end date of the new data becomes
9999-12-31, because we don’t know when it is going to finish.
-->
Refresh History Process: On RHP the old data is not maintained, the new data
just replaces the old data, and the start date of the record becomes the data
of the new upsert. On RHP the end date is always 9999-12-31, because never
exists old data.
-->
Expire history Process: On EHP, no old data stay in records, all of them
get expired, unless it repeats on the new records. Just the new data is active,
until the new data gets inserted.
*
The tables with the examples of each History Process type and the differences
between each type of HP are in the attached file.
======= ===============
What
parameters do I need to consider when calling Generic History processing?
The
generic history processing sequence sqGenericHistoryProcess requires passing
through parameters correctly. There are a few common issues to consider.
Check
the parameter help text but please:
1.
When
creating the PX dataset for use as input to the history processing sequence,
please always ensure that RCP is turned OFF on the output link.
2.
Parameter
pSELECT_COLS should be all columns in the target table EXCEPT REC_TS
3.
DO
NOT include REC_ACDT as a "key column"
======== ==============
Why
does my job loop in the Generic run script waiting for PX nodes?
All
Datasequences that are called using the generic run scripts and have the
$APT_CONFIG_FILE parameter defined, will go through the dynamic node allocation
process. The set up requirements are in detailed in the Datastage Dynamic Node Allocation doc
the
process to trigger the allocation of nodes was amended to be centralised. The
detail of this is in the PX Centralised Node Allocation Doc
In
essence you will need to run:
$UKDWENV/bin/DS_PX_node_control.ksh
====== ================
How
do the Generic Components of DataStage work?
Generic
Components:
-Key Lookup
container
-Key Generator
Container
-Rejection
handler container
Multiple
Instance Jobs / Generic Sequences
1) Generic HP
Sequence
2) Generic
Re-Startable
3) Generic
Extraction Sequence
-First step of
outbound job à Extraction
-Key
Lookup container
It is used for
retrieving surrogate key from XREF table provided that we have natural key as
the input.
When this natural key
is passed to the container is refers to the latest timestamp DataSet which is
stored in a shared data path (pSHAREDDATAPATH). This DS contains 2 columns,
surrogate key and natural key. The look up is performed on the natural key from
this DataSet with incoming natural key from the job, if the Surrogate key of
them don't match, it will be rejected.
The container reads
the natural key as one of the input parameters.
Key
Generator container
It is also used for
retrieving surrogate key from XREF but it is also used to generating surrogate
keys for those records which do not have a surrogate key yet.
When the natural key
is passed to the container it will refer to the latest timestamp DataSet which
contains SK and NK. The lookup will check whether that NK already has a SK or
not. If the SK is not found a TD Macro (store procedure) will get the next
value for the SK and to that NK. This way all NK will have its respective SK
passed to the link.
Rejection
handler container
This container will
take care of the rejected records as an input.
It produces the
rejection DS file in the temp file path.
It will check the
input data with the constraint in the transformer, if it’s rejected, it will
generate the 4th column(REC_NO) then it will become a rejected file,
that would be created at Temp File Path.
The DS file name will
be as follows:
#SUBJECTNAME.rej
-Fields of the reject
file:
REJ_NO: The sequence number.
REJ_DESC: Description of why the field was
rejected (“Invalid account” for example).
REC_TYPE: It’s the column itself.
REJ_STAGE: Cleanse job name.
Multiple
Instance Jobs / Generic Sequences
Shared jobs can be used across several projects and several instances of a job can be run at the same time. This is called multiple instance jobs.
In DS Director we are able to find the log of a multiple instance job because the log name will carry the invocation id.
Invocation id:
<job_name> <subject_name> <schema_name><table_name>
Shared jobs can be used across several projects and several instances of a job can be run at the same time. This is called multiple instance jobs.
In DS Director we are able to find the log of a multiple instance job because the log name will carry the invocation id.
Invocation id:
<job_name> <subject_name> <schema_name><table_name>
Invocator ID which is
composed by some names, and one of them is the subject name that identifies
different instances of a same sequence running under different projects. Some
sequences may have more than one instance running at the same time in different
projects. Because of this, a sequence must have a different identification for
each project.
Generic
Extraction Sequence:
The first step of the
Outbound job is extraction, but if you have, for example, more than 20 tables,
it will take too much time to extract one by one, so you can call a generic Extraction
sequence and do it faster. It is used to extract data from warehouse for
outbound jobs.
It can be found under
/jobs/shared/PxExtractDB2_generic
Parameters:
pTABLE_NAME: The table name.
pCLEASEDDATASET
pDATASET_NAME: It creates the Data Set name by
the first 5 column names, if the table has more than 5 columns, it will take
the first five columns and put _TRNC.ds, just to show that there is more than 5
columns in that table.
pMode: It’s the mode of extraction.
“S”: It means the
user defined DataSet name and SQL query.
“T”: It means that
the type of extraction would be default.
pSCHEMAPATH
pSELECT_COLS: The columns you want to select.
pWHERE_CLAUSE: Put your where clause.
pCMD_RD_FLG
$TDPATH
pSELECT_STMT
Generic
Re-startable load sequence:
This is used to load
the data. You can use as many generic restartable sequence as you need however
each one of them has a limitation of 6 keys, so in case a table has 3 keys each
key will use one of the 6 sequence within each generic restartable sequence,
and the keys that aren’t used will be turned to “ ”.
In case something
goes wrong during the loading process you can restart your sequence because
this generic container looks up if the records already exist in the target
before loading the records.
Parameters:
pKEYFIELD: The name of the key_field.
(“IP_ID”)
pKEYFIELD_TYPE: The type of the key_field.
(“int64”)
pNUMKEYS: Number of the keys that will be
used.
pFORCE_LOOKUP: It checks if the table was
already loaded.
pOPTIONAL_FILTER: The extraction needs to have where_clause,
so it checks the where_clause.
pLOOKUP_DATASET_ALREADY_EXISTS: It asks if there’s a lookup
dataset or not.
History
processing container
It takes care of
History processing and loads the data
You can define the
type of history process you want by setting the relevant parameter to SHP, RHP
or EHP.
pFUNCFIELD: The function. (“UPDATE”)
pHPTYPE: Type of history processing.
(“SHP”)
pFUNC_TYPE: The type of the function.
“string[1]”
pCMT_RD_FLG: “N”
pSELECT_COLS
- DO NOT include
REC_ACDT as a "key column"
====== ================
Partitioning
Techniques:
When
the data is partitioned, the job runs faster and performance is improved.
*Node:
Node is a processing space, manually defined by the administrator, depending on
the resource of the HD.
-Hash
Partitioning:
It is based on a function of hash partitioning keys. Records with the same values for all hash key fields are grouped in the same processing node.
Hash partition ensures that records with the same keys are grouped in the same partition, so we can sort the records using hash key fields and remove duplicates.
After a hash partition, partitions will not necessarily have the same number of records.
-->Records are partitioned based in hash key.
-->A single primary partitioning key must be defined, and as many secondary keys as required.
(Primary
key + secondary key <= total number of columns)
--> Primary partitioning key can be that with the biggest cardinality.
-->Fields that assume only two values are poor choices of hash key, because resources are wasted (just two nodes would be used), which will decrease performance.
--> Primary partitioning key can be that with the biggest cardinality.
-->Fields that assume only two values are poor choices of hash key, because resources are wasted (just two nodes would be used), which will decrease performance.
-Modulus
Partitioning:
Partitioning based on mathematic modulus operation:
Partition_number = field_value mod number_of_partitions
-->Field value is a numeric field of the input data set
-->Number of partitions is the number of processing nodes.
If
field value = 125 and number of partitions = 4, this record will be placed in
partition number 1, because 125 mod 4 = 1.
-Round
Robin Partitioning:
Data is distributed circularly starting from the first partition. When the last partition is reached, the next partition will be the first one.
-Random
Partitioning:
Data is randomly distributed amongst partitions. Random partitioning can rebalance the partitions to insure that each partition will receive similar amount of data.
-Same
Partitioning:
This is the fastest partitioning method, because data are not repartitioned. The input of a stage is exactly the output of the previous stage.
-Entire
Partitioning:
All partition nodes have all input data set.
-Auto
Partitioning:
It is a default method on DataStage. DataStage determines the best partitioning method depending on the stage. Generally the chosen method is Round Robin as initial partitioning and Same method for intermediate stages.
======== ===========
Concepts of
Parallelism:
It
defines how data is distributed.
Sequential
file processing:
Process needs to be complete before passing the data to the next stage. It will
take much longer compared to a parallel file processing. A server job processing
is done sequentially.
Pipeline
Parallelism:
As soon as a row is processed it moves to the next node.
Partition
Parallelism:
Data is partitioned among the nodes so that processing is done in parallel.
Pipeline
+ Partition Parallelism: It is the union of both concepts above. A Parallel
job uses both types of processing at the same time.(This is the one we use in
BI)
SHP Example
SOURCE
TABLES
|
|
TARGET
TABLES
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
December 17th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
100
|
ABC
|
BANK-Mega
|
1234
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
9999-12-31
|
200
|
XYZ
|
BANK-Mega
|
1111
|
|
2222
|
XYZ
|
BANK-Mega
|
1111
|
2010-12-17
|
9999-12-31
|
|
|
|
|
|
|
|
|
|
|
|
December 18th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
100
|
ABC
|
ABCD-BANK
|
1234
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
2010-12-18
|
200
|
XYZ
|
BANK-Mega
|
9999
|
|
2222
|
XYZ
|
BANK-Mega
|
1111
|
2010-12-17
|
2010-12-18
|
300
|
PQR
|
NO ADDR
|
1113
|
|
1111
|
ABC
|
ABCD-BANK
|
1234
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
2222
|
XYZ
|
BANK-Mega
|
9999
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
3333
|
PQR
|
NO ADDR
|
1113
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
|
|
|
|
|
|
December 19th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
100
|
ABC
|
ABCD-BANK
|
1234
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
2010-12-18
|
200
|
XYZ
|
BANK-COP
|
9999
|
|
2222
|
XYZ
|
BANK-Mega
|
1111
|
2010-12-17
|
2010-12-18
|
300
|
PQR
|
NO ADDR
|
1113
|
|
1111
|
ABC
|
ABCD-BANK
|
1234
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
2222
|
XYZ
|
BANK-Mega
|
9999
|
2010-12-18
|
2010-12-19
|
|
|
|
|
|
3333
|
PQR
|
NO ADDR
|
1113
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
2222
|
XYZ
|
BANK-COP
|
9999
|
2010-12-19
|
9999-12-31
|
RHP Example
SOURCE TABLES
|
|
TARGET TABLE
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
December 17th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
100
|
ABC
|
BANK-Mega
|
1234
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
9999-12-31
|
200
|
XYZ
|
BANK-Mega
|
1111
|
|
2222
|
XYZ
|
BANK-Mega
|
1111
|
2010-12-17
|
9999-12-31
|
|
|
|
|
|
|
|
|
|
|
|
December 18th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
200
|
XYZ
|
BANK-Mega
|
9999
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
9999-12-31
|
300
|
PQR
|
NO ADDR
|
1113
|
|
2222
|
XYZ
|
BANK-Mega
|
9999
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
3333
|
PQR
|
NO ADDR
|
1113
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
December 19th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
200
|
XYZ
|
BANK-COP
|
9999
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
9999-12-31
|
300
|
PQR
|
NO ADDR
|
1113
|
|
2222
|
XYZ
|
BANK-COP
|
9999
|
2010-12-19
|
9999-12-31
|
|
|
|
|
|
3333
|
PQR
|
NO ADDR
|
1113
|
2010-12-18
|
9999-12-31
|
EHP Example
SOURCE TABLES
|
|
TARGET TABLE
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
December 17th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
100
|
ABC
|
BANK-Mega
|
1234
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
9999-12-31
|
200
|
XYZ
|
BANK-Mega
|
1111
|
|
2222
|
XYZ
|
BANK-Mega
|
1111
|
2010-12-17
|
9999-12-31
|
|
|
|
|
|
|
|
|
|
|
|
December 18th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
200
|
XYZ
|
BANK-Mega
|
9999
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
2010-12-18
|
300
|
PQR
|
NO ADDR
|
1113
|
|
2222
|
XYZ
|
BANK-Mega
|
9999
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
3333
|
PQR
|
NO ADDR
|
1113
|
2010-12-18
|
9999-12-31
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
December 19th
|
|
|
|
|
|
|
|
|||
C.NO
|
C.NAME
|
C.ADDR
|
C.TNO
|
|
C.NO
|
C.NAME
|
ADDR
|
TNO
|
RA
|
RE
|
400
|
MNP
|
NO ADDR
|
4321
|
|
1111
|
ABC
|
BANK-Mega
|
1234
|
2010-12-17
|
2010-12-18
|
|
|
|
|
|
2222
|
XYZ
|
BANK-Mega
|
9999
|
2010-12-18
|
2010-12-19
|
|
|
|
|
|
3333
|
PQR
|
NO ADDR
|
1113
|
2010-12-18
|
2010-12-19
|
|
|
|
|
|
4444
|
MNP
|
NO ADDR
|
4321
|
2010-12-19
|
9999-12-31
|
Differences between SHP, RHP and EHP
SHP
|
RHP
|
EHP
|
- All the
changes are stored
|
- Just the
newest or refreshed information is maintained
|
- Expired
records(without their whole changes history), and refreshed records are stored
|
- When some
record is changed an UPSERT operation is realized.
|
- When some
record is changed it is deleted and reinserted with the changes
|
- When some
record is
changed it is refreshed as in RHP |
- To find
the latest change in a record, we must to search what record contains
9999-31-12 value in its REC_EDT field
|
- All values
in REC_EDT
column are equal to 9999-31-12 |
- Expired or
inactive records have the REC_EDT value different of 9999-31-12, and active
records have REC_EDT value equal to 9999-31-12
|
- When a
record is in target
table but not in source table, nothing is done |
- When a
record is in target
table but not in source table, nothing is done |
- When a
record is in target table but not in source table it is expired
|
Have you ever worked in HSBC ?
ReplyDelete