DATASTAGE HISTORY
Introduction
DataStage Enterprise Edition is a package of three products: DataStage Server Edition, the parallel extender with parallel ETL jobs and the MetaStage product described on the Metadata Workbench entry. The flagship tool of Enterprise Edition is parallel ETL jobs.
History
During the 1990s the data integration vendors such as Ascential and Informatica were competing to deliver tools that provided a wide range of data connectivity and transformation functions in a mostly code free environment. Towards the late 1990s data stores were becoming large, data warehouses and business intelligence was demanding larger volumes of data loads. The physical architecture of these loads was hitting a limit on the volume that a single server could handle and was moving towards clusters or grids of servers.
The data integration vendors need to be able to integrate data across a massively scalable architecture to keep up with the increased data volumes.
Ascential started to roll out a parallel capability in the DataStage Server Editionproduct called multiple instance jobs. This allowed some additional manual programming to partition and process data in parallel. In November 2001 they switched to a buy approach and purchased Torrent Systems for $46 million.
Torrent had the capability to run tools on a massively parallel processing (MPP) platform.
Datastage Versions
This section lists each major release of DataStage Enterprise Edition and the enhancements for DataStage parallel jobs. For a list of enhancements to the client tools see the versions on the DataStage Server Edition page is it is the version that has been delivered with every release going back to DataStage 1.
All release of DataStage 7 can import and upgrade DataStage 6 export files. DataStage 8 can only import and upgrade DataStage 7.5.1 or 7.5.2 jobs.
DataStage 6
Released in September 2002, ten months after the acquisition of Torrent, it was the first version of DataStage to feature the Parallel Extender (PX), the parallel platform that allows processes to run in parallel across a multiple processor environment.
- New parallel job type with a new set of parallel stages. Some with the same name as server job stages but with different properties and options.
- Server job shared container for parallel jobs.
- CPU based licensing instead of server based licensing.
- Support for SAS 6.12 and 8.2.
This release was followed by the client only 6.0.1 release that fixed a number problems.
DataStage 7
Release September 2003 it uses much the same architecture of the previous version with improvements to the usability. This was the first release to have no server job improvements but many parallel job improvements.
- XML Pack 2.0 provides improved XML metadata support for parallel jobs.
- National Language Support (NLS) for parallel jobs but not for all parallel stages.
- Parallel shared and local stages.
- Enhanced transformer with improved reject row handling, string handling, timestamp conversion and compile performance.
- Modify, Switch and Filter stages added.
- Multiple-instance parallel jobs.
- Non blocking funnel stage.
DataStage 7.5
Unknown release date.
- Parallel complex flat file stage.
- A parallel job message handler for demoting or removing warning messages from the job log.
- Lookup stage changes from a property screen to a drag and drop mapping screen.
- Multi node import of sequential files.
- Additional options for sequential file and file set stages such as Read First Rows, Row Number Column and First Line is Column Names.
- View data support for custom stages.
- New Parallel Advanced Job Developers Guide.
DataStage 7.5.1
Released in March 2005.
- New SQL Builder for building SQL query statements from a database plugin stage.
- Command line job search function added.
- DataStage parallel jobs for Unix System Services (USS) on the mainframe.
- Remote job deployment to deliver and run jobs across a cluster or grid.
- Vector support in the parallel transformer stage.
- Sybase and ODBC stages added to parallel jobs.
- Complex Flat File stage improvements: multiple output links, automatically generated fillers, MVS dataset support.
- Thread based job monitoring for parallel jobs.
DataStage 7.5X2
Released in December 2004 this was the first release of parallel jobs that could run on Windows. While the Server runs on all the same Unix and Linux platforms as 7.5.1 it adds the additional platform of Windows 2003 Standard or Enterprise on the Intel x86 Processor Family.
There were no changes to parallel jobs in this release apart from the capability to compile and run them on Windows.
DataStage 8
Released in October 2006 for Windows and April 2007 for Unix this is the first version to run on the IBM Information Server. There are a number of parallel job improvements in this release:
NEW FEATURES IN DATASTAGE 8.1
DataStage continued to enhance it’s capabilities to manage data quality and data integration solutions. DataStage 8.0 introduced many new features to make development and maintenance of project comfortable. These enhancements include data quality management, connectivity methods, implementation of slowly changing dimension.
IBM Information Server, consist of the following components, WebSphere DataStage and Quality Stage, WebSphere Information Analyzer, Federation Server, and Business Glossary, common administration, logging and reporting. These components are designed to provide much more efficient ways to manage metadata and develop ETL solutions. Components can be deployed based on client need.
I. The Metadata Server.
With the Hawk release, DataStage has created common administration, logging and reporting and this will improve metadata reporting available, compared to prior releases.
II. QualityStage
Data Quality is highly critical for data integration projects. With earlier releases such as MetaStage, Quality Stages used to add lot of additional overhead in installation, training and implementation. With new release of QualityStage, integration projects using standardization, matching and survivorship to improve quality will be more accessible and easier to use. Also, developer will be able to design jobs with data transformation stages and data quality stages in the same session. Designer is called DataStage and QualityStage Designer in current release, based on it’s usage.
III. Frictionless Connectivity and Connection Objects:
Managing connectivity information and propagating connectivity information between different environments, has added additional development and maintenance overhead. These new objects help in connecting to remote database connectivity easier. Earlier releases, development team may need to spend considerable time in resolving connectivity issues with the database. DataStage 8 will help the team by providing frictionless connectivity and connectivity objects, ensure reusability and reduces risk of data issues due to wrong connectivity information.
IV. Parallel job range lookup.
It’s always important to get different options to access data for lookup and accessing over a range is always better option when data range is available for improving performance. Range lookup has been merged into the existing lookup form and are easy to use.
V. Slowly Changing Dimension StageData Warehouse developers need to develop complex jobs to implement Slowly Changing Dimension. With this stage introduced in DataStage 8, following enhancements can be done easily, surrogate key generation, there is the slowly changing dimension stage and updates passed to in memory lookups. That's it for me with DBMS generated keys, I'm only doing the keys in the ETL job from now on! DataStage server jobs have the hash file lookup where you can read and write to it at the same time, parallel jobs will have the updateable lookup.
VI. Collaboration
This new feature allows developers to open any job, which is already opened by other developers. This copy of developer will be READ ONLY. This helps the developers in reducing wait time, when job is currently LOCKED by other user. New enhancements also allows you to unlock the job associated with a disconnected session from the web console in an easier way than prior releases.
VII. Session Disconnection.
With this feature an administrator can disconnect sessions and unlock jobs.
VIII. Improved SQL Builder.
This feature reduces the effort spent in synchronizing SQL Select list to the DataStage column list. This will ensure that column mismatches. Adding to this in ODBC Connector, you will be able to complex queries with GUI, which includes adding columns and where clause to the statement.
IX. Improved job startup times.
With this new enhancement, when lot of small parallel jobs gets invocated, this will have less impact on DataStage long running jobs. Connectivity and resource allocation for parallel jobs has improved and load is balanced based on job requirement.
X. Common logging
With this new feature, DataStage has introduced common logging of DataStage job logs. This helps in searching from DataStage log. DataStage has also introduced time based and record based job monitoring.
These are add on products (at an additional fee) that attach themselves to source databases and perform change data capture. Most source system database owners I've come across don't like you playing with their production transactional database and will not let you near it with a ten foot poll, but I guess there are exceptions:
- Oracle
- Microsoft SQL Server
- DB2 for z/OS
- IMS
There are three ways to get incremental feeds on the Information Server: the CDC products for DataStage, the Replication Server (renamed Information Integrator: Replication Edition, does DB2 replication very well) and the change data capture functions within DataStage jobs such as the parallel CDC stage.
These are the functions that are not in DataStage 8,
· dssearch command line function
- dsjob "-import"
- Version Control tool
- Released jobs
- Oracle 8i native database stages
- ClickPack
The loss of the Version Control tool is not a big deal as the import/export functions have been improved. Building a release file as an export in version 8 is easier than building it in the Version Control tool in version 7.
The common connection objects functionality means the very wide range of DataStage database connections are now available across Information Server products.
Latest supported databases for version 8:
- DB2 8.1, 8.2 and 9.1
- Oracle 9i, 10i, 10gR2 not Oracle 8
- SQL Server 2005 plus stored procedures.
- Teradata v2r5.1, v2r6.0, v2r6.1 (DB server) / 8.1 (TTU) plus Teradata Parallel Transport (TPT) and stored procedures and macro support, reject links for bulk loads, restart capability for parallel bulk loads.
- Sybase ASE 15, Sybase IQ 11.5, 12.5, 12.7
- Informix 10 (IDS)
- SAS 612, 8.1, 9.1 and 9.1.3
- IBM WS MQ 6.1, WS MB 5.1
- Netezza v3.1
- ODBC 3.5 standard and level 3 compliant
- UniData 6 and UniVerse ?
- Red Brick ?
A new stage from the IBM software family, new stages from new partners and the convergence of QualityStage functions into Datastage. Apart from the SCD stage these all come at an additional cost.
- WebSphere Federation and Classic Federation
- Netezza Enterprise Stage
- SFTP Enterprise Stage
- iWay Enterprise Stage
- Slowly Changing Dimension: for type 1 and type 2 SCDs.
- Six QualityStage stages
- Complex Flat File Stage: Multi Format File (MFF) in addition to existing cobol file support.
- Surrogate Key Generator: the key sourceis a new feature included in this stage which is maintained via integrated state file or DBMS sequence.
- Lookup Stage: Range Look-up is a new function which is equivalent to the operator between. Lookup against a range of values was difficult to implement in previous DataStage versions. By having this functionality in the lookup stage, comparing a source column to a range of two lookup columns or a lookup column to a range of two source columns can be easily implemented.
- Transformer Stage: new surrogate key functions Initialize() and GetNextKey().
- Enterprise FTP Stage: now choose between ftp and sftp transfer.
Secure FTP (SFTP) Select this option if you want to transfer files between computers in a secured channel. Secure FTP (SFTP) uses the SSH (Secured Shell) protected channel for data transfer between computers over a nonsecure network such as a TCP/IP network. Before you can use SFTP to transfer files, you should configure the SSH connection without any pass phrase for RSA authentication.
This is a big area of improvement.
LOB/BLOC/CLOB Data: pictures, documents etc of any size can now be moved between databases. Connector can transfer large objects (LOB) using inline or reference methods.However, a connector is the only stage that does reference methods so another connector is needed to transfer the LOB inline later in the job.
Reject Links: Connecter has its own reject-handling function which eliminates the need to add a Modify or a Transformer stage for capturing SQL errors or for aborting jobs. A choice between number of rows or percentage or rows rejected can be specified for terminating the job run.
Schema Reconciliation: Connector has a schema reconciliation function that automatically compares DataStage schemas to external-resource schemas such as a database. Schemas include data types, attributes and field lengths. Based on the reconciliation rules that you specify, runtime errors or extra transformation on mismatched schemas can be avoided.
Improved SQL Builder that supports more database types.
Connector is the best stage to use for your database because it gives themaximum parallel performance and offers more features compared to database
Test button The Test Button on connectors allows developers to test database connections without having to view the data or to run the job.
Connectors are for accessing external data sources and can be used to read, write, look up and filter data or simply to test the database connectivity during job design.
Drag and drop your configured database connections onto jobs.
Before and after SQL defined per job or per node with a failure handling option. Neater than previous versions.
DataStage 8 gives you access to the latest versions of databases that DataStage 7 may never get. Extra functions on all connectors includes improved reject handling, LOB support and easier stage configuration.
Note the database compatibility for the Metadata Server repository is the latest versions of the three DBMS engines. DB2 is an optional extra in the bundle if you don't want to use an existing database.
- IBM UDB DB2 ESE 9
-IBM Information Server does not support the Database Partitioning Feature (DPF) for use in the repository layer
-DB2 Restricted Enterprise Edition 9 is included with IBM Information Server and is an optional part of the installation however its use is restricted to hosting the IBM Information Server repository layer and cannot be used for other applications - Oracle 10g
- SQL Server 2005
Different enterprise packs are available in version 8. These packs are:
- SAP BW Pack
- BAPI: (Staging Business API) loads from any source to BW.
- OpenHub: extract data from BW.
- SAP R/3 Pack
- ABAP: (Advanced Business Application Processing) auto generate ABAP, Extraction Object Builder, SQL Builder, Load and execute ABAP from DataStage, CPI-C Data Transfer, FTP Data Transfer, ABAP syntax check, background execution of ABAP.
- IDoc: create source system, IDoc listener for extract, receive IDocs, send IDocs.
- BAPI: BAPI explorer, import export Tables Parameters Activation, call and commit BAPI.
- Siebel Pack
- EIM: (data integration manager) interface tables
- Business Component: access business views via Siebel Java Data Bean
- Direct Access: use a metadata browser to select data to extract
- Hierarchy: for extracts from Siebel to SAP BW.
- Oracle Applications Pack
- Oracle flex fields: extract using enhanced processing techniques.
- Oracle reference data structures: simplified access using the Hierarchy Access component.
- Metadata browser and importer
- DataStage Pack for PeopleSoft Enterprise
- Import business metadata via a metadata browser.
- Extract data from PeopleSoft tables and trees.
- JD Edwards Pack
- Standard ODBC calls
- Pre-joined database tables via business views
These packs can be used by server and/or parallel jobs to interact with other coding languages. This lets you access programming modules or functions within a job:
- Java Pack: Produce or consume rows for DataStage Parallel or Server jobs. Using a java transformer.
- Web Service Pack: Access web services operations in a Server job transformer or Server routine.
- XML Pack: Read, write or transform XML files in parallel or server jobs.
The DataStage stages, custom stages, transformer functions and routines will usually be faster at transforming data than these packs however they are useful for re-using existing code.
NEW FEATURES IN DATASTAGE 8.5
DataStage 8.5 is out and IBM has made some significant improvements this time around. Let’s see some of the important enhancements in the new DataStage 8.5 version
Its Fast!
DataStage 8.5 is considerably faster than its previous version (8.1). Tasks like saving, renaming, compiling are faster by nearly 40%. The run time performance of jobs has also improved.
The parallel engine
on DataStage has been tuned to improve performance and resource usage has reduced by 5% when compared to DataStage 8.1
XML data
DataStage has historically been inefficient at handling XML files, but in 8.5 IBM has given us a great XML processing package. DataStage 8.5 can now process large XML files (over 30 GB) with ease. Also, we can now process XML data in parallel.
The new XML transform stage can data from multiple sources into a single XML output stream. If you think that is cool, it can also do it the other way around i.e., multiple XML input to a single output stream.
It can also convert data from one XML format to another.
Transformer Stage
It is one of the most used and the most important stages on DataStage and it just got better in 8.5
a. Transformer Looping:
Over the years DataStage programmers have been using workarounds to implement this concept. Now IBM has included it directly in the transformer stage.
There are two types of looping’s available
Output looping: Where we can output multiple output links for a single input link
Ex:
Input Record:
Salesman_name City_1 City_2 City_3
Jason Bourne New York Madrid New Delhi
Output Record:
Salesman_name City
Jason Bourne New York
Jason Bourne Madrid
Jason Bourne New Delhi
This is achieved using a new system variable @ITERATION
Input looping: We can now aggregate input records within the transformer and assign the aggregated data to the original input link while sending it to the output.
b. Transformer change detection:
SaveInputRecord() – Save a record to be used for later transformations within the job
GetInputRecord() – Retrieve the saved record as when it is required for comparisons
c. System Variables:
i. @ITERATION: Used in the looping mechanism
ii. LastRow(): Indicates the last row in the job
iii. LastRowInGroup(): Will return the last row in the group based on the key column
Below are links to the posts that explain this concept with an example -
DataStage – Transformer Looping Example 1
DataStage – Transformer Looping Example 2
d. New NULL Handling features:
In DataStage 8.5 we need not explicitly handle NULL values. Record dropping is arrested if the target column is nullable. We need not handle NULL values explicitly when using functions over columns that have NULL values. And also stage variables are now nullable by default.
APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING has been prepared to support backward compatibility
e. New Data functions:
There are a host of new date functions incorporated into DataStage 8.5. I personally found the below function most useful
DataFromComponents(years, months, daysofmonth)
Ex: DataFromComponenets(2012,07,20) will output 2012-07-20
DataOffsetByComponents(basedate, years offset, month offset, daysofmonth offset)
Ex: DataOffsetByComponents(2012-07-20, 2,1,1) will output 2014-08-21
DataOffsetByComponents(2012-07-20, -4,0,0) will output 2008-07-20
I will write another detailed blog on the new data functions shortly
Parallel Debugger:
DataStage 8.5 now has a built in debugger functionality. We can now set breakpoints on the links in our jobs.
When the job is run in debug mode, it will stop when it encounters a breakpoint. From here we can step to the next action on that link or skip to the next row of data.
Functionality Enhancements:
- Mask encryption for before and after job subroutines
- Ability to copy permissions from one project to a new project
- Improvements in the multi-client manager
- New audit tracing and enhanced exception dialog
- Enhanced project creation failure details
Vertical Pivoting:
At long last vertical pivoting has been added
Integration with CVS
Now in DataStage 8.5 we have the feature that integrates directly with version control systems like CVS. We can now Check-in and Check-out directly from DataStage
Information Architecture Diagraming Tool:
Now solution architects can draw detailed integration solution plans for data warehouses from within DataStage
Balanced Optimizer:
As you all know DataStage is an ETL tool. But now with Balanced Optimizer directly being integrated we have the ELT (Extract Load and Transform) feature.
With this we can extract the data, load it and perform the transformations inside the database engine.
DataStage Editions
Enterprise Edition: a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopeSoft and Oracle Corporation.
DataStage TX: for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA: Real Time Integration pack can turn server or parallel jobs into SOA services.
NEW FEATURES OF DATA STAGE 8.7
NEW FEATURES OF DATA STAGE 9.1
http://dsxchange.net/uploads/91_DataStage_Deep_Dive_DSXchange.pdf
Difference between Informatica and Data stage
Both Datastage and Informatica are powerful ETL tools . Both tools do almost exactly the same thing in almost exactly the same way. Performance, maintainability, learning curve are all similar and comparable. Below are the few things which I would like highlight regarding both these tools.
Multiple Partitions
Informatica offers partitioning as dynamic partitioning which defaults a workflow not at every Stage/Object level in a mapping/job. Informatica offers other partitioning choices as well at the workflow level.
DataStage's pipeline partitioning uses multiple partitions, processed and then re-collected with DataStage. DataStage lets control a job design based on the logic of the processing instead of defaulting the whole pipeline flow to one partition type. DataStage offers 7 different types of multi-processing partitions.
User Interface
Informatica offers access to the development and monitoring effort through its 4 GUIs - offered as Informatica PowerDesigner, Repository Manager, Worflow Designer, Workflow Manager.
Data Stage caters to development and monitoring its jobs through 3 GUIs - IBM Data Stage Designer(for development), Job Sequence Designer(workflow design) and Director(for monitoring).
Version Control
Informatica offers instant version control through its repository server managed with “Repository Manager” GUI console. A mapping with work-in-progress cannot be opened until saved and checked back into the repository. Version control is done by using checkin and check out.
Version Control was offered as a component until version Ascential DataStage7.5.x. Ascential was acquired by IBM and when DataStagewas integrated into IBM Information Server with Data Stage at version 8.0.1, the support of version control as a component was discontinued.
Repository based flow
Informatica, offers a step-by-step effort of creating a data integration solution. Each object created while mapping a source with a target gets saved into the repository project folder categorized by - Sources, Targets, Transformations, Mappings, Mapplets, User-defined functions, Business Components, Cubes and Dimensions. Each object created can be shared, dropped into a mapping across cross-functional development teams. Thus increasing re-usability. Projects are folder based and inter-viewable.
Data Stage offers a project based integration solution, projects are not interviewable. Every project needs a role based access. The step-by-step effort in mapping a source to a target lineages into a job. For sharing objects within a job, separate objects need to be created called containers that are local/shared.
Data Encryption
Informatica has an offering within Power Center Designer as a separate transformation called “Data Masking Transformation”.
Data Masking or encryption needs to be done before reachingDataStage Server.
Variety of Transformations
Informatica offers about 30 general transformations for processing incoming data.
Datastage offers about 40 data transforming stages/objects.Datastage is more powerful transformation engine by using functions (Oconv and IConv) and routines. We can do almost any transformation.
Source_- Target flow
Within Informatica’s PowerCenter Designer, first a source definition needs to be created using “Source Analyzer” that imports the metadata, then a target definition is created using “Target Designer”, then a transformation using “Transformation Developer” is created, and finally maps a source-transformation-target using “Mapping Designer”.
Datastage lets drag and drop a functionality i.e a stage within in one canvas area for a pipeline source-target job. With DataStage within the “DataStage Designer” import of both source and target metadata is needed, proceeding with variety of stages offered as database stages, transformation stages, etc.
The biggest difference between both the vendor offerings in this area is Informatica forces you to be organized through a step-by-step design process, while DataStage leaves the organization as a choice and gives you flexibility in dragging and dropping objects based on the logic flow.
Checking Dependencies
Informatica offers a separate edition – Advanced edition that helps with data lineage and impact analysis. We can go to separate targets and source and check all the dependencies on that.
DataStage offers through Designer by right clicking on a job to perform dependencies or impact analysis.
Components Used
The Informatica ETL transformations are very specific purpose, so you tend to need more boxes on the page to do the same thing. eg. A simple transform in Informatica would have a Source Table, Source Qualifier, Lookup, Router, 2 Update Strategies, and 2 Target Tables (9 boxes).
In DataStage, you would have a Table and Hashed File for the lookup, plus a Source Relational Stage, Transformation Stage, and 2 links to a target Relational Stage (5 boxes). This visual clutter in Informatica is a bit annoying.
Type of link
To link two components in Informatica, you have to link at the column level.We have to connect each and every column bw the two componenents
In DataStage, you link at the component level, and then map individual columns. This allows you to have coding templates that are all linked up - just add columns. I find this a big advantage in DS.
Reusability
Informatica offers ease of re-usability through Mapplets and Worklets for re-using mappings and workflows.This really improves the performance
DataStage offers re-usability of a job through containers(local&shared). To re-use a Job Sequence(workflow), you will need to make a copy, compile and run.
Code Generation and Compilation
Informatica’s thrust is the auto-generated code. A mapping gets created by dropping a source-transformation-target that doesn’t need to be compiled.
DataStage requires to compile a job in order to run it successfully.
Heterogeneous Sources
In Informatica we can use both heterogenous source and homogenous source.
Datastage does not perform very well with heterogeneous sources. You might end up extracting data from all the sources and putting them into a hash and start your transformation
Slowly Changing Dimension
Informatica supports Full History, Recent Values, Current & Previous Values using SCD wizards.
DataStage supports only through Custom scripts and does not have a wizard to do this
Dynamic Lookup Cache
Informatica's marvellous Dynamic Cache Lookup has no equivalent in DS Server Edition. The same saves some effort and is very easily maintainable.
Your blog is in a convincing manner, thanks for sharing such an information with lots of your effort and time
ReplyDeletedatastage online training India
datastage online training Hyderabad