Efficient Log Management using Oozie, Parquet and Hive
Datastage parallell jobs vs datastage server jobs
1. Datastage Parallel jobs Vs Datastage Server jobs:
1)The basic difference between the server and parallel jobs is the degree of p[parallelism
server jobs stages don not have in built partitioning and parallelism mechanism for
extracting and loading data between different stages.
We can do to enhance the speed and performance in server jobs is to enable inter process
row buffering through the administrator. This helps stages to exchange data soon as it is
available in the link.
We can use IPC stage too which helps one passive stage read data from another as soon
as data is available. In other words, stages do not have to wait fir the entire set of records
to be read first and then transferred to the next stage. Link partitioner and link collector
stages can be used to achieve a certain degree of partitioning parallelism.
The above features which have to be explored in server jobs are built in Datastage PX.
2)The PX Engine runs on a multiprocessor system and takes full advantage of the
processing nodes defined in the configuration file. Both SMP and MMP architecture is
supported by Datastage PX.
3)PX takes advantage of both Pipeline parallelism and Partitioning parallelism. Pipeline
parallelism means that as soon as data is available between stages (in pipes or links),it
can be exchanged between them without waiting for the entire record set to be read.
Partitioning parallelism means that entire record set is partitioned into small sets and
processed on different nodes(logical processors).For example, if there are 100 records
then if there are 4 logical nodes then each node would process 25 records each. This
enhances the speed at which loading takes place to an amazing degree .Imagine situations
where billions of records have to be loaded daily. This is where Datastage PX comes as a
boon for ETL process and surpasses all other ETL tools in the market.
4)In parallel we have Dataset which acts as the intermediate data storage in the linked
list. It is the best storage option it stores the data in Datastage internal format.
5)In parallel we can choose to display OSH,which gives information about how job
works.
6)In parallel transformer there is no reference link possibility ,in server stage reference
could be given to transformer .Parallel stage can use both basic and parallel oriented
functions.
7)Datastage server executed by Datastage server environment but parallel executed under
control of Datastage run time environment.
8))Datastage compiled in to BASIC (interpreted pseudo code) and parallel compiled to
OSH (Orchestrate Scripting Language).
9)Debugging and Testing Stages are available only in the Parallel Extender.
2. 10)More Processing stages are not included in Server example,Join,CDC,LookUp e
etc….
11) In file stages, Hash file available only in Server and Complex flat file,dataset,lookup
file set avail in parallel only.
12)Server Transformer supports basic language compatibility ,parallel transformer is C++
language computability.
14)Look up of sequential file is possible in parallel jobs.
15)In parallel we can specify more file paths to fetch data from using file pattern similar
to Folder stage in Server, while in server e can specify one file name in one input link.
16)We can simultaneously give input as well as output link to a sequential file stage in
Server. But an output link in parallel means a reject link, that is a link that collects
records that fail to load into the sequential file for some reasons.
17)The difference is file size Restriction.
Sequential file size in server is 2GB.
Sequential file size in parallel is: No Limitation
18)Parallel sequential file has filter options too. Where we can specify the file pattern.
Introduction to Datastage Enterprise Edition (EE)
Datastage Enterprise Edition, formerly known as Datastage PX(parallel extender) has
become recently a part of IBM, Infosphere Information Server and its official name is
IBM Infosphere Datastage.
With the recent versions of Datastage(7.5,8,8.1),IBM does not release any updates to
Datastage Server Edition (however it is still available in Datastage 8) and they seem to
put the biggest effort in developing and enriching the Enterprise Edition of the Infosphere
product line.
Key Datastage Enterprise Edition concepts.
Project Environment:
1.We do work with flat files and oracle database as source.
2.We get data in two ways by using
3. Push technique
Pull technique
3.Most of the time we get the data using push technique (push technique is client himself
send data to our server environment).
4.If situation is like this where it is our responsibility to fetch the data from(client gives
us proper authenticated privileged to access his server) client server then we got for pull
technique.
5.In our Unix environment (server) we do have particular file structure.
6.Whatever the files we got from the client .Those files are placed in drop box.
7.Then we move the received files to the input files folder.
8.From there we dump the files to staging area, where we cleansing the data.
9.After applying required business logic(transformations) we move the data to
ODS(operational data stage).From there on we apply scd’s on the data whatever we got
from ODS.
10.Then the resulted data will be sent to Data Ware house.
11.Now whatever the files we had in input files folder will be moved to archive
folder(back up and future purpose).
12.While running some jobs if we want to send the resulted data to the output files folder.
Then we specify its path of the output files folder(i.e Data generated file after execution).
13.For dataset files we give path of the dataset folder where we want to store dataset
related to our project.
14.Reject file folder contains files from staging and ODS.These files generally as part of
cleansing and transformation.
About project:
1.Files and Datastage is our source.
2.It’s a sales domain and the main intension of this project is to get the total sales
information based on the location.
3.Because of the U.S rescission Publix is facing bad-sales and bad-revenue in particular
locations. And at sometime they are doing very good in terms of revenue in certain
places.
4. 4.To identify the total revenue and bad-sales information Publix kicked-off this project.
5.In our project we do have 18 dimensional tables and 11 fact tables.
6.In that I involved in developing of 4 dimensions and 2 fact tables.
7.In 20 DS-jobs for 4 dimensions and 9 jobs for 2 fact tables.
8.Our Dataware house size is 1.5 TB.
9.This Project is a top-down approach.
10.We are loaded the data into data warehouse in our project no data marts.