Friday, March 14, 2014

Departition Components

Departition components combine the multiple flow partitions of data records into single flow as follows:

Concatenate :

Concatenate appends multiple flows of data records one after the other.

1)It reads all the records from the in port and copies them to the out port.
2)After reading all the records it will read the records from the second flow in port and append it after the first flow data records.

Gather:

Gather combines the data records from multiple flow partitions arbitarly .
Not key-based
Result ordering is unpredictable
Has no affect on the upstream processing
Most useful method for efficient collection of data from multiple flows.        
Multiple partitions and for repartitioning
Used most frequently.


Merge:
         Key-based.
         Result ordering is sorted if each input is sorted.
         Possibly synchronizes pipelined computation.
         May even serialize.
         Useful for creating ordered data flows.    
         Other than the ‘Gather’ , the Merge is the other ‘departitioner’ of choice.

Ab-initio Components (Datasets)

In this section we are going to discuss the basic and important Ab initio components:

  • Datasets components
  • Partition
  • Sort
  • Departition
  • Validate
  • Miscellaneous
  • Transform
1)Dataset Components

Dataset components represents data records or act upon data records as follows:

Input file:

Input file can be the source file that is read as input for the graph.It can either be serial or multifile depend upon the requirement.

Input table:

It unloads the data records from the database into the Abinitio graph allowing you to specify either the source table or an sql statement that extracts the data records from one or more records.

Intermediate file :

It represents one or more serial files or multifile of intermediate results that a graph write during execution and save for review after execution.


Look up :

Lookup basically contains the one or more serial files or multifile  of data records small enough to be held in the main memory let the transform retrieve the data records quickly than to be kept in disk.

Output file:

It writes the data records to one or more serial files or multifiles.

Output table:

It writes the data records to database letting you to write the data records either directly by providing the record destination table name or by writing the sql that inserts the records in one or more tables.

Read Multiple files:

Reads the data sequentially from list of file.

Write Multiple files:

Writes the records to the set of output files.

Thursday, March 13, 2014

More about Ab-initio Overview & graph

Ab-initio :

It is an important ETL tool(Extraction,Transfom and load) to analyze the data for business purpose.

ETL Process

Extract :

In this process the required data is extracted from the source file like text files,databse and other source systems.

Transform :

In this the required data is converted into the required format for analyzing the data.It involves the below tasks:

  • Applying Business Rules(derivations,calculating new values and dimensions)
  • Cleaning(eg Male to 'm' or Female to 'f')
  • Filtering(filter the records only the selected columns to load)
  • Joining together the data from various sources (lookup files)
Load :

Loading the data into enterprise data warehouse or other data repository.

Ab-initio software is based on two main programs:

CO>OPERATING SYSTEM :

It is the Ab-initio server which the system administrator install on host unix or windows NT.Layered on the top of the operating system.

CO>OPERATING SYSTEM provides the following features:

  • Manage and run Abinitio graphs and controls the ETL process.
  • ETL processes monitoring and debugging
  • Metadata management and interaction with the EME.
  • GDE which install on PC (GDE computer) and configure to communicate with the host.
Graphical development environment(GDE) basically used for:

  • To design and running graph
  • ETL process in Abinitio is represented by Ab initio graphs.
  • It is user friendly

Overview of Graph:

A graph is data flow diagram that defines the various processing stages of tasks and the streams of data as they move from one stage to another.In graph a component represents the stage and data flow represents the data stream.Build a graph in GDE by dragging and dropping components connecting them with flow and then defining values for parameters,run,debug and tune the graph in GDE.

The process of building a graph are developing an Ab-initio application and thus graph development is known as graph programming.

Parts of the Graph:

  • Metadata
Metadata is any information about data or how to process it.Metadata is of two types:

1) Technical Metadata:

Technical metadata is related to the graphs.This includes the information related to the graph.Infomation like data needed to build the graph eg record format,transform functions,job histories and versions etc.You can store technical metadata of the graph in a file,data store or in EME.

2) Enterprise Metadata:

This includes the user defined fuctions of the job function,roles,categories and so on.


  • Layout
1)Layout determines the location of the resources.
2)Layout is either Serial or Parallel.
3)Serial layout specifies the one node or one directory.
4)Parallel layout specifies the multiple nodes or multiple directories.


Phase:

Phase are basically to break up the graph into blocks for performance tuning.Phase limits the number of simultaneous processes by breaking up the graphs into different phase.The main use of phase is to avoid the deadlock.The temporary files generated by phase break will be deleted at the end of phase regardless of wether the job got successful or not.

Checkpoint:

The temporary file generated through checkpoint will not get deleted hence it will start the job from the last good process.Checkpoint are used as the purpose of recovery.


Sandbox:

Sandbox is user own personnel work place.A sandbox is a collection of graphs and related files that are stored in a single directory tree and that can be treated as group for version control,navigation and migration.

Sandbox consists of five folder:

db :database fields
dml :record format
mp:graph
run:deployed scripts
xfr:transforms









 

Abinitio Beginning

  1. What is Abinitio?
Ans) Abinitio is a latin word which means start from the beginning.It is an ETL tool.ETL stands for extraction,transform and load.It is a powerful ETL tool used in data warehousing.Its main objective is to process the data for enterprise.

This software works as a client server model.

Client is GDE i.e Graphical Development Environment.Server is CO>OPERATING system.Parallelism and Integration is the main part of data warehousing.Abinitio code is called graph which has got an extension .mp.

Ab-initio is having 13 in-built components that will be used to achive  the operations.These are as follows:

  • Sort
  • compress
  • deprecated
  • partition
  • transform
  • continuous
  • dataset
  • ftp
  • miscellaneous
  • translate
  • validate
  • database
  • DE-partition


Important Questions:


 Q1) What component need to be used to lower the size of the file?
Ans) Deflate or compress are the components that can be used to lower the size of the file.

Q2)  Can a graph be infinitely run? If yes how?
Ans) A graph can run infinitely by call the .ksh in the end of the script.

Q3) What meaning has lock in abinitio?
Ans) A graph must be locked in order to give permission to the developers to edit the object if needed.For eg if any other developer want to make change in the same object then he ll get warn that this graph has already been locked by some other user.This is basically for protection mechanism.

Q4)What is EME?
Ans) EME stands for Enterprise meta environment.It is basically repository to store all the objects.It is also called as version controller.It keep track of graphs or other objects.

Q5)What role does xfr plays in Abinitio?
Ans) XFR is basically used to store the mapping.It is useful because rewriting the code takes time and xfr saves that efforts.

Q6)What is the difference between phase and checkpoint?
Ans) Phase basically deletes the intermediate file(temporary files) before a new phase begins which is quite different from checkpoint.Checkpoint keeps the temporary files till the end of the graph hence it can start from the last good process.

Q7) How much memory do we need for a graph?
Ans)Some calculations lead to 8 MB plus.MAX_CORE and phase size of the file.

Q8)How the term Standard environment can be defined?
Ans) The term standard environment is basically used when it include more than one project i.e private and public.

Q9)What is the difference between DB config and cfg?
Ans) Similarity between both is that they both used in database connectivity.The difference is that cfg used in Informix database however Db config is used in sql server and oracle.

Q10)What is the difference between Scan and rollup component?
Ans) Scan is basically used to get the cumulative summary of records and rollup is used to get the summarized records.

Q11)What are supported layouts in Abinitio?
Ans) Abinitio supports both serial and parallel layouts.Parallel layouts is basically related to parallelism degree.

Q12)What is the definition of multistage components?
Ans) Multistage components are basically transform components that includes different packages.

Q13)What can we say about partition by key and partition by round robin?
Ans) Partition by key also known as hash partition when we have diverse keys.It is basically used for processing parallel data.Round Robin is the technique that allows us to distribute the data on every partition uniformly.

Q14)What is driving port?
Ans) Driving port is basicall used to increase the performance of the graph.

Q15)What is the reason for a database to contain stored procedures?
Ans) The main reason is the network traffic reduction.Stored pocedures are precompiled sql blocks,the time of execution can be reduced.In this way the application performance will get increase being stored in the database the procedure will be called by the application and execute faster than in case of not already compiled.They also provide reusability for different other applications.

Q16)What is the reasons for using the parameterized graphs?
Ans)When we are trying to use the same graph many times for various files in this way we should set the parameters in the graph.So we can keep a generic graph to achieve it.

Q17)Different between API and Utility mode?
Ans)API and utility mode both are used as the connection interface of performing specific tasks required by the user.
The difference between the two is that API is slower but provide a high range of flexibility.Also API is considered to be more diagnostic feature.

Q18)What methods exists for performance tuning?
Ans)The main focus is to use join when we have two tables that needs to bring together.Alternatively we can write the query to make join at the level of database advantage is it will hit the database only once and it will improve the performance.