Thursday, March 13, 2014

More about Ab-initio Overview & graph

Ab-initio :

It is an important ETL tool(Extraction,Transfom and load) to analyze the data for business purpose.

ETL Process

Extract :

In this process the required data is extracted from the source file like text files,databse and other source systems.

Transform :

In this the required data is converted into the required format for analyzing the data.It involves the below tasks:

  • Applying Business Rules(derivations,calculating new values and dimensions)
  • Cleaning(eg Male to 'm' or Female to 'f')
  • Filtering(filter the records only the selected columns to load)
  • Joining together the data from various sources (lookup files)
Load :

Loading the data into enterprise data warehouse or other data repository.

Ab-initio software is based on two main programs:

CO>OPERATING SYSTEM :

It is the Ab-initio server which the system administrator install on host unix or windows NT.Layered on the top of the operating system.

CO>OPERATING SYSTEM provides the following features:

  • Manage and run Abinitio graphs and controls the ETL process.
  • ETL processes monitoring and debugging
  • Metadata management and interaction with the EME.
  • GDE which install on PC (GDE computer) and configure to communicate with the host.
Graphical development environment(GDE) basically used for:

  • To design and running graph
  • ETL process in Abinitio is represented by Ab initio graphs.
  • It is user friendly

Overview of Graph:

A graph is data flow diagram that defines the various processing stages of tasks and the streams of data as they move from one stage to another.In graph a component represents the stage and data flow represents the data stream.Build a graph in GDE by dragging and dropping components connecting them with flow and then defining values for parameters,run,debug and tune the graph in GDE.

The process of building a graph are developing an Ab-initio application and thus graph development is known as graph programming.

Parts of the Graph:

  • Metadata
Metadata is any information about data or how to process it.Metadata is of two types:

1) Technical Metadata:

Technical metadata is related to the graphs.This includes the information related to the graph.Infomation like data needed to build the graph eg record format,transform functions,job histories and versions etc.You can store technical metadata of the graph in a file,data store or in EME.

2) Enterprise Metadata:

This includes the user defined fuctions of the job function,roles,categories and so on.


  • Layout
1)Layout determines the location of the resources.
2)Layout is either Serial or Parallel.
3)Serial layout specifies the one node or one directory.
4)Parallel layout specifies the multiple nodes or multiple directories.


Phase:

Phase are basically to break up the graph into blocks for performance tuning.Phase limits the number of simultaneous processes by breaking up the graphs into different phase.The main use of phase is to avoid the deadlock.The temporary files generated by phase break will be deleted at the end of phase regardless of wether the job got successful or not.

Checkpoint:

The temporary file generated through checkpoint will not get deleted hence it will start the job from the last good process.Checkpoint are used as the purpose of recovery.


Sandbox:

Sandbox is user own personnel work place.A sandbox is a collection of graphs and related files that are stored in a single directory tree and that can be treated as group for version control,navigation and migration.

Sandbox consists of five folder:

db :database fields
dml :record format
mp:graph
run:deployed scripts
xfr:transforms









 

2 comments: