Data integration-on-hadoop

Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011

Introduction Challenges Results of analysis or mining are only as good as the completeness & quality of underlying data Need for the right level of abstraction & tools Data integration & data quality tools have tackled these challenges for many years! More than 4,200 enterprises worldwide rely on Informatica

Files Applications Databases Hadoop HBase HDFS Data sources Transact-SQL Java C/C++ SQL Web services OCI Java JMS BAPI JDBC PL/SQL ODBC Hive XQuery vi PIG Word Notepad Sqoop CLI Access methods & languages Excel ,[object Object]

Vendor neutrality/flexibilityDeveloper tools

Lookup example ‘Bangalore’, …, 234, … ‘Chennai’, …, 82, … ‘Mumbai’, …, 872, … ‘Delhi’, …, 11, … ‘Chennai’, …, 43, … ‘xxx’, …, 2, … Database table HDFS file Your choices ,[object Object]

Could use PIG/Hive to leverage the join operator

Implement Java code to lookup the database table

Need to use access method based on the vendor,[object Object]

Or… you could start with a mapping STORE Filter Load

Goals of the prototype Enable Hadoop developers to leverage Data Transformation and Data Quality logic Ability to invoke mappletsfrom Hadoop Lower the barrier to Hadoop entry by using Informatica Developer as the toolset Ability to run a mapping on Hadoop

MappletInvocation Generation of the UDF of the right type Output-only mapplet Load UDF Input-only mapplet  Store UDF Input/output  Eval UDF Packaging into a jar Compiled UDF Other meta-data: connections, reference tables Invokes Informatica engine (DTM) at runtime

Mapplet Invocation (contd.) Challenges UDF execution is per-tuple; mappletsare optimized for batch execution Connection info/reference data need to be plugged in Runtime dependencies: 280 jars, 558 native dependencies Benefits PIG user can leverage Informatica functionality Connectivity to many (50+) data sources Specialized transformations Re-use of already developed logic

Mapping Deployment: Idea Leverage PIG Map to equivalent operators where possible Let the PIG compiler optimize & translate to Hadoop jobs Wraps some transformations as UDFs Transformations with no equivalents, e.g., standardizer, address validator Transformations with richer functionality, e.g., case-insensitive sorter

LeveragingInformaticaTransformations Case converter UDF Native PIG Source UDFs Lookup UDF Target UDF Native PIG Native PIG Informatica Transformation (Translated to PIG UDFs)

Mapping Deployment Design Leverages PIG operators where possible Wraps other transformations as UDFs Relies on optimization by the PIG compiler Challenges Finding equivalent operators and expressions Limitations of the UDF model – no notion of a user defined operator Benefits Re-use of already developed logic Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer

Enterprise Connectivity for Hadoop programs Hadoop Cluster Weblogs Databases BI HDFS Name Node DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Data Node HDFS Enterprise Applications HDFS Job Tracker Informatica & HadoopBig Picture Transformation Engine for custom data processing

Files Applications Databases Hadoop HBase HDFS Data sources Java C/C++ SQL Web services JMS OCI Java BAPI PL/SQL XQuery vi Hive PIG Word Notepad Sqoop Access methods & languages Excel ,[object Object]

Data integration-on-hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data integration-on-hadoop

Ähnlich wie Data integration-on-hadoop (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data integration-on-hadoop

Hinweis der Redaktion