Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

•Download as PPTX, PDF•

1 like•968 views

Yahoo Developer Network

Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011

Introduction Challenges Results of analysis or mining are only as good as the completeness & quality of underlying data Need for the right level of abstraction & tools Data integration & data quality tools have tackled these challenges for many years! More than 4,200 enterprises worldwide rely on Informatica

Files Applications Databases Hadoop HBase HDFS Data sources Transact-SQL Java C/C++ SQL Web services OCI Java JMS BAPI JDBC PL/SQL ODBC Hive XQuery vi PIG Word Notepad Sqoop CLI Access methods & languages Excel ,[object Object]

Vendor neutrality/flexibilityDeveloper tools

Lookup example ‘Bangalore’, …, 234, … ‘Chennai’, …, 82, … ‘Mumbai’, …, 872, … ‘Delhi’, …, 11, … ‘Chennai’, …, 43, … ‘xxx’, …, 2, … Database table HDFS file Your choices ,[object Object]

Could use PIG/Hive to leverage the join operator

Implement Java code to lookup the database table

Need to use access method based on the vendor,[object Object]

Or… you could start with a mapping STORE Filter Load

Goals of the prototype Enable Hadoop developers to leverage Data Transformation and Data Quality logic Ability to invoke mappletsfrom Hadoop Lower the barrier to Hadoop entry by using Informatica Developer as the toolset Ability to run a mapping on Hadoop

MappletInvocation Generation of the UDF of the right type Output-only mapplet Load UDF Input-only mapplet  Store UDF Input/output  Eval UDF Packaging into a jar Compiled UDF Other meta-data: connections, reference tables Invokes Informatica engine (DTM) at runtime

Mapplet Invocation (contd.) Challenges UDF execution is per-tuple; mappletsare optimized for batch execution Connection info/reference data need to be plugged in Runtime dependencies: 280 jars, 558 native dependencies Benefits PIG user can leverage Informatica functionality Connectivity to many (50+) data sources Specialized transformations Re-use of already developed logic

Mapping Deployment: Idea Leverage PIG Map to equivalent operators where possible Let the PIG compiler optimize & translate to Hadoop jobs Wraps some transformations as UDFs Transformations with no equivalents, e.g., standardizer, address validator Transformations with richer functionality, e.g., case-insensitive sorter

LeveragingInformaticaTransformations Case converter UDF Native PIG Source UDFs Lookup UDF Target UDF Native PIG Native PIG Informatica Transformation (Translated to PIG UDFs)

Mapping Deployment Design Leverages PIG operators where possible Wraps other transformations as UDFs Relies on optimization by the PIG compiler Challenges Finding equivalent operators and expressions Limitations of the UDF model – no notion of a user defined operator Benefits Re-use of already developed logic Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer

Enterprise Connectivity for Hadoop programs Hadoop Cluster Weblogs Databases BI HDFS Name Node DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Data Node HDFS Enterprise Applications HDFS Job Tracker Informatica & HadoopBig Picture Transformation Engine for custom data processing

Files Applications Databases Hadoop HBase HDFS Data sources Java C/C++ SQL Web services JMS OCI Java BAPI PL/SQL XQuery vi Hive PIG Word Notepad Sqoop Access methods & languages Excel ,[object Object]

What's hot

Analytics at the Speed of Thought: Actian Express Overview Actian Corporation

Turning Your Data Lake into Measurable Business ValueActian Corporation

Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseDataWorks Summit

OpenPOWER Updateinside-BigData.com

Analyzing the World's Largest Security Data Lake!DataWorks Summit

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics

Hortonworks roadshowAccenture

Benefits of Transferring Real-Time Data to Hadoop at ScaleHortonworks

The Modern Data Platform - How to Conquer a New World with Old ProblemsDataWorks Summit/Hadoop Summit

Eliminating the Challenges of Big Data Management Inside HadoopHortonworks

Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit

2017 OpenWorld Keynote for Data IntegrationJeffrey T. Pollock

Oracle Data Integration CON9737 at OpenWorldJeffrey T. Pollock

Building Fast Applications for Streaming Datafreshdatabos

2010.03.16 Pollock.Edw2010.Modern D Ifor WarehousingJeffrey T. Pollock

YARN Ready: Integrating to YARN with Tez Hortonworks

Oracle Big Data Appliance and Big Data SQL for advanced analyticsjdijcks

Microservices, DevOps, and Continuous DeliveryKhalid Salama

Oracle Solaris Build and Run Applications Better on 11.3OTN Systems Hub

Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks

What's hot (20)

Analytics at the Speed of Thought: Actian Express Overview

Turning Your Data Lake into Measurable Business Value

Smart Enterprise Big Data Bus for the Modern Responsive Enterprise

OpenPOWER Update

Analyzing the World's Largest Security Data Lake!

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Hortonworks roadshow

Benefits of Transferring Real-Time Data to Hadoop at Scale

The Modern Data Platform - How to Conquer a New World with Old Problems

Eliminating the Challenges of Big Data Management Inside Hadoop

Swimming Across the Data Lake, Lessons learned and keys to success

2017 OpenWorld Keynote for Data Integration

Oracle Data Integration CON9737 at OpenWorld

Building Fast Applications for Streaming Data

2010.03.16 Pollock.Edw2010.Modern D Ifor Warehousing

YARN Ready: Integrating to YARN with Tez

Oracle Big Data Appliance and Big Data SQL for advanced analytics

Microservices, DevOps, and Continuous Delivery

Oracle Solaris Build and Run Applications Better on 11.3

Enterprise Data Warehouse Optimization: 7 Keys to Success

Viewers also liked

Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks

Talend Open Studio and Hortonworks Data PlatformHortonworks

Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tr...Gabriele Baldassarre

Talend Open Studio Data IntegrationRoberto Marchetto

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

Data Ingestion, Extraction & Parsing on Hadoopskaluska

Hadoop and Enterprise Data WarehouseDataWorks Summit

Henry Ford and Social CreditWealthbuilder.ie

Hadoop project design and a usecasesudhakara st

Hadoop and Your Data WarehouseCaserta

Large scale ETL with HadoopOReillyStrata

Viewers also liked (11)

Big Data Analytics - Is Your Elephant Enterprise Ready?

Talend Open Studio and Hortonworks Data Platform

Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tr...

Talend Open Studio Data Integration

Hadoop Integration into Data Warehousing Architectures

Data Ingestion, Extraction & Parsing on Hadoop

Hadoop and Enterprise Data Warehouse

Henry Ford and Social Credit

Hadoop project design and a usecase

Hadoop and Your Data Warehouse

Large scale ETL with Hadoop

Similar to Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...mindscriptsseo

How pig and hadoop fit in data processing architectureKovid Academy

Resume_VipinKPindhuparvathy

Resume_KarthickKarthick Selvaraj

Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre

HDInsight Hadoop on Windows AzureLynn Langit

Democratization of Data @IndixManoj Mahalingam

Robin_HadoopRobin David

[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsightNaoki (Neo) SATO

Sasmita bigdata resumeSasmita Swain

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

HadoopIntroduction.pptxBalasundaramSr

Sparkflows.iosparkflows

CCD-410 Cloudera Study MaterialRoxycodone Online

Big SQL Competitive Summary - Vendor LandscapeNicolas Morales

Prashanth Kumar_Hadoop_NEWPrashanth Shankar kumar

Deepankar Sehdev- Resume2015Deepankar Sehdev

2014.07.11 biginsights data2014Wilfried Hoge

Hadoop in actionMahmoud Yassin

Similar to Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar (20)

Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...

How pig and hadoop fit in data processing architecture

Resume_VipinKP

Resume_Karthick

Big-Data Hadoop Tutorials - MindScripts Technologies, Pune

HDInsight Hadoop on Windows Azure

Democratization of Data @Indix

Robin_Hadoop

[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight

Sasmita bigdata resume

Hadoop Tutorial For Beginners

HadoopIntroduction.pptx

Sparkflows.io

CCD-410 Cloudera Study Material

Big SQL Competitive Summary - Vendor Landscape

Prashanth Kumar_Hadoop_NEW

Deepankar Sehdev- Resume2015

2014.07.11 biginsights data2014

Hadoop in action

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

1. Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011

2. Introduction Challenges Results of analysis or mining are only as good as the completeness & quality of underlying data Need for the right level of abstraction & tools Data integration & data quality tools have tackled these challenges for many years! More than 4,200 enterprises worldwide rely on Informatica

4. Vendor neutrality/flexibilityDeveloper tools

6. Could use PIG/Hive to leverage the join operator

7. Implement Java code to lookup the database table

9. Or… you could start with a mapping STORE Filter Load

10. Goals of the prototype Enable Hadoop developers to leverage Data Transformation and Data Quality logic Ability to invoke mappletsfrom Hadoop Lower the barrier to Hadoop entry by using Informatica Developer as the toolset Ability to run a mapping on Hadoop

11. MappletInvocation Generation of the UDF of the right type Output-only mapplet Load UDF Input-only mapplet  Store UDF Input/output  Eval UDF Packaging into a jar Compiled UDF Other meta-data: connections, reference tables Invokes Informatica engine (DTM) at runtime

12. Mapplet Invocation (contd.) Challenges UDF execution is per-tuple; mappletsare optimized for batch execution Connection info/reference data need to be plugged in Runtime dependencies: 280 jars, 558 native dependencies Benefits PIG user can leverage Informatica functionality Connectivity to many (50+) data sources Specialized transformations Re-use of already developed logic

13. Mapping Deployment: Idea Leverage PIG Map to equivalent operators where possible Let the PIG compiler optimize & translate to Hadoop jobs Wraps some transformations as UDFs Transformations with no equivalents, e.g., standardizer, address validator Transformations with richer functionality, e.g., case-insensitive sorter

14. Leveraging PIG Operators

15. LeveragingInformaticaTransformations Case converter UDF Native PIG Source UDFs Lookup UDF Target UDF Native PIG Native PIG Informatica Transformation (Translated to PIG UDFs)

16. Mapping Deployment Design Leverages PIG operators where possible Wraps other transformations as UDFs Relies on optimization by the PIG compiler Challenges Finding equivalent operators and expressions Limitations of the UDF model – no notion of a user defined operator Benefits Re-use of already developed logic Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer

17. Enterprise Connectivity for Hadoop programs Hadoop Cluster Weblogs Databases BI HDFS Name Node DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Data Node HDFS Enterprise Applications HDFS Job Tracker Informatica & HadoopBig Picture Transformation Engine for custom data processing

18.

19. Connectivity

20. Rich transforms

21. Designer tool

22. Vendor neutrality/flexibility

23. Without losingperformance Developer tools

24. Informatica Extras… Specialized transformations Matching Address validation Standardization Connectivity Other tools Data federation Analyst tool Administration Metadata manager Business glossary

25.

26. HadoopConnector for Enterprise data access Opens up all the connectivity available from Informatica for Hadoopprocessing Sqoop-based connectors Hadoop sources & targets in mappings Benefits Loaddata from Enterprise data sources into Hadoop Extract summarized data from Hadoop to load into DW and other targets Data federation

27. Data Node PIG Script HDFS UDF Informatica eDTM Mapplets Complex Transformations: Addr Cleansing Dedup/Matching Hierarchical data parsing Enterprise Data Access InformaticaDeveloper tool for Hadoop Metadata Repository Informatica developer builds hadoop mappings and deploys to Hadoop cluster InformaticaHadoopDeveloper Mapping  PIG script eDTM Mapplets etc  PIG UDF Informatica Developer Hadoop Designer

28. Data Node PIG Script HDFS UDF Informatica eDTM Mapplets Complex Transformations: Dedupe/Matching Hierarchical data parsing Enterprise Data Access Metadata Repository Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts Hadoop developer invokes Informatica UDFs from PIG scripts Hadoop Developer Informatica Developer Tool Mapplets  PIG UDF Reuse Informatica Components in Hadoop

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

Similar to Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar