SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011
Introduction Challenges Results of analysis or mining are only as good as the completeness & quality of underlying data Need for the right level of abstraction & tools Data integration & data quality tools have tackled these challenges for many years! More than 4,200 enterprises worldwide rely on Informatica
Files Applications Databases Hadoop HBase HDFS Data sources Transact-SQL Java C/C++ SQL Web services OCI Java JMS BAPI JDBC PL/SQL ODBC Hive XQuery vi PIG Word Notepad Sqoop CLI Access methods & languages Excel ,[object Object]
Vendor neutrality/flexibilityDeveloper tools
Lookup example ‘Bangalore’, …, 234, … ‘Chennai’, …, 82, … ‘Mumbai’, …, 872, … ‘Delhi’, …, 11, … ‘Chennai’, …, 43, … ‘xxx’, …, 2, … Database table HDFS file Your choices ,[object Object]
Could use PIG/Hive to leverage the join operator
Implement Java code to lookup the database table
Need to use access method based on the vendor,[object Object]
Or… you could start with a mapping STORE Filter Load
Goals of the prototype Enable Hadoop developers to leverage Data Transformation and Data Quality logic Ability to invoke mappletsfrom Hadoop Lower the barrier to Hadoop entry by using Informatica Developer as the toolset Ability to run a mapping on Hadoop
MappletInvocation Generation of the UDF of the right type Output-only mapplet Load UDF Input-only mapplet  Store UDF Input/output  Eval UDF Packaging into a jar Compiled UDF Other meta-data: connections, reference tables Invokes Informatica engine (DTM) at runtime
Mapplet Invocation (contd.) Challenges UDF execution is per-tuple; mappletsare optimized for batch execution Connection info/reference data need to be plugged in Runtime dependencies: 280 jars, 558 native dependencies Benefits PIG user can leverage Informatica functionality Connectivity to many (50+) data sources Specialized transformations Re-use of already developed logic
Mapping Deployment: Idea Leverage PIG Map to equivalent operators where possible Let the PIG compiler optimize & translate to Hadoop jobs Wraps some transformations as UDFs Transformations with no equivalents, e.g., standardizer, address validator Transformations with richer functionality, e.g., case-insensitive sorter
Leveraging PIG Operators
LeveragingInformaticaTransformations Case converter UDF Native PIG Source UDFs Lookup UDF Target UDF Native PIG Native PIG Informatica Transformation (Translated to PIG UDFs)
Mapping Deployment Design Leverages PIG operators where possible Wraps other transformations as UDFs Relies on optimization by the PIG compiler Challenges Finding equivalent operators and expressions Limitations of the UDF model – no notion of a user defined operator Benefits Re-use of already developed logic Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer
Enterprise  Connectivity  for  Hadoop programs Hadoop Cluster Weblogs Databases BI HDFS Name Node DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Data Node HDFS Enterprise Applications HDFS Job Tracker Informatica & HadoopBig Picture Transformation Engine for custom data processing
Files Applications Databases Hadoop HBase HDFS Data sources Java C/C++ SQL Web services JMS OCI Java BAPI PL/SQL XQuery vi Hive PIG Word Notepad Sqoop Access methods & languages Excel ,[object Object]
Connectivity
Rich transforms

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...
Yahoo Developer Network
 
Nagarjuna_Damarla
Nagarjuna_DamarlaNagarjuna_Damarla
Nagarjuna_Damarla
Nag Arjun
 

Was ist angesagt? (20)

Un guide complet pour la migration de bases de données héritées vers PostgreSQL
Un guide complet pour la migration de bases de données héritées vers PostgreSQLUn guide complet pour la migration de bases de données héritées vers PostgreSQL
Un guide complet pour la migration de bases de données héritées vers PostgreSQL
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
 
Shiv shakti resume
Shiv shakti resumeShiv shakti resume
Shiv shakti resume
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...
 
Yasar resume 2
Yasar resume 2Yasar resume 2
Yasar resume 2
 
SAP on Azure - Deck
SAP on Azure - DeckSAP on Azure - Deck
SAP on Azure - Deck
 
Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...Application Development & Database Choices: Postgres Support for non Relation...
Application Development & Database Choices: Postgres Support for non Relation...
 
Sidharth_CV
Sidharth_CVSidharth_CV
Sidharth_CV
 
Circuit Design with HyDraw CAD700 : What's New
Circuit Design with HyDraw CAD700 : What's NewCircuit Design with HyDraw CAD700 : What's New
Circuit Design with HyDraw CAD700 : What's New
 
Hareesh
HareeshHareesh
Hareesh
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Mallikharjun_Vemana
Mallikharjun_VemanaMallikharjun_Vemana
Mallikharjun_Vemana
 
resumePdf
resumePdfresumePdf
resumePdf
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
Azure Batch development
Azure Batch developmentAzure Batch development
Azure Batch development
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Lessons Learnt Implementing High-Performance Integration using SAP PI
Lessons Learnt Implementing High-Performance Integration using SAP PILessons Learnt Implementing High-Performance Integration using SAP PI
Lessons Learnt Implementing High-Performance Integration using SAP PI
 
Nagarjuna_Damarla
Nagarjuna_DamarlaNagarjuna_Damarla
Nagarjuna_Damarla
 

Ähnlich wie Data integration-on-hadoop

Sasmita bigdata resume
Sasmita bigdata resumeSasmita bigdata resume
Sasmita bigdata resume
Sasmita Swain
 

Ähnlich wie Data integration-on-hadoop (20)

Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...
Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...
Big-Data Hadoop Training Institutes in Pune | CloudEra Certification courses ...
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
Resume_VipinKP
Resume_VipinKPResume_VipinKP
Resume_VipinKP
 
Resume_Karthick
Resume_KarthickResume_Karthick
Resume_Karthick
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
 
Sasmita bigdata resume
Sasmita bigdata resumeSasmita bigdata resume
Sasmita bigdata resume
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
HadoopIntroduction.pptx
HadoopIntroduction.pptxHadoopIntroduction.pptx
HadoopIntroduction.pptx
 
HadoopIntroduction.pptx
HadoopIntroduction.pptxHadoopIntroduction.pptx
HadoopIntroduction.pptx
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Data integration-on-hadoop

  • 1. Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011
  • 2. Introduction Challenges Results of analysis or mining are only as good as the completeness & quality of underlying data Need for the right level of abstraction & tools Data integration & data quality tools have tackled these challenges for many years! More than 4,200 enterprises worldwide rely on Informatica
  • 3.
  • 5.
  • 6. Could use PIG/Hive to leverage the join operator
  • 7. Implement Java code to lookup the database table
  • 8.
  • 9. Or… you could start with a mapping STORE Filter Load
  • 10. Goals of the prototype Enable Hadoop developers to leverage Data Transformation and Data Quality logic Ability to invoke mappletsfrom Hadoop Lower the barrier to Hadoop entry by using Informatica Developer as the toolset Ability to run a mapping on Hadoop
  • 11. MappletInvocation Generation of the UDF of the right type Output-only mapplet Load UDF Input-only mapplet  Store UDF Input/output  Eval UDF Packaging into a jar Compiled UDF Other meta-data: connections, reference tables Invokes Informatica engine (DTM) at runtime
  • 12. Mapplet Invocation (contd.) Challenges UDF execution is per-tuple; mappletsare optimized for batch execution Connection info/reference data need to be plugged in Runtime dependencies: 280 jars, 558 native dependencies Benefits PIG user can leverage Informatica functionality Connectivity to many (50+) data sources Specialized transformations Re-use of already developed logic
  • 13. Mapping Deployment: Idea Leverage PIG Map to equivalent operators where possible Let the PIG compiler optimize & translate to Hadoop jobs Wraps some transformations as UDFs Transformations with no equivalents, e.g., standardizer, address validator Transformations with richer functionality, e.g., case-insensitive sorter
  • 15. LeveragingInformaticaTransformations Case converter UDF Native PIG Source UDFs Lookup UDF Target UDF Native PIG Native PIG Informatica Transformation (Translated to PIG UDFs)
  • 16. Mapping Deployment Design Leverages PIG operators where possible Wraps other transformations as UDFs Relies on optimization by the PIG compiler Challenges Finding equivalent operators and expressions Limitations of the UDF model – no notion of a user defined operator Benefits Re-use of already developed logic Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer
  • 17. Enterprise Connectivity for Hadoop programs Hadoop Cluster Weblogs Databases BI HDFS Name Node DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Data Node HDFS Enterprise Applications HDFS Job Tracker Informatica & HadoopBig Picture Transformation Engine for custom data processing
  • 18.
  • 24. Informatica Extras… Specialized transformations Matching Address validation Standardization Connectivity Other tools Data federation Analyst tool Administration Metadata manager Business glossary
  • 25.
  • 26. HadoopConnector for Enterprise data access Opens up all the connectivity available from Informatica for Hadoopprocessing Sqoop-based connectors Hadoop sources & targets in mappings Benefits Loaddata from Enterprise data sources into Hadoop Extract summarized data from Hadoop to load into DW and other targets Data federation
  • 27. Data Node PIG Script HDFS UDF Informatica eDTM Mapplets Complex Transformations: Addr Cleansing Dedup/Matching Hierarchical data parsing Enterprise Data Access InformaticaDeveloper tool for Hadoop Metadata Repository Informatica developer builds hadoop mappings and deploys to Hadoop cluster InformaticaHadoopDeveloper Mapping  PIG script eDTM Mapplets etc  PIG UDF Informatica Developer Hadoop Designer
  • 28. Data Node PIG Script HDFS UDF Informatica eDTM Mapplets Complex Transformations: Dedupe/Matching Hierarchical data parsing Enterprise Data Access Metadata Repository Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts Hadoop developer invokes Informatica UDFs from PIG scripts Hadoop Developer Informatica Developer Tool Mapplets  PIG UDF Reuse Informatica Components in Hadoop

Hinweis der Redaktion

  1. Many examples of users using Hadoop for analysis/miningExample - social networking data – need to identify the same user across multiple applicationsMap/reduce functions – powerful – low levelInformatica is the leader>>>>>>>>>>>> You may wonder: How? Why do so many people use Informatica tools?
  2. Historical perspective – proliferation of data sources over time, data fragmentationDeveloper productivity is due to higher-level abstraction, built-in transformations and re-use.Vendor neutrality is not at the cost of performance. It gives the flexibility to move to a different vendor easily.The challenge of being productive with Hadoop is similar.>>>>>>> Let’s make this more concrete with a very simple example.
  3. This could be sales data that you want to analyze.
  4. PIG script calls the lookup as a UDFAppealing for somebody familiar with PIG.>>>>>>>>>>>>>> Or for somebody familiar with Informatica
  5. This is more appealing for Informatica users.>>>>>> We started prototyping with these ideas.
  6. Choice of PIGAppeal to 2 different user segments>>>>>>>> Next I will go into some implementation details.
  7. Amapplet may be treated as a source, target or a function>>>>>> Just a few more low-level details for the Hadoop hackers
  8. PIG should have array execution for UDFsIdeally don’t want the runtime to access Informatica domainDistcache seems like the right solutionWorks for native libsSome problems with jarsAddress doctor supports 240 countries!>>>>>>>>>> Next we will look at mapping deploymentRegistering each individual jar is tedious & error prone; also, PIG re-packs everything together which overwrites files Top-level jar with other jars on class-path-- Need to have the jars distributed preserving the dir structureRules out mapred.cache.filesProblem with mapred.cache.archives (can’t add top-level jar to classpath - mapred.job.classpath.files entries must be from mapred.cache.files)Problem with mapred.child.java.opts (can’t add to java.class.path but can add to java.library.path)
  9. Leveraging PIG – saves us a lot of work, avoids re-inventing the wheel>>>>> Details of conversion
  10. So many similaritiesNote dummy files in load & storeThe concept could be generalized – currently, the parallelism is a problem>>>>>>> Let’s look at an example
  11. Anybody curious what this translates into?>>>>>>>>>>> Some implementation details.
  12. >>>>>>>>>>>Where are we going with all this?
  13. Sqoop adaptersReader/writers to allow HDFS sources/targets>>>>>>>> To summarize
  14. Any quick questions?>>>>>>>>>>> I didn’t mention some non-trivial extras