February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
1. Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011
2. Introduction Challenges Results of analysis or mining are only as good as the completeness & quality of underlying data Need for the right level of abstraction & tools Data integration & data quality tools have tackled these challenges for many years! More than 4,200 enterprises worldwide rely on Informatica
10. Goals of the prototype Enable Hadoop developers to leverage Data Transformation and Data Quality logic Ability to invoke mappletsfrom Hadoop Lower the barrier to Hadoop entry by using Informatica Developer as the toolset Ability to run a mapping on Hadoop
11. MappletInvocation Generation of the UDF of the right type Output-only mapplet Load UDF Input-only mapplet Store UDF Input/output Eval UDF Packaging into a jar Compiled UDF Other meta-data: connections, reference tables Invokes Informatica engine (DTM) at runtime
12. Mapplet Invocation (contd.) Challenges UDF execution is per-tuple; mappletsare optimized for batch execution Connection info/reference data need to be plugged in Runtime dependencies: 280 jars, 558 native dependencies Benefits PIG user can leverage Informatica functionality Connectivity to many (50+) data sources Specialized transformations Re-use of already developed logic
13. Mapping Deployment: Idea Leverage PIG Map to equivalent operators where possible Let the PIG compiler optimize & translate to Hadoop jobs Wraps some transformations as UDFs Transformations with no equivalents, e.g., standardizer, address validator Transformations with richer functionality, e.g., case-insensitive sorter
16. Mapping Deployment Design Leverages PIG operators where possible Wraps other transformations as UDFs Relies on optimization by the PIG compiler Challenges Finding equivalent operators and expressions Limitations of the UDF model – no notion of a user defined operator Benefits Re-use of already developed logic Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer
17. Enterprise Connectivity for Hadoop programs Hadoop Cluster Weblogs Databases BI HDFS Name Node DW/DM Metadata Repository Graphical IDE for Hadoop Development Semi-structured Un-structured Data Node HDFS Enterprise Applications HDFS Job Tracker Informatica & HadoopBig Picture Transformation Engine for custom data processing
24. Informatica Extras… Specialized transformations Matching Address validation Standardization Connectivity Other tools Data federation Analyst tool Administration Metadata manager Business glossary
25.
26. HadoopConnector for Enterprise data access Opens up all the connectivity available from Informatica for Hadoopprocessing Sqoop-based connectors Hadoop sources & targets in mappings Benefits Loaddata from Enterprise data sources into Hadoop Extract summarized data from Hadoop to load into DW and other targets Data federation
27. Data Node PIG Script HDFS UDF Informatica eDTM Mapplets Complex Transformations: Addr Cleansing Dedup/Matching Hierarchical data parsing Enterprise Data Access InformaticaDeveloper tool for Hadoop Metadata Repository Informatica developer builds hadoop mappings and deploys to Hadoop cluster InformaticaHadoopDeveloper Mapping PIG script eDTM Mapplets etc PIG UDF Informatica Developer Hadoop Designer
28. Data Node PIG Script HDFS UDF Informatica eDTM Mapplets Complex Transformations: Dedupe/Matching Hierarchical data parsing Enterprise Data Access Metadata Repository Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts Hadoop developer invokes Informatica UDFs from PIG scripts Hadoop Developer Informatica Developer Tool Mapplets PIG UDF Reuse Informatica Components in Hadoop