SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Data Ingestion, Extraction, and
           Preparation for Hadoop


                     Sanjay Kaluskar, Sr.
                    Architect, Informatica
                     David Teniente, Data
                    Architect, Rackspace




1
Safe Harbor Statement
•   The information being provided today is for informational purposes only. The
    development, release and timing of any Informatica product or functionality described
    today remain at the sole discretion of Informatica and should not be relied upon in
    making a purchasing decision. Statements made today are based on
    currently available information, which is subject to change. Such statements should
    not be relied upon as a representation, warranty or commitment to deliver specific
    products or functionality in the future.
•   Some of the comments we will make today are forward-looking statements including
    statements concerning our product portfolio, our growth and operational
    strategies, our opportunities, customer adoption of and demand for our products and
    services, the use and expected benefits of our products and services by
    customers, the expected benefit from our partnerships and our expectations
    regarding future industry trends and macroeconomic development.
•   All forward-looking statements are based upon current expectations and beliefs.
    However, actual results could differ materially. There are many reasons why actual
    results may differ from our current expectations. These forward-looking statements
    should not be relied upon as representing our views as of any subsequent date and
    Informatica undertakes no obligation to update forward-looking statements to reflect
    events or circumstances after the date that they are made.
•   Please refer to our recent SEC filings including the Form 10-Q for the quarter ended
    September 30th, 2011 for a detailed discussion of the risk factors that may affect our
    results. Copies of these documents may be obtained from the SEC or by contacting
    our Investor Relations department.




                                                                                             2
The Hadoop Data Processing Pipeline
Informatica PowerCenter + PowerExchange
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
        PowerCenter +                      Hadoop
       PowerExchange

                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             3
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           4
Unleash the Power of Hadoop
    With High Performance Universal Data Access

     Messaging,                                                                                     Packaged
and Web Services     WebSphere MQ         Web Services     JD Edwards        SAP NetWeaver        Applications
                     JMS                  TIBCO            Lotus Notes       SAP NetWeaver BI
                     MSMQ                 webMethods       Oracle E-Business SAS
                     SAP NetWeaver XI                      PeopleSoft        Siebel

   Relational and    Oracle               Informix                                                  SaaS/BPO
        Flat Files                                         Salesforce CRM   ADP
                     DB2 UDB              Teradata                          Hewitt
                     DB2/400              Netezza          Force.com
                                                           RightNow         SAP By Design
                     SQL Server           ODBC                              Oracle OnDemand
                     Sybase               JDBC             NetSuite
      Mainframe                                                                                       Industry
   and Midrange                                            EDI–X12          AST                     Standards
                     ADABAS          VSAM
                     Datacom         C-ISAM                EDI-Fact         FIX
                     DB2             Binary Flat Files     RosettaNet       Cargo IMP
                     IDMS            Tape Formats…         HL7              MVR
                     IMS
                                                           HIPAA
    Unstructured
   Data and Files    Word, Excel           Flat files                                           XML Standards
                     PDF                   ASCII reports   XML              ebXML
                     StarOffice            HTML            LegalXML         HL7 v3.0
                     WordPerfect           RPG             IFX              ACORD (AL3, XML)
                     Email (POP, IMPA)     ANSI            cXML
                     HTTP                  LDAP
MPP Appliances

                     EMC/Greenplum       AsterData         Facebook         LinkedIn
                     Vertica                               Twitter

                                                                                                Social Media



                                                                                                                 5
Ingest Data
                      Access Data         Pre-Process        Ingest Data
  Web server




                      PowerExchange         PowerCenter
  Databases,
Data Warehouse
                         Batch                               HDFS




 Message Queues,          CDC                                 HIVE
Email, Social Media                             e.g.
                                         Filter, Join, Cle
                                               anse
                        Real-time
  ERP, CRM
                                        Reuse
                                      PowerCenter
                                       mappings
  Mainframe



                                                                           6
Extract Data

Extract Data   Post-Process           Deliver Data

                                                        Web server



                 PowerCenter           PowerExchange
                                                         Databases,
 HDFS                                     Batch        Data Warehouse




               e.g. Transform
                                                         ERP, CRM
                  to target
                   schema

                                  Reuse                  Mainframe
                                PowerCenter
                                 mappings




                                                                        7
1. Create Ingest or
Extract Mapping




2. Create Hadoop
Connection




                                  3. Configure
                                  Workflow




          4. Create & Load Into
          Hive Table




                                                 8
The Hadoop Data Processing Pipeline
Informatica HParser
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
            HParser                        Hadoop



                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             9
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           10
Informatica HParser
Productivity: Data Transformation Studio




                                           11
Informatica HParser
    Productivity: Data Transformation Studio


Financial            Insurance           B2B Standards
                                                             Out of the box
SWIFT MT             DTCC-NSCC                               transformations for
                                         UNEDIFACT
SWIFT MX             ACORD-AL3
                                                             all messages in all
                                         Easy example
                                         EDI-X12
NACHA
                                                             versions
                     ACORD XML           based visual
                                         EDI ARR
FIX                                      enhancements
                                         EDI UCS+WINS
Telekurs                                 and edits
                                         EDI VICS            Updates and new
FpML
                                         RosettaNet          versions delivered
BAI – V2.0Lockbox
                     Healthcare          OAGI                from Informatica
CREST DEX
IFX                  HL7
                                  Definition is done using
TWIST                             Business (industry)
                                           Other
                     HL7 V3
  Enhanced
UNIFI (ISO 20022)
                                  terminology and
                     HIPAA
  Validations                     definitions
                                           IATA-PADIS
SEPA                 NCPDP
FIXML                                    PLMXML
                     CDISC
MISMO                                    NEIM



                                                                                  12
Informatica HParser
    How does it work?
                                                 Hadoop cluster




                                Svc Repository

                                      S


hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt


1. Develop an HParser transformation
2. Deploy the transformation
3. Run HParser on Hadoop to produce
   tabular data                                      HDFS
4. Analyze the data with HIVE / PIG /
   MapReduce / Other


                                                                  13
The Hadoop Data Processing Pipeline
Informatica Roadmap
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
                                           Hadoop



                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             14
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           15
Informatica Hadoop Roadmap – 1H 2012

• Process data on Hadoop
   • IDE, administration, monitoring, workflow
   • Data processing flow designed through IDE: Source/Target,
     Filter, Join, Lookup, etc.
   • Execution on Hadoop cluster (pushdown via Hive)

• Flexibility to plug-in custom code
   • Hive and PIG UDFs
   • MR scripts

• Productivity with optimal performance
   • Exploit Hive performance characteristics
   • Optimize end-to-end data flow for performance


                                                                 16
Mapping for Hive execution

                                                      Logical
                                                      representation
                                                      of processing
                                                      steps




                              Validate &
                              configure for
          Source
                              Hive translation
                   INSERT INTO STG0
                   SELECT * FROM StockAnalysis0;   Pre-view
                   INSERT INTO STG1
                   SELECT * FROM StockAnalysis1;
                                                   generated
                   INSERT INTO STG2
                   SELECT * FROM StockAnalysis2;
                                                   Hive code



                                                                       17
                                                                        17
Takeaways

• Universal connectivity
   • Completeness and enrichment of raw data for holistic analysis
   • Prevent Hadoop from becoming another silo accessible to a few
     experts

• Maximum productivity
   • Collaborative development environment
      • Right level of abstraction for data processing logic
      • Re-use of algorithms and data flow logic
   • Meta-data driven processing
      • Document data lineage for auditing and impact analysis
      • Deploy on any platform for optimal performance and utilization




                                                                         18
Customer Sentiment - Reaching beyond
NPS (Net Promoter Score) and surveys

Gaining insight in to our customer’s sentiment
will improve Rackspace’s ability to provide
Fanatical Support™
Objectives:
• What are “they” saying
• Gauge the level of sentiment
• Fanatical Support™ for the win
   • Increase NPS
   • Increase MRR
   • Decrease Churn
   • Provide the right products
   • Keep our promises


                              19                 19
Customer Sentiment Use Cases
Pulling it all together
                     Case 1                   Case 2
           Match social media posts        Determine the
           with Customer. Determine        sentiment of a
               a probable match.          post, searching
                                          key words and
                                         scoring the post.
          Case 3
   Determine correlations
between posts, ticket volume
and NPS leading to negative                   Case 4
   or positive sentiments.            Determine correlations in
                                          sentiments with
                                      products/configurations
                                      which lead to negative or
                   Case 5               positive sentiments.
           The ability to trend all
            inputs over time…


                                                                  20
Rackspace Fanatical Support™
Big Data Environment

  Data Sources
(DBs, Flat files, Data
     Streams)




   Oracle
   MySql
   MS SQL                                                                     Greenplum DB
                                                        Indirect Analytics
   Postgres                                                over Hadoop
   DB2                                                                          BI Analytics

   Excel
   CSV                                                                           BI Stack
   Flat File             Message bus /
   XML                   port listening


   EDI                                                   Direct Analytics
                                                          over Hadoop
   Binary
   Sys Logs
                                          Hadoop HDFS                        Search, Analytics,
   Messaging
   APIs                                                                         Algorithmic




                                                                                                  21
Twitter Feed for Rackspace
Using Informatica




      Input Data             Output Data




                    22                     22
23

Weitere ähnliche Inhalte

Was ist angesagt?

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARNAdarsh Pannu
 
Pf: the OpenBSD packet filter
Pf: the OpenBSD packet filterPf: the OpenBSD packet filter
Pf: the OpenBSD packet filterGiovanni Bechis
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and ProtobufGuido Schmutz
 
Ansible Automation - Enterprise Use Cases | Juncheng Anthony Lin
Ansible Automation - Enterprise Use Cases | Juncheng Anthony LinAnsible Automation - Enterprise Use Cases | Juncheng Anthony Lin
Ansible Automation - Enterprise Use Cases | Juncheng Anthony LinVietnam Open Infrastructure User Group
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 

Was ist angesagt? (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Apache KAfka
Apache KAfkaApache KAfka
Apache KAfka
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Pf: the OpenBSD packet filter
Pf: the OpenBSD packet filterPf: the OpenBSD packet filter
Pf: the OpenBSD packet filter
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
 
Ansible Automation - Enterprise Use Cases | Juncheng Anthony Lin
Ansible Automation - Enterprise Use Cases | Juncheng Anthony LinAnsible Automation - Enterprise Use Cases | Juncheng Anthony Lin
Ansible Automation - Enterprise Use Cases | Juncheng Anthony Lin
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 

Andere mochten auch

Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestionVinod Nayal
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestionTreasure Data, Inc.
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformNavneet Gupta
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopYinan Li
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
Designing a Real Time Data Ingestion Pipeline
Designing a Real Time Data Ingestion PipelineDesigning a Real Time Data Ingestion Pipeline
Designing a Real Time Data Ingestion PipelineDataScience
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopDataWorks Summit
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXMLKyong-Ha Lee
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse OptimisationBigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse OptimisationExcelerate Systems
 
La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes
La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes  La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes
La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes Excelerate Systems
 
Turning Text Into Insights: An Introduction to Topic Models
Turning Text Into Insights: An Introduction to Topic ModelsTurning Text Into Insights: An Introduction to Topic Models
Turning Text Into Insights: An Introduction to Topic ModelsDataScience
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloudTreasure Data, Inc.
 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive ParadigmLucidworks
 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksLucidworks
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Lin Qiao
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosMeson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosAntony Arokiasamy
 

Andere mochten auch (20)

Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data Platform
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Designing a Real Time Data Ingestion Pipeline
Designing a Real Time Data Ingestion PipelineDesigning a Real Time Data Ingestion Pipeline
Designing a Real Time Data Ingestion Pipeline
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in Hadoop
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXML
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse OptimisationBigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
 
La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes
La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes  La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes
La plateforme OpenData 3.0 pour libĂŠrer et valoriser les donnĂŠes
 
Turning Text Into Insights: An Introduction to Topic Models
Turning Text Into Insights: An Introduction to Topic ModelsTurning Text Into Insights: An Introduction to Topic Models
Turning Text Into Insights: An Introduction to Topic Models
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive Paradigm
 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, Lucidworks
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosMeson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on Mesos
 

Ähnlich wie Data Ingestion, Extraction & Parsing on Hadoop

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.
 
Hadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - InformaticaHadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - InformaticaSanjeev Kumar
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Informatica
InformaticaInformatica
Informaticamukharji
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems divjeev
 
Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Cana Ko
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarYahoo Developer Network
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPCNetApp
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudDataWorks Summit/Hadoop Summit
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 

Ähnlich wie Data Ingestion, Extraction & Parsing on Hadoop (20)

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
 
Hadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - InformaticaHadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - Informatica
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Informatica
InformaticaInformatica
Informatica
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPC
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
 
2012 06 hortonworks paris hug
2012 06 hortonworks paris hug2012 06 hortonworks paris hug
2012 06 hortonworks paris hug
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 

KĂźrzlich hochgeladen

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

KĂźrzlich hochgeladen (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Data Ingestion, Extraction & Parsing on Hadoop

  • 1. Data Ingestion, Extraction, and Preparation for Hadoop Sanjay Kaluskar, Sr. Architect, Informatica David Teniente, Data Architect, Rackspace 1
  • 2. Safe Harbor Statement • The information being provided today is for informational purposes only. The development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future. • Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development. • All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made. • Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department. 2
  • 3. The Hadoop Data Processing Pipeline Informatica PowerCenter + PowerExchange Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on PowerCenter + Hadoop PowerExchange 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 3
  • 4. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 4
  • 5. Unleash the Power of Hadoop With High Performance Universal Data Access Messaging, Packaged and Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications JMS TIBCO Lotus Notes SAP NetWeaver BI MSMQ webMethods Oracle E-Business SAS SAP NetWeaver XI PeopleSoft Siebel Relational and Oracle Informix SaaS/BPO Flat Files Salesforce CRM ADP DB2 UDB Teradata Hewitt DB2/400 Netezza Force.com RightNow SAP By Design SQL Server ODBC Oracle OnDemand Sybase JDBC NetSuite Mainframe Industry and Midrange EDI–X12 AST Standards ADABAS VSAM Datacom C-ISAM EDI-Fact FIX DB2 Binary Flat Files RosettaNet Cargo IMP IDMS Tape Formats… HL7 MVR IMS HIPAA Unstructured Data and Files Word, Excel Flat files XML Standards PDF ASCII reports XML ebXML StarOffice HTML LegalXML HL7 v3.0 WordPerfect RPG IFX ACORD (AL3, XML) Email (POP, IMPA) ANSI cXML HTTP LDAP MPP Appliances EMC/Greenplum AsterData Facebook LinkedIn Vertica Twitter Social Media 5
  • 6. Ingest Data Access Data Pre-Process Ingest Data Web server PowerExchange PowerCenter Databases, Data Warehouse Batch HDFS Message Queues, CDC HIVE Email, Social Media e.g. Filter, Join, Cle anse Real-time ERP, CRM Reuse PowerCenter mappings Mainframe 6
  • 7. Extract Data Extract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, HDFS Batch Data Warehouse e.g. Transform ERP, CRM to target schema Reuse Mainframe PowerCenter mappings 7
  • 8. 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Create & Load Into Hive Table 8
  • 9. The Hadoop Data Processing Pipeline Informatica HParser Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on HParser Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 9
  • 10. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 10
  • 11. Informatica HParser Productivity: Data Transformation Studio 11
  • 12. Informatica HParser Productivity: Data Transformation Studio Financial Insurance B2B Standards Out of the box SWIFT MT DTCC-NSCC transformations for UNEDIFACT SWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12 NACHA versions ACORD XML based visual EDI ARR FIX enhancements EDI UCS+WINS Telekurs and edits EDI VICS Updates and new FpML RosettaNet versions delivered BAI – V2.0Lockbox Healthcare OAGI from Informatica CREST DEX IFX HL7 Definition is done using TWIST Business (industry) Other HL7 V3 Enhanced UNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADIS SEPA NCPDP FIXML PLMXML CDISC MISMO NEIM 12
  • 13. Informatica HParser How does it work? Hadoop cluster Svc Repository S hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt 1. Develop an HParser transformation 2. Deploy the transformation 3. Run HParser on Hadoop to produce tabular data HDFS 4. Analyze the data with HIVE / PIG / MapReduce / Other 13
  • 14. The Hadoop Data Processing Pipeline Informatica Roadmap Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 14
  • 15. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 15
  • 16. Informatica Hadoop Roadmap – 1H 2012 • Process data on Hadoop • IDE, administration, monitoring, workflow • Data processing flow designed through IDE: Source/Target, Filter, Join, Lookup, etc. • Execution on Hadoop cluster (pushdown via Hive) • Flexibility to plug-in custom code • Hive and PIG UDFs • MR scripts • Productivity with optimal performance • Exploit Hive performance characteristics • Optimize end-to-end data flow for performance 16
  • 17. Mapping for Hive execution Logical representation of processing steps Validate & configure for Source Hive translation INSERT INTO STG0 SELECT * FROM StockAnalysis0; Pre-view INSERT INTO STG1 SELECT * FROM StockAnalysis1; generated INSERT INTO STG2 SELECT * FROM StockAnalysis2; Hive code 17 17
  • 18. Takeaways • Universal connectivity • Completeness and enrichment of raw data for holistic analysis • Prevent Hadoop from becoming another silo accessible to a few experts • Maximum productivity • Collaborative development environment • Right level of abstraction for data processing logic • Re-use of algorithms and data flow logic • Meta-data driven processing • Document data lineage for auditing and impact analysis • Deploy on any platform for optimal performance and utilization 18
  • 19. Customer Sentiment - Reaching beyond NPS (Net Promoter Score) and surveys Gaining insight in to our customer’s sentiment will improve Rackspace’s ability to provide Fanatical Support™ Objectives: • What are “they” saying • Gauge the level of sentiment • Fanatical Support™ for the win • Increase NPS • Increase MRR • Decrease Churn • Provide the right products • Keep our promises 19 19
  • 20. Customer Sentiment Use Cases Pulling it all together Case 1 Case 2 Match social media posts Determine the with Customer. Determine sentiment of a a probable match. post, searching key words and scoring the post. Case 3 Determine correlations between posts, ticket volume and NPS leading to negative Case 4 or positive sentiments. Determine correlations in sentiments with products/configurations which lead to negative or Case 5 positive sentiments. The ability to trend all inputs over time… 20
  • 21. Rackspace Fanatical Support™ Big Data Environment Data Sources (DBs, Flat files, Data Streams) Oracle MySql MS SQL Greenplum DB Indirect Analytics Postgres over Hadoop DB2 BI Analytics Excel CSV BI Stack Flat File Message bus / XML port listening EDI Direct Analytics over Hadoop Binary Sys Logs Hadoop HDFS Search, Analytics, Messaging APIs Algorithmic 21
  • 22. Twitter Feed for Rackspace Using Informatica Input Data Output Data 22 22
  • 23. 23

Hinweis der Redaktion

  1. * EXAMPLE *Some talking points to cover over the next few slides on PowerExchange for Hadoop…Access all data sourcesAbility to pre-process (e.g. filter) before landing to HDFS and post-process to fit target schemaPerformance of load via partitioning, native APIs, grid, pushdown to source or target, process offloadingProductivity via visual designerDifferent latencies (batch, near real-time)One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. All too often developers resort to reinventing the wheel by building custom adapters and scripts that require expert knowledge of the source systems, applications, data structures and formats. Once they overcome this hurdle they need to make sure their custom code will perform and scale as data volumes grow. Along with the need for speed, security and reliability are often overlooked which increases the risk of non-compliance and system downtime. Needless to say building a robust custom adapter takes time and can be costly to maintain as software versions change. Sometimes the end result is adapters that lack direct connectivity between the source systems and Hadoop which means you need to temporarily stage the data before it can move into Hadoop, increasing storage costs. Informatica PowerExchange can access data from virtually any data source at any latency (e.g. batch, real-time, or near real-time) and deliver all your data directly into Hadoop (see Figure 2). Similarly, Informatica PowerExchange can deliver data from Hadoop to your enterprise applications and information management systems. You can schedule batch loads to move data from multiple source systems directly into Hadoop without any staging. Alternatively, you can move only changed data from relational and mainframe systems directly into Hadoop. For real-time data feeds, you can move data off of message queues and deliver into Hadoop. Informatica PowerExchange accesses data through native API’s to ensure optimal performance and is designed to minimize the impact to source systems through caching and process offloading. To further increase the performance of data flows between the source systems and Hadoop, PowerCenter supports data partitioning to distribute the processing across CPUs.  Informatica PowerExchange for Hadoop is integrated with PowerCenter so that you can pre-process data from multiple data sources before it lands in Hadoop. This enables you to leverage the source system metadata since this information is not retained in the Hadoop File System (HDFS). For example, you can perform lookups, filters, or relational joins based on primary and foreign key relationships before data is delivered to HDFS. You can also pushdown the pre-processing to the source system to limit data movement and unnecessary data duplication to Hadoop. Common design patterns for data flows into or out of Hadoop can be generated in PowerCenter using parameterized templates built in Microsoft Visio to dramatically increase productivity. To securely and reliably manage the file transfer and collection of very large data files from both inside and outside the firewall you can use Informatica Managed File Transfer (MFT).
  2. Sanjay’s notes:Flume, scribe are options for streaming ingestion of log filesKafka is for near real-time
  3. See PWX for Hadoop white paperDoes not require expert knowledge of source systemsDeliver data directly to Hadoop without any intermediate stagingAccess data through native API’s for optimal performanceBring in both un-modeled / un-structured and structured relational data to make the analysis completeUse example to illustrate combining both unstructured and structured data needed for analysis
  4. Have lineage of where data came from
  5. Informatica announced on Nov 2 the industry’s first data parser for HadoopThe solution is designed to provide a powerful data parsing alternative to organizations who are seeking to achieve the full potential of Big Data in Hadoop with efficiency and scale.This solution addresses the industry’s growing demand in turning the unstructured, complex data into structured or semi-structured format in Hadoop to drive insights and improve operations.Tapping our industry leading experience in parsing unstructured data and handling industry formats and documents within and across enterprise, Informatica pioneered the development of the data parser that exploits the parallelism of MapReduce framework.Using an engine-based, interactive tool to simplify the data parsing process, Informatica HParser processes complex files and messages in Hadoop with the following three offerings:Informatica HParser for logs, Omniture, XML and JSON (community edition), free of charge.Informatica HParser for industry standards (commercial edition).Informatica HParser for documents (commercial edition).With HParser, organizations can derive unique benefits using:Accelerate deployment using out of the box ready to use transformations and industry standards.Increase productivity for tackling diverse complex formats including proprietary log files.Speed the development of parsing exploiting the parallelism inside MapReduce.Optimize performance in data parsing for large files including logs, XML, JSON and industry standards.Informatica also provides a free 30 day trial of the commercial edition of Hparser for Documents to the users interested in learning about the design environment for data transformation.
  6. Definethe extraction/transformation logic using the designerRun the parser as a standalone MR jobCommand line arguments are script, input, output filesParallelism across files, no support for file splits
  7. Describe each of the future capabilities in the bulletsYou can design and specify the entire end-to-end flow of your data processing pipeline with the flexibility to insert custom code.Choose the right level of abstraction to define your data flow, don’t reinvent the wheel. Informatica provides the right level of abstraction for data processing for rapid development (e.g. metadata driven development environment) and easy maintenance (e.g. complete specification and lineage of data)