The document discusses options for ingesting, extracting, parsing, and transforming data on Hadoop using Informatica products. It outlines Informatica's current capabilities for data integration with Hadoop and its roadmap to enhance capabilities for processing data directly on Hadoop in the first half of 2012. This will allow users to design data processing flows visually and execute them on Hadoop for optimized performance.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Â
Data Ingestion, Extraction & Parsing on Hadoop
1. Data Ingestion, Extraction, and
Preparation for Hadoop
Sanjay Kaluskar, Sr.
Architect, Informatica
David Teniente, Data
Architect, Rackspace
1
2. Safe Harbor Statement
⢠The information being provided today is for informational purposes only. The
development, release and timing of any Informatica product or functionality described
today remain at the sole discretion of Informatica and should not be relied upon in
making a purchasing decision. Statements made today are based on
currently available information, which is subject to change. Such statements should
not be relied upon as a representation, warranty or commitment to deliver specific
products or functionality in the future.
⢠Some of the comments we will make today are forward-looking statements including
statements concerning our product portfolio, our growth and operational
strategies, our opportunities, customer adoption of and demand for our products and
services, the use and expected benefits of our products and services by
customers, the expected benefit from our partnerships and our expectations
regarding future industry trends and macroeconomic development.
⢠All forward-looking statements are based upon current expectations and beliefs.
However, actual results could differ materially. There are many reasons why actual
results may differ from our current expectations. These forward-looking statements
should not be relied upon as representing our views as of any subsequent date and
Informatica undertakes no obligation to update forward-looking statements to reflect
events or circumstances after the date that they are made.
⢠Please refer to our recent SEC filings including the Form 10-Q for the quarter ended
September 30th, 2011 for a detailed discussion of the risk factors that may affect our
results. Copies of these documents may be obtained from the SEC or by contacting
our Investor Relations department.
2
3. The Hadoop Data Processing Pipeline
Informatica PowerCenter + PowerExchange
Available Today
Sales & Marketing Customer Service
1H / 2012 Data mart Portal
4. Extract Data from Hadoop
3. Transform & Cleanse Data
on Hadoop
2. Parse & Prepare Data on
PowerCenter + Hadoop
PowerExchange
1. Ingest Data into Hadoop
Product & Service Customer Service
Marketing Campaigns Customer Profile Account Transactions Social Media
Offerings Logs & Surveys
3
5. Unleash the Power of Hadoop
With High Performance Universal Data Access
Messaging, Packaged
and Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications
JMS TIBCO Lotus Notes SAP NetWeaver BI
MSMQ webMethods Oracle E-Business SAS
SAP NetWeaver XI PeopleSoft Siebel
Relational and Oracle Informix SaaS/BPO
Flat Files Salesforce CRM ADP
DB2 UDB Teradata Hewitt
DB2/400 Netezza Force.com
RightNow SAP By Design
SQL Server ODBC Oracle OnDemand
Sybase JDBC NetSuite
Mainframe Industry
and Midrange EDIâX12 AST Standards
ADABAS VSAM
Datacom C-ISAM EDI-Fact FIX
DB2 Binary Flat Files RosettaNet Cargo IMP
IDMS Tape Formats⌠HL7 MVR
IMS
HIPAA
Unstructured
Data and Files Word, Excel Flat files XML Standards
PDF ASCII reports XML ebXML
StarOffice HTML LegalXML HL7 v3.0
WordPerfect RPG IFX ACORD (AL3, XML)
Email (POP, IMPA) ANSI cXML
HTTP LDAP
MPP Appliances
EMC/Greenplum AsterData Facebook LinkedIn
Vertica Twitter
Social Media
5
6. Ingest Data
Access Data Pre-Process Ingest Data
Web server
PowerExchange PowerCenter
Databases,
Data Warehouse
Batch HDFS
Message Queues, CDC HIVE
Email, Social Media e.g.
Filter, Join, Cle
anse
Real-time
ERP, CRM
Reuse
PowerCenter
mappings
Mainframe
6
7. Extract Data
Extract Data Post-Process Deliver Data
Web server
PowerCenter PowerExchange
Databases,
HDFS Batch Data Warehouse
e.g. Transform
ERP, CRM
to target
schema
Reuse Mainframe
PowerCenter
mappings
7
9. The Hadoop Data Processing Pipeline
Informatica HParser
Available Today
Sales & Marketing Customer Service
1H / 2012 Data mart Portal
4. Extract Data from Hadoop
3. Transform & Cleanse Data
on Hadoop
2. Parse & Prepare Data on
HParser Hadoop
1. Ingest Data into Hadoop
Product & Service Customer Service
Marketing Campaigns Customer Profile Account Transactions Social Media
Offerings Logs & Surveys
9
12. Informatica HParser
Productivity: Data Transformation Studio
Financial Insurance B2B Standards
Out of the box
SWIFT MT DTCC-NSCC transformations for
UNEDIFACT
SWIFT MX ACORD-AL3
all messages in all
Easy example
EDI-X12
NACHA
versions
ACORD XML based visual
EDI ARR
FIX enhancements
EDI UCS+WINS
Telekurs and edits
EDI VICS Updates and new
FpML
RosettaNet versions delivered
BAI â V2.0Lockbox
Healthcare OAGI from Informatica
CREST DEX
IFX HL7
Definition is done using
TWIST Business (industry)
Other
HL7 V3
Enhanced
UNIFI (ISO 20022)
terminology and
HIPAA
Validations definitions
IATA-PADIS
SEPA NCPDP
FIXML PLMXML
CDISC
MISMO NEIM
12
13. Informatica HParser
How does it work?
Hadoop cluster
Svc Repository
S
hadoop ⌠dt-hadoop.jar
⌠My_Parser /input/*/input*.txt
1. Develop an HParser transformation
2. Deploy the transformation
3. Run HParser on Hadoop to produce
tabular data HDFS
4. Analyze the data with HIVE / PIG /
MapReduce / Other
13
14. The Hadoop Data Processing Pipeline
Informatica Roadmap
Available Today
Sales & Marketing Customer Service
1H / 2012 Data mart Portal
4. Extract Data from Hadoop
3. Transform & Cleanse Data
on Hadoop
2. Parse & Prepare Data on
Hadoop
1. Ingest Data into Hadoop
Product & Service Customer Service
Marketing Campaigns Customer Profile Account Transactions Social Media
Offerings Logs & Surveys
14
16. Informatica Hadoop Roadmap â 1H 2012
⢠Process data on Hadoop
⢠IDE, administration, monitoring, workflow
⢠Data processing flow designed through IDE: Source/Target,
Filter, Join, Lookup, etc.
⢠Execution on Hadoop cluster (pushdown via Hive)
⢠Flexibility to plug-in custom code
⢠Hive and PIG UDFs
⢠MR scripts
⢠Productivity with optimal performance
⢠Exploit Hive performance characteristics
⢠Optimize end-to-end data flow for performance
16
17. Mapping for Hive execution
Logical
representation
of processing
steps
Validate &
configure for
Source
Hive translation
INSERT INTO STG0
SELECT * FROM StockAnalysis0; Pre-view
INSERT INTO STG1
SELECT * FROM StockAnalysis1;
generated
INSERT INTO STG2
SELECT * FROM StockAnalysis2;
Hive code
17
17
18. Takeaways
⢠Universal connectivity
⢠Completeness and enrichment of raw data for holistic analysis
⢠Prevent Hadoop from becoming another silo accessible to a few
experts
⢠Maximum productivity
⢠Collaborative development environment
⢠Right level of abstraction for data processing logic
⢠Re-use of algorithms and data flow logic
⢠Meta-data driven processing
⢠Document data lineage for auditing and impact analysis
⢠Deploy on any platform for optimal performance and utilization
18
19. Customer Sentiment - Reaching beyond
NPS (Net Promoter Score) and surveys
Gaining insight in to our customerâs sentiment
will improve Rackspaceâs ability to provide
Fanatical Supportâ˘
Objectives:
⢠What are âtheyâ saying
⢠Gauge the level of sentiment
⢠Fanatical Support⢠for the win
⢠Increase NPS
⢠Increase MRR
⢠Decrease Churn
⢠Provide the right products
⢠Keep our promises
19 19
20. Customer Sentiment Use Cases
Pulling it all together
Case 1 Case 2
Match social media posts Determine the
with Customer. Determine sentiment of a
a probable match. post, searching
key words and
scoring the post.
Case 3
Determine correlations
between posts, ticket volume
and NPS leading to negative Case 4
or positive sentiments. Determine correlations in
sentiments with
products/configurations
which lead to negative or
Case 5 positive sentiments.
The ability to trend all
inputs over timeâŚ
20
21. Rackspace Fanatical Supportâ˘
Big Data Environment
Data Sources
(DBs, Flat files, Data
Streams)
Oracle
MySql
MS SQL Greenplum DB
Indirect Analytics
Postgres over Hadoop
DB2 BI Analytics
Excel
CSV BI Stack
Flat File Message bus /
XML port listening
EDI Direct Analytics
over Hadoop
Binary
Sys Logs
Hadoop HDFS Search, Analytics,
Messaging
APIs Algorithmic
21
22. Twitter Feed for Rackspace
Using Informatica
Input Data Output Data
22 22
* EXAMPLE *Some talking points to cover over the next few slides on PowerExchange for HadoopâŚAccess all data sourcesAbility to pre-process (e.g. filter) before landing to HDFS and post-process to fit target schemaPerformance of load via partitioning, native APIs, grid, pushdown to source or target, process offloadingProductivity via visual designerDifferent latencies (batch, near real-time)One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. All too often developers resort to reinventing the wheel by building custom adapters and scripts that require expert knowledge of the source systems, applications, data structures and formats. Once they overcome this hurdle they need to make sure their custom code will perform and scale as data volumes grow. Along with the need for speed, security and reliability are often overlooked which increases the risk of non-compliance and system downtime. Needless to say building a robust custom adapter takes time and can be costly to maintain as software versions change. Sometimes the end result is adapters that lack direct connectivity between the source systems and Hadoop which means you need to temporarily stage the data before it can move into Hadoop, increasing storage costs. Informatica PowerExchange can access data from virtually any data source at any latency (e.g. batch, real-time, or near real-time) and deliver all your data directly into Hadoop (see Figure 2). Similarly, Informatica PowerExchange can deliver data from Hadoop to your enterprise applications and information management systems. You can schedule batch loads to move data from multiple source systems directly into Hadoop without any staging. Alternatively, you can move only changed data from relational and mainframe systems directly into Hadoop. For real-time data feeds, you can move data off of message queues and deliver into Hadoop. Informatica PowerExchange accesses data through native APIâs to ensure optimal performance and is designed to minimize the impact to source systems through caching and process offloading. To further increase the performance of data flows between the source systems and Hadoop, PowerCenter supports data partitioning to distribute the processing across CPUs.  Informatica PowerExchange for Hadoop is integrated with PowerCenter so that you can pre-process data from multiple data sources before it lands in Hadoop. This enables you to leverage the source system metadata since this information is not retained in the Hadoop File System (HDFS). For example, you can perform lookups, filters, or relational joins based on primary and foreign key relationships before data is delivered to HDFS. You can also pushdown the pre-processing to the source system to limit data movement and unnecessary data duplication to Hadoop. Common design patterns for data flows into or out of Hadoop can be generated in PowerCenter using parameterized templates built in Microsoft Visio to dramatically increase productivity. To securely and reliably manage the file transfer and collection of very large data files from both inside and outside the firewall you can use Informatica Managed File Transfer (MFT).
Sanjayâs notes:Flume, scribe are options for streaming ingestion of log filesKafka is for near real-time
See PWX for Hadoop white paperDoes not require expert knowledge of source systemsDeliver data directly to Hadoop without any intermediate stagingAccess data through native APIâs for optimal performanceBring in both un-modeled / un-structured and structured relational data to make the analysis completeUse example to illustrate combining both unstructured and structured data needed for analysis
Have lineage of where data came from
Informatica announced on Nov 2 the industryâs first data parser for HadoopThe solution is designed to provide a powerful data parsing alternative to organizations who are seeking to achieve the full potential of Big Data in Hadoop with efficiency and scale.This solution addresses the industryâs growing demand in turning the unstructured, complex data into structured or semi-structured format in Hadoop to drive insights and improve operations.Tapping our industry leading experience in parsing unstructured data and handling industry formats and documents within and across enterprise, Informatica pioneered the development of the data parser that exploits the parallelism of MapReduce framework.Using an engine-based, interactive tool to simplify the data parsing process, Informatica HParser processes complex files and messages in Hadoop with the following three offerings:Informatica HParser for logs, Omniture, XML and JSON (community edition), free of charge.Informatica HParser for industry standards (commercial edition).Informatica HParser for documents (commercial edition).With HParser, organizations can derive unique benefits using:Accelerate deployment using out of the box ready to use transformations and industry standards.Increase productivity for tackling diverse complex formats including proprietary log files.Speed the development of parsing exploiting the parallelism inside MapReduce.Optimize performance in data parsing for large files including logs, XML, JSON and industry standards.Informatica also provides a free 30 day trial of the commercial edition of Hparser for Documents to the users interested in learning about the design environment for data transformation.
Definethe extraction/transformation logic using the designerRun the parser as a standalone MR jobCommand line arguments are script, input, output filesParallelism across files, no support for file splits
Describe each of the future capabilities in the bulletsYou can design and specify the entire end-to-end flow of your data processing pipeline with the flexibility to insert custom code.Choose the right level of abstraction to define your data flow, donât reinvent the wheel. Informatica provides the right level of abstraction for data processing for rapid development (e.g. metadata driven development environment) and easy maintenance (e.g. complete specification and lineage of data)