Weitere Ă€hnliche Inhalte Ăhnlich wie Putting Business Intelligence to Work on Hadoop Data Stores (20) Mehr von DATAVERSITY (20) KĂŒrzlich hochgeladen (20) Putting Business Intelligence to Work on Hadoop Data Stores1. Putting Business Intelligence to
Work on Hado Data Stores
oop
Ian Fyfe, Chief Techno
ology Evangelist, Pentaho
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights R
Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 1
2. Session Abstract
This presentation will cover how to ov
vercome Hadoop's constraints to get
more out of your business data analyssis.
An inexpensive way of storing large volumes of da ata,
ata Hadoop is also scalable and redundant But
redundant.
getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users
k
experience high latency (up to several minutes pe query), Hadoop is not appropriate for ad hoc
er
query, reporting, and business analysis with tradiitional tools.
The fi t t in
Th first step i overcoming H d
i Hadoop's constraints i connecting t HIVE a d t warehouse
' t i ts is ti to HIVE, data h
infrastructure built on top of Hadoop, which provvides the relational structure necessary for
schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query
i
language called Hive QL which is based on SQL an which enables users familiar with SQL to query
nd
this data.
But to really unlock the power of Hadoop, you mu be able to efficiently extract data stored across
ust
multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load)
tool that will then allow you to move y
y your Hadoop data into a relational data mart or warehouse
op
where you can use BI tools for analysis.
Attendees will learn, how an IT person without java programming skills can:
Integrate with Hadoop and Hive to bring ETL, dat warehousing and BI applications to the tasks of
ta
analyzing Big Data;
Provide key data integration and transformation functionality to Hadoop data;
f
Manage and control Hadoop jobs using a graphica interface;
al
Integrating Hadoop data with data from other souurces to drive compelling reporting and analytics
for today's massive volumes of data.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 2
3. THE CASE FOR B DATA
BIG
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 3
4. The Case for Big Data
Enterprises increasingly face nee to store, process and maintain
eds
larger and larger volumes of structured and unstructured data
Compliance
Competitive Advantage
Challenges associated with big da
ata
Cost â storage and processing power
r
Timeliness of data processing
Why Hadoop? Google trends for âHadoopâ
Low cost, reliable scale-out architec
cture for storing massive amounts of data
Parallel,
Parallel distributed computing frammework for processing data
Proven success in solving Big Data pr
roblems at fortune 500 companies like
Google, Yahoo!, IBM and GE
Vibrant community, exploding i
Vib i l di intere strong commercial i
est, i l investments
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 4
5. Hadoop for Data Integration and BI
Top Use Cases for Hadoop*
1. âmine data for improved busines intelligenceâ
ss
2 âreducing cost of data analysisâ
2. reducing analysis
3. âlog analysisâ
Top Challenges with Hadoop*
1. Steep technical learning curve
2. Hiring qualified people
3. Availability of appropriate produ
ucts and tools
Unfortunately, Hadoop was not designed specifically for ETL and BI use cases:
d
Itâs not a database
High latency queries and jobs not ideal for all BI use cases
Skill set mismatch for traditional ETL us
sers and BI Solution architects
*Based on a survey of 100+ Hadoop users conducted by Karmasphere Sept 2010
d Karmasphere, Sept.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 5
6. ESTABLISHING A
AN
ARCHITECTURE FFOR BIG DATA
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 6
7. Example Use Cases Today
p y
Transactional
âąFraud detection
âąFinancial services/sto k markets
Fi i l i / tock k t
Sub-Transactional
âąWeblogs
âąSocial/online media
âąTelecoms events
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555 Slide 7
8. Example Use Cases Today
p y
Non-Transactional
âąWeb pages, blogs etc
c
âąDocuments
D t
âąPhysical events
y
âąApplication events
âąMachine events
In most cases structur or semi-structured
red
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555 Slide 8
9. Traditional Business In
ntelligence ( )
g (BI)
Data Mart(s)
Tape/T
Trash
Data ? ? ?
Source ?
? ??
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555 Slide 9
10. Data Lake
âą Single source
âą Large volume
âą Not distilled
âą T i ll no more th 0 2
Typically than 0-2
lakes per company
âą Known and unknown
questions
âą Multiple user communities
âą Donât fit in traditional
RDBMS with a reasonable
cost
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 10
11. Data Lake Requiremen
q nts
âą Store all the data
âą Satisfy routine reporting
and analysis
âą Satisfy ad-hoc query /
analysis / reporting
âą Balance performance and
cost
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 11
12. What if...
Data Mart(s) Ad-H
Hoc Data Warehouse
Data L
Lake(s)
Data
Source
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 12
13. Big Data Does Not Replace Data Marts
g p
Itâs not a database
High latency
sive data-crunching
Optimized for mass
Big Data databases are immature
s
Databases are no SQL
no-
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 13
14. What Hadoop Really is
p y sâŠ.
Core Components
HDFS
a distributed file system allow
wing massive
storage across a cluster of com
mmodity
servers
MapReduce
Framework for distributed com mputation,
common use cases include agg gregating,
sorting, and filtering BIG data sets
Problem is broken up into sma fragments
all
of work that can be computed or
d
recomputed in isolation on any node of the
y
cluster
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 14
15. What Hadoop Really is
p y sâŠ.
Related Projects
Hive â a data warehouse
infrastructure on top of Hadoop
H
Implements a SQL like Query l
language,
language
including a JDBC driver
Allows MapReduce developers to plugin
p p p g
custom mappers and reducers
Hbase â the Hadoop data
abase â
AH HA!
A variant of NoSQL databases,
problematic for traditional BI
Best at storing large amounts of
unstructured data
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 15
16. Hadoop and BI?
p
Distributed processin
ng
Distributed file syste
em
Commodity h d re
C dit hardwar
Platform independen (in theory)
nt
Scales out beyond te
echnology and/or
economy of a RDBM MS
In many cases itâs the only viable solution
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 16
17. Hadoop and BI?
p
90% of new Had doop use cases
are transfo
ormation of
semi/struct
tured data*
data
* of those companies weâve talke to
we ve ed to...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 17
18. Hadoop and BI?
p
âThe working conditio
ons
within Hadoop are sho
ockingâ
ocking
ETL Developer
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 18
19. Hadoop and BI?
p
Instead of this...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 19
20. Hadoop and BI?
p
You have to do this in Java...
public void map(
Text key,
Text value,
OutputCollector output
t,
Reporter reporter)
public void reduce(
p
Text key,
Iterator values,
OutputCollector output
t,
Reporter reporter)
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 20
21. People d t use
don
donât
Hadoop for BI because
they wa to
ant to...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 21
22. ...they do i because
they it
they ha to
ave to...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 22
23. ... and unfo
ortunately it
wasnât d
designed
for most BI requirements
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 23
24. Why not add to Hadoop
d
the things itâs missing...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 24
25. ... until it can do
t
what we n need it to?
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 25
26. If only w had a
we
Java,
Java emb beddable,
beddable
data transformmation engine
engine...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 26
27. A Data Integration Eng
g g
gine for Hadoop
p
Data Marts, Da Warehouse,
ata
Analytical App
y Applications
Data Integr
ration
Enginee
Design
Data Integr
ration
Hadoop Engine
E i e Deploy
Orchestrate
Data Integr
ration
Engine
g e
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 27
28. Visualize Reporting / Dashb
boards / Analysis
Web Tier
DMÂ &
&Â DW RDBMS
Optimize
Hiv
ve
Hadoop
Files / HDFS
Load Applications
s & Systems
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 28
29. Reporting / Dashb
boards / Analysis
Web Tier
DMÂ &
&Â DW RDBMS
adata
Meta
Hiv
ve
Hadoop
Files / HDFS
Applications
s & Systems
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 29
30. Data Mart(s) Ad-H
Hoc Data Warehouse
Data Lake(s)
Data
Source
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 30
31. Reporting / Dashb
boards / Analysis
Web Tier
RDBMS
Data Hadoop
Lake
Applications
s & Systems
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 31
32. Product Requirements for BI Ag
gainst Hadoop
Lower technical barriers through grap
phical ETL
environment for creating and managing Hadoop
g
MapReduce j b
M R d jobs Interactive Analysis
Batch Reporting
Extreme ETL scalability through deplo
oyment and Ad Hoc Query
across the Hadoop cluster Data M t
D t Marts
Easily spin-off high performance data marts for
Ag BI
interactive analysis
gile
Hive
Hi
Easily integrate data from Hadoop with data from
h
other sources Hadoop
Provide end-to-end BI addressing comm BI use
P id dt d dd i mon
Data Integration Jobs
cases with Hadoop including reporting, ad hoc
query and interactive analysis
Reduce costs through subscription-base pricing,
ed
reduced dependency on scarce technica al Log DBs and
Files other sources
resources, and easier maintainability
d i i t i bilit
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 32
33. THE ROAD AHEAD
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 33
34. The Road Ahead
Other NoSQL Integration
Facilitate BI use cases on top of HBase, possibly others like
HBase
MongoDB, Cassandra
Streaming Data Source Su
upport
In support of near-realtime us cases
se
Long/always running data proc cessing jobs
Contiguous Meta-data
Data Lineage and Impact Analy covering the entire big data
ysis
architecture
The End of MapReduce ( as a concept ETL users need to
p (⊠s p
understand)
Push down optimization of Tra
ansformations that generate
native MapReduce tasks in Had
doop
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 34
35. Hadoop Distro Wars
The Apache Software Foundation
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 35
36. Tools That Make Hado Easier
oop
e.g. Apache Pig
Pig is a platform for
analyzing large data sets
Produces sequences of
MapReduce programs
Integrate Pig scripts into
enterprise data integration
workflows e.g.
1 Submit and monitor a
1.
series of Pig and
MapReduce jobs
2. Process a database bulk
load step to ready data
for ad-hoc analysis or
report bursting
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 36
37. Growth in Adoption of Other
o
NoSQL Big Data Platf
forms
Hbase â the Hadoop database
mongoDB â scalable high performance document oriented database
scalable, high-performance, document-oriented
LexisNexis HPCC â a data intensive computing system platform
Many others
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 37
38. Summary
Hadoop and other Big Data NoSQL platforms
N
Great at storing and processin large diverse data volumes
ng
Not designed for Business Inte
elligence
Choosing the right BI technoology can unlock your Big Data
to drive actionable insights
g
Graphical user interfaces
Scalable
Spin-off data marts
Integrate data into data warehhouses
Integrated dashboards, reportting, data analysis, data
integration
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 38
39. Thank You!
k
ifyfe@pen
ntaho.com
ntaho com
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555Slide 39