Powerful Google developer tools for immediate impact! (2023-24 C)
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
1. 1
Ask
Bigger
Ques,ons
with
Cloudera
and
Apache
Hadoop
Graham
Gear
graham@cloudera.com
JUNE
2013
2. Data
Has
Changed
in
the
Last
30
Years
DATA
GROWTH
END-‐USER
APPLICATIONS
THE
INTERNET
MOBILE
DEVICES
SOPHISTICATED
MACHINES
STRUCTURED
DATA
–
10%
1980
2012
UNSTRUCTURED
DATA
–
90%
3. Data
Management
Strategies
Have
Stayed
the
Same
• Raw
data
on
SAN,
NAS
and
tape
• Data
moved
from
storage
to
compute
• Rela,onal
models
with
predesigned
schemas
4. Too
Much
Data,
Too
Many
Sources
• Can’t
ingest
fast
enough
5. Too
Much
Data,
Too
Many
Sources
$
!
$ $
$
• Can’t
ingest
fast
enough
• Costs
too
much
to
store
6. Too
Much
Data,
Too
Many
Sources
1
2 3 4
5
• Can’t
ingest
fast
enough
• Costs
too
much
to
store
• Exists
in
different
places
7. Too
Much
Data,
Too
Many
Sources
• Can’t
ingest
fast
enough
• Costs
too
much
to
store
• Exists
in
different
places
• Archived
data
is
lost
8. Can’t
Use
It
The
Way
You
Want
To
• Analysis
and
processing
takes
too
long
9. Can’t
Use
It
The
Way
You
Want
To
1
2 3 4
5
• Analysis
and
processing
takes
too
long
• Data
exists
in
silos
10. Can’t
Use
It
The
Way
You
Want
To
? ? ?
• Analysis
and
processing
takes
too
long
• Data
exists
in
silos
• Can’t
ask
new
ques,ons
11. Can’t
Use
It
The
Way
You
Want
To
• Analysis
and
processing
takes
too
long
• Data
exists
in
silos
• Can’t
ask
new
ques,ons
• Can’t
analyze
unstructured
data
13. Ask
Bigger
Ques,ons
13
When
customer
x
visits
my
store
what
can
I
recommend
based
on
their
recent
web
behavior
across
our
various
brand
websites?
What
is
the
best
loca,on
in
North
America
to
efficiently
produce
both
tomato
plants
and
corn?
What
does
every
fraudulent
ac,vity
in
the
last
2
years
have
in
common
that
will
help
us
iden,fy
and
proac,vely
prevent
the
next
incident?
Are
hotel
room
sales
at
Christmas
slow
because
of
inventory
or
compe,,ve
pricing?
What
did
customer
x
view
on
their
last
website
visit?
`
What
makes
tomato
plants
more
frui[ul
than
others
?
What
incidents
of
fraud
did
we
detect
last
year?
What
search
terms
are
used
most
oen
when
looking
for
hotels
in
NYC?
14.
SIMPLIFIED,
UNIFIED,
EFFICIENT
•
Bulk
of
data
stored
on
scalable
low
cost
pla[orm
•
Perform
end-‐to-‐end
workflows
•
Specialized
systems
reserved
for
specialized
workloads
•
Provides
data
access
across
departments
or
LOB
COMPLEX,
FRAGMENTED,
COSTLY
•Data
silos
by
department
or
LOB
•
Lots
of
data
stored
in
expensive
specialized
systems
•
Analysts
pull
select
data
into
EDW
•
No
one
has
a
complete
view
The
Cloudera
Approach
14
Meet
enterprise
demands
with
a
new
way
to
think
about
data.
THE
CLOUDERA
WAY
THE
OLD
WAY
Single
data
pla[orm
to
support
BI,
Repor,ng
&
App
Serving
Mul,ple
pla[orms
for
mul,ple
workloads
15.
INGEST
STORE
EXPLORE
PROCESS
ANALYZE
SERVE
CDH
CLOUDERA
MANAGER
CLOUDERA
SUPPORT
Cloudera
Enterprise:
The
Pla[orm
for
Big
Data
15
BRINGS
STORAGE
&
COMPUTE
TOGETHER
WORKS
WITH
EVERY
TYPE
OF
DATA
CHANGES
THE
ECONOMICS
OF
DATA
MANGAGEMENT
A
Revolu,onary
Solu,on
Built
on
Apache
Hadoop
CLOUDERA
NAVIGATOR
16. 16
Cloudera
Enterprise
Includes
Advanced
System
Management
&
Support
for
the
Core
CDH
Projects
CDH
100%
OPEN
SOURCE
HADOOP
DISTRIBUTION
CLOUDERA
MANAGER
END-‐TO-‐END
SYSTEM
MANAGEMENT
CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS
HDFS
MAPREDUCE
FLUME
HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE
HUE
MAHOUT
OOZIE
PIG
SQOOP
WHIRR
ZOOKEEPER
HBASE
IMPALA
SEARCH
(BETA)
DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME
SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR
CLOUDERA
SUPPORT
BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,
COMMUNICTY
ADVOCACY
&
INDEMNIFICATION
CLOUDERA
NAVIGATOR
END-‐TO-‐END
DATA
MANAGEMENT
ACCESS
MGMT
DATA
AUDIT
CORE
HADOOP
PROJECTS
CLOUDERA
MANAGER
CLOUDERA
NAVIGATOR
HBASE
IMPALA
Search
17. 17
RTD
SubscripVon
Includes
Support
&
Indemnity
for
Apache
HBase
CDH
100%
OPEN
SOURCE
HADOOP
DISTRIBUTION
CLOUDERA
MANAGER
END-‐TO-‐END
SYSTEM
MANAGEMENT
CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS
HDFS
MAPREDUCE
FLUME
HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE
HUE
MAHOUT
OOZIE
PIG
SQOOP
WHIRR
ZOOKEEPER
HBASE
IMPALA
SEARCH
(BETA)
DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME
SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR
CLOUDERA
SUPPORT
BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,
COMMUNICTY
ADVOCACY
&
INDEMNIFICATION
CLOUDERA
NAVIGATOR
END-‐TO-‐END
DATA
MANAGEMENT
ACCESS
MGMT
DATA
AUDIT
CORE
HADOOP
PROJECTS
CLOUDERA
MANAGER
CLOUDERA
NAVIGATOR
HBASE
IMPALA
Search
18. 18
RTQ
SubscripVon
Includes
Support
&
Indemnity
for
Cloudera
Impala
CDH
100%
OPEN
SOURCE
HADOOP
DISTRIBUTION
CLOUDERA
MANAGER
END-‐TO-‐END
SYSTEM
MANAGEMENT
CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS
HDFS
MAPREDUCE
FLUME
HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE
HUE
MAHOUT
OOZIE
PIG
SQOOP
WHIRR
ZOOKEEPER
HBASE
IMPALA
SEARCH
(BETA)
DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME
SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR
CLOUDERA
SUPPORT
BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,
COMMUNICTY
ADVOCACY
&
INDEMNIFICATION
CLOUDERA
NAVIGATOR
END-‐TO-‐END
DATA
MANAGEMENT
ACCESS
MGMT
DATA
AUDIT
CORE
HADOOP
PROJECTS
CLOUDERA
MANAGER
CLOUDERA
NAVIGATOR
HBASE
IMPALA
Search
19. 19
RTS
SubscripVon
Includes
Support
&
Indemnity
for
Cloudera
Search
CDH
100%
OPEN
SOURCE
HADOOP
DISTRIBUTION
CLOUDERA
MANAGER
END-‐TO-‐END
SYSTEM
MANAGEMENT
CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS
HDFS
MAPREDUCE
FLUME
HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE
HUE
MAHOUT
OOZIE
PIG
SQOOP
WHIRR
ZOOKEEPER
HBASE
IMPALA
SEARCH
(BETA)
DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME
SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR
CLOUDERA
SUPPORT
BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,
COMMUNICTY
ADVOCACY
&
INDEMNIFICATION
CLOUDERA
NAVIGATOR
END-‐TO-‐END
DATA
MANAGEMENT
ACCESS
MGMT
DATA
AUDIT
CORE
HADOOP
PROJECTS
CLOUDERA
MANAGER
CLOUDERA
NAVIGATOR
HBASE
Search
IMPALA
20. 20
BDR
SubscripVon
Includes
Centralized
Management
For
Disaster
Recovery
Workflows
CDH
100%
OPEN
SOURCE
HADOOP
DISTRIBUTION
CLOUDERA
MANAGER
END-‐TO-‐END
SYSTEM
MANAGEMENT
CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS
HDFS
MAPREDUCE
FLUME
HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE
HUE
MAHOUT
OOZIE
PIG
SQOOP
WHIRR
ZOOKEEPER
HBASE
IMPALA
SEARCH
(BETA)
DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME
SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR
CLOUDERA
SUPPORT
BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,
COMMUNICTY
ADVOCACY
&
INDEMNIFICATION
CLOUDERA
NAVIGATOR
END-‐TO-‐END
DATA
MANAGEMENT
ACCESS
MGMT
DATA
AUDIT
CORE
HADOOP
PROJECTS
CLOUDERA
MANAGER
CLOUDERA
NAVIGATOR
HBASE
IMPALA
Search
21. 21
Navigator
SubscripVon
Enables
Cloudera
Navigator
for
Automated
Data
Management
CDH
100%
OPEN
SOURCE
HADOOP
DISTRIBUTION
CLOUDERA
MANAGER
END-‐TO-‐END
SYSTEM
MANAGEMENT
CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS
HDFS
MAPREDUCE
FLUME
HCATALOG
MICROSTRATEGY
NETEZZA
ORACLE
QLIKVIEW
TABLEAU
TERADATA
HIVE
HUE
MAHOUT
OOZIE
PIG
SQOOP
WHIRR
ZOOKEEPER
HBASE
IMPALA
SEARCH
(BETA)
DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME
SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR
CLOUDERA
SUPPORT
BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,
COMMUNICTY
ADVOCACY
&
INDEMNIFICATION
CLOUDERA
NAVIGATOR
END-‐TO-‐END
DATA
MANAGEMENT
ACCESS
MGMT
DATA
AUDIT
CORE
HADOOP
PROJECTS
CLOUDERA
MANAGER
CLOUDERA
NAVIGATOR
HBASE
IMPALA
Search
23. A
mul,na,onal
bank
saves
millions
by
op,mizing
DW
for
analy,cs
&
reducing
data
storage
costs
by
99%.
Ask
Bigger
Ques,ons:
How
can
we
op,mize
our
data
warehouse
investment?
24. Cloudera
op,mizes
the
EDW,
saves
millions
24
The
Challenge:
• Teradata
EDW
at
capacity:
ETL
processes
consume
7
days;
takes
5
weeks
to
make
historical
data
available
for
analysis
• Performance
issues
in
business
cri,cal
apps;
liqle
room
for
discovery,
analy,cs,
ROI
from
opportuni,es
Mul,na,onal
bank
saves
millions
by
op,mizing
exis,ng
DW
for
analy,cs
&
reducing
data
storage
costs
by
99%.
The
Solu,on:
• Cloudera
Enterprise
offloads
data
storage,
processing
&
some
analy,cs
from
EDW
• Teradata
can
focus
on
opera,onal
func,ons
&
analy,cs
25. A
Semiconductor
Manufacturer
uses
predic,ve
analy,cs
to
take
preventa,ve
ac,on
on
chips
likely
to
fail.
Ask
Bigger
Ques,ons:
Which
semiconductor
chips
will
fail?
26. Cloudera
enables
beqer
predic,ons
26
The
Challenge:
• Want
to
capture
greater
granular
and
historical
data
for
more
accurate
predic,ve
yield
modeling
• Storing
9
months’
data
on
Oracle
is
expensive
Semiconductor
manufacturer
can
prevent
chip
failure
with
more
accurate
predic,ve
yield
models.
The
Solu,on:
• Dell
|
Cloudera
solu,on
for
Apache
Hadoop
• 53
nodes;
plan
to
store
up
to
10
years
(~10PB)
• Capturing
&
processing
data
from
each
phase
of
manufacturing
process
CONFIDENTIAL
-‐
RESTRICTED
27. The
quant
risk
LOB
within
a
mul,na,onal
bank
saves
millions
through
beqer
risk
exposure
analysis
&
fraud
preven,on.
Ask
Bigger
Ques,ons:
How
can
we
prevent
fraud?
28. Cloudera
delivers
savings
through
fraud
preven,on
28
The
Challenge:
• Fraud
detec,on
is
a
cumbersome,
mul,-‐step
analy,c
process
requiring
data
sampling
• 2B
transac,ons/month
necessitate
constant
revisions
to
risk
profiles
• Highly
tuned
100TB
Teradata
DW
drives
over-‐budget
capital
reserves
&
lower
investment
returns
Quant
risk
LOB
in
mul,na,onal
bank
saves
millions
through
beqer
risk
exposure
analysis
&
fraud
preven,on
The
Solu,on:
• Cloudera
Enterprise
data
factory
for
fraud
preven,on,
credit
&
opera,onal
risk
analysis
• Look
at
every
incidence
of
fraud
for
5
years
for
each
person
• Reduced
costs;
expensive
CPU
no
longer
consumed
by
data
processing
29. BlackBerry
eliminates
data
sampling
&
simplifies
data
processing
for
beqer,
more
comprehensive
analysis.
Ask
Bigger
Ques,ons:
How
do
we
retain
customers
in
a
compe,,ve
market?
30. Cloudera
delivers
ROI
through
storage
alone
30
The
Challenge:
• BlackBerry
Services
generates
.5PB
(50-‐60TB
compressed)
data
per
day
• RDBMS
is
expensive
–
limited
to
1%
data
sampling
for
analy,cs
BlackBerry
can
analyze
all
their
data
vs.
relying
on
1%
sample
for
beqer
network
capacity
trending
&
management.
The
Solu,on:
• Cloudera
Enterprise
manages
global
data
set
of
~100PB
• Collec,ng
device
content,
machine-‐
generated
log
data,
audit
details
• 90%
ETL
code
base
reduc,on
31. 31
A
global
retailer’s
customers
benefit
from
more
personalized
communica,ons
and
offers
based
on
interac,ons
across
all
channels.
Ask
Bigger
Ques,ons:
How
can
we
offer
customers
the
best
experience?
32. Cloudera
op,mizes
the
DW
for
improved
ROI
32
Global
retailer’s
customers
benefit
from
more
personalized
communica,ons
based
on
interac,ons
across
all
channels.
The
Solu,on:
• Cloudera
Enterprise
with
Impala
—
1PB
over
250
nodes
• Consolidated
pla[orm
for
Big
Data
with
single
environment
for
query
and
machine
learning
CONFIDENTIAL
-‐
RESTRICTED
The
Challenge:
•
Need
to
correlate
online/offline
data
across
disparate,
costly
legacy
DWs
•
Data
takes
up
to
4
weeks
to
get
data
from
one
group
–
inhibits
produc,vity