Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
1. 1
Advanced
Analy,cs
Part
I:
Use
All
Your
Data
DC:
Rob
Morrow,
Senior
Systems
Engineer
MD:
Chris
Bove,
Senior
Systems
Engineer
August
6
2. 2
From
BI
to
Advanced
Analy,cs
2
What
happened,
where,
and
when?
What
will
happen?
How
and
why
did
it
happen?
How
can
we
do
beLer?
Time
Data
Size
Facts
Interpreta,ons
3. 3
Tradi,onal
Analy,cs
Process
3
Opera,onalize
Model
In-‐Database
Model
Scoring
Data
Cleansing
&
Processing
Data
Extrac,on
Data
Explora,on
&
Discovery
In-‐Memory
Model
Development
Time-‐to-‐Insight
4. 4
Accessing
&
Sharing
the
Data
is
Difficult
DW
External
Mul7-‐structured
Structured
5. 5
“Are
we
there
yet?”
1. Find
the data
2. Get access
to data
4. Move
sample data
to ADW
5. Analysis
Finally!
6. Operationalize
the model
3. Learn
about the data
Data
Discovery:
6-‐9
Months
7. 7
1. Find
the data
2. Get access
To data
4. Move to
ADW
5. Analysis
Finally!
6. Operationalize
the model
3. Learn
about the data
6-‐9
Months
Users
&
Business
Influencers
Data
Scien,st,
Business
Analysts
“I’m
sick
of
wai.ng
for
my
data,
I’m
going
to
make
my
own
copy.”
Technical
Influencers
DBA/DW
Admins
“I
need
to
get
those
data
scien.sts
the
data
they
want,
or
else
they
will
stand
up
another
data
mart,
I
will
have
to
manage
it
sooner
or
later.”
Ouch!
7
Execu,ves
Execu,ve
Sponsors,
LOB
Manager
(PM,
Director,
R&D,
etc.)
“We
don’t
have
the
informa.on
we
need
to
answer
key
business
ques.ons.”
8. 8
Unified
Scale-‐out
Storage
For
Any
Type
of
Data
Elas,c,
Fault-‐tolerant,
Self-‐healing,
In-‐memory
capabili,es
Resource
Management
Batch
Processing
Analy,c
MPP
DBMS
Search
Engine
Online
NoSQL
DBMS
Stream
Processing
Machine
Learning
SQL
Streaming
File
System
(NFS)
Data
Management
System
Management
Metadata,
Security,
Audit,
Lineage
Training
&
Services
Solu,on:
Cloudera
EDH
8
Search
Faster
data
discovery
Navigator
Mul7ple
tools
on
one
plaGorm
Impala
Spark
Hadoop
Map
Reduce
Use
all
data
with
centralized
mgmt
&
security
Metadata,
Security
Cloudera
Manager
Training
&
Services
Opera7onalize
Models
Flume
/
Spark
Streaming
HBase
9. 9
Enterprise
Data
Hub
Unified
Scale-‐out
Storage
For
Any
Type
of
Data
Elas,c,
Fault-‐tolerant,
Self-‐healing,
In-‐memory
capabili,es
Resource
Management
Batch
Processing
Analy,c
MPP
DBMS
Search
Engine
Online
NoSQL
DBMS
Stream
Processing
Machine
Learning
SQL
Streaming
File
System
(NFS)
Data
Management
System
Management
Metadata,
Security,
Audit,
Lineage
Training
&
Services
Solu,on:
Cloudera
EDH
9
10. 10
Analy,cs
with
EDH
10
Opera,onalize
Model
In-‐Database
Model
Scoring
Data
Cleansing
&
Processing
Data
Explora,on
&
Discovery
In-‐Memory
Model
Development
Time-‐to-‐Insight
Data
Explora,on
&
Discovery
Data
Cleansing
&
Processing
Opera,onalize
Model
Data
Extrac,on
In-‐PlaGorm
Model
Dev
&
Scoring
Deliver
Insight
Sooner
11. 11
Solu,on
Benefits
• Use
100x
more
data,
and
more
types
of
data,
with
exis,ng
tools
• Reduce
sampling
and
increase
model
accuracy
and
precision
• Centralize
informa,on
security,
metadata,
management,
and
governance
Use
all
your
data
• Compress
the
cycle
7me
from
data
to
insights
• Facilitate
data
discovery
with
real-‐,me
SQL
and
Search
• Track
data
life-‐cycle
in
place
• Define,
test,
deploy,
and
update
models
all
within
the
EDH
Shorten
analy,cs
lifecycle
• Deliver
mul7-‐genre
analy7cs
in
a
single
plaGorm
• Apply
diverse
concurrent
analy,cs
to
your
full
datasets
in-‐place
• Protect
exis,ng
technology
and
skillset
investments
Do
more
with
data
11
12. 12
“I’m
sick
of
wai.ng
for
my
data,
I’m
going
to
make
my
own
copy.”
“I
need
to
get
those
data
scien.sts
the
data
they
want,
or
else
they
will
stand
up
another
data
mart,
which
I
will
have
to
manage
sooner
or
later.”
“We
don’t
have
the
informa.on
we
need
to
answer
key
business
ques.ons.”
Data
Scien,st,
Business
Analysts
DBA/DW
Admins
Execu,ve
Sponsors,
LOB
Manager
(Marke,ng,
Sales,
R&D,
etc.)
• Acquire
data
necessary
for
projects
• Develop
analysis/models
with
beLer
fit
faster
• Share
data
sets
to
empower
others
• Spend
less
,me
and
money
reconciling
shadow
IT
environments
• Shared
security,
metadata,
management,
and
governance
• Acquire
necessary
informa,on
sooner
to
make
cri,cal
business
decisions
Business
Value
Delivered
12
Users
&
Business
Influencers
Technical
Influencers
Execu,ves
Buyers
14. 14
Historical
Archive:
Tape
vs
Data
14
• Direct
access
to
data
has
value,
Data
Stored
offsite/offline
has
cost
• A
single
8k
record
may
have
nearly
zero
value,
but
10,000?
10,000,000?
• What
is
Business
value
of
tes,ng
the
predic,ve
power
of
current
data?
Aggregate
Data
Value
• Assuming
locality,
Is
110MB
per
drive
fast
enough?
• Certainly
not
fast
enough
to
be
included
in
any
current
analy,cs.
• Striping
across
tape
drives
is
Science-‐Fic,on.
Complex
Tiers,
anyone?
• I/O
IS
the
problem.
Not
CPU.
Data
Availability
• Everyone
prac7ces
Backups.
How
about
Restores?
Full
site
restores?
• Can’t
we
just
more
aggressively
compress
online
data?
• “Tape
is
cheap”.
It
had
beLer
be,
because
the
data
isn’t
easily
usable.
Data
Volume/
Cost
15. 15
Spark
Streaming:
What
is
it?
15
Spark
is
processed
in
micro-‐batches:
Resilient
Distributed
Datasets
(RDD)
Consistent
with
HDFS
Architectural
Principles
Processing
individual
records
creates
inconsistencies
(simultaneous
writes),
AKA
Storm.
16. 16
What
can
you
do
with
it?:
Stream
It
16
Streaming
“Windows”
allows
,me-‐sliced
atomic
updates
to
Analy,cs
Discre,zed
Stream
(DStream):
Sequence
of
RDD’s
arranged
as
lines/
words
Window:
Sequence
of
DStreams
,me-‐
arranged
as
windows
17. 17
What
can
you
do
with
it?:
ML
17
Spark-‐ML:
Same
Input
format
and
algorithms
as
Mahout.
Uses
Resilient
Distributed
DataSets
In-‐Memory
Useful
for:
Clustering
(k-‐Means,
etc)
Classifica,on
(email,
sen,ment)
Recommenders
(ra,ngs
correla,on)
Dimensionality
Reduc,on
(PCA,
SVD)
“What
about
Machine
Learning?”
18. 18
Model
Effec,veness
and
Sampling
• Some
Sta,s,cians
(medical)
find
it
hard
to
turn
the
corner
on
the
sampling
topic:
• ANOVA
vs
Mul,ple
Regression.
Same
tests**,
one’s
a
vector
without
the
Power
problems
• Algorithm
choice
should
be
related
to,
not
restricted
by,
data
volume.
• Best
approach
=
simple
algorithm,
lots
of
data
• Sampling
should
s7ll
be
used,
but
to
test
model
effec7veness.
Not
to
fix
IT.
**Source:
Applied
Mul,ple
Regression/Correla,on
Analysis
(Cohen
&
Cohen,
1983)
19. 19
Which
dataset
offers
beLer
predic,ve
power?
Remember,
this
is
not
tes,ng
for
an
effect…
Alic
e
Bo
b
Chuc
k
Donna
Eddi
e
Frank
Gina
Uses
work
computer
for
shopping
1
4
5
1
Moves
data
between
networks
4
5
2
Works
long
hours
4
3
3
System/
Network
admin
Privs
5
Alice
Bo
b
Chuc
k
Donn
a
Eddi
e
Frank
Gina
Uses
work
computer
for
shopping
1
4
2
4
5
1
3
Moves
data
between
networks
4
3
1
5
1
4
3
Works
long
hours
2
4
3
3
4
3
2
System/
Network
admin
Privs
1
2
1
5
3
5
4
1
2
1.
As
we
add
dimensions,
average
distance
increases.
Add
Data.
2.
Fewer
“neighbors”
within
a
certain
radius
of
any
given
point
when
the
dataset
is
smaller.
Add
Data.
3.
Are
you
looking
at
similarity
(r/cosine)
or
are
you
using
dissimilarity
(Euclidean)?
20. 20
Algorithms:
Clustering
Sort
documents,
emails,
objects
by
text
class
and
group
terms/documents
into
dis,nct
categories.
Produce
visualiza,on.
Ques,on:
What’s
an
emerging
topic
among
users?
21. 21
Algorithms:
Naïve
Bayesian
Classifier
Given
a
training
set,
sort
documents
by
content:
Spam/
Not,
Religion/
Poli,cs/Art,
etc.
Ques,on:
Which
content
“looks
like”
other
content?
22. 22
Algorithms:
Recommender
Systems
• User-‐based
filtering
for
cold
start
(AKA
“likes”)
• Item-‐based
(user
similarity)
filtering
once
there
is
sufficient
user
data
Ques,on:
If
user
thinks
“A”
is
useful,
how
about
“B”,
“C”?
How
similar
is
one
user’s
paLern
to
another?
24. 24
Query
results
from
large
analyses
in
Impala
• Brings
real-‐,me
query
capabili,es
to
Hadoop
• It’s
fast!
Na,vely
wriLen
in
C++
• Same
great
SQL
query
language
as
Hive
25. 25
Analy,cs
to
users:
HUE
• Included
in
EDH
• Mul,-‐capability
interface
for
analy,cs
• Interac,ve
graph
libraries
• Customizable
Search,
Impala,
Hive,
Pig
Apps
• But
Also:
Tableau,
Pentaho,
PlaZora,
ZoomData,
SAS…
26. 26
Cloudera
Manager
End-‐to-‐End
Administra,on
for
CDH
Manage
Easily
deploy,
configure
&
op,mize
clusters
1
Monitor
Maintain
a
central
view
of
all
ac,vity
2
Diagnose
Easily
iden,fy
and
resolve
issues
3
Integrate
Use
Cloudera
Manager
with
exis,ng
tools
4
28. 28
2
8
Enterprise
Services
Inges,on
&
ETL
Pilot
Reference
implementa,on
up
to
3
sources,
5
transforma,ons,
1
target
Create,
execute,
test,
and
review
a
custom
inges,on/ETL
plan
Security
Integra,on
Implementa,on
of
role
based
access
control
with
the
data
processing
environment
Hadoop
Cluster
Deployment
Cer,fica,on
Fully
review
hardware,
data
sources,
typical
jobs,
and
exis,ng
SLAs
Develop,
implement,
benchmark,
and
document
Hadoop
deployment
29. 29
Path to Success – Services & Training
Hadoop
Cluster
Deployment
Cer,fica,on
1
week
Inges,on
&
ETL
Pilot
2
weeks
Security
Integra,on
1
week
Cloudera
Admin
Training
3
days
Hive/Pig
Training
2
days
Data
Science
3
days
Developer
Training
4
days