Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

1

Advanced
Analy,cs
Part
I:
Use
All
Your
Data

DC:
Rob
Morrow,
Senior
Systems
Engineer

MD:
Chris
Bove,
Senior
Systems
Engineer

August
6

2

From
BI
to
Advanced
Analy,cs

2

What
happened,

where,

and
when?

What
will

happen?

How
and
why

did
it
happen?

How
can
we
do

beLer?

Time

Data
Size

Facts

Interpreta,ons

3

Tradi,onal
Analy,cs
Process

3

Opera,onalize

Model

In-‐Database

Model

Scoring

Data

Cleansing
&

Processing

Data

Extrac,on

Data
Explora,on
&

Discovery

In-‐Memory

Model

Development

Time-‐to-‐Insight

4

Accessing
&
Sharing
the
Data
is
Diﬃcult

DW

External
Mul7-‐structured
Structured

5

“Are
we
there
yet?”

1. Find
the data
2. Get access
to data
4. Move
sample data
to ADW
5. Analysis
Finally!
6. Operationalize
the model
3. Learn
about the data
Data
Discovery:

6-‐9
Months

6

Silo’d
PlaZorms
Challenge
Collabora,on

6

Departmental

Warehouse

Non-‐Agile
Models

Enterprise

Apps

Repor,ng

Prioritized
Operational Processes
Departmental

Warehouse

Silo’d

Analy7cs

Sta,c
schemas

accrete
over
,me

Data

Sources

Silo’d

Analy7cs

7

1. Find
the data
2. Get access
To data
4. Move to
ADW
5. Analysis
Finally!
6. Operationalize
the model
3. Learn
about the data
6-‐9
Months

Users
&
Business

Inﬂuencers

Data
Scien,st,

Business
Analysts

“I’m
sick
of
wai.ng
for
my

data,
I’m
going
to
make
my

own
copy.”

Technical

Inﬂuencers

DBA/DW
Admins

“I
need
to
get
those
data

scien.sts
the
data
they
want,

or
else
they
will
stand
up

another
data
mart,
I
will
have

to
manage
it
sooner
or
later.”

Ouch!

7

Execu,ves

Execu,ve
Sponsors,

LOB
Manager

(PM,
Director,
R&D,
etc.)

“We
don’t
have
the

informa.on
we
need
to

answer
key
business

ques.ons.”

8

Uniﬁed
Scale-‐out
Storage

For
Any
Type
of
Data

Elas,c,
Fault-‐tolerant,
Self-‐healing,
In-‐memory
capabili,es

Resource
Management

Batch

Processing

Analy,c

MPP
DBMS

Search

Engine

Online

NoSQL

DBMS

Stream

Processing

Machine

Learning

SQL
Streaming
File
System
(NFS)

Data

Management

System

Management

Metadata,
Security,
Audit,
Lineage

Training
&
Services

Solu,on:
Cloudera
EDH

8

Search

Faster
data

discovery

Navigator

Mul7ple
tools
on

one
plaGorm

Impala
Spark

Hadoop

Map

Reduce

Use
all
data
with

centralized
mgmt

&
security

Metadata,
Security

Cloudera
Manager

Training
&
Services

Opera7onalize

Models

Flume
/

Spark

Streaming

HBase

9

Enterprise
Data
Hub

Uniﬁed
Scale-‐out
Storage

For
Any
Type
of
Data

Elas,c,
Fault-‐tolerant,
Self-‐healing,
In-‐memory
capabili,es

Resource
Management

Batch

Processing

Analy,c

MPP
DBMS

Search

Engine

Online

NoSQL

DBMS

Stream

Processing

Machine

Learning

SQL
Streaming
File
System
(NFS)

Data

Management

System

Management

Metadata,
Security,
Audit,
Lineage

Training
&
Services

Solu,on:
Cloudera
EDH

9

10

Analy,cs
with
EDH

10

Opera,onalize

Model

In-‐Database

Model

Scoring

Data

Cleansing
&

Processing

Data
Explora,on
&

Discovery

In-‐Memory

Model

Development

Time-‐to-‐Insight

Data

Explora,on
&

Discovery

Data

Cleansing
&

Processing

Opera,onalize

Model

Data

Extrac,on

In-‐PlaGorm

Model
Dev
&

Scoring

Deliver
Insight
Sooner

11

Solu,on
Beneﬁts

•  Use
100x
more
data,
and
more
types
of
data,
with
exis,ng
tools

•  Reduce
sampling
and
increase
model
accuracy
and
precision

•  Centralize
informa,on
security,
metadata,
management,
and

governance

Use
all
your

data

•  Compress
the
cycle
7me
from
data
to
insights

•  Facilitate
data
discovery
with
real-‐,me
SQL
and
Search

•  Track
data
life-‐cycle
in
place

•  Deﬁne,
test,
deploy,
and
update
models
all
within
the
EDH

Shorten

analy,cs

lifecycle

•  Deliver
mul7-‐genre
analy7cs
in
a
single
plaGorm

•  Apply
diverse
concurrent
analy,cs
to
your
full
datasets
in-‐place

•  Protect
exis,ng
technology
and
skillset
investments

Do
more
with

data

11

12

“I’m
sick
of
wai.ng
for
my

data,
I’m
going
to
make

my
own
copy.”

“I
need
to
get
those
data

scien.sts
the
data
they

want,
or
else
they
will

stand
up
another
data

mart,
which
I
will
have
to

manage
sooner
or
later.”

“We
don’t
have
the

informa.on
we
need
to

answer
key
business

ques.ons.”

Data
Scien,st,

Business
Analysts

DBA/DW
Admins

Execu,ve
Sponsors,

LOB
Manager

(Marke,ng,
Sales,
R&D,

etc.)

•  Acquire
data
necessary
for

projects

•  Develop
analysis/models

with
beLer
fit
faster

•  Share
data
sets
to

empower
others

•  Spend
less
,me
and

money
reconciling

shadow
IT
environments

•  Shared
security,

metadata,
management,

and
governance

•  Acquire
necessary

informa,on
sooner
to

make
cri,cal
business

decisions

Business
Value
Delivered

12

Users
&
Business

Influencers

Technical

Influencers

Execu,ves

Buyers

13

Thrio
pdf/
Word/txt

csv

Data
Access:
Stores
and
Connectors

13

CONNECTORS

ORACLE

NETEZZA

ODBC/JDBC

TERADATA

MongoDB

Splunk/Hunk

MICROSTRATEGY

IMPALA

HBASE

SOLR

SPARK

ACCUMULO

ZoomData

Hive

Sqoop

Flume

Partner
Na,ve

Connectors

Revolu7on
R

Parquet

Sequenc
e

JSON

Binary

SkyTree

Avro

14

Historical
Archive:
Tape
vs
Data

14

•  Direct
access
to
data
has
value,
Data
Stored
oﬀsite/oﬄine
has
cost

•  A
single
8k
record
may
have
nearly
zero
value,
but
10,000?
10,000,000?

•  What
is
Business
value
of
tes,ng
the
predic,ve
power
of
current
data?

Aggregate
Data

Value

•  Assuming
locality,
Is
110MB
per
drive
fast
enough?

•  Certainly
not
fast
enough
to
be
included
in
any
current
analy,cs.

•  Striping
across
tape
drives
is
Science-‐Fic,on.
Complex
Tiers,
anyone?

•  I/O
IS
the
problem.
Not
CPU.

Data

Availability

•  Everyone
prac7ces
Backups.
How
about
Restores?
Full
site
restores?

•  Can’t
we
just
more
aggressively
compress
online
data?

•  “Tape
is
cheap”.
It
had
beLer
be,
because
the
data
isn’t
easily
usable.

Data
Volume/
Cost

15

Spark
Streaming:
What
is
it?

15

Spark
is
processed
in
micro-‐batches:

Resilient
Distributed
Datasets
(RDD)

Consistent
with
HDFS
Architectural
Principles

Processing
individual
records
creates
inconsistencies
(simultaneous
writes),
AKA
Storm.

16

What
can
you
do
with
it?:
Stream
It

16

Streaming
“Windows”
allows
,me-‐sliced
atomic
updates
to
Analy,cs

Discre,zed
Stream
(DStream):

Sequence
of
RDD’s
arranged
as
lines/
words

Window:
Sequence
of
DStreams
,me-‐
arranged
as
windows

17

What
can
you
do
with
it?:
ML

17

Spark-‐ML:
Same
Input
format
and
algorithms
as

Mahout.

Uses
Resilient
Distributed
DataSets
In-‐Memory

Useful
for:

Clustering
(k-‐Means,
etc)

Classiﬁca,on
(email,
sen,ment)

Recommenders
(ra,ngs
correla,on)

Dimensionality
Reduc,on
(PCA,
SVD)

“What
about

Machine

Learning?”

18

Model
Effec,veness
and
Sampling

•  Some
Sta,s,cians
(medical)
find
it
hard
to
turn
the
corner
on
the
sampling
topic:

•  ANOVA
vs
Mul,ple
Regression.
Same
tests**,
one’s
a
vector
without
the
Power

problems

•  Algorithm
choice
should
be
related
to,
not
restricted
by,
data
volume.

•  Best
approach
=
simple
algorithm,
lots
of
data

•  Sampling
should
s7ll
be
used,
but
to
test
model
effec7veness.
Not
to
fix
IT.

**Source:
Applied
Mul,ple
Regression/Correla,on
Analysis
(Cohen
&
Cohen,
1983)

19

Which
dataset
oﬀers
beLer
predic,ve
power?

Remember,
this
is
not
tes,ng
for
an
eﬀect…

Alic
e

Bo
b

Chuc
k

Donna
Eddi
e

Frank
Gina

Uses
work

computer

for

shopping

1
4
5
1

Moves

data

between

networks

4
5
2

Works
long

hours

4
3
3

System/
Network

admin

Privs

5

Alice
Bo
b

Chuc
k

Donn
a

Eddi
e

Frank
Gina

Uses
work

computer

for

shopping

1
4
2
4
5
1
3

Moves

data

between

networks

4
3
1
5
1
4
3

Works
long

hours

2
4
3
3
4
3
2

System/
Network

admin

Privs

1
2
1
5
3
5
4

1
2

1.
As
we
add
dimensions,
average
distance
increases.
Add
Data.

2.
Fewer
“neighbors”
within
a
certain
radius
of
any
given
point
when
the
dataset

is
smaller.
Add
Data.

3.
Are
you
looking
at
similarity
(r/cosine)
or
are
you
using
dissimilarity
(Euclidean)?

20

Algorithms:
Clustering

Sort
documents,
emails,

objects
by
text
class
and

group
terms/documents

into
dis,nct
categories.

Produce
visualiza,on.

Ques,on:
What’s
an
emerging
topic
among
users?

21

Algorithms:
Naïve
Bayesian
Classiﬁer

Given
a
training

set,
sort

documents
by

content:
Spam/
Not,
Religion/
Poli,cs/Art,
etc.

Ques,on:
Which
content
“looks
like”
other
content?

22

Algorithms:
Recommender
Systems

•  User-‐based
filtering
for
cold
start

(AKA
“likes”)

•  Item-‐based
(user
similarity)

filtering
once
there
is
sufficient

user
data

Ques,on:
If
user
thinks
“A”
is
useful,
how
about
“B”,
“C”?

How
similar
is
one
user’s
paLern
to
another?

23

Easily
Convert
between
bits/bytes
and

numbers/words
with
Avro

•  Serializa,on

•  Expressive

•  Records,
arrays,
unions,
enums

•  Eﬃcient

•  Compact
binary,
compressed,
spliLable

•  Interoperable

•  Langs:
C,
C++,
C#,
Java,
Perl,
Python,
Ruby,
PHP

•  Tools:
MR,
Pig,
Hive,
Crunch,
Flume,
Sqoop,
etc

•  Dynamic

•  Can
read
&
write
w/o
genera,ng
code
ﬁrst

•  Evolvable

24

Query
results
from
large
analyses
in
Impala

•  Brings
real-‐,me
query
capabili,es
to
Hadoop

•  It’s
fast!
Na,vely
wriLen

in
C++

•  Same
great
SQL
query

language
as
Hive

25

Analy,cs
to
users:
HUE

•  Included
in
EDH

•  Mul,-‐capability

interface
for
analy,cs

•  Interac,ve
graph

libraries

•  Customizable
Search,

Impala,
Hive,
Pig
Apps

•  But
Also:
Tableau,

Pentaho,
PlaZora,

ZoomData,
SAS…

26

Cloudera
Manager

End-‐to-‐End
Administra,on
for
CDH

Manage

Easily
deploy,
conﬁgure
&
op,mize
clusters
1
Monitor

Maintain
a
central
view
of
all
ac,vity
2
Diagnose

Easily
iden,fy
and
resolve
issues
3
Integrate

Use
Cloudera
Manager
with
exis,ng
tools

4

28

2
8

Enterprise
Services

Inges,on
&
ETL

Pilot

Reference
implementa,on
up
to
3
sources,
5
transforma,ons,
1
target

Create,
execute,
test,
and
review
a
custom
inges,on/ETL
plan

Security

Integra,on

Implementa,on
of
role
based
access
control
with
the
data

processing
environment

Hadoop
Cluster

Deployment

Cer,ﬁca,on

Fully
review
hardware,
data
sources,
typical
jobs,
and
exis,ng
SLAs

Develop,
implement,
benchmark,
and
document
Hadoop
deployment

29

Path to Success – Services & Training

Hadoop
Cluster

Deployment
Cer,ﬁca,on

1
week

Inges,on
&
ETL
Pilot

2
weeks

Security
Integra,on

1
week

Cloudera
Admin
Training

3
days

Hive/Pig
Training

2
days

Data

Science

3
days

Developer

Training

4
days

30
©2014
Cloudera,
Inc.
All

rights
reserved.

•  Winners
will
receive:

•  Free
Strata
+
Hadoop
World
pass

•  Free
seat
to
any
public
Cloudera

University
Training

•  Invita,on
to
exclusive
awards
dinner

•  Bragging
rights

Nomina7ons
are
open
for

the
Data
Impact
Awards!

Submission
deadline:
September
12th

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Similar to Cloudera Breakfast Series, Analytics Part 1: Use All Your Data (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data