How to Troubleshoot Apps for the Modern Connected Worker
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
1. 1
Advanced
Analy,cs
Part
II:
Do
More
with
Your
Data
(and
your
people
too..)
DC:
Rob
Morrow,
Senior
Systems
Engineer
MD:
Richard
Im,
Senior
Systems
Engineer
September
3
2. Doing
more
with
Cloudera
Enterprise
Data
Hub
and
why
there
is
no
Compe,,on
to
EDH.
2
Do
more
with
data
• Deliver
mul*-‐genre
analy*cs
in
a
single
pla3orm
• Apply
diverse
concurrent
analy,cs
to
your
full
datasets
in-‐
place
• Connect
easily
to
Partner
products
• Not
more
copies
of
the
data;
More
analy,cs
without
moving
the
data
• The
ONLY
way
to
do
that
is
with
a
single
Tool-‐Rich
plaRorm
that
has
Analy,cs
baked-‐in.
It’s
more
than
just
Map/Reduce…
• Protect
the
data
products:
• System
Changes
(Installa,on/Upgrades:
CM)
• Data
Change
(Navigator)
• Data
Over-‐writes
(Spark
vs
Storm)
• Unauthorized
Access
(Sentry,
Kerberos,
LDAP)
• Encryp,on,
Data
Movement
(Rhino,
Intel,
Gazzang)
3. Business
Intelligence
If
all
we
want
to
know
is
what
has
already
happened,
then
BI
is
a
fine
answer.
3
What
happened,
where,
and
when?
Time
Facts
Interpreta,ons
4. How
about
the
Future?
How
about
Root
Cause?
How
About
“What
If’s”?
How
about
“fuzzy”
mul,variate
ques,ons?
4
What
will
happen?
What
happened,
where,
and
when?
How
can
we
do
beeer?
How
and
why
did
it
happen?
Time
Data
Size
Facts
Interpreta,ons
5. 5
A
drive-‐by
of
Sta,s,cal
op,ons
on
EDH
Do
more
with
data
• Oryx:
Recommender
System
used
for
Produc*on
Purposes.
Real-‐Time
Updates.
• MADlib:
In-‐database
Machine
Learning
libraries
for
Impala.
• ImPyla:
Python
UDF’s
inside
Impala.
Really
fast.
• Mahout:
Yep,
s*ll
there.
Easy-‐to-‐use
Bayesian
models,
Classifiers,
Clustering,
etc.
Requires
data
wrangling.
• SparkML:
Exact
same
algorithms
as
Mahout,
only
faster
because
it’s
in-‐memory.
• SAS
Visual
Analy*cs:
U*lizes
Cloudera
EDH
as
data
hos*ng
pla3orm.
• Rhadoop/Revolu*on
Analy*cs:
Quickly
becoming
defacto
Data
Science
pla3orm
due
to
cost/performance
• H2O:
In-‐Development,
but
very
promising
alterna*ve
to
R/
SAS
6. Spark:
Why
do
we
need
it?
10x
beeer
performance
than
M/R
Less
Code
wrieen
using
Scala
instead
of
Java
Allows
founda,onal
improvements
that
drive
many
new
features:
in-‐memory
indexing,
etc.
6
7. 7
How
Cloudera
Helps…
• Best
Data
Wrangling
Tools
on
the
planet,
Period.
And
all
of
them
at
scale!
• If
you
can
write
code
in
any
language
available
inside
Linux
(Perl,
shell,
Java,
Python,
SQL),
you
can
data
wrangle
on
Cloudera’s
EDH.
Not
to
men,on
partners
like
Informa,ca,
Pemtaho,
etc.
• Advanced
Analy,cal
Methods
backed-‐in,
at-‐scale
and
many
different
ways
to
u,lize
them.
• Cloudera
excels
at
having
the
right
tools
to
answer
tough,
fuzzy
ques,ons
that
have
an
impact
on
the
real
world.
If
we
don’t
have
it,
we
invent
it
then
give
it
back!
• Try
Another
model,
improve
this
one,
track
changes
in
results,
all
on
one
plaRorm—and
securely!
• The
combina,on
of
Cloudera
EDH’s
Navigator,
Spark,
Mahout,
Sentry
means
you
can
run
advanced
models
in
an
Enterprise-‐supportable
way
that
allows
granular
tracking
of
every
step.
• Cloudera
have
been
providing
advanced
methods
to
the
Government
for
years.
• Cloudera
are
deployed
to
answer
other
tough
ques,ons
all
over
the
Intelligence
Community
(HUMINT<
SIGINT<COMMINT<IMINT)
Which
Mission
problem
do
you
want
to
address?
These
DO
NOT
and
WILL
NOT
show
up
in
our
marke,ng
collateral.
So,
talk
to
your
Account
SE!
Prepare
Data
Model
Data
Mgt
Contextual
Modelling
8. 8
Advanced
Analy,c
Tools
already
on
Cloudera’s
Enterprise
Data
Hub
• Advanced
Informa,on
Loca,ng
methods
Available
out-‐of-‐the-‐box
in
Cloudera
Search
• Fuzzy
Search
Finds
“boat”
and
“float”
with
the
same
search
query
and
is
easily
tunable.
• Batch-‐query
allows
users
to
upload
a
file
with
thousands/millions
of
search
terms
and
which
index
they
want
to
search
and
returns
a
down-‐selectable
set
of
data
for
all
terms.
• In-‐memory
indexing
coming
soon!
• Graph
Analy,cs
Tools
Extend
EDH
to
uncover
hidden
rela,onships
• Whether
you
choose
Titan
Aurelius,
Giraph,
or
the
Spark-‐enabled
GraphX
for
your
Graph
analy,c
workload,
Cloudera
supports
your
Mission.
• Spark
technology
was
invented
specifically
to
address
itera,ve
and
self-‐referen,al
data
at
the
speed
of
memory!
• Machine
Learning
Methods
to
Predict
future
events
and
Rapidly
Categorize
current
trends
• Whether
you
choose
a
Naïve
Bayesian
filter
to
mobilize
against
Insider
Threat,
a
Classifier
to
group
objects
for
more
effec,ve
models,
or
Principal
Components
Analysis
to
reveal
the
3
relevant
factors
out
of
thousands,
Cloudera’s
EDH
is
ready
for
your
workload.
9. 9
Current
Ways
to
make
decisions:
I
think
we
should
go
this
way…
HIghest
Paid
Person
in
the
Organiza,on
OR
People
like
us
more
when
we
go
this
way…
Poli*cs
Subject
MaYer
Expert
OR
My
training/
intui9on
says
we
should
go
this
way…
VS
Data
Scien*st
10. 10
Data
Science
Teams
in
the
Real
World.
You
already
have
the
people…
Es9mates
required
capital
(people
and
poli9cal),
team
care/feeding.
Chooses
alterna9ves.
Fallout
of
each
path?
Provides
legit
and
relevant
hypotheses.
Provides
models/
Probabili9es
Provides
Code,
tools,
code.
Tools.
11. 11
Strategies
that
Drive
Data
Science
into
your
Organiza,on:
• Choose
a
single
Specific
Business/Mission
problem
• Never
Choose
IT
first,
and
then
look
for
a
problem
it
solves.
Great
way
to
create
a
Welfare
Program,
but
the
most
expensive
and
lowest
payback
projects
in
Government
have
this
approach
in
common.
• Seek
to
Disconfirm
an
Idea/Approach
• Never
begin
with
an
idea/tool
you’d
like
to
confirm.
Science
doesn’t
work
that
way.
Sorry.
• Promote
Equal
Contribu,ons
from
Areas
of
Exper,se
and
Acquire
Tools
as-‐needed
• Do
not
allow
micro-‐steering
due
to
“personali,es”.
This
is
quite
hard
in
prac,ce,
actually.
• Be
honest
about
Team/Organiza,onal
Weaknesses
• Though
it
takes
slightly
longer,
crea,ng
a
data-‐oriented
center
of
gravity
reduces
risk,
increases
effec,veness,
and
balances
contribu,ons
while
naturally
crea,ng
independent
measures
of
success.
But
you
can’t
begin
un,l
you
understand
the
gaps.
12. 12
Let’s
Pick
One
Area
and
Go
Deeper:
Principal
Components
Analysis
If
a
mul,variate
dataset
is
viewed
as
a
set
of
coordinates
in
a
high-‐dimensional
data
space,
PCA
can
supply
the
user
with
a
lower-‐dimensionality
picture.
If
the
original
data
has
1B
variables,
how
few
of
these
variables
do
you
need
in
order
to
predict
90%
of
the
variance?
80%?
60%?
How
few
dimensions
do
you
need
for
a
visualiza,on?
Humans
easily
visualize
3D
with
current
technology,
but
no
more.
A
25-‐element
high-‐dimensional
space
vector
(term
co-‐occurrences)
of
the
word
"road"
rendered
in
gray
scale.
Original
was
300K
terms.
13. Real-‐World
:
Rescue
Objec,vity
from
Subjec,vity
People
[who
want
money]
complain
of
an
increased
focus
on
DoD
training
for
“culture”
over
“language”,
sta*ng
that…
A)
“Culture”
trains
13
concurrently
whenever
you
train
“Language”
because
the
two
are
linked.
B)
This
hypothesis
isn’t
testable
because
it’s
far
too
subjec,ve
and
too
domain-‐
specific.
Assessment
Requires
SME’s-‐-‐from
the
“language”
skills
community-‐-‐to
make
recommenda,ons.
Circular
logic,
huh?
J
Yielding
2
Dis,nct
Hypotheses:
1) “Culture”
is
focused
upon
more
than
“Language”
in
DoD
Policy
2) “Language”
is
direc,onally
linked
to
“Culture”
(if
you
acquire
“language”,
you
also
acquire
“culture”)
14. Real-‐World
:
Create
Objec,vity
from
the
Subjec,ve
I
Don’t
buy
it.
So,
we
tested
it
and
published
the
results…
here’s
how:
Point
1:
As
a
ra*o
among
Policy
docs,
“Culture”
has
fewer
men*ons
than
Language
in
every
document
except
2.
14
15. Algorithms:
“Fuzzy”
Hypothesis
Tes,ng
Point
2a:
Ager
building
a
high-‐dimensional
model
based
on
weighted
co-‐occurrences
which
uses
all
English
Language
topics
from
Wikipedia
to
define
“meaning”,
it
turns
out
the
opposite
is
true.
First,
the
Representa*onal
Density
of
Culture
is
Higher
Than
Language
(average
distance
is
lower)
15
16. Algorithms:
“Fuzzy”
Hypothesis
Tes,ng
Point
2b:
Using
the
same
model,
we
grabbed
the
20
nearest-‐neighbors
to
each
word.
Then,
ager
reducing
the
dimensionality,
we
showed:
Though
“Language”
tends
to
include
more
func*on
words
and
fewer
seman*cally
rich
terms,
it
DOES
have
“Culture”
as
it’s
nearest
non-‐deriva*ve
neighbor.
But
the
inverse
is
not
true.
Meaning:
Knowing
something
about
Culture
contributes
to
Language,
but
knowing
something
about
Language
contributes
far
less
to
Culture
(and
is
equivalent
to
or
less
than
other
noise/func*on
words.)
16
19. 19
Why
do
this
stuff?
• Results
Published
in
peer-‐reviewed
scien,fic
journal:
• Abbe
&
Morrow,
“Lexical
and
Seman,c
Analysis
of
Culture
and
Foreign
Language
Policies”,
Journal
of
Culture,
Language
and
Interna9onal
Security,
May
2014
• It
impacts
the
DoD/Government.
• A
problem
exists
when
someone
establishes
an
oligarchy:
“You
can’t
measure
it
easily,
therefore
let
me
make
all
the
decisions
for
you.”
• It
uses
DATA
to
drive
the
decision,
and
not
whichever
sub-‐
organiza,on
is
the
most
popular
for
this
5
minutes.
• Inserts
harmony
and
objec,vity.
It’s
a
cool
“fuzzy”
problem.
20. 20
Analy,cs
to
users:
HUE
• Included
in
EDH
• Mul,-‐capability
interface
for
analy,cs
• Interac,ve
graph
libraries
• Customizable
Search,
Impala,
Hive,
Pig
Apps
• But
Also:
Tableau,
Pentaho,
PlaRora,
ZoomData,
SAS…
21. 21
Cloudera
Manager
End-‐to-‐End
Administra,on
for
CDH
Manage
Easily
deploy,
configure
&
op,mize
clusters
1
Monitor
Maintain
a
central
view
of
all
ac,vity
2
Diagnose
Easily
iden,fy
and
resolve
issues
3
Integrate
Use
Cloudera
Manager
with
exis,ng
tools
4
23. 23
2
3
Enterprise
Services
Inges,on
&
ETL
Pilot
Reference
implementa,on
up
to
3
sources,
5
transforma,ons,
1
target
Create,
execute,
test,
and
review
a
custom
inges,on/ETL
plan
Security
Integra,on
Implementa,on
of
role
based
access
control
with
the
data
processing
environment
Hadoop
Cluster
Deployment
Cer,fica,on
Fully
review
hardware,
data
sources,
typical
jobs,
and
exis,ng
SLAs
Develop,
implement,
benchmark,
and
document
Hadoop
deployment
24. 24
Path to Success – Services & Training
Hadoop
Cluster
Deployment
Cer,fica,on
1
week
Inges,on
&
ETL
Pilot
2
weeks
Security
Integra,on
1
week
Cloudera
Admin
Training
3
days
Hive/Pig
Training
2
days
Data
Science
3
days
Developer
Training
4
days