9. Put
it
away,
delete
it,
tweet
it,
compress
it,
shred
it,
wikileak-‐it,
put
it
in
a
database,
put
it
in
SAN/NAS,
put
it
in
the
cloud,
hide
it
in
tape…
17. Another
EDW
Analy=cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
The
solu=on?
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
18. Another
EDW
Analy=cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Ummm…you
dropped
something
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
21. Wait,
you’ve
seen
this
before.
Data
Data
Data
…
Sausage
Factory
Data
Data
Data
Data
Data
Data
Data
Data
Data
…
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
24. “Prices,
Stupid
passwords,
and
Boring
Sta=s=cs.”
-‐
Hans
Rosling
h"p://www.youtube.com/watch?v=hVimVzgtD6w
25. Your
data
silos
are
lonely
places.
EDW
Accounts
Customers
Web
Proper=es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
26. …
Data
likes
to
be
together.
EDW
Accounts
Customers
Data
Data
Web
Proper=es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
27. CDR
Data
Data
Data
Machine
Data
Facebook
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Weather
Data
TwiYer
Data
Data
likes
to
socialize
too.
Data
Data
EDW
Data
Data
Data
Data
Data
Data
Accounts
Data
Web
Proper=es
Data
Data
Data
Customers
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
28. New
types
of
data
don’t
quite
fit
into
your
pris=ne
view
of
the
world.
Logs
Data
Data
Data
Data
Data
Data
Data
Machine
Data
Data
Data
Data
Data
Data
Data
Data
My
LiYle
Data
Empire
Data
?
Data
?
Data
Data
Data
Data
Data
?
?
Data
Data
29. To
resolve
this,
some
people
take
hints
from
Lord
Of
The
Rings...
31. ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
…but
that
has
its
problems
too.
ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
39. If
you
could
design
a
system
that
would
handle
this,
what
would
it
look
like?
40. It
would
probably
need
a
highly
resilient,
self-‐healing,
cost-‐efficient,
distributed
file
system…
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
41. It
would
probably
need
a
completely
parallel
processing
framework
that
took
tasks
to
the
data…
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
42. It
would
probably
run
on
commodity
hardware,
virtualized
machines,
and
common
OS
plaeorms
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
43. It
would
probably
be
open
source
so
innova=on
could
happen
as
quickly
as
possible
47. HDFS
stores
data
in
blocks
and
replicates
those
blocks
block1
Processing
Processing
Processing
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
block3
48. If
a
block
fails
then
HDFS
always
has
the
other
copies
and
heals
itself
block1
Processing
Processing
Processing
block3
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
X
49. MapReduce
is
a
programming
paradigm
that
completely
parallel
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
50. MapReduce
has
three
phases:
Map,
Sort/Shuffle,
Reduce
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Mapper
Key,
Value
Key,
Value
Key,
Value
Reducer
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Key,
Value
Key,
Value
Key,
Value
51. MapReduce
applies
to
a
lot
of
data
processing
problems
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
52. MapReduce
goes
a
long
way,
but
not
all
data
processing
and
analy=cs
are
solved
the
same
way
53. Some=mes
your
data
applica=on
needs
parallel
processing
and
inter-‐
process
communica=on
Data
Data
Data
Data
Data
Data
Process
Data
Data
Data
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Process
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
55. Some=mes
your
machine
learning
data
applica=on
needs
to
process
in
memory
and
iterate
Data
Data
Data
Data
Data
Data
Process
Data
Data
Data
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Process
Process
Process
Process
Process
Data
Data
Data
Data
Data
Data
59. YARN
abstracts
resource
management
so
you
can
run
more
than
just
MapReduce
MapReduce
V2
MapReduce
V?
STORM
Giraph
Tez
YARN
HDFS2
MPI
HBase
…
and
more
Spark
67. Tez
provides
a
layer
for
abstract
tasks,
these
could
be
mappers,
reducers,
customized
stream
processes,
in
memory
structures,
etc
68. Tez
can
chain
tasks
together
into
one
job
to
get
Map
–
Reduce
–
Reduce
jobs
suitable
for
things
like
Hive
SQL
projec=ons,
group
by,
and
order
by
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
TezMap
TezMap
TezReduce
TezReduce
Data
Data
Data
TezMap
TezReduce
TezReduce
Data
Data
Data
TezReduce
TezReduce
TezMap
TezMap
Data
Data
Data
69. Tez
can
provide
long-‐running
containers
for
applica=ons
like
Hive
to
side-‐step
batch
process
startups
you
would
have
with
MapReduce