10. Hadoop
Distributed
Filesystem
ü Files
as
big
as
you
want
ü Horizontal
scalability
ü Failover
Distributed
Compu5ng
ü MapReduce
ü Batch
oriented
• Input
files
processed
and
converted
in
output
files
ü Horizontal
scalability
11. Easier
Hadoop
Java
API
ü But
keeping
similar
efficiency
Common
design
paIerns
covered
ü Compound
records
ü Secondary
sor5ng
ü Joins
Other
improvements
ü Instance
based
configura5on
ü First
class
mul5ple
input/output
Tuple
MapReduce
implementaDon
for
Hadoop
12. Tuple
MapReduce
Our
evoluDon
to
Google’s
MapReduce
Pere
Ferrera,
Iván
de
Prado,
Eric
Palacios,
Jose
Luis
Fernandez-‐
Marquez,
Giovanna
Di
Marzo
Serugendo:
Tuple
MapReduce:
Beyond
classic
MapReduce.
In
ICDM
2012:
Proceedings
of
the
IEEE
Interna6onal
Conference
on
Data
Mining
Brussels,
Belgium
|
December
10
–
13,
2012
14. Tuple
MapReduce
Main
constraint
ü Group
by
clause
must
be
a
subset
of
sort
by
clause
Indeed,
Tuple
MapReduce
can
be
implemented
on
top
of
any
MapReduce
implementaDon
• Pangool
-‐>
Tuple
MapReduce
over
Hadoop
17. Voldemort
&
Hadoop
Benefits
ü Scalability
&
failover
ü Upda5ng
the
database
does
not
affect
serving
queries
ü All
data
is
replaced
at
each
execu5on
• Providing
agility/flexibility
§ Big
development
changes
are
not
a
pain
• Easier
survival
to
human
errors
§ Fix
code
and
run
again
• Easy
to
set
up
new
clusters
with
different
topologies
18. Basic
sta5s5cs
Easy
to
implement
with
Pangool/Hadoop
ü One
job,
grouping
by
the
dimension
over
which
you
want
to
calculate
the
sta5s5cs.
Count
Average
Min
Max
Stdev
CompuDng
several
Dme
periods
in
the
same
job
ü Use
the
mapper
for
replica5ng
each
datum
for
each
period
ü Add
a
period
iden5fier
field
in
the
tuple
and
include
it
in
the
group
by
clause
19. Dis5nct
count
Possible
to
compute
in
a
single
job
ü Using
secondary
sor5ng
by
the
field
you
want
to
dis5nct
count
on
ü Detec5ng
changes
on
that
field
Example
ü Group
by
shop,
sort
by
shop
and
card
Shop
Card
Shop
1
1234
Shop
1
1234
Shop
1
1234
Change
+1
Shop
1
5678
2
dis5nct
buyers
for
Shop
1
5678
Change
+1
shop
1
20. Histograms
Typically
two-‐pass
algorithm
ü First
pass
for
detec5ng
the
minimum
and
the
maximum
and
determine
the
bins
ranges
ü Second
pass
to
count
the
number
of
occurrences
on
each
bin
AdaptaDve
histogram
ü One
pass
ü Fixed
number
of
bins
ü Bins
adapt
21. Op5mal
histogram
Calculate
the
be:er
histogram
that
represents
the
original
one
using
a
limited
number
of
flexible
width
bins
ü Reduce
storage
needs
ü More
representa5ve
than
fixed
width
ones
-‐>
beIer
visualiza5on
22. Op5mal
histogram
Exact
Algorithm
Petri
Kontkanen,
Petri
Myllym
aki
̈
MDL
Histogram
Density
EsDmaDon
hIp://eprints.pascal-‐network.org/archive/00002983/
Too
slow
for
producDon
use
23. Op5mal
histogram
Alterna5ve:
Approximated
algorithm
Random-‐restart
hill
climbing
ü A
solu5on
is
just
a
way
of
grouping
exis5ng
bins
ü From
a
solu5on,
you
can
move
to
some
close
solu5ons
ü Some
are
beIer:
reduce
the
representa5on
error
Algorithm
1. Iterate
N
5mes,
keeping
best
solu5on
1. Generate
a
random
solu5on
2. Iterate
un5l
no
improvement
1. Move
to
next
beIer
possible
movement
24. Op5mal
histogram
Alterna5ve:
Approximated
algorithm
Random-‐restart
hill
climbing
ü One
order
of
magnitude
faster
ü 99%
accuracy
25. Everything
in
one
job
Basic
staDsDcs
-‐>
1
job
DisDnct
count
staDsDcs
-‐>
1
job
One
pass
histograms
-‐>
1
job
Several
periods
&
shops
-‐>
1
job
We
can
put
all
together
so
that
compu5ng
all
sta5s5cs
for
all
shops
fits
into
exactly
one
job
26. Shop
recommenda5ons
Based
on
co-‐occurrences
ü If
somebody
bought
in
shop
A
and
in
shop
B,
then
a
co-‐occurrence
between
A
and
B
exists
ü Only
one
co-‐occurrence
is
considered
although
a
buyer
bought
several
5mes
in
A
and
B
ü Top
co-‐occurrences
per
each
shop
are
the
recommenda5ons
Improvements
ü Most
popular
shops
are
filtered
out
because
almost
everybody
buys
in
them.
ü Recommenda5ons
by
category,
by
locaDon
and
by
both
ü Different
calcula5on
periods
27. Shop
recommenda5ons
Implemented
in
Pangool
ü Using
its
coun5ng
and
joining
capabili5es
ü Several
jobs
Challenges
ü If
somebody
bought
in
many
shops,
the
list
of
co-‐occurrences
can
explode:
• Co-‐occurrences
=
N
*
(N
–
1),
where
N
=
#
of
dis5nct
shops
where
the
person
bought
ü Alleviated
by
limi5ng
the
total
number
of
dis5nct
shops
to
consider
ü Only
uses
the
top
M
shops
where
the
client
bought
the
most
Future
ü Time
aware
co-‐occurrences.
The
client
bought
in
A
and
B
and
he
did
it
in
a
close
period
of
5me.
28. Some
numbers
EsDmated
resources
needed
with
1
year
data
270
GB
of
stats
to
serve
24
large
instances
~
11
hours
of
execu5on
$3500
month
ü Op5miza5ons
s5ll
possible
ü Cost
without
the
use
of
reserved
instances
ü Probably
cheaper
with
an
in-‐house
Hadoop
cluster
29. Conclusion
It
was
possible
to
develop
a
Big
Data
soluDon
for
a
Bank
ü With
low
use
of
resources
ü Quickly
ü Thanks
to
the
use
of
technologies
like
Hadoop,
Amazon
Web
Services
and
NoSQL
databases
The
soluDon
is
ü Scalable
ü Flexible/agile.
Improvements
easy
to
implement
ü Prepared
to
stand
human
failures
ü At
a
reasonable
cost
Main
advantage:
doing
always
everything
30. Future:
Splout
Key/value
datastores
have
limitaDons
ü Only
accept
querying
by
the
key
ü Aggrega5ons
no
possible
ü In
other
words,
we
are
forced
to
pre-‐compute
everything
ü Not
always
possible
-‐>
data
explode
ü For
this
par5cular
case,
5me
ranges
are
fixed
Splout:
like
Voldemort
but
SQL!
ü The
idea:
to
replace
Voldemort
by
Splout
SQL
ü Much
richer
queries:
real-‐5me
aggrega5ons,
flexible
5me
ranges
ü It
would
allow
to
create
some
kind
of
Google
Analy5cs
for
the
sta5s5cs
discussed
in
this
presenta5on
ü Open
Sourced!!!
hIps://github.com/datasalt/splout-‐db