Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/value-extraction-from-bbva-credit-card-transactions/ivan-de-prado
Unblocking The Main Thread Solving ANRs and Frozen Frames
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012
1. Iván
de
Prado
Alonso
–
CEO
of
Datasalt
www.datasalt.es
@ivanprado
@datasalt
www.bigdataspain.org
November
16th,
2012
ETSI
Telecomunicación
Madrid
Spain
#BDSpain
Value extraction from BBVA
credit card transactions
12. Hadoop
Distributed
Filesystem
ü Files
as
big
as
you
want
ü Horizontal
scalability
ü Failover
Distributed
CompuNng
ü MapReduce
ü Batch
oriented
• Input
files
processed
and
converted
in
output
files
ü Horizontal
scalability
13. Easier
Hadoop
Java
API
ü But
keeping
similar
efficiency
Common
design
paXerns
covered
ü Compound
records
ü Secondary
sorNng
ü Joins
Other
improvements
ü Instance
based
configuraNon
ü First
class
mulNple
input/output
Tuple
MapReduce
implementaJon
for
Hadoop
14. Tuple
MapReduce
Our
evoluJon
to
Google’s
MapReduce
Pere
Ferrera,
Iván
de
Prado,
Eric
Palacios,
Jose
Luis
Fernandez-‐
Marquez,
Giovanna
Di
Marzo
Serugendo:
Tuple
MapReduce:
Beyond
classic
MapReduce.
In
ICDM
2012:
Proceedings
of
the
IEEE
Interna2onal
Conference
on
Data
Mining
Brussels,
Belgium
|
December
10
–
13,
2012
16. Tuple
MapReduce
Main
constraint
ü Group
by
clause
must
be
a
subset
of
sort
by
clause
Indeed,
Tuple
MapReduce
can
be
implemented
on
top
of
any
MapReduce
implementaJon
• Pangool
-‐>
Tuple
MapReduce
over
Hadoop
19. Voldemort
&
Hadoop
Benefits
ü Scalability
&
failover
ü UpdaNng
the
database
does
not
affect
serving
queries
ü All
data
is
replaced
at
each
execuNon
• Providing
agility/flexibility
§ Big
development
changes
are
not
a
pain
• Easier
survival
to
human
errors
§ Fix
code
and
run
again
• Easy
to
set
up
new
clusters
with
different
topologies
20. Basic
staNsNcs
Easy
to
implement
with
Pangool/Hadoop
ü One
job,
grouping
by
the
dimension
over
which
you
want
to
calculate
the
staNsNcs.
Count
Average
Min
Max
Stdev
CompuJng
several
Jme
periods
in
the
same
job
ü Use
the
mapper
for
replicaNng
each
datum
for
each
period
ü Add
a
period
idenNfier
field
in
the
tuple
and
include
it
in
the
group
by
clause
21. DisNnct
count
Possible
to
compute
in
a
single
job
ü Using
secondary
sorNng
by
the
field
you
want
to
disNnct
count
on
ü DetecNng
changes
on
that
field
Example
ü Group
by
shop,
sort
by
shop
and
card
Shop
Card
Shop
1
1234
Shop
1
1234
Shop
1
1234
Change
+1
Shop
1
5678
2
disNnct
buyers
for
Shop
1
5678
Change
+1
shop
1
22. Histograms
Typically
two-‐pass
algorithm
ü First
pass
for
detecNng
the
minimum
and
the
maximum
and
determine
the
bins
ranges
ü Second
pass
to
count
the
number
of
occurrences
on
each
bin
AdaptaJve
histogram
ü One
pass
ü Fixed
number
of
bins
ü Bins
adapt
23. OpNmal
histogram
Calculate
the
beCer
histogram
that
represents
the
original
one
using
a
limited
number
of
flexible
width
bins
ü Reduce
storage
needs
ü More
representaNve
than
fixed
width
ones
-‐>
beXer
visualizaNon
24. OpNmal
histogram
Exact
Algorithm
Petri
Kontkanen,
Petri
Myllym
aki
̈
MDL
Histogram
Density
EsJmaJon
hXp://eprints.pascal-‐network.org/archive/00002983/
Too
slow
for
producJon
use
25. OpNmal
histogram
AlternaNve:
Approximated
algorithm
Random-‐restart
hill
climbing
ü A
soluNon
is
just
a
way
of
grouping
exisNng
bins
ü From
a
soluNon,
you
can
move
to
some
close
soluNons
ü Some
are
beXer:
reduce
the
representaNon
error
Algorithm
1. Iterate
N
Nmes,
keeping
best
soluNon
1. Generate
a
random
soluNon
2. Iterate
unNl
no
improvement
1. Move
to
next
beXer
possible
movement
26. OpNmal
histogram
AlternaNve:
Approximated
algorithm
Random-‐restart
hill
climbing
ü One
order
of
magnitude
faster
ü 99%
accuracy
27. Everything
in
one
job
Basic
staJsJcs
-‐>
1
job
DisJnct
count
staJsJcs
-‐>
1
job
One
pass
histograms
-‐>
1
job
Several
periods
&
shops
-‐>
1
job
We
can
put
all
together
so
that
compuNng
all
staNsNcs
for
all
shops
fits
into
exactly
one
job
28. Shop
recommendaNons
Based
on
co-‐occurrences
ü If
somebody
bought
in
shop
A
and
in
shop
B,
then
a
co-‐occurrence
between
A
and
B
exists
ü Only
one
co-‐occurrence
is
considered
although
a
buyer
bought
several
Nmes
in
A
and
B
ü Top
co-‐occurrences
per
each
shop
are
the
recommendaNons
Improvements
ü Most
popular
shops
are
filtered
out
because
almost
everybody
buys
in
them.
ü RecommendaNons
by
category,
by
locaJon
and
by
both
ü Different
calculaNon
periods
29. Shop
recommendaNons
Implemented
in
Pangool
ü Using
its
counNng
and
joining
capabiliNes
ü Several
jobs
Challenges
ü If
somebody
bought
in
many
shops,
the
list
of
co-‐occurrences
can
explode:
• Co-‐occurrences
=
N
*
(N
–
1),
where
N
=
#
of
disNnct
shops
where
the
person
bought
ü Alleviated
by
limiNng
the
total
number
of
disNnct
shops
to
consider
ü Only
uses
the
top
M
shops
where
the
client
bought
the
most
Future
ü Time
aware
co-‐occurrences.
The
client
bought
in
A
and
B
and
he
did
it
in
a
close
period
of
Nme.
30. Some
numbers
EsJmated
resources
needed
with
1
year
data
270
GB
of
stats
to
serve
24
large
instances
~
11
hours
of
execuNon
$3500
month
ü OpNmizaNons
sNll
possible
ü Cost
without
the
use
of
reserved
instances
ü Probably
cheaper
with
an
in-‐house
Hadoop
cluster
31. Conclusion
It
was
possible
to
develop
a
Big
Data
soluJon
for
a
Bank
ü With
low
use
of
resources
ü Quickly
ü Thanks
to
the
use
of
technologies
like
Hadoop,
Amazon
Web
Services
and
NoSQL
databases
The
soluJon
is
ü Scalable
ü Flexible/agile.
Improvements
easy
to
implement
ü Prepared
to
stand
human
failures
ü At
a
reasonable
cost
Main
advantage:
doing
always
everything
32. Future:
Splout
Key/value
datastores
have
limitaJons
ü Only
accept
querying
by
the
key
ü AggregaNons
no
possible
ü In
other
words,
we
are
forced
to
pre-‐compute
everything
ü Not
always
possible
-‐>
data
explode
ü For
this
parNcular
case,
Nme
ranges
are
fixed
Splout:
like
Voldemort
but
SQL!
ü The
idea:
to
replace
Voldemort
by
Splout
SQL
ü Much
richer
queries:
real-‐Nme
aggregaNons,
flexible
Nme
ranges
ü It
would
allow
to
create
some
kind
of
Google
AnalyNcs
for
the
staNsNcs
discussed
in
this
presentaNon
ü Open
Sourced!!!
hXps://github.com/datasalt/splout-‐db
33. Iván
de
Prado
Alonso
–
CEO
of
Datasalt
www.datasalt.es
@ivanprado
@datasalt
QuesJons?