Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Bayesian Counters
1. Bayesian
Counters
aka
In
Memory
Data
Mining
for
Large
DataSets
Alex
Kozlov,
Ph.D.,
Principal
Solutions
Architect,
Cloudera
Inc.
@alexvk2009
(Twitter)
June
13-‐th,
2012
5. About
Cloudera
Entering
it’s
5-‐th
year
Cloudera’s
mission
is
to
help
organizations
profit
from
all
of
their
data.
Cloudera
helps
organizations
profit
from
all
of
their
data.
We
deliver
the
industry-‐standard
platform
which
consolidates,
stores
and
processes
any
kind
of
data,
from
any
source,
at
scale.
We
make
it
possible
to
do
more
powerful
analysis
of
more
kinds
of
data,
at
scale,
than
ever
before.
With
Cloudera,
you
get
better
insight
into
their
customers,
partners,
vendors
and
businesses.
5
8. State
space
explosion
• Chess
alpha-‐beta
tree
has
1045
nodes
• We
can
solve
only
1018
state
space
• Go
has
10360
nodes
• Given
the
Moore’s
law
we’ll
be
there
only
by
2120
Can
we
help?
Uncertainty
rules
the
world!
Or
use
distributed
systems
9. More
zeros
• Most
powerful
computer
(2019):
1024
ops/sec
• Seconds
in
a
year:
3
x
107
seconds
• Sun’s
expected
life:
107
years
We
can
probably
be
done
with
chess!
10. Time
Examples
Value
vs
time
• Advertising:
if
you
don’t
figure
what
the
user
wants
in
5
minutes,
you
lost
him
• Intrusion
detection:
the
0
1
2
3
4
5
6
7
8
9
damage
may
be
significantly
bigger
after
a
few
minutes
Value
Precision
after
break-‐in
• Missing/misconfigured
pages
http://cetas.net
http://www.woopra.com
http://www.wibidata.com/
11. What
we’ve
learned
so
far
• There
is
a
lot
of
data
out
there
• The
storage
capacity
of
a
distributed
systems
today
is
overwhelming
• We
need
to
admit
that
some
problems
will
never
be
solved
• Time
is
a
critical
factor
12. Why
(not)
to
Mine
from
HD?
• L1
Cache:
64
bits
per
CPU
clock
• Move
computation
to
the
data:
cycle
(10-‐9
sec)
1010
bytes
per
but
ML
wants
all
your
data!
second,
latency
in
ns
• And
sorted…
• HD
–
12
x
100
x
106
bytes
per
second,
latency
in
ms
What
if
it
does
not
fit
in
• Network
–
10
GbE
switches
RAM?
(depends
on
distance,
topology)
• East-‐West
coast
latency
20-‐40
ms
(ms
within
a
datacenter)
• Work
on
reasonable
subsets
13. Push
computations
to
the
source
• Collect
relevant
information
at
the
source
(pairwise
correlations,
can
be
done
in
parallel
using
Hbase)
Compare:
-‐>
computations
to
data
=
MapReduce
-‐>
data
to
computations
=
map
side
join
15. Time
What
if
we
want
to
access
more
recent
data
more
often?
• Key:
subset
of
variables
with
their
values
+
timestamp
(variable
length)
• Value:
count
(8
bytes)
index
Key
1
Value
Key
2
Value
Key
3
Value
Key
4
Value
Column
families
are
different
HFiles
(30
min,
2
hours,
24
hours,
5
days,
etc.)
Pr(A|B,
last
20
minutes)
16. Anatomy
of
a
counter
Region
(divide
between)
Counter/Table
File
Column
family
Iris
[sepal_width=2;class=0]
Column
qualifier
30
mins
1321038671
Version
1321038998
15
2
hours
Value
(data)
Cars
…
18. HBase
schema
design
• Push
computations
into
distributed
realm
• Column
family
for
data
locality
• Key
is
a
tuple
of
var=value
combinations
• No
random
salt
• Value
is
a
counter
(8
bytes)
20. Naïve
Bayes
Pr(C|F1,
F2,
...,
FN)
=1/z
Pr(C)
Πi
Pr(F |C)
i
Required
only
pairwise
counters
(complexity
N2)
*Linear
if
we
fix
the
target
node
21. k-‐NN
P(C)
for
k
nearest
neighbors
count(C|X)
=
ΣXi
count(C|Xi)
where
X1,
X2,
...,
XN
are
in
the
vicinity
of
X
22. Clique
ranking
What
is
the
best
structure
of
a
Bayesian
Network
I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]
Where
x
in
X
and
y
in
Y
Using
random
projection
can
generalize
on
abstract
subset
Z
23. Assoc
• Confidence
(A
-‐>
B):
count(A
and
B)/count(A)
• Lift
(A
-‐>
B):
count(A
and
B)/[count(A)
x
count(B)]
• Usually
filtered
on
support:
count(A
and
B)
• Frequent
itemset
search
24. Performance
retail.dat
–
88K
transactions
over
14,246
items
• Mahout
FPGrowth
–
0.5
sec
per
pattern
(58,623
patterns
with
min
support
2)
•
<
1
ms
per
pattern
on
a
5
node
cluster
28. Time
nb
iris
class=2
sepal_length=5;petal_length=1.4
300
Target
Variable
Time
(seconds
from
now)
Predictors
29. Conclusions
• Storing
n-‐wise
counts
is
a
powerful
data
analysis
paradigm
• We
can
implement
a
number
of
powerful
algorithms
on
top
of
counters
• A
system
that
will
know
about
the
world
more
than
you
would
ever
dare
to
admit
30. Future
Directions
• Direct
extensions:
– Dynamic
adjustment
of
counters
to
collect
– Dynamic
adjustment
to
time
buckets
– Optimization
• Testing
problems:
– Can
not
directly
compare
to
static
algos
• More
general:
– Better
data
management
tools
for
machine
learning
30