Bayesian Counters

Bayesian
Counters

aka
In
Memory
Data
Mining
for
Large
DataSets

Alex
Kozlov,
Ph.D.,
Principal
Solutions
Architect,
Cloudera
Inc.

@alexvk2009
(Twitter)

June
13-‐th,
2012

My
past
(aka
about
me)

Agenda

•  Current
trends

(large
data,
real
time,
uncertainty)

•  What
is
Bayesian
Counters

•  Naïve
Bayes

•  NN

•  Clique
ranking

•  Association
Rules

•  Some
performance
results

•  Conclusions

©2012
Cloudera,
Inc.
All
Rights
Reserved.
4

About
Cloudera

Entering
it’s
5-‐th
year

Cloudera’s
mission
is
to
help
organizations
proﬁt
from
all
of
their
data.

Cloudera
helps
organizations
proﬁt
from
all
of
their
data.
We
deliver
the
industry-‐standard

platform
which
consolidates,
stores
and
processes
any
kind
of
data,
from
any
source,
at

scale.
We
make
it
possible
to
do
more
powerful
analysis
of
more
kinds
of
data,
at
scale,

than
ever
before.
With
Cloudera,
you
get
better
insight
into
their
customers,
partners,

vendors
and
businesses.

5

A
Distributed
System

Centralized
Distributed

•  SPoF
•  Availability

•  Strict
synchronization/Locking
•  Redundancy/Fault
Tolerance

•  Better
Resource
Management
•  Flexible

•  Interactive

State
space
explosion

•  Chess
alpha-‐beta
tree
has
1045
nodes

•  We
can
solve
only
1018
state
space

•  Go
has
10360
nodes

•  Given
the
Moore’s
law
we’ll
be
there
only
by
2120

Can
we
help?

Uncertainty
rules
the
world!

Or
use
distributed
systems

More
zeros

•  Most
powerful
computer
(2019):
1024
ops/sec

•  Seconds
in
a
year:
3
x
107
seconds

•  Sun’s
expected
life:
107
years

We
can
probably
be
done
with
chess!

Time

Examples
Value
vs
time

•  Advertising:
if
you
don’t
figure

what
the
user
wants
in
5

minutes,
you
lost
him

•  Intrusion
detection:
the

0
1
2
3
4
5
6
7
8
9

damage
may
be
significantly

bigger
after
a
few
minutes
Value
Precision

after
break-‐in

•  Missing/misconfigured
pages
http://cetas.net

http://www.woopra.com

http://www.wibidata.com/

What
we’ve
learned
so
far

•  There
is
a
lot
of
data
out
there

•  The
storage
capacity
of

a
distributed
systems

today
is
overwhelming

•  We
need
to
admit
that
some
problems
will

never
be
solved

•  Time
is
a
critical
factor

Why
(not)
to
Mine
from
HD?

•  L1
Cache:
64
bits
per
CPU
clock
•  Move
computation
to
the
data:

cycle
(10-‐9
sec)
1010
bytes
per
but
ML
wants
all
your
data!

second,
latency
in
ns
•  And
sorted…

•  HD
–
12
x
100
x
106
bytes
per

second,
latency
in
ms

What
if
it
does
not
ﬁt
in

•  Network
–
10
GbE
switches
RAM?

(depends
on
distance,
topology)

•  East-‐West
coast
latency
20-‐40
ms

(ms
within
a
datacenter)
•  Work
on
reasonable
subsets

Push
computations
to
the
source

•  Collect
relevant
information
at
the
source

(pairwise
correlations,
can
be
done
in
parallel

using
Hbase)

Compare:

-‐>
computations
to
data
=
MapReduce

-‐>
data
to
computations
=
map
side
join

Bayesian
Counters

•  [A=a1;B=b1]
-‐>
5

•  [A=a1;B=b2]
-‐>
15

Pr(A|B)
=
Pr(AB)/Pr(B)
•  …

=

Count(AB)/Count(B)
•  [A=a2;B=b1]
-‐>
3

•  …

Time

What
if
we
want
to
access
more

recent
data
more
often?

•  Key:
subset
of
variables
with
their
values
+
timestamp
(variable
length)

•  Value:
count
(8
bytes)

index

Key
1
Value
Key
2
Value
Key
3
Value
Key
4
Value

Column
families
are
diﬀerent
HFiles
(30
min,
2
hours,
24
hours,
5
days,
etc.)

Pr(A|B,
last
20
minutes)

Anatomy
of
a
counter

Region
(divide
between)

Counter/Table

File
Column
family

Iris

[sepal_width=2;class=0]
Column
qualiﬁer

30
mins

1321038671
Version

1321038998

15

2
hours

Value
(data)

Cars
…

HBase
schema
design

•  Push
computations
into
distributed
realm

•  Column
family
for
data
locality

•  Key
is
a
tuple
of
var=value
combinations

•  No
random
salt

•  Value
is
a
counter
(8
bytes)

Implementations

•  Naïve
Bayes

•  Nearest
Neighbor

•  Association
rules

•  Clique
ranking

Naïve
Bayes

Pr(C|F1,
F2,
...,
FN)
=1/z
Pr(C)
Πi
Pr(F |C)

i

Required
only
pairwise
counters
(complexity
N2)

*Linear
if
we
ﬁx
the
target
node

k-‐NN

P(C)
for
k
nearest
neighbors

count(C|X)
=
ΣXi
count(C|Xi)

where
X1,
X2,
...,
XN
are
in
the
vicinity
of
X

Clique
ranking

What
is
the
best
structure
of
a
Bayesian
Network

I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]

Where
x
in
X
and
y
in
Y

Using
random
projection
can
generalize
on

abstract
subset
Z

Assoc

•  Conﬁdence
(A
-‐>
B):
count(A
and
B)/count(A)

•  Lift
(A
-‐>
B):
count(A
and
B)/[count(A)
x
count(B)]

•  Usually
ﬁltered
on
support:
count(A
and
B)

•  Frequent
itemset
search

Performance

retail.dat
–
88K
transactions
over
14,246
items

•  Mahout
FPGrowth
–
0.5
sec
per
pattern

(58,623
patterns
with
min
support
2)

• 
<
1
ms
per
pattern
on
a
5
node
cluster

FPGrowth
performance

Row
Support

Rules

Time(ms)

1
1

69,309

25,659,052

2
2

58,623

23,103,547

3
4

48,270

20,782,325

4
8

38,661

17,643,592

5
16

28,988

13,994,334

6
32

19,939

9,714,935

Time

nb
iris
class=2
sepal_length=5;petal_length=1.4
300

Target
Variable
Time
(seconds
from
now)

Predictors

Conclusions

•  Storing
n-‐wise
counts
is
a
powerful
data

analysis
paradigm

•  We
can
implement
a
number
of
powerful

algorithms
on
top
of
counters

•  A
system
that
will
know
about
the
world
more

than
you
would
ever
dare
to
admit

Future
Directions

•  Direct
extensions:

–  Dynamic
adjustment
of
counters
to
collect

–  Dynamic
adjustment
to
time
buckets

–  Optimization

•  Testing
problems:

–  Can
not
directly
compare
to
static
algos

•  More
general:

–  Better
data
management
tools
for
machine
learning

30

Bayesian Counters

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bayesian Counters

Ähnlich wie Bayesian Counters (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bayesian Counters