Rapid pruning of search space through hierarchical matching

RAPID PRUNING OF SEARCH SPACE THROUGH
HIERARCHICAL MATCHING
Chandra Mouleeswaran
Machine Learning Scientist, ThreatMetrix Inc.
5/2/13
1

My
Background

•  Machine
Learning
Scien8st
at
ThreatMetrix
Inc.

•  Co-‐
Chair,
Developer
Programs,
IntelliFest.org,
Oct
2013,

San
Diego,
CA

Career
Path

-‐  Siemens
Corporate
Research:
Learning
&
Expert
Systems

-‐  Technology
division
of
Donaldson,
LuQin
and
JenreSe

company
(Pershing):
Ar8ﬁcial
Intelligence
Group
-‐
Network

Monitoring

-‐  Several
startups:
Classiﬁca8on,
Web
Crawling,
Security,

Financial
Trading
etc.

5/2/13
2

Outline

•  Task
descrip8on

•  Approaches

•  Why
search
paradigm?

•  Hierarchical
matching

•  Results

•  Acknowledgments

5/2/13
3

The
Device
Iden8fica8on
Task

•  Computa8onally,
it’s
a
CLASSIFICATION
problem:

{
a0,
a1,
a2,
a3………..
an
}

è
{
ci
}

ai
=
(
aSribute
|
field
|
key
)
value

ci
=
(
label
|
signature
|
class
|
hash
)

•  Returning
devices
should
be
correctly
iden8fied

within
certain
tolerances

•  New
classes
may
be
created
if
a
good
match
is

not
found
in
the
repository
of
known
devices

•  Devices
age
out,
based
on
data
reten8on
policy

5/2/13
4

Task
Challenges

•  Extremely
vola8le
aSributes

•  There
are
no
pivot
aSributes
to
divide
and

conquer
the
search
space

•  Changing
distribu8ons

•  Emphasis
on
PRECISION

•  Stringent
RESPONSE
8me

5/2/13
5

Engineering
Challenges

•  Precision
(accuracy)
and
latency
(response

8me)
are
antagonis8c
constraints

•  Project
management

Repository
Size

(millions)

Load

(TPS)

Latency

(ms)

Project
start
28
200

<
100

Present
280
300

<
100

Change
10
X
1.5
X
None

5/2/13
6

Approaches

•  Rules
engine

•  Learning
models

•  Vector
space
models

Need
an
enterprise
grade
solu8on!

5/2/13
7

Rules
Engine

•  No
experts

•  Number
of
rules?

•  Maintenance?

Not
a
viable
approach!

5/2/13
8

Learning
Models

•  Most
machine
learning
methods
deal

predominantly
with
binary
classiﬁca8on

problems
(eg.
fraud
/
not
fraud)
or
a
small

number
of
target
classes

•  Few
exemplars
for
each
class

•  ASribute
values
may
be
unbounded

•  ASributes
may
not
follow
a
natural

progression

5/2/13
9

Learning
Models
…

•  Unsupervised
learning
such
as
clustering

methods
would
make
good
models,
but
not

good
enough
to
be
of
prac8cal
use.
Any

simpliﬁca8on
process
will
compromise
on

accuracy

•  Ability
to
explain
is
cri8cal

•  Tend
to
ignore
domain
knowledge

Challenge
in
providing
enterprise
solu8on

5/2/13
10

Thoughts

•  No
comparable
applica8on
with
such

requirements

•  Build
and
deploy
a
classifier
that
explains
itself

easily,
scales
temporally
and
offers
quick

response

•  Use
domain
knowledge
to
guide
verifica8on

•  Improve
the
classifier
through
machine

learning
methods
by
analyzing
performance
in

the
field

5/2/13
11

Vector-‐Space
Models

•  Similarity
based
search
make
vector-‐space

model
a
good
choice
for
genera8ng
selec8ons

•  Given
the
vola8le
nature
of
data,
informa8on

retrieval
(IR)
systems
can
adapt
easily

•  Good
at
neighborhood
search

Sensi8ve
to
individual
aSribute
changes!

5/2/13
12

Sources
of
Inspira8on

•  Lucene/Solr
features

•  Documenta8on
from
(erstwhile)
Lucid

Imagina8on

•  Ease
with
which
Lucene/Solr
could
be

installed
and
explored

Very
short
learning
curve
for
novices!

5/2/13
13

Feature
Selec8on

•  Primi8ve
and
derived
aSributes

•  Entropy

•  Distribu8on

5/2/13
14

Domain

•  Devices
come
with
structural
informa8on
but

not
much
grammar
or
seman8cs

•  Bag-‐of-‐words
(single
ﬁeld)
approach
is
fast
but

not
precise

•  Using
all
ﬁelds
is
precise
but
response
is
slow

Now
what?

5/2/13
15

Disjunc8on
Max

•  Matrix
of
all
possible
combina8ons
of
user
input
query

and
document
ﬁelds

•  Transforms
into
a
Boolean
query
of

Disjunc8onMaxQueries
of
each
row

•  Maximum
score
of
sub
clauses
Is
used
by

Disjunc8onMaxQuery

•  No
single
term
in
user
input
dominates

This
is
needed!

Src:
SearchHub
and
LucidWorks

5/2/13
16

DisMax
Experiments

(index
size
=
60
Million)

Scenario
1

mm=2

Solr
ﬁelds
=
{
a1,
a2,

a3
}

Values=
{
phrase1,

phrase2,
phrase3}

Must-‐Match
Clauses

Latency:
YES
(35
ms)

Precision:
NO
(20%

failure)

5/2/13
17

Scenario
2

mm
=
50
%

Solr
ﬁelds
=
{
a1
}

Values=
{
term1,
term2,

term3
….
termn
}

Should-‐Match
Clauses

Latency:
NO
(>
2
seconds)

Precision:
YES
(>
98%)

Possible
Workaround

•  Look-‐ahead:
Customize
Lucene/Solr
to
do
a

branch-‐and-‐bound
search,
bail
out
on
some

lower
bound
score

•  Minimize
candidates
for
DisMax
search

-‐  reduce
total
number
of
Solr
instances
to
search

-‐  reduce
total
number
of
disjunc8ve
terms

[
Empirical
es8mate:
tn
=
2
*
tn-‐1

where
t
=
8me
&

n
=
number
of
disjunc8ve
terms]

5/2/13
18

Phrases
over
Terms

•  Used
coloca8on
(co-‐occurrence
matrix)
to

determine
most
common
phrases

•  Delete
terms
covered
by
phrases

•  Add
stop
words
based
on
frequency
analysis

•  Ensure
precision
is
preserved
through

regression
tests

Reduced
the
number
of
DisMax
terms
by
30%

5/2/13
19

Sources
of
Inspira8on

•  Planning
in
a
Hierarchy
of
Abstrac8on
Spaces,

Ar8ﬁcial
Intelligence,
Vol.
5,
No.
2,
pp.

115-‐135
(1974)

•  Search
Reduc8on
in
Hierarchical
Problem

Solving,
Proc.
Of
the
9th
IJCAI,
AAAI
Press,

Menlo
Park,
CA
(1991)

•  Excep8onal
Data
Quality
Using
Intelligent

Matching
and
Retrieval,
AI
Magazine,
AAAI

Press
(Spring
2010)

5/2/13
20

Hierarchical
Matching

Bag
of
words

Models

Phrases

Filters
DisMax

Query

Formulator

Domain-‐
speciﬁc

paSerns

CSV/JSON

Solr

instances

selector

To
Solr
Servers

5/2/13

21

Veriﬁca8on

Conﬂict
Resolu8on

•  Top
n
candidates
are
returned
from
each
Solr

instance

•  They
are
ranked
based
on
custom
veriﬁca8on

module

•  Ties
are
broken
using
recency

•  Top
candidate
is
persisted
and
returned
along

with
custom
score

5/2/13
22

Comments

•  Dismax
performs
mul8dimensional
match

•  Extracted
mul8ple
ﬁlters
and
arranged
them

hierarchically

•  Separa8on
of
selec8on
and
evalua8on

-‐  Selec8on
=
approximate
solu8on

-‐  Evalua8on
=
reﬁnement

5/2/13
23

Where
8me
went..

•  ASribute
selec8on

•  Ranking

•  Op8miza8on

•  Index
re-‐genera8on

•  Regression
tes8ng

5/2/13
24

Sources
for
Tune
Up

•  Scaling
Solr,
Lucene
Revolu8on,
May
2011

•  Prac8cal
Search
with
Solr:
Beyond
just
Looking

it
Up,
Lucid
Imagina8on,
May
2010

5/2/13
25

Tes8ng

•  Precision
tes8ng
using
self
and
mixed
modes

•  Latency
tests

-‐  custom
harness
for
stand-‐alone
tests

-‐  integrated
tests
with
JMeter
framework

5/2/13
26

Latency
Percen8les

original
edismax

Ini8al
solu8on

Op8miza8on
2:
Domain
paSerns,

Stop
words,
de-‐dupe

Op8miza8on
1:
Filters,

Focused
search,
veriﬁca8on

5/2/13
28

Response
Times
over
Time

5/2/13
30

Project
Execu8on

•  Agile
Methodology

•  Risk
mi8ga8on
through
primary
and

con8ngency
plans

•  Rapid
prototyping
followed
by
good
sozware

engineering
prac8ces

•  Evalua8ng
DSE
(DataStax)
&
Solr
Cloud

5/2/13
31

Gleanings

•  You
can
classify
anything
with
Lucene/Solr,

lexicon
is
your
own

•  The
ques8on
is
not
whether
Lucene/Solr
can

solve
a
par8cular
classiﬁca8on
problem,
but

whether
you
can
priori8ze
among
the
many

ways
of
doing
it

•  If
you
run
into
a
problem,
someone
has
solved

it
or
will
solve
it
in
the
near
future

5/2/13
32

Gleanings
…

•  Deal
with
accuracy
before
latency

•  If
precision,
latency
and
scale
are
all
cri8cal
to

your
domain,
expect
to
invest
some8me
in

hierarchical
abstrac8ons

•  Index
once,
run
any8me,
anywhere,
does
not

apply
during
development

•  Throwing
all
data
at
Lucene/Solr
will
not
work
for

mission
cri8cal
applica8ons

•  Rapid
prototyping
and
willingness
to
fail

5/2/13
33

Summary

Simplify
and
match
at
mul0ple
levels
of

abstrac0on

5/2/13
34

Contributors

Chandra
Mouleeswaran

Research
&
Prototyping

Fang
Chen

Research
&
Prototyping

Luke
Mertens

Produc8za8on
&
Scalability

Brent
Pearson

Release
Management

Tracy
Hsu

Precision
Tes8ng
&
QA

5/2/13
35

Srinivas
Nayani

Deployment
&
QA

COMMENTS & FEEDBACK:
Chandra Mouleeswaran
cmouleeswaran@threatmetrix.com
5/2/13
36

Rapid pruning of search space through hierarchical matching

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Rapid pruning of search space through hierarchical matching

Ähnlich wie Rapid pruning of search space through hierarchical matching (20)

Mehr von lucenerevolution

Mehr von lucenerevolution (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Rapid pruning of search space through hierarchical matching