Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix
This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.
Transaction Management in Database Management System
Rapid pruning of search space through hierarchical matching
1. RAPID PRUNING OF SEARCH SPACE THROUGH
HIERARCHICAL MATCHING
Chandra Mouleeswaran
Machine Learning Scientist, ThreatMetrix Inc.
5/2/13
1
2. My
Background
• Machine
Learning
Scien8st
at
ThreatMetrix
Inc.
• Co-‐
Chair,
Developer
Programs,
IntelliFest.org,
Oct
2013,
San
Diego,
CA
Career
Path
-‐ Siemens
Corporate
Research:
Learning
&
Expert
Systems
-‐ Technology
division
of
Donaldson,
LuQin
and
JenreSe
company
(Pershing):
Ar8ficial
Intelligence
Group
-‐
Network
Monitoring
-‐ Several
startups:
Classifica8on,
Web
Crawling,
Security,
Financial
Trading
etc.
5/2/13
2
4. The
Device
Iden8fica8on
Task
• Computa8onally,
it’s
a
CLASSIFICATION
problem:
{
a0,
a1,
a2,
a3………..
an
}
è
{
ci
}
ai
=
(
aSribute
|
field
|
key
)
value
ci
=
(
label
|
signature
|
class
|
hash
)
• Returning
devices
should
be
correctly
iden8fied
within
certain
tolerances
• New
classes
may
be
created
if
a
good
match
is
not
found
in
the
repository
of
known
devices
• Devices
age
out,
based
on
data
reten8on
policy
5/2/13
4
5. Task
Challenges
• Extremely
vola8le
aSributes
• There
are
no
pivot
aSributes
to
divide
and
conquer
the
search
space
• Changing
distribu8ons
• Emphasis
on
PRECISION
• Stringent
RESPONSE
8me
5/2/13
5
7. Approaches
• Rules
engine
• Learning
models
• Vector
space
models
Need
an
enterprise
grade
solu8on!
5/2/13
7
8. Rules
Engine
• No
experts
• Number
of
rules?
• Maintenance?
Not
a
viable
approach!
5/2/13
8
9. Learning
Models
• Most
machine
learning
methods
deal
predominantly
with
binary
classifica8on
problems
(eg.
fraud
/
not
fraud)
or
a
small
number
of
target
classes
• Few
exemplars
for
each
class
• ASribute
values
may
be
unbounded
• ASributes
may
not
follow
a
natural
progression
5/2/13
9
10. Learning
Models
…
• Unsupervised
learning
such
as
clustering
methods
would
make
good
models,
but
not
good
enough
to
be
of
prac8cal
use.
Any
simplifica8on
process
will
compromise
on
accuracy
• Ability
to
explain
is
cri8cal
• Tend
to
ignore
domain
knowledge
Challenge
in
providing
enterprise
solu8on
5/2/13
10
11. Thoughts
• No
comparable
applica8on
with
such
requirements
• Build
and
deploy
a
classifier
that
explains
itself
easily,
scales
temporally
and
offers
quick
response
• Use
domain
knowledge
to
guide
verifica8on
• Improve
the
classifier
through
machine
learning
methods
by
analyzing
performance
in
the
field
5/2/13
11
12. Vector-‐Space
Models
• Similarity
based
search
make
vector-‐space
model
a
good
choice
for
genera8ng
selec8ons
• Given
the
vola8le
nature
of
data,
informa8on
retrieval
(IR)
systems
can
adapt
easily
• Good
at
neighborhood
search
Sensi8ve
to
individual
aSribute
changes!
5/2/13
12
13. Sources
of
Inspira8on
• Lucene/Solr
features
• Documenta8on
from
(erstwhile)
Lucid
Imagina8on
• Ease
with
which
Lucene/Solr
could
be
installed
and
explored
Very
short
learning
curve
for
novices!
5/2/13
13
15. Domain
• Devices
come
with
structural
informa8on
but
not
much
grammar
or
seman8cs
• Bag-‐of-‐words
(single
field)
approach
is
fast
but
not
precise
• Using
all
fields
is
precise
but
response
is
slow
Now
what?
5/2/13
15
16. Disjunc8on
Max
• Matrix
of
all
possible
combina8ons
of
user
input
query
and
document
fields
• Transforms
into
a
Boolean
query
of
Disjunc8onMaxQueries
of
each
row
• Maximum
score
of
sub
clauses
Is
used
by
Disjunc8onMaxQuery
• No
single
term
in
user
input
dominates
This
is
needed!
Src:
SearchHub
and
LucidWorks
5/2/13
16
18. Possible
Workaround
• Look-‐ahead:
Customize
Lucene/Solr
to
do
a
branch-‐and-‐bound
search,
bail
out
on
some
lower
bound
score
• Minimize
candidates
for
DisMax
search
-‐ reduce
total
number
of
Solr
instances
to
search
-‐ reduce
total
number
of
disjunc8ve
terms
[
Empirical
es8mate:
tn
=
2
*
tn-‐1
where
t
=
8me
&
n
=
number
of
disjunc8ve
terms]
5/2/13
18
19. Phrases
over
Terms
• Used
coloca8on
(co-‐occurrence
matrix)
to
determine
most
common
phrases
• Delete
terms
covered
by
phrases
• Add
stop
words
based
on
frequency
analysis
• Ensure
precision
is
preserved
through
regression
tests
Reduced
the
number
of
DisMax
terms
by
30%
5/2/13
19
20. Sources
of
Inspira8on
• Planning
in
a
Hierarchy
of
Abstrac8on
Spaces,
Ar8ficial
Intelligence,
Vol.
5,
No.
2,
pp.
115-‐135
(1974)
• Search
Reduc8on
in
Hierarchical
Problem
Solving,
Proc.
Of
the
9th
IJCAI,
AAAI
Press,
Menlo
Park,
CA
(1991)
• Excep8onal
Data
Quality
Using
Intelligent
Matching
and
Retrieval,
AI
Magazine,
AAAI
Press
(Spring
2010)
5/2/13
20
21. Hierarchical
Matching
Bag
of
words
Models
Phrases
Filters
DisMax
Query
Formulator
Domain-‐
specific
paSerns
CSV/JSON
Solr
instances
selector
To
Solr
Servers
5/2/13
21
Verifica8on
22. Conflict
Resolu8on
• Top
n
candidates
are
returned
from
each
Solr
instance
• They
are
ranked
based
on
custom
verifica8on
module
• Ties
are
broken
using
recency
• Top
candidate
is
persisted
and
returned
along
with
custom
score
5/2/13
22
23. Comments
• Dismax
performs
mul8dimensional
match
• Extracted
mul8ple
filters
and
arranged
them
hierarchically
• Separa8on
of
selec8on
and
evalua8on
-‐ Selec8on
=
approximate
solu8on
-‐ Evalua8on
=
refinement
5/2/13
23
24. Where
8me
went..
• ASribute
selec8on
• Ranking
• Op8miza8on
• Index
re-‐genera8on
• Regression
tes8ng
5/2/13
24
25. Sources
for
Tune
Up
• Scaling
Solr,
Lucene
Revolu8on,
May
2011
• Prac8cal
Search
with
Solr:
Beyond
just
Looking
it
Up,
Lucid
Imagina8on,
May
2010
5/2/13
25
26. Tes8ng
• Precision
tes8ng
using
self
and
mixed
modes
• Latency
tests
-‐ custom
harness
for
stand-‐alone
tests
-‐ integrated
tests
with
JMeter
framework
5/2/13
26
31. Project
Execu8on
• Agile
Methodology
• Risk
mi8ga8on
through
primary
and
con8ngency
plans
• Rapid
prototyping
followed
by
good
sozware
engineering
prac8ces
• Evalua8ng
DSE
(DataStax)
&
Solr
Cloud
5/2/13
31
32. Gleanings
• You
can
classify
anything
with
Lucene/Solr,
lexicon
is
your
own
• The
ques8on
is
not
whether
Lucene/Solr
can
solve
a
par8cular
classifica8on
problem,
but
whether
you
can
priori8ze
among
the
many
ways
of
doing
it
• If
you
run
into
a
problem,
someone
has
solved
it
or
will
solve
it
in
the
near
future
5/2/13
32
33. Gleanings
…
• Deal
with
accuracy
before
latency
• If
precision,
latency
and
scale
are
all
cri8cal
to
your
domain,
expect
to
invest
some8me
in
hierarchical
abstrac8ons
• Index
once,
run
any8me,
anywhere,
does
not
apply
during
development
• Throwing
all
data
at
Lucene/Solr
will
not
work
for
mission
cri8cal
applica8ons
• Rapid
prototyping
and
willingness
to
fail
5/2/13
33
34. Summary
Simplify
and
match
at
mul0ple
levels
of
abstrac0on
5/2/13
34