This document discusses using facets in Solr to facilitate relevant search. It provides an overview of facet history and how facets represent metadata that provides context about search results. Facets can be used for visualization, analytics, and understanding language semantics from text. The document argues that facets are dynamic context discovery tools that can be leveraged to find similar items and enhance search in various ways such as query autofiltering, typeahead suggestions, and text analytics.
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
A Multifaceted Look at Faceting for Relevant Search
1. A Multifaceted Look At Faceting - Using Facets
“Under the Hood" to Facilitate Relevant Search
Ted Sullivan
Senior Solutions Architect, Lucidworks Professional Services
2. Agenda
• Facet history - why solr faceting rocks
• Text - unstructured? I hardly think so!
• Using facets as context mining engines
3. 3
01
Facets are Metadata
Facet: “A particular aspect or feature of something.”
Metadata: ”Data about data" - attributes, aspects, descriptors, features, properties, traits
Facets == Metadata
Metadata Semantics: what, where, when, why
name, size, shape, color, material, texture
manufacturer, number of outlets, voltage, is pre-assembled …
address, phone number, birth date, user rating
hand size, likes country music, believes in climate change …
Metadata Dependencies: Some metadata fields depend on “what" the “thing” is
e.g. People have different attributes than Toaster Ovens
4. 4
01
A
Little
History:
Facet
Technology
Terminology
Verity K2 Parametric Search
Fast ESP Navigators
MS Fast Refiners
Endeca Dimensions
Solr Facets
Google (& Autonomy too) Facets? Facets? We don’t need no stinkin’ facets!
5. 5
Traditional
Uses
of
Facets
Faceted Navigation - aka “Refinement” / “Drill down”
Allows initial query to be ambiguous without requiring the user to “rethink”
what to search for.
Neatly handles the old “No Results Found” trial-and-error bug-a-boo.
Down sides:
Facets should not be used as a ‘band-aid’ for poorly tuned relevance!!!
•The “need” for faceted navigation forces us to favor recall over precision. (Maybe this is
why Google avoids them!)
•… because you have to drill-in to something.
If users have to use facets to drill in to what they really want -
Why search in the first place - why not just browse?
Facet 'noise' - false positives due to poor precision / high recall causes weird outliers
(ML techniques like Signal Aggregation to improve relevance do not help here)
6. 6
Visualization
Facets show a high-level or global 'context' of what the result set is “about"
Dashboards - Search Driven BI:
Eye Candy: Pie charts, bar charts, histograms, etc. - make use of the basic statistical nature
of facets (i.e. counts - but now lots of things mean, median, std, skewness, etc).
Data Analytics:
Solr now enhances facet statistics to include many more useful mathematical calculations.
Use basic analytics such as mean, standard deviation and the like to do more complex
analytics similar to what is done with Databases in OLAP “cubes”.
Time-series: Range Facets on a Date-Time (Trie)Field
Advantage - analytics are search driven so that the “cube” can change with the query
Facet
Analytics
-‐
Maturing
Rapidly
7. 7
Solr
Facets
are
Dynamic
not
Static
In other search engines like Verity, Fast or Endeca:
Facet values are computed at index time - thereby making them Static at query time.
Lucene did not have faceting originally
-Solr added faceting “on the fly” - i.e. at query time (before Lucene added it)
-Solr faceting is thus Dynamic!
The main hurdle to doing it this way is to make it FAST (mission accomplished - thanks Yonik!)
Although this would seem to be what we engineers call a "bolt on"
- in hindsight this was a very fortuitous evolutionary path!!
Once you do this, there are serious advantages over index-time faceting!!
One main advantage is flexibility
-In Solr - you can facet on just about anything - even things that weren’t thought about when
the collection was designed (function queries - extensible with ValueSource impls!)
-Good, good, good!!!
8. 8
Very
Brief
Survey
of
Solr
Faceting
Methods
Metadata
fields
-‐
prefer
non-‐tokenized
field
types
(you
can
facet
on
tokenized
fields
too
-‐
but
why
would
you
want
to?)
enum
method
=>
filter
cached
filter
queries
fc
method
=>
uses
the
FieldCache
(
it
now
uses
DocValues)
facet
query
facet
prefix
facet
range
function
queries
and
Value
Sources
pivot
facets
excluding
and
tagging
JSON
Facets
Facet
performance
tuning
(gotchas)
-‐
I
could
talk
about
this
at
length
but
…
nah!
Read
the
Wikis!
9. 9
Language
Semantics
Nouns,
Verbs,
Adjectives,
Adverbs,
Prepositions,
etc.
What
type
of
thing
is
Is?
Car
What
it’s
name?
Lamborghini
Aventador
Where
is
it
Made?
Italy
How
much
horsepower
does
it
have?
A
hellova
lot
How
fast
does
it
go?
Very
How
much
does
it
cost?
If
you
have
to
ask
…
My
Search
Philosophy:
Humans
use
language
to
search
because
that
is
how
we
reason
about
things.
Search
engines
need
to
do
a
better
job
of
understanding
language
to
better
help
us
to
find
the
things
we
are
looking
for.
Search
index
schemas
describe
a
machine
oriented
/
data-‐centric
view
of
things
-‐
want
to
translate
that
to
and
from
language-‐centric
views
-‐
from
a
search
engine
perspective
-‐
descriptive
text
is
“unstructured”
data
-‐
but
not
to
us!
10. 10
Metadata
and
Text
Transforms
Metadata:
Data-‐centric
view
of
things
<=>
Text:
Language-‐centric
view
of
things
Metadata
terms
are
embedded
in
language
Compose
descriptive
text
about
a
thing
from
its
attributes
or
properties
-‐>
Create
linguistic
expressions
from
metadata
Deduce
attributes
or
properties
of
a
thing
from
descriptive
text
-‐>
Compute
metadata
by
linguistic
analyses
of
text
Search
problem
-‐
match
terms
in
query
with
things
in
index
-‐>
Knowledge
of
word
meanings
is
power!!!
-‐>
Facet
metadata
constitutes
knowledge
that
can
be
leveraged!!
11. FACETS ARE CONTEXT DISCOVERY TOOLS
Lemma 1: Similar things occur in similar contexts
Lemma 2: Facets are context exploration tools
Assertion: Facets can be used to find similar things
12. 12
Exploiting
Facet
Metadata
Facets
provide
a
sort
of
global
metadata
CONTEXT
for
a
search
result
set
In
addition
to
faceting,
how
can
we
exploit
metadata
to
enhance
search?
❖
Turning
facet
metadata
inside-‐out:
Query
Autofiltering
❖
Using
Facets
to
build
contextual
typeahead
suggester:
•Pivot
facets
to
construct
phrases
from
structured
data.
•Extract
related
information
using
facets
at
index
time
to
enable
security
trimming
and
dynamic
boosting
❖
Using
Facets
for
text
analytics
to
generate
better
facets
•Facet
ratios
of
positive
and
negative
queries
on
key
terms
-‐>
detects
“key
term
clusters”
•Document
clustering
using
key
term
cluster
vectors
-‐>
detects
key
term
categories
13. 13
Example:
Detecting
User
Intent
in
eCommerce
Separating
the
‘What’
from
the
‘What
about’
(
i.e.
a
Thing
vs.
a
Really
BIG
Thing)
microwave
safe
dishes
‘microwave
safe’
-‐
adjective
phrase
compact
microwave
oven
‘microwave
oven’
-‐
noun
phrase
microwave
‘microwave’
-‐
noun
-‐
contraction
for
‘microwave
oven’
coffee
filter
‘coffee
filter’
-‐
noun-‐noun
phrase
-‐
a
filter
coffee
‘coffee’
-‐
noun
-‐
a
beverage
coff
coffee
table
‘coffee
table’
-‐
noun-‐noun
phrase
-‐
table
coffee
colored
sheets
‘coffee
colored’
-‐
adjective
phrase
coffee
ice
cream
coffee
flavored
ice
cream
milk
chocolate
a
type
of
chocolate
chocolate
milk
flavored
milk
14. 14
Query
Autofiltering
Uses
field
values
to
generate
a
“reverse
lookup
map”
that
maps
values
to
fields
that
contain
them.
-‐
Inverts
the
“uninverted
map"
-‐
ah
…
another
type
of
inverted
map
-‐
values
-‐>
fields
-‐
Uses
the
Lucene
SynonymMap
Finite
State
Machine
(FST)
implementation
Uses
this
map
to
parse
the
query
to
find
terms
in
the
query
related
to
specific
metadata
fields.
Example:
‘red
sofa’
maps
red
=>
color
sofa
=>
product_type
Selects
the
longest
contiguous
phrase
in
its
lexicon
to
match
against
parts
of
the
query
If
have
‘coffee’
and
‘coffee
filter’
in
the
lexicon
(i.e.
the
Solr
collection)
the
query
‘coffee
filter’
will
only
match
‘coffee
filter’
Can
construct
either
a
Solr
filter
query
(fq)
or
boost
query
(bq)
using
this
information.
15. 15
Query
Autofiltering
-‐
Knowledge
Mining
Doing
this
is
a
way
of
exploiting
the
field/value
relationships
in
the
collection
metadata.
So
what
it
effectively
does
is
extract
the
knowledge
that
is
built-‐in
to
your
collection
due
to
the
facet
metadata
that
it
contains
and
applies
that
knowledge
to
parsing
of
the
query:
•It
knows
that
‘red’
is
a
color
because
‘red’
is
a
value
in
the
’color’
field.
•It
‘Short
circuits’
the
search-‐then-‐drill-‐in
paradigm
-‐>
just
search!
•But
as
the
telemarketers
say:
“Wait!
there’s
more!
…”
The
knowledge
about
what
terms
mean
and
the
properties
of
the
term
field
(single
valued
vs.
multi-‐valued)
provide
other
opportunities
that
can
be
exploited!
16. 16
Query
Autofiltering
-‐
Language
Logic
Can
provide
a
semblance
of
“natural
language
processing”
by
breaking
a
query
into
semantic
parts
and
applying
those
appropriately
Natural
Language
Boolean
vs
Mathematical
Boolean
Language
usage
of
boolean
terms
like
‘AND’
and
‘OR’
is
contextual!!
“show
me
green
or
blue
shirts”
is
equivalent
to
“show
me
green
and
blue
shirts”
The
user
means
‘both’
in
each
case
so
‘and’
and
‘or’
are
synonyms
in
this
usage
context!
but
in
“show
me
fast
and
inexpensive
cars”
-‐
‘and’
means
AND!
Depends
on
field
cardinality!
If
color
is
single-‐valued
and
‘attributes’
is
multi-‐valued.
Users
understand
this
intuitively
-‐
Search
Engines
don’t
but
Query
Autofilter
can
get
this
right!
17. 17
Query
Autofiltering
-‐
Extensions
-‐
Query
Patterns
Once
you
know
what
individual
query
terms
and
phrases
mean,
you
can
exploit
this
by
creating
templates
for
popular
query
patterns
Query
Pattern:
Terms
+
Facet
fields
that
will
be
captured
by
Query
AutoFilter
Query
Template:
Query
template
with
placeholders
for
field
values
filled
in
if
user
query
matches
the
pattern.
Example:
Music
Ontology
User
Query:
Who’s
in
The
Who
Query
Pattern:
(who's
in,was
in,were
in,member
of,members
of)|${hasPerformer_ss}
Query
Template:
memberOfGroup_ss:${hasPerformer_ss}
User
Query:
Songs
Beatles
Covered
Query
Pattern:
(song,songs)|${hasPerformer_ss}|covered
Query
Template:
hasPerformer_ss:${hasPerformer_ss}
AND
version_s:Cover
19. 19
Typeahead
-‐
Priming
the
Pump
with
Pivot
Facet
Patterns
Construct
semantically
meaningful
phrases
from
multiple
metadata
fields
✦Inverse
of
Query
AutoFiltering
-‐
creates
suggestions
that
we
know
how
to
process!!
✦Uses
Solr
Pivot
Facets
to
translate
field
patterns
to
suggested
query
phrases
Examples:
${hasPerformer_ss}
${Recording_Type_s}s
=>
Beatles
Songs,
Led
Zeppelin
Songs,
Billy
Joel
Songs,
Frank
Zappa
Songs
etc.
${genres_ss}
${Musician_Type_ss}s
=>
Classical
Pianists,
Hard
Rock
Guitarists,
Jazz
Drummers
${Recording_Type_s}s
${hasPerformer_ss}
Covered
(with
fq
version_s:Cover)
=>
Songs
Jimi
Hendrix
Covered
20. 20
Building
a
Suggester
with
Dynamic
Context
Assertion:
Facets
can
be
used
to
find
similar
things.
Example:
John
Lennon
and
Paul
McCartney
share
many
attributes,
activities,
group
memberships,
in
common
-‐>
They
are
closely
related
entities.
Search
Agendas:
Users
tend
to
have
some
high
level
goal
when
searching
(e.g.
Find
out
information
about
The
Beatles)
Agenda’s
can
change
in
a
session,
but
it
is
likely
that
queries
issued
within
a
short
period
of
time
will
have
a
similar
goal.
Conclusion:
Facet
meta-‐information
from
facets
can
be
used
to
associate
similar
things
or
concepts
within
a
search
session.
21. 21
Building
a
Suggester
with
Dynamic
Context
Suggester
Builder
Design
(Fusion
Connector)
Uses
Facet
Queries
against
a
Content
Collection
to
create
additional
metadata
for
the
Suggester
or
Typeahead
Collection.
This
contextual
metadata
can
then
be
used
for:
•
Security
Trimming
of
Typeahead
suggestions
•
Dynamic
boosting
of
similar
suggestions
within
a
user
session
22. 22
Building
a
Suggester
with
Dynamic
Context
Bring
back
other
fields
in
addition
to
displayed
suggestion
text
(i.e.,
the
ones
that
were
calculated
using
faceting)
If
a
query
is
used
to
search,
temporarily
store
its
associated
metadata
in
a
circular
cache
on
the
browser.
When
submitting
the
next
typeahead
query,
add
the
cached
information
from
the
queue
as
boost
queries.
Type
‘j’
-‐
get
back
Jai
Johnny
Johanson
Bands
Jai
Johnny
Johanson
Groups
J.J.
Johnson
Jai
Johnny
Johanson
Juke
Joint
Jezebel
Juke
Joint
Jimmy
Just
searched
for
‘Paul
McCartney’
then
type
‘j’
John
Lennon
John
Lennon
Songs
John
Lennon
Songs
Covered
James
P
Johnson
Songs
(?)
John
Lennon
Originals
Hey
Jude
23. 23
Building
a
Suggester
with
Dynamic
Context
Paul
McCartney’s
“Meta-‐informational
Context”:
genres_ss:
Rock,
Rock
&
Roll,
Soft
Rock,
Pop
Rock
hasPerformer_ss:
Beatles,
Paul
McCartney,
José
Feliciano,
Jimi
Hendrix,
Joe
Cocker,
Aretha
Franklin,
Bon
Jovi,
Elvis
Presley
(
…
and
many
more)
composer_ss:
Paul
McCartney,
John
Lennon,
Ringo
Starr,
George
Harrison,
George
Jackson,
Michael
Jackson,
Sonny
Bono
memberOfGroup_ss:
Beatles,
Wings
Dynamic
Boost
Query:
genres_ss:”Rock”^50
genres_ss:”Rock
&
Roll”^50
genres_ss:”Soft
Rock”^50
genres_ss:”Pop
Rock”^50
hasPerformer_ss:”Beatles”^50
hasPerformer_ss:”Paul
McCartney”^50
hasPerformer_ss:”José
Feliciano”^50
hasPerformer_ss:”Jimi
Hendrix”^50
composer_ss:”Paul
McCartney”^50
composer_ss:”John
Lennon”^50
composer_ss:”Ringo
Starr”^50
composer_ss:”George
Harrison”^50
memberOfGroup_ss:”Beatles”^50
memberOfGroup_ss:”Wings”^50
24. 24
Text
Mining
Analyses
Problem:
Metadata
needs
to
be
improved
for
useful
application
of
QAF
(i.e.
Real
World)
Case
1:
Extracting
product
type
and
product
attributes
metadata
from
short
product
descriptions
in
eCommerce
data
-‐
dealing
with
precision
and
recall
Case
2:
Large
text
documents.
Want
to
extract
keywords
and
assign
categories
to
documents.
Interesting
properties
of
facets
when
directed
towards
unstructured
text:
Facet
ratios
of
positive
and
negative
queries
yield
“keyword
clusters”
Document
clustering
of
keyword
cluster
vectors
give
crisp
categories
25. 25
Auto
phrasing
vs.
Auto
filtering
Auto
Phrasing
-‐Multi-‐term
phrases
that
refer
to
a
single
entity.
-‐Used
as
a
workaround
to
Solr
“Multi-‐term
synonym
problem”
-‐That
is
now
fixed
(as
of
6.4.1
-‐
thanks
Steve
Rowe!)
-‐Is
Auto
phrasing
solution
now
obsolete?
-‐Answer:
NOT!!!,
that
was
exploiting
a
side
effect
of
what
it
does!
-‐
Uses
knowledge
from
a
phrase
list
to
determine
what
is
an
auto
phrase
-‐Works
on
tokenized
text
fields
(implemented
as
a
Lucene
TokenFilter)
Query
Auto
Filtering
-‐
Utilizes
information
from
non-‐tokenized
text
fields
-‐
inherently
solves
multi-‐term
problem
Strategy
for
“unstructured
text”:
Use
auto
phrasing
to
extract
phrase
metadata
(
keywords
)
from
unstructured
text
This
metadata
can
then
be
consumed
by
Query
Autofilter
at
search
time.
26. 26
Simple
Keyword
Analysis
“Unstructured” Text Lucene Analyzer with Auto Phrasing Extensions
Spark Job
Metadata
we
would
like
to
have
but
don’t
have
-‐
requires
lots
of
manual
curation
==
$$$
Have
short
descriptive
text
fields
that
can
be
mined
to
glean
useful
metadata
such
as
product
type,
material,
size.
Special
Sauce
Ingredients:
➡Semantically
pure
lexicons
(things,
brands,
attributes,
dimensions,
logos,
materials)
of
key
terms
➡Auto
phrasing-‐based
Lucene
Analysis
to
extract
key
terms
and
“stop
phrases”
(e.g.
Mr
Coffee)
➡Expansions
and
Relations
based
on
noun
phrases
in
lexicon.
Contextually
aware
management
of
precision
and
recall.
➡Tricks
to
deal
with
“leather
case
for
iPhone”,
“DSLR
camera
with
50-‐mm
lens”
27. 27
Expansions
and
Relations
Motivation:
eCommerce
Use
Case:
Search
for
‘iPhone’
-‐
get
iPhone
cases
and
iPhone
chargers
mixed
in.
-‐
Want
to
have
both
BUT
want
iPhones
at
the
TOP
of
the
result
set.
=>
TF/IDF
doesn’t
always
deliver
on
this
(can’t
control
relevance
-‐
you
get
what
you
get)
i.e.
-‐
want
recall
for
up
sell
opportunities
-‐
so
relax
precision
a
bit.
Relevance
(what
I
want
is
on
top)
is
still
very
important
Search
for
‘iPhone
case’
Now
I
want
precision
-‐
just
show
me
iPhone
cases
please
‘cause
I
already
got
a
stinkin’
iPhone!!
Why
else
would
I
be
looking
for
accessories
for
it
???
28. 28
Expansions
and
Relations
Noun
phrases
have
structure:
end
table
side
table
dining
room
table
picnic
table
coffee
table
folding
table
=>
Are
ALL
types
of
tables
table
cloth
table
setting
table
lamp
table
chair
=>
Are
table
related
things.
Expansions
-‐
IS-‐A
relationships
Phrases
that
end
in
‘table’
are
specific
types
of
tables
classify
‘end
table’
as
‘table’
too
=>
search
for
‘table’
returns
all
types
of
tables
=>
search
for
‘end
table’
just
returns
end
tables
Relations
-‐
IS-‐LIKE
Relationships
Phrases
that
start
with
‘table’
are
table
related
things
Add
table
related
things
to
fq
for
‘table’
as
OR
list
Boost
search
term
‘table’
more
than
table
related
things
-‐
get
both
but
tables
are
first
Table
related
things
don’t
have
relations
-‐
search
is
more
specific
-‐
just
get
that
thing!
29. 29
Unstructured
Text
-‐
Oh
My!
The
problem
of
unstructured
text
is
that
it
is
…
well
unstructured
….
or
is
it?
(Linguists
don’t
think
so!)
We
search
but
don’t
typically
facet
on
unstructured
text
fields
(i.e.
tokenized
fields).
Even
though
in
Solr
we
can
facet
on
anything
-‐
Get
all
of
the
tokenized
terms
and
their
counts
as
facet
values
-‐>
very
high
cardinality
-‐
Absolutely
useless
for
UI
drill
in
-‐
so
this
is
basically
a
no-‐no
at
query
time
=>
But
that
is
not
all
that
facets
are
good
for
so
…
wait
a
minute
(light-‐bulb
moment)!
<=
What
if
we
DID
facet
on
the
tokens
and
used
their
stats
to
do
some
text
analysis?
=>
It
turns
out
we
can
use
facets
to
detect
keywords
in
documents.
<=
Keywords
-‐
Terms
that
occur
in
relatively
few
documents
(but
not
too
few).
-‐
Tend
to
be
important
words
in
some
subjects
but
not
others
-‐
i.e.
their
usage
is
highly
contextual
to
a
subject!
Keywords
for
the
same
subject
area
tend
to
occur
together
because
they
share
the
same
context!
Facets
are
a
great
context
mining
tool!!
Sounds
like
a
FIT!
30. 30
Facet
Ratios
=>
Keyword
Clustering
Method
to
my
Madness:
•
Tokenize
text
with
auto
phrasing,
stop
words
and
synonyms
-‐
store
tokens
in
a
multi-‐valued
field
with
DocValues
-‐
(yes
you
can
facet
on
a
text
field
but
it
tends
to
hit
a
wall
-‐
2M
word
limit
on
facet
values)
•
Using
the
/terms
handler,
get
each
term
in
the
text
field.
•
Submit
two
queries
-‐
one
with
text_field:[term]
(positive
Q)
-‐
one
with
-‐text_field:[term]
(negative
Q)
•
Calculate
the
following
ratio:
•
Take
the
xlog(x)
of
this
ratio
(for
better
discrimination)
-‐for
each
term,
take
the
best
related
terms
above
some
threshold
Facet
counts
(posizve
Q)
————————————
Total
counts
(posizve
Q)
———————————————
Facet
counts
(negazve
Q)
————————————-‐
Total
counts
(negazve
Q)