Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
3. Who Am I
•
•
•
•
Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
http://shaierera.blogspot.com
shaie@apache.org
5. Faceted Search
•
Technique for accessing documents that were classified into a taxonomy of categories
–
•
Flat: Author/John Doe, Tags/Lucene, Popularity/High
–
Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP)
Quick overview of the break down of the search results
–
•
How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet
Simplifies interaction with the search application
–
Drilldown to issues that were updated in Past 2 days by clicking a link
–
No knowledge required about search syntax and index schema
http://jirasearch.mikemccandless.com
6. Lucene Facets
•
•
Contributed by IBM in 2011, released in 3.4.0
Major changes since 4.1.0+
–
–
–
–
•
Two main indexing-time modes
–
–
•
Taxonomy-based: hierarchical facets, managed by a
sidecar index, low NRT reopen cost
SortedSetDocValues: flat facets only, no sidecar index,
higher NRT reopen cost
Runtime modes
–
•
NRT support
Nearly 400% search speedups
Complete API revamp
New features (SortedSet, range faceting, drill-sideways)
Range facets (on NumericDocValues fields)
Other implementations: Solr, ElasticSearch, Bobo
Browse
7. Lucene Facet Components
•
TaxonomyWriter/Reader
–
•
FacetFields
–
•
Defines which facets to aggregate and the FacetsAggregator (aggregation function)
FacetsCollector
–
•
Add facets information to documents (DocValues fields, drilldown terms)
FacetRequest
–
•
Manage the taxonomy information
Collects matching documents and computes the top-K categories for each facet request
(invokes FacetsAccumulator)
DrillDownQuery / DrillSideways
–
Execute drilldown and drill-sideways requests
8. Sample Code – Indexing
// Builds the taxonomy as documents are indexed, multi-threaded, single instance
TaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);
// Adds facets information to a document, can be initialized once per thread
FacetFields facetFields = new FacetFields(taxoWriter);
// List of categories to add to the document
List<CategoryPath> cats = new ArrayList<CategoryPath>();
cats.add(new CategoryPath("Author", "Erik Hatcher"));
cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’));
cats.add(new CategoryPath("Pub Date", "2004", "December", "1"));
Document bookDoc = new Document();
bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES);
// add categories fields (DocValues, Postings)
facetFields.addFields(bookDoc, cats);
// index the document
indexWriter.addDocument(bookDoc);
9. Sample Code – Search
// Open an NRT TaxonomyReader
TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
// Define the facets to
FacetSearchParams fsp =
fsp.addFacetRequest(new
fsp.addFacetRequest(new
aggregate (top-10 categories for each)
new FacetSearchParams();
CountFacetRequest(new CategoryPath("Author"), 10));
CountFacetRequest(new CategoryPath("Pub Date"), 10));
// Collect both top-K facets and top-N matching documents
TopDocsCollector tdc = TopScoredDocCollector.create(10, true);
FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor);
Query q = new TermQuery(new Term(“title”, “lucene”));
searcher.search(q, MultiCollector.wrap(tdc, fc));
// Traverse the top facets
for (FacetResult fres : facetsCollector.getFacetResults()) {
FacetResultNode root = fres.getFacetResultNode();
System.out.println(String.format("%s (%d)", root.label, root.value));
for (FacetResultNode cat : root.getSubResults()) {
System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”);
}
}
10. Drilldown and Drill-Sideways
•
Drilldown adds a filter to the search
–
Multiple categories can be OR’d
// Drilldown – filter results to “Component/core/index”;
// All other “Component/*” and “Component/core/*” get count 0
Query base = new MatchAllDocsQuery();
DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base);
ddq.add(new CategoryPath(“Component/core/index”, ‘/’));
•
Drill sideways allows drilldown, yet still aggregate “sideways”
categories
// Drill-Sideways – drilldown on “Component/core/index”;
// Other “Component/*” and “Component/core/*” are counted too
DrillSideways ds = new DrillSideways(searcher, taxoReader);
DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp);
http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
11. Dynamic Facets
•
Range facets on NumericDocValues fields
–
–
Define interested buckets during query
Supports any arbitrary ValueSource (Lucene 4.6.0)
// Aggregate matching documents into buckets
RangeAccumulator a = new RangeAccumulator(new
RangeFacetRequest<LongRange>("field",
new LongRange(“1-5", 1L, true, 5L, true),
new LongRange(“6-20", 6L, true, 20L, true),
new LongRange(“21-100", 21L, false, 100L, false),
new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));
12. Facet Associations
•
Not all facets created equal
–
–
–
•
Categories can have values associated with them per document
–
–
•
Categories added by an automatic categorization system, e.g. Category/Apache
Lucene (0.74) (confidence level is 0.74)
Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated
from contracts)
Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD,
numUpdates=8…)
They are later aggregated by these values
NOTE: ≠ NumericDocValuesFields!
Facet associations are completely customizable – encoded as a byte[] per
document
http://shaierera.blogspot.com/2013/01/facet-associations.html
13. More Features
•
Complements
–
–
–
•
Sampling
–
–
•
Holds the count of each category in-memory, per IndexReader
When number of search results is >50% of the index, count the “complement set”
Useful for “overview” queries, e.g. MatchAllDocsQuery
Aggregate a sampled set of the search results
Optionally re-count top-K facets for accurate values
Partitions
–
–
Partition the taxonomy space to control memory usage during faceted search
Useful for very big taxonomies (10s of millions of categories)
15. The Taxonomy Index
•
The taxonomy maps categories to integer codes (referred to as ordinals)
–
–
–
•
Kind of like a Map<CategoryPath,Integer>, with hierarchy support
Provides taxonomy browsing services
DirectoryTaxonomyWriter is managed as a sidecar Lucene index
Categories are broken down to their path components, e.g.
Date/2012/March/20 becomes:
–
–
–
–
Date, with ordinal=1
Date/2012, with ordinal=2
Date/2012/March, with ordinal=3
Date/2012/March/20, with ordinal=4
16. The Search Index
•
Categories are added as drilldown terms, e.g. for Date/2012/March/20:
–
–
–
•
$facets:Date
$facets:Date/2012
…
All category ordinals associated with the document are added as a
BinaryDocValuesField
–
–
All path components ordinals’ are added, not just the leafs’
Encoded as VInt + gap for efficient compression and speed
•
–
Other compression methods attempted, but were slower to decode (LUCENE-4609)
Used during faceted search to read all the associated ordinals and aggregate accordingly
(e.g. count)
17. SortedSet Facets
•
•
•
•
SortedSetFacetFields add SortedSetDocValuesFields and drilldown
terms to documents
Local-segment SortedSet ordinals are mapped to global ones through
SortedSetDocValuesReaderState
Use SortedSetDocValuesAccumulator to accumulate SortedSet facets
Advantages:
–
–
–
•
Taxonomy representation requires less RAM (flat taxonomy)
No sidecar index
Tie-breaks by label-sort order
Disadvantages:
–
–
–
–
Not full taxonomy
Overall uses more RAM (local-to-global ordinal mapping)
Adds NRT reopen cost
Slower than taxonomy-based facets
18. Global Ordinals
•
Per-segment integer codes (as used by the SortedSet approach) are less efficient
–
–
–
•
Global ordinals allow efficient per-segment faceting and aggregation
–
–
•
Different ordinals for same categories across segments
Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable
Resolve top-K on the String representation of categories – more CPU
No translation maps required (no extra RAM, highly scalable)
Aggregation, top-K computation done on integer codes
But, do not play well with IndexWriter.addIndexes(Directory…)
–
Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the
input search are mapped to the destination’s
19. Two-Phase Aggregation
•
FacetsCollector works in two steps:
–
–
•
Performance tests show that this improves faceted search (LUCENE-4600)
–
•
Collects matching documents (and optionally their scores)
Invokes FacetsAccumulator to accumulate the top-K facets
Locality of reference?
Useful for Sampling and Complements
–
Hard to do otherwise
20. FacetIndexingParams
•
Determine how facets are encoded
–
–
–
•
CategoryListParams holds parameters for a category list
–
–
•
Partition size
Facet delimiter character (for drilldown terms, default u001F)
CategoryListParams
Encoder/Decoder (default DGapVInt)
OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and
ALL_BUT_DIMENSION (default)
CategoryListParams can be used to group facets together
–
–
Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field)
Expert: separate categories by dimension into different category lists
•
•
Useful when sets of categories are always aggregated together, but not with other categories
FacetIndexingParams are currently not recorded per-segment and therefore you
should be careful if you suddenly change them!