Payloads have been a powerful aspect of Lucene for a long time, but have only had limited exposure in Solr. The Tika project has only recently finished integrating the powerful Tesseract OCR library, bringing the prospect of OCR to the masses.
1. Payloads and OCR with Solr
OpenSource Connections October 2019
Apache Lucene/Solr - London User Group
2. Introductions
Eric Pugh: Search Relevance Engineer at OpenSource Connections
Daniel Worley: Search Relevance Engineer at OpenSource Connections
3. Searching Text Inside Images
http://pdf-discovery-demo.dev.o19s.com:8080 and search for “HELOC”
4. OCR
● Tesseract and Tika enable us to get text out of images via OCR
● Text can then be indexed into Solr
● Problem solved?
5. Highlighting
● What if we want to highlight the text in an image that matched?
● Regular Solr Highlighting not good enough
○ We can get text snippets but we can’t see where they came from in the
image
○ For images with a lot of information this can make it hard for users to
see why a particular image matched their query
6. The Problem
● Tesseract has provided us bounding boxes for all of the OCR’d text
● We need to access this bounding box information within Solr on a per
match basis
7. What about Payloads?
● Payloads provide a way of attaching various metadata to each token
● More info
https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-luci
dworks
10. The Challenge
● Payloads are typically used at query time for matching or to affect the
score of matching documents.
● Not much in the area of surfacing payload data at query time without
manually extracting it again from the stored data
11. Iteration 1 - Idea
Create a highlighter formatter that surfaces payload attributes
12. Iteration 1 - Results
● Required hacking at low level Lucene internals to include the payload
attribute in the token stream.
● Suitable for a PoC, not great for any real applications
13. Iteration 2 - Idea
Create a component that only returns payloads for clauses that matched in
the query
14. Iteration 2 - Results
A deployable plugin that doesn’t require hacking on Lucene to work
15. Payload Component - What’s in the box?
● Payload Component
● And some conveniences:
○ Base64Encoder
○ PayloadBufferFilterFactory
● Available at: https://github.com/o19s/payload-component
16. Payload Component
● Similar to the highlighting component but returns matches only
● Currently no scoring of matches
● For each match, add the payload data to the response if available
17. PayloadBufferFilterFactory
● A filter to work around payload oddities in Solr
● Filters that produce new tokens often remove all attributes, which
includes payloads.
● This filter will copy the Payload data and restore it later on after other
filters have been run.
20. Base64Encoder Cont’d
● To get around this problem, the payload can be encoded in Base64
○ dog|YmFya3Mgd29vZnM=
● The Base64Encoder will accept Base64 data at index time but store it
out as the decoded version.
○ YmFya3Mgd29vZnM= -> barks woofs
21. The Future: Matches Component
● Surface which terms/phrases from the query matched
● Surface payload attribute data that’s already included in the payload
component
● Surface other data from the index such as offsets
22. Thanks
● PayloadComponent Repo: https://github.com/o19s/payload-component
● Demo Repo: https://github.com/o19s/pdf-discovery-demo
Big thanks to Dan Worley and Andrew Boyd and a brave
client for working with me to make this idea happen!
Interested in Relevance? Join us at www.o19s.com/slack to chat with your
peers.