SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Payloads and OCR with Solr
OpenSource Connections October 2019
Apache Lucene/Solr - London User Group
Introductions
Eric Pugh: Search Relevance Engineer at OpenSource Connections
Daniel Worley: Search Relevance Engineer at OpenSource Connections
Searching Text Inside Images
http://pdf-discovery-demo.dev.o19s.com:8080 and search for “HELOC”
OCR
● Tesseract and Tika enable us to get text out of images via OCR
● Text can then be indexed into Solr
● Problem solved?
Highlighting
● What if we want to highlight the text in an image that matched?
● Regular Solr Highlighting not good enough
○ We can get text snippets but we can’t see where they came from in the
image
○ For images with a lot of information this can make it hard for users to
see why a particular image matched their query
The Problem
● Tesseract has provided us bounding boxes for all of the OCR’d text
● We need to access this bounding box information within Solr on a per
match basis
What about Payloads?
● Payloads provide a way of attaching various metadata to each token
● More info
https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-luci
dworks
But ....
The Challenge
● Payloads are typically used at query time for matching or to affect the
score of matching documents.
● Not much in the area of surfacing payload data at query time without
manually extracting it again from the stored data
Iteration 1 - Idea
Create a highlighter formatter that surfaces payload attributes
Iteration 1 - Results
● Required hacking at low level Lucene internals to include the payload
attribute in the token stream.
● Suitable for a PoC, not great for any real applications
Iteration 2 - Idea
Create a component that only returns payloads for clauses that matched in
the query
Iteration 2 - Results
A deployable plugin that doesn’t require hacking on Lucene to work
Payload Component - What’s in the box?
● Payload Component
● And some conveniences:
○ Base64Encoder
○ PayloadBufferFilterFactory
● Available at: https://github.com/o19s/payload-component
Payload Component
● Similar to the highlighting component but returns matches only
● Currently no scoring of matches
● For each match, add the payload data to the response if available
PayloadBufferFilterFactory
● A filter to work around payload oddities in Solr
● Filters that produce new tokens often remove all attributes, which
includes payloads.
● This filter will copy the Payload data and restore it later on after other
filters have been run.
PayloadBufferFilterFactory
COPY->
PASTE->
Base64Encoder
● The DelimitedPayloadTokenFilterFactory expects data as:
○ [term][delimiter][payload]
● What about dog|barks woofs?
○ Will “woofs” be included as part of the payload?
Base64Encoder Cont’d
● To get around this problem, the payload can be encoded in Base64
○ dog|YmFya3Mgd29vZnM=
● The Base64Encoder will accept Base64 data at index time but store it
out as the decoded version.
○ YmFya3Mgd29vZnM= -> barks woofs
The Future: Matches Component
● Surface which terms/phrases from the query matched
● Surface payload attribute data that’s already included in the payload
component
● Surface other data from the index such as offsets
Thanks
● PayloadComponent Repo: https://github.com/o19s/payload-component
● Demo Repo: https://github.com/o19s/pdf-discovery-demo
Big thanks to Dan Worley and Andrew Boyd and a brave
client for working with me to make this idea happen!
Interested in Relevance? Join us at www.o19s.com/slack to chat with your
peers.

Weitere ähnliche Inhalte

Was ist angesagt?

Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQDoncho Minkov
 
Advanced Reflection in Pharo
Advanced Reflection in PharoAdvanced Reflection in Pharo
Advanced Reflection in PharoMarcus Denker
 
Lerman Vvs13 Entity Framework 4 And Wcf
Lerman Vvs13 Entity Framework 4 And WcfLerman Vvs13 Entity Framework 4 And Wcf
Lerman Vvs13 Entity Framework 4 And WcfJulie Lerman
 
Lerman Adx303 Entity Framework 4 In Aspnet
Lerman Adx303 Entity Framework 4 In AspnetLerman Adx303 Entity Framework 4 In Aspnet
Lerman Adx303 Entity Framework 4 In AspnetJulie Lerman
 
C++ Actor Model - You’ve Got Mail ...
C++ Actor Model - You’ve Got Mail ...C++ Actor Model - You’ve Got Mail ...
C++ Actor Model - You’ve Got Mail ...Gianluca Padovani
 
Java.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmapJava.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmapSrinivasan Raghvan
 
Introduction to Web Scraping with Python
Introduction to Web Scraping with PythonIntroduction to Web Scraping with Python
Introduction to Web Scraping with PythonOlga Scrivner
 
Whats new in .NET for 2019
Whats new in .NET for 2019Whats new in .NET for 2019
Whats new in .NET for 2019Rory Preddy
 
Lec 4 06_aug [compatibility mode]
Lec 4 06_aug [compatibility mode]Lec 4 06_aug [compatibility mode]
Lec 4 06_aug [compatibility mode]Palak Sanghani
 
.NET Core, ASP.NET Core Course, Session 17
.NET Core, ASP.NET Core Course, Session 17.NET Core, ASP.NET Core Course, Session 17
.NET Core, ASP.NET Core Course, Session 17aminmesbahi
 
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnel
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnelEntity Framework 4 In Microsoft Visual Studio 2010 - ericnel
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnelukdpe
 
MVC and Entity Framework
MVC and Entity FrameworkMVC and Entity Framework
MVC and Entity FrameworkJames Johnson
 
Stateful patterns in Azure Functions
Stateful patterns in Azure FunctionsStateful patterns in Azure Functions
Stateful patterns in Azure FunctionsMassimo Bonanni
 
Multithreading and concurrency in android
Multithreading and concurrency in androidMultithreading and concurrency in android
Multithreading and concurrency in androidRakesh Jha
 
The .net remote systems
The .net remote systemsThe .net remote systems
The .net remote systemsRaghu nath
 

Was ist angesagt? (20)

Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQ
 
Understanding linq
Understanding linqUnderstanding linq
Understanding linq
 
Advanced Reflection in Pharo
Advanced Reflection in PharoAdvanced Reflection in Pharo
Advanced Reflection in Pharo
 
Lerman Vvs13 Entity Framework 4 And Wcf
Lerman Vvs13 Entity Framework 4 And WcfLerman Vvs13 Entity Framework 4 And Wcf
Lerman Vvs13 Entity Framework 4 And Wcf
 
Lerman Adx303 Entity Framework 4 In Aspnet
Lerman Adx303 Entity Framework 4 In AspnetLerman Adx303 Entity Framework 4 In Aspnet
Lerman Adx303 Entity Framework 4 In Aspnet
 
C++ Actor Model - You’ve Got Mail ...
C++ Actor Model - You’ve Got Mail ...C++ Actor Model - You’ve Got Mail ...
C++ Actor Model - You’ve Got Mail ...
 
Android with kotlin course
Android with kotlin courseAndroid with kotlin course
Android with kotlin course
 
Java.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmapJava.util.concurrent.concurrent hashmap
Java.util.concurrent.concurrent hashmap
 
Introduction to Web Scraping with Python
Introduction to Web Scraping with PythonIntroduction to Web Scraping with Python
Introduction to Web Scraping with Python
 
The OCLforUML Profile
The OCLforUML ProfileThe OCLforUML Profile
The OCLforUML Profile
 
Java 8 streams
Java 8 streams Java 8 streams
Java 8 streams
 
Whats new in .NET for 2019
Whats new in .NET for 2019Whats new in .NET for 2019
Whats new in .NET for 2019
 
Lec 4 06_aug [compatibility mode]
Lec 4 06_aug [compatibility mode]Lec 4 06_aug [compatibility mode]
Lec 4 06_aug [compatibility mode]
 
.NET Core, ASP.NET Core Course, Session 17
.NET Core, ASP.NET Core Course, Session 17.NET Core, ASP.NET Core Course, Session 17
.NET Core, ASP.NET Core Course, Session 17
 
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnel
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnelEntity Framework 4 In Microsoft Visual Studio 2010 - ericnel
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnel
 
MVC and Entity Framework
MVC and Entity FrameworkMVC and Entity Framework
MVC and Entity Framework
 
Stateful patterns in Azure Functions
Stateful patterns in Azure FunctionsStateful patterns in Azure Functions
Stateful patterns in Azure Functions
 
Multithreading and concurrency in android
Multithreading and concurrency in androidMultithreading and concurrency in android
Multithreading and concurrency in android
 
The .net remote systems
The .net remote systemsThe .net remote systems
The .net remote systems
 
Link quries
Link quriesLink quries
Link quries
 

Ähnlich wie Payloads and OCR with Solr

MuleSoft Manchester Meetup #3 slides 31st March 2020
MuleSoft Manchester Meetup #3 slides 31st March 2020MuleSoft Manchester Meetup #3 slides 31st March 2020
MuleSoft Manchester Meetup #3 slides 31st March 2020Ieva Navickaite
 
The Basic Concept Of IOC
The Basic Concept Of IOCThe Basic Concept Of IOC
The Basic Concept Of IOCCarl Lu
 
Getting Started with React, When You’re an Angular Developer
Getting Started with React, When You’re an Angular DeveloperGetting Started with React, When You’re an Angular Developer
Getting Started with React, When You’re an Angular DeveloperFabrit Global
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allMarc Dutoo
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
26 top angular 8 interview questions to know in 2020 [www.full stack.cafe]
26 top angular 8 interview questions to know in 2020   [www.full stack.cafe]26 top angular 8 interview questions to know in 2020   [www.full stack.cafe]
26 top angular 8 interview questions to know in 2020 [www.full stack.cafe]Alex Ershov
 
Building a blog with an Onion Architecture
Building a blog with an Onion ArchitectureBuilding a blog with an Onion Architecture
Building a blog with an Onion ArchitectureBarry O Sullivan
 
Onion Architecture and the Blog
Onion Architecture and the BlogOnion Architecture and the Blog
Onion Architecture and the Blogbarryosull
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Lightning web components
Lightning web components Lightning web components
Lightning web components Cloud Analogy
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.orgTed Husted
 
Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Naveen P
 
Deep Learning for Java Developer - Getting Started
Deep Learning for Java Developer - Getting StartedDeep Learning for Java Developer - Getting Started
Deep Learning for Java Developer - Getting StartedSuyash Joshi
 
Angular interview questions
Angular interview questionsAngular interview questions
Angular interview questionsGoa App
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
 

Ähnlich wie Payloads and OCR with Solr (20)

MuleSoft Manchester Meetup #3 slides 31st March 2020
MuleSoft Manchester Meetup #3 slides 31st March 2020MuleSoft Manchester Meetup #3 slides 31st March 2020
MuleSoft Manchester Meetup #3 slides 31st March 2020
 
COinS (eng version)
COinS (eng version)COinS (eng version)
COinS (eng version)
 
Entity Framework 4
Entity Framework 4Entity Framework 4
Entity Framework 4
 
The Basic Concept Of IOC
The Basic Concept Of IOCThe Basic Concept Of IOC
The Basic Concept Of IOC
 
Getting Started with React, When You’re an Angular Developer
Getting Started with React, When You’re an Angular DeveloperGetting Started with React, When You’re an Angular Developer
Getting Started with React, When You’re an Angular Developer
 
Concurrency and parallel in .net
Concurrency and parallel in .netConcurrency and parallel in .net
Concurrency and parallel in .net
 
Tori ORM for PyCon 2014
Tori ORM for PyCon 2014Tori ORM for PyCon 2014
Tori ORM for PyCon 2014
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
26 top angular 8 interview questions to know in 2020 [www.full stack.cafe]
26 top angular 8 interview questions to know in 2020   [www.full stack.cafe]26 top angular 8 interview questions to know in 2020   [www.full stack.cafe]
26 top angular 8 interview questions to know in 2020 [www.full stack.cafe]
 
Building a blog with an Onion Architecture
Building a blog with an Onion ArchitectureBuilding a blog with an Onion Architecture
Building a blog with an Onion Architecture
 
Onion Architecture and the Blog
Onion Architecture and the BlogOnion Architecture and the Blog
Onion Architecture and the Blog
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Lightning web components
Lightning web components Lightning web components
Lightning web components
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.org
 
Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)
 
Deep Learning for Java Developer - Getting Started
Deep Learning for Java Developer - Getting StartedDeep Learning for Java Developer - Getting Started
Deep Learning for Java Developer - Getting Started
 
Angular interview questions
Angular interview questionsAngular interview questions
Angular interview questions
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 

Mehr von OpenSource Connections

How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonOpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...OpenSource Connections
 

Mehr von OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
 

Kürzlich hochgeladen

Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)Delhi Call girls
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...singhpriety023
 

Kürzlich hochgeladen (20)

Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 

Payloads and OCR with Solr

  • 1. Payloads and OCR with Solr OpenSource Connections October 2019 Apache Lucene/Solr - London User Group
  • 2. Introductions Eric Pugh: Search Relevance Engineer at OpenSource Connections Daniel Worley: Search Relevance Engineer at OpenSource Connections
  • 3. Searching Text Inside Images http://pdf-discovery-demo.dev.o19s.com:8080 and search for “HELOC”
  • 4. OCR ● Tesseract and Tika enable us to get text out of images via OCR ● Text can then be indexed into Solr ● Problem solved?
  • 5. Highlighting ● What if we want to highlight the text in an image that matched? ● Regular Solr Highlighting not good enough ○ We can get text snippets but we can’t see where they came from in the image ○ For images with a lot of information this can make it hard for users to see why a particular image matched their query
  • 6. The Problem ● Tesseract has provided us bounding boxes for all of the OCR’d text ● We need to access this bounding box information within Solr on a per match basis
  • 7. What about Payloads? ● Payloads provide a way of attaching various metadata to each token ● More info https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-luci dworks
  • 8.
  • 10. The Challenge ● Payloads are typically used at query time for matching or to affect the score of matching documents. ● Not much in the area of surfacing payload data at query time without manually extracting it again from the stored data
  • 11. Iteration 1 - Idea Create a highlighter formatter that surfaces payload attributes
  • 12. Iteration 1 - Results ● Required hacking at low level Lucene internals to include the payload attribute in the token stream. ● Suitable for a PoC, not great for any real applications
  • 13. Iteration 2 - Idea Create a component that only returns payloads for clauses that matched in the query
  • 14. Iteration 2 - Results A deployable plugin that doesn’t require hacking on Lucene to work
  • 15. Payload Component - What’s in the box? ● Payload Component ● And some conveniences: ○ Base64Encoder ○ PayloadBufferFilterFactory ● Available at: https://github.com/o19s/payload-component
  • 16. Payload Component ● Similar to the highlighting component but returns matches only ● Currently no scoring of matches ● For each match, add the payload data to the response if available
  • 17. PayloadBufferFilterFactory ● A filter to work around payload oddities in Solr ● Filters that produce new tokens often remove all attributes, which includes payloads. ● This filter will copy the Payload data and restore it later on after other filters have been run.
  • 19. Base64Encoder ● The DelimitedPayloadTokenFilterFactory expects data as: ○ [term][delimiter][payload] ● What about dog|barks woofs? ○ Will “woofs” be included as part of the payload?
  • 20. Base64Encoder Cont’d ● To get around this problem, the payload can be encoded in Base64 ○ dog|YmFya3Mgd29vZnM= ● The Base64Encoder will accept Base64 data at index time but store it out as the decoded version. ○ YmFya3Mgd29vZnM= -> barks woofs
  • 21. The Future: Matches Component ● Surface which terms/phrases from the query matched ● Surface payload attribute data that’s already included in the payload component ● Surface other data from the index such as offsets
  • 22. Thanks ● PayloadComponent Repo: https://github.com/o19s/payload-component ● Demo Repo: https://github.com/o19s/pdf-discovery-demo Big thanks to Dan Worley and Andrew Boyd and a brave client for working with me to make this idea happen! Interested in Relevance? Join us at www.o19s.com/slack to chat with your peers.