SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Introduction to Full-text search
About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism, scalability
Why should you care?
Because every application needs search
We live in an era of big, complex and connected applications.
That means a lot of data
But it's no use if you can't find anything!
But it's no use if you can't quickly find anything something relevant
Quick
Relevant
Customized Experience
Deathy's Tip You can't win by being generic, but you can be the best for your specific type of content.
So back to our full-text search...
Some core ideas "index" (or "inverted index") "document"
Deathy’s Tip Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)
First we need some documents, more specifically some text samples
Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“ "Stolen" from http://www.slideshare.net/tomdyson/being-google
Important: individual words are the basis for the index
Individual words index = [ "cow", 	"dog", 	"moo", 	"moof", 	"The", 	"says", 	"woof" ]
For each word we have a list of documents to which it belongs
Words, with appearances index = { 	"cow": ["Doc1", "Doc3"], 	"dog": ["Doc2", "Doc3"], 	"moo": ["Doc1"], 	"moof": ["Doc3"], 	"The": ["Doc1", "Doc2", "Doc3"], 	"says": ["Doc1", "Doc2", "Doc3"], 	"woof": ["Doc2"] }
Q1: Find documents which contain "moo" A1: index["moo"]
Q2: Find documents which contain "The" and "dog" A2: set(index["The"]) & set(index["dog"])
Try to think of search as unions/intersections or other filters on sets.
Most searches are using simple terms and "boolean" operators.
“boolean” "word"  - word MAY/SHOULD appear in document "+word" - word MUST appear in document "-word" - word MUST NOT appear in document
Example Query: “+type:bookcontent:javacontent:python -content:ruby” Find books, with "java" or "python" in content but which don't contain "ruby" in content.
Err...wait...what the hell does "content:java" mean?
Reviewing the "document" concept
An index consists out of one or more documents
Each document consists of one or more "field"s. Each field has a name and content.
Field examples content title author publication date etc.
So how are fields handled internally? In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.
New index example index = { 	"content:cow": ["Doc1", "Doc3"], 	"content:dog": ["Doc2", "Doc3"], 	"content:moo": ["Doc1"], 	"content:moof": ["Doc3"], 	"content:The": ["Doc1", "Doc2", "Doc3"], 	"content:says": ["Doc1", "Doc2", "Doc3"], 	"content:woof": ["Doc2"], 	"type:example_documents": ["Doc1", "Doc2", "Doc3"] }
But enough of that
We missed the most important thing!
We missedsaved the most important thing for last!
Analysis
or for mortals: how you get from a long text to small tokens/words/terms
…borrowing from Lucene naming/API...
(One) Tokenizer
and zero or more Filters
First...
Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"
Tokenizer: Breaks up a single string into smaller tokens.
You define what splitting rules are best for you.
Whitespace Tokenizer Just break into tokens wherever there is some space. So we get something like:
Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"] Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]
But wait, that doesn't look right...
So we apply Filters
Filter transforms one single token into another single token, multiple tokens or no token at all you can apply more of them in a specific order
Filter 1: lower-case (since we don't want the search to be case-sensitive)
Result Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"] Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]
Filter 2: remove punctuation
Result Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"] Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]
Add more filter seasoning until it tastes just right.
Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms
Possibilities are endless, enjoy experimenting with them!
Just one warning…
Always use the same analysis rules when indexing and when parsing search text entered by the user!
I bet you want to start working with this
Implementations Lucene (Java main, .NET, Python, C ) SOLR if using from other languages Xapian Sphinx OpenFTS MySQL Full-Text Search (kind of…)
Related Books
The theory Introduction to Information Retrieval http://nlp.stanford.edu/IR-book/information-retrieval-book.html Warning: contains a lot of math.
The practice (for Lucene at least): Lucene in Action, second edition: http://www.manning.com/hatcher3/ Warning: contains a lot of Java
Questions?
Contact me (with interesting problems involving lots of data  ) @deathy cristian.vat@gmail.com http://blog.deathy.info/ (yeah…I know…)
Fin.
So where’s the Halloween Party? Happy Halloween !

Weitere ähnliche Inhalte

Ähnlich wie Introduction to Full-Text Search

Java Building Blocks
Java Building BlocksJava Building Blocks
Java Building BlocksCate Huston
 
You shouldneverdo
You shouldneverdoYou shouldneverdo
You shouldneverdodaniil3
 
WordCamp US: Clean Code
WordCamp US: Clean CodeWordCamp US: Clean Code
WordCamp US: Clean Codemtoppa
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital eraApex CoVantage
 
Dmitry Lebedev: Agile Testing Using Agile Tools
Dmitry Lebedev: Agile Testing Using Agile ToolsDmitry Lebedev: Agile Testing Using Agile Tools
Dmitry Lebedev: Agile Testing Using Agile ToolsAgile Lietuva
 
Advanced java script essentials v1
Advanced java script essentials v1Advanced java script essentials v1
Advanced java script essentials v1ASHUTOSHPATKAR1
 
Alfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialAlfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialPASCAL Jean Marie
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search EnginesNitin Pande
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelyht4ever
 
McrFRED talk 25/09/2014
McrFRED talk 25/09/2014McrFRED talk 25/09/2014
McrFRED talk 25/09/2014Jake Smith
 
Getting started-php unit
Getting started-php unitGetting started-php unit
Getting started-php unitmfrost503
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Falcon Full Text Search Engine
Falcon Full Text Search EngineFalcon Full Text Search Engine
Falcon Full Text Search EngineHideshi Ogoshi
 
Zotero Framework Translators
Zotero Framework TranslatorsZotero Framework Translators
Zotero Framework Translatorsadam3smith
 
Object-Oriented Programming in Java (Module 1)
Object-Oriented Programming in Java (Module 1)Object-Oriented Programming in Java (Module 1)
Object-Oriented Programming in Java (Module 1)muhammadmubinmacadad2
 

Ähnlich wie Introduction to Full-Text Search (20)

Java Building Blocks
Java Building BlocksJava Building Blocks
Java Building Blocks
 
Words in Code
Words in CodeWords in Code
Words in Code
 
You shouldneverdo
You shouldneverdoYou shouldneverdo
You shouldneverdo
 
WordCamp US: Clean Code
WordCamp US: Clean CodeWordCamp US: Clean Code
WordCamp US: Clean Code
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
HTML 101
HTML 101HTML 101
HTML 101
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital era
 
Dmitry Lebedev: Agile Testing Using Agile Tools
Dmitry Lebedev: Agile Testing Using Agile ToolsDmitry Lebedev: Agile Testing Using Agile Tools
Dmitry Lebedev: Agile Testing Using Agile Tools
 
Advanced java script essentials v1
Advanced java script essentials v1Advanced java script essentials v1
Advanced java script essentials v1
 
Alfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialAlfresco in few points - Search Tutorial
Alfresco in few points - Search Tutorial
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
McrFRED talk 25/09/2014
McrFRED talk 25/09/2014McrFRED talk 25/09/2014
McrFRED talk 25/09/2014
 
Getting started-php unit
Getting started-php unitGetting started-php unit
Getting started-php unit
 
Full text search
Full text searchFull text search
Full text search
 
Falcon Full Text Search Engine
Falcon Full Text Search EngineFalcon Full Text Search Engine
Falcon Full Text Search Engine
 
Zotero Framework Translators
Zotero Framework TranslatorsZotero Framework Translators
Zotero Framework Translators
 
BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
Object-Oriented Programming in Java (Module 1)
Object-Oriented Programming in Java (Module 1)Object-Oriented Programming in Java (Module 1)
Object-Oriented Programming in Java (Module 1)
 

Kürzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Introduction to Full-Text Search

Hinweis der Redaktion

  1. I won't delve into specifics or actual implementations. I'll try to present main concepts which come from Information Retrieval theory and also essential components you should be aware of when dealing with any full-text search system. If interested, there could be a future presentation on actual implementations (Lucene in my case).
  2. Java Web Developer-ish. Last 4 years worked mostly on electronic publishing applications: processing/searching/displaying various content sets of various sizes. Passion for big data and lots of it. ( Last weekend I was parallelizing indexing on a 800K document set so it uses as many cores as possible. On Friday I was indexing a data set of 5.8M documents... )
  3. about fulltext search, or search in general
  4. take your pick: lots of pictures, lots of friends, lots of blog posts
  5. actually, scratch that..
  6. much better..
  7. fulltext search is usually VERY fast. and by adding your own custom one, you can make it faster for where your specific application needs it most.
  8. Depending on your content and users you can have very specific relevance criteria. You can surprise your users with the quality of results.
  9. various needs for various content- bitch about imobiliare.ro not having search in text or very dynamic filters. Example: cannot search for apartments to rent with internet access...- bitch about geekmeet.ro wordpress search not being able to filter based on category (Timisoara in this case)
  10. "index" = where you add items which you want to find and where you search for them."document" = the basic unit of indexing/searching. Usually one row from the search results list. Could be a book, a chapter, a page, a URL, etc.
  11. Observe the sorting. More on this later...
  12. not quite boolean, but simple enough to understand..
  13. actual implementations vary and it usually shouldn't matter. Just remember that there are fields and documents and each indexed term is indexed for a specific field.
  14. I'm going Lucene here, but any good index/search API will let you customize this process. This is as many have found a good way to structure your process.
  15. punctuation and various mixes of upper/lower-case in tokens.
  16. Bitch about tokenizer/filter options (or lack thereof in Sphinx/MySQL)…