SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Lucene Intro
About Me

•   Cristian Vat

•   Java Developer / Geek / Enthusiast

•   Contact

    •   @deathy

    •   ... or TM JUG mailing list
About YOU


•   Heard about Lucene / Solr ?

•   Used Lucene / Solr ?
Databases / Text Search
Databases

•   Select/Search on (usually) exact values or
    ranges

•   Group/Summarize Results

•   Sort results by value(s) of certain result
    column(s)
Text Search

•   Search for individual words/tokens

•   Search long text documents

•   More language-aware

•   “Sorting” by Relevance of results by default
IR Quick Intro
IR = Information Retrieval
IR Quick Intro

•   Doc 1: “I did enact Julius Caesar: I was killed i’
    the Capitol; Brutus killed me.”

•   Doc 2: “So let it be with Caesar. The noble
    Brutus hath told you Caesar was ambitious:”
IR Quick Intro

•   Index

    •   “I” -> Doc 1

    •   “Caesar” -> Doc 1, Doc 2

    •   “enact” -> Doc 1

    •   “noble” -> Doc 1
IR Quick Intro
•   Search

    •   caesar

    •   c?es*

    •   caesar AND noble

    •   “Julius Caesar”

    •   Caesar NOT Brutus
Lucene Ecosystem




             ...and many more
Lucene

•   IR Library

•   Just API for Indexing/Searching

•   No GUI

•   No parsers for different file formats
Lucene

•   Fast

•   Thread-Safe/Multi-Threaded indexing and
    searching

•   No dependencies! (not even logging
    framework)
Solr

•   Search Server / Layer over Lucene

•   Provides REST-like HTTP (JSON/XML) API

•   Client libraries in Java, PHP, Python, Ruby,
    Perl, .NET, ...
Solr

•   More structured indexes

•   Replication / Distribution, Master-Slave, etc.

•   Faceted Search / Filtering

•   Indexing of rich document types (via Tika)
Tika

•   “Content Analysis Toolkit”

•   Text and Metadata extraction from various
    rich document types

•   Used by Solr for indexing rich document
    types
Lucene (in more detail)
Lucene Index Structure


•   Index = One or more Documents

•   Document = one or more Fields with values

•   NO Schema/Structure restrictions
Adding documents
Lucene Search
Query Parser
•   AND, OR, NOT ( +/- )

    •   “apache AND lucene NOT solr” ( “+apache
        +lucene -solr” )

•   Range Queries

    •   year:[1994 TO 2011]

•   Wildcard/Fuzzy:

    •   “ap?che”, “apac*”, “appche”˜0.8
Sorting or Results


•   Default sort by Relevance

•   Possible to use custom sort fields
Relevance


•   Score is calculated for each document based
    on individual document/fields and the current
    search query
For the nerds




http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
Analysis


•   From long continuous text to individual
    tokens/words used for indexing
Analysis


•   Text -> Tokenizer -> (TokenFilter)* -> Tokens
Tokenizer

•   Splits main text into words, by whitespace,
    punctuation, other rules

•   Text: “So, it has come to this!”

•   Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
Token Filters

•   Change existing tokens or add new ones

    •   Case-Folding

    •   Synonyms

    •   Stemming
Token Filters
•   Text: “The Pandorica was constructed to
    ensure the safety of the Alliance.”

•   Tokens: [“The”, “Pandorica”, “was”,
    “constructed”, “to”, “ensure”, “the”, “safety”,
    “of”, “the”, “Alliance” ]

•   Filtered: [ “pandorica”, “was”, “construct”,
    “to”, “ensure”, “safe”, “of”, “alliance” ]
Q&A
Questions?
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profitlucenerevolution
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsItamar
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Lucidworks
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 

Was ist angesagt? (20)

Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Ähnlich wie Lucene intro

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Search in the Biblical Domain - BibleTech: 2011
Search in the Biblical Domain - BibleTech: 2011Search in the Biblical Domain - BibleTech: 2011
Search in the Biblical Domain - BibleTech: 2011Brian Seagraves
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 

Ähnlich wie Lucene intro (20)

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Search in the Biblical Domain - BibleTech: 2011
Search in the Biblical Domain - BibleTech: 2011Search in the Biblical Domain - BibleTech: 2011
Search in the Biblical Domain - BibleTech: 2011
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Apache solr
Apache solrApache solr
Apache solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
SOLR
SOLRSOLR
SOLR
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Solr
SolrSolr
Solr
 

Mehr von Cristian Vat

Are we security yet
Are we security yetAre we security yet
Are we security yetCristian Vat
 
Timisoara Wireless Survey
Timisoara Wireless SurveyTimisoara Wireless Survey
Timisoara Wireless SurveyCristian Vat
 
Introduction to Full-Text Search
Introduction to Full-Text SearchIntroduction to Full-Text Search
Introduction to Full-Text SearchCristian Vat
 

Mehr von Cristian Vat (6)

Ten years later
Ten years laterTen years later
Ten years later
 
Are we security yet
Are we security yetAre we security yet
Are we security yet
 
Timisoara Wireless Survey
Timisoara Wireless SurveyTimisoara Wireless Survey
Timisoara Wireless Survey
 
Introduction to Full-Text Search
Introduction to Full-Text SearchIntroduction to Full-Text Search
Introduction to Full-Text Search
 
A A A
A A AA A A
A A A
 
Language Barriers
Language BarriersLanguage Barriers
Language Barriers
 

Kürzlich hochgeladen

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Kürzlich hochgeladen (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Lucene intro

  • 2. About Me • Cristian Vat • Java Developer / Geek / Enthusiast • Contact • @deathy • ... or TM JUG mailing list
  • 3. About YOU • Heard about Lucene / Solr ? • Used Lucene / Solr ?
  • 5. Databases • Select/Search on (usually) exact values or ranges • Group/Summarize Results • Sort results by value(s) of certain result column(s)
  • 6. Text Search • Search for individual words/tokens • Search long text documents • More language-aware • “Sorting” by Relevance of results by default
  • 8. IR = Information Retrieval
  • 9. IR Quick Intro • Doc 1: “I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.” • Doc 2: “So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:”
  • 10. IR Quick Intro • Index • “I” -> Doc 1 • “Caesar” -> Doc 1, Doc 2 • “enact” -> Doc 1 • “noble” -> Doc 1
  • 11. IR Quick Intro • Search • caesar • c?es* • caesar AND noble • “Julius Caesar” • Caesar NOT Brutus
  • 12. Lucene Ecosystem ...and many more
  • 13. Lucene • IR Library • Just API for Indexing/Searching • No GUI • No parsers for different file formats
  • 14. Lucene • Fast • Thread-Safe/Multi-Threaded indexing and searching • No dependencies! (not even logging framework)
  • 15. Solr • Search Server / Layer over Lucene • Provides REST-like HTTP (JSON/XML) API • Client libraries in Java, PHP, Python, Ruby, Perl, .NET, ...
  • 16. Solr • More structured indexes • Replication / Distribution, Master-Slave, etc. • Faceted Search / Filtering • Indexing of rich document types (via Tika)
  • 17. Tika • “Content Analysis Toolkit” • Text and Metadata extraction from various rich document types • Used by Solr for indexing rich document types
  • 18. Lucene (in more detail)
  • 19. Lucene Index Structure • Index = One or more Documents • Document = one or more Fields with values • NO Schema/Structure restrictions
  • 22. Query Parser • AND, OR, NOT ( +/- ) • “apache AND lucene NOT solr” ( “+apache +lucene -solr” ) • Range Queries • year:[1994 TO 2011] • Wildcard/Fuzzy: • “ap?che”, “apac*”, “appche”˜0.8
  • 23. Sorting or Results • Default sort by Relevance • Possible to use custom sort fields
  • 24. Relevance • Score is calculated for each document based on individual document/fields and the current search query
  • 26. Analysis • From long continuous text to individual tokens/words used for indexing
  • 27. Analysis • Text -> Tokenizer -> (TokenFilter)* -> Tokens
  • 28. Tokenizer • Splits main text into words, by whitespace, punctuation, other rules • Text: “So, it has come to this!” • Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
  • 29. Token Filters • Change existing tokens or add new ones • Case-Folding • Synonyms • Stemming
  • 30. Token Filters • Text: “The Pandorica was constructed to ensure the safety of the Alliance.” • Tokens: [“The”, “Pandorica”, “was”, “constructed”, “to”, “ensure”, “the”, “safety”, “of”, “the”, “Alliance” ] • Filtered: [ “pandorica”, “was”, “construct”, “to”, “ensure”, “safe”, “of”, “alliance” ]

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. Office (Word,Excel,PowerPoint), OpenOffice, PDF, Images(metadata), audio (ID3 for mp3 files), RTF, etc..\n
  18. \n
  19. similar to NoSQL databases. Not all documents need to contain the same fields.\n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html\n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n