SlideShare ist ein Scribd-Unternehmen logo
1 von 25
SEARCH ME
Using Lucene.Net In Your Apps
About Me
   Zachary Johnson Gramana
   Engineer at Potts Consulting Group
   Proud new father of Rex
Search is...
   A vague term that encompasses multiple
    problems.
   Better term is “information retrieval”, or IR
    system.
   Interdisciplinary, drawing from:
     computer   science (parsing, data structures)
     psychology (query grammar, human/computer
      interact.)
     linguistics (textual analysis)

     information science (scoring/relevancy)

     maths (document retrieval strategy)
Problems Solved
   Information Overload
   Transparently handle all kinds of data:
     structured (hierarchical)
     semi-structured (markup)

     un-structured data (plain text)
Problems Solved
    Information Overload
      Find  the information that users want,
       not just the information they asked for.
    Transparently handle all kinds of data:
      structured (hierarchical)
      semi-structured (markup)

      un-structured data (plain text)

    Single portal to multiple data types and
     sources.
    Do it fast!
Basic IR System Capabilities
   Collection (importing, crawling)
       Anonymous web page crawling (google)
       User-uploaded photographs (flickr)
       Publisher upload of .mp3 files (iTunes)
   Indexing
       Analysis
       Modify index data structure
   Querying
       Input parsing
       Query generation & execution
       Collecting the results
       Filtering the results (optional)
What is Lucene.Net?
   Port of the Apache Foundation‟s Lucene
    libraries from Java to C#
   It‟s a search library.
   Lucene created by Doug Cutting
   Named after his wife.
   First released in 2000 on SourceForge
   Migrated to Apache Foundation in 9/2001.
Used By
   StackOverflow
   JIRA
   IBM
   Akamai
   Apple
   Autodesk
   Orchard
   RavenDB
   CouchDB
What Isn‟t Lucene.NET
   Not a complete information retrieval system
       Check out Google Search Appliance instead:
        http://www.google.com/enterprise/search/
   Not a web-crawler.
       Check out Arachnode instead
        http://arachnode.net
   Not a query service.
       Check out SOLR instead
        http://lucene.apache.org/solr
   Not hard
       Check out Windows Search SDK instead
        http://bit.ly/ImRtMk
Concept and Overview
What‟s In an Index?
   Stores a collection of Documents, each of
    which represent a source record.
   Document contain:
     Metadata   about the source record.
     (optionally) actual data from the source record.

     (optionally) derived analytical products.

   Documents store a collection of
    token/frequency pairs (optionally position),
    plus a document identifier.
Lucene‟s Index Structure
   Documents store a collection of fields.
   Fields are collection of terms, plus and identifier, and
    optional term vectors.
   Terms are string key-value-pairs of a field name, and
    a string value.
   Lucene provides special classes to deal with tricky
    data, like the NumericField class.
   Term vectors are terms, along with their frequency
    counts and positions.
   Fields can be indexed, stored, or both.
       Storing allows a term value to be retrieved after indexing.
       Indexing adds the term value to Lucene‟s inverted index.
The Inverted Index




     (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Lucene‟s Index Structure
   What an „inverted index‟?
     verted   index: document points to collection of
      terms
     inverted index: term points to a collection of
      documents
   One or more segments
     Self-contained,   independent partition of the
      entire index.
     Stores: field names, field values, term dictionary,
      term frequencies, term proximities, normalization
      factor, term vectors, and (optional) deleted record
      lookup table.
Analysis




     (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Tokenization




     (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Tokenization
   Normalization: “Gramåna” > “gramana”
   Stemming: “preschooling” > “school”
Norms




    (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Time to Look at Some Code
Getting a Query
   Two options:
     Parse a search string using a QueryParser class.
     Programatically build a query.

   QueryParser can build very complex queries
    very quickly, but requires user to provide a
    query string.
   Programatic building of a query requires less
    overhead for simple queries.
General Query Types




     (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
Some Lucene Query Types
   TermQuery (general purpose)
   BooleanQuery
   MultiPhraseQuery
   SpanQuery
   WildcardQuery
   FilteredQuery
   MoreLikeThisQuery
   BoostingQuery
   FuzzyQuery
   ConstantScoreRangeQuery
Time to Look at More Code
Lucene.Net Contribs
   Spatial (geo-spatial search)
   Similarity
   SimpleFactedSearch
   Highlighter
   SpellChecker
   WordNET (synonyms)
   Snowball (stemming library)
   RegEx
That‟s All!
Thanks for your time and attention.

twitter: @zgramana
blog: http://www.excitabyte.com/
Email: zgramanaATgee mail dot com

Weitere ähnliche Inhalte

Was ist angesagt?

Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tikaJukka Zitting
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaStuart Chalk
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesResearch Data Alliance
 
Tech. session : Interoperability and Data FAIRness emerges from a novel combi...
Tech. session : Interoperability and Data FAIRness emerges from a novel combi...Tech. session : Interoperability and Data FAIRness emerges from a novel combi...
Tech. session : Interoperability and Data FAIRness emerges from a novel combi...Mark Wilkinson
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaJukka Zitting
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector BuilderMark Wilkinson
 
Citation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research ObjectsCitation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research ObjectsDaniel S. Katz
 

Was ist angesagt? (20)

Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Apache tika
Apache tikaApache tika
Apache tika
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Tech. session : Interoperability and Data FAIRness emerges from a novel combi...
Tech. session : Interoperability and Data FAIRness emerges from a novel combi...Tech. session : Interoperability and Data FAIRness emerges from a novel combi...
Tech. session : Interoperability and Data FAIRness emerges from a novel combi...
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector Builder
 
Citation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research ObjectsCitation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research Objects
 
The Chemtools LaBLog
The Chemtools LaBLogThe Chemtools LaBLog
The Chemtools LaBLog
 
Web search engines
Web search enginesWeb search engines
Web search engines
 

Ähnlich wie Search Me: Using Lucene.Net

Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a librarySEECS NUST
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingShay Sofer
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
osm.cs.byu.edu
osm.cs.byu.eduosm.cs.byu.edu
osm.cs.byu.edubutest
 
Cornell20080516
Cornell20080516Cornell20080516
Cornell20080516charper
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaStuart Chalk
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkPaolo Nesi
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 

Ähnlich wie Search Me: Using Lucene.Net (20)

Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a library
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
osm.cs.byu.edu
osm.cs.byu.eduosm.cs.byu.edu
osm.cs.byu.edu
 
Cornell20080516
Cornell20080516Cornell20080516
Cornell20080516
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social Network
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Search Me: Using Lucene.Net

  • 2. About Me  Zachary Johnson Gramana  Engineer at Potts Consulting Group  Proud new father of Rex
  • 3. Search is...  A vague term that encompasses multiple problems.  Better term is “information retrieval”, or IR system.  Interdisciplinary, drawing from:  computer science (parsing, data structures)  psychology (query grammar, human/computer interact.)  linguistics (textual analysis)  information science (scoring/relevancy)  maths (document retrieval strategy)
  • 4. Problems Solved  Information Overload  Transparently handle all kinds of data:  structured (hierarchical)  semi-structured (markup)  un-structured data (plain text)
  • 5. Problems Solved  Information Overload  Find the information that users want, not just the information they asked for.  Transparently handle all kinds of data:  structured (hierarchical)  semi-structured (markup)  un-structured data (plain text)  Single portal to multiple data types and sources.  Do it fast!
  • 6. Basic IR System Capabilities  Collection (importing, crawling)  Anonymous web page crawling (google)  User-uploaded photographs (flickr)  Publisher upload of .mp3 files (iTunes)  Indexing  Analysis  Modify index data structure  Querying  Input parsing  Query generation & execution  Collecting the results  Filtering the results (optional)
  • 7. What is Lucene.Net?  Port of the Apache Foundation‟s Lucene libraries from Java to C#  It‟s a search library.  Lucene created by Doug Cutting  Named after his wife.  First released in 2000 on SourceForge  Migrated to Apache Foundation in 9/2001.
  • 8. Used By  StackOverflow  JIRA  IBM  Akamai  Apple  Autodesk  Orchard  RavenDB  CouchDB
  • 9. What Isn‟t Lucene.NET  Not a complete information retrieval system  Check out Google Search Appliance instead: http://www.google.com/enterprise/search/  Not a web-crawler.  Check out Arachnode instead http://arachnode.net  Not a query service.  Check out SOLR instead http://lucene.apache.org/solr  Not hard  Check out Windows Search SDK instead http://bit.ly/ImRtMk
  • 11. What‟s In an Index?  Stores a collection of Documents, each of which represent a source record.  Document contain:  Metadata about the source record.  (optionally) actual data from the source record.  (optionally) derived analytical products.  Documents store a collection of token/frequency pairs (optionally position), plus a document identifier.
  • 12. Lucene‟s Index Structure  Documents store a collection of fields.  Fields are collection of terms, plus and identifier, and optional term vectors.  Terms are string key-value-pairs of a field name, and a string value.  Lucene provides special classes to deal with tricky data, like the NumericField class.  Term vectors are terms, along with their frequency counts and positions.  Fields can be indexed, stored, or both.  Storing allows a term value to be retrieved after indexing.  Indexing adds the term value to Lucene‟s inverted index.
  • 13. The Inverted Index (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
  • 14. Lucene‟s Index Structure  What an „inverted index‟?  verted index: document points to collection of terms  inverted index: term points to a collection of documents  One or more segments  Self-contained, independent partition of the entire index.  Stores: field names, field values, term dictionary, term frequencies, term proximities, normalization factor, term vectors, and (optional) deleted record lookup table.
  • 15. Analysis (taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
  • 16. Tokenization (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
  • 17. Tokenization  Normalization: “Gramåna” > “gramana”  Stemming: “preschooling” > “school”
  • 18. Norms (taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
  • 19. Time to Look at Some Code
  • 20. Getting a Query  Two options:  Parse a search string using a QueryParser class.  Programatically build a query.  QueryParser can build very complex queries very quickly, but requires user to provide a query string.  Programatic building of a query requires less overhead for simple queries.
  • 21. General Query Types (taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
  • 22. Some Lucene Query Types  TermQuery (general purpose)  BooleanQuery  MultiPhraseQuery  SpanQuery  WildcardQuery  FilteredQuery  MoreLikeThisQuery  BoostingQuery  FuzzyQuery  ConstantScoreRangeQuery
  • 23. Time to Look at More Code
  • 24. Lucene.Net Contribs  Spatial (geo-spatial search)  Similarity  SimpleFactedSearch  Highlighter  SpellChecker  WordNET (synonyms)  Snowball (stemming library)  RegEx
  • 25. That‟s All! Thanks for your time and attention. twitter: @zgramana blog: http://www.excitabyte.com/ Email: zgramanaATgee mail dot com