Illuminating Lucene.Net

Illuminating Lucene.Net:
Bringing Full-Text Search to Light
W. Dean Thrasher
14 May 2013

Agenda
• About the presenter
• About Lucene.Net
– What it is
– What it does
– How it works
– Who uses it
– Why you should care

More Agenda
• Core concepts
– Lucene structure
– Luke
– Terminology
• Code examples
• Things to know
• Recap
• References

W. Dean Thrasher
Dean.thrasher@infovark.com
www.infovark.com
www.linkedin.com/in/deanthrasher
@DThrasher
@infovark

BACKGROUND
Illuminating Lucene.Net

What is Lucene.Net?
Lucene.Net is a port of the Lucene search engine
library, written in C# and targeted at .NET
runtime users.

What is Lucene?
Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java.
Apache Lucene is an open source project
available for free download.

History
1997 – Lucene project began by Doug Cutting
2000 – First open source release
2002 – First Apache Jakarta release
2005 – Lucene becomes a top-level project
2006 – Lucene.Net gets Apache incubation status
2010 – Lucene.Net orphaned by original committers
2011 – Lucene.Net reaccepted into Apache Incubator
2012 – Lucene.Net graduates from the Incubator

Why you should care
You want to provide
customers with a
“Google-like” search
experience
You want to tune
incoming queries or
results ranking
You want better
performance than SQL
“like” searches
You want to avoid
deploying a separate
search tool with your
website or application

What does it do?
• Allows you to index and search vast amounts
of text quickly
• Provides a powerful query syntax
• Integrates into applications easily

How it works
• Lucene uses an inverted index
– Maps terms to the documents that contain them
• Lucene manages its index
– Stores the index in memory or on disk
– Allows documents to be added or removed
• Makes an index for each document
• Merges the index with a set of other indices

Who uses Lucene.Net?
• Stackoverflow
• RavenDB
• Sitecore
• Orchard
• MindTouch
• Umbraco
• Sitefinity
• SubText

CONCEPTS

Differences between Java and .Net
The Lucene.Net API:
• Lags a few steps behind the Java version of
Lucene
• Takes advantage of advanced .Net features not
found in Java
But it:
• Preserves the core Lucene concepts
• Maintains indexes that are compatible with the
Java version

Logical Index Storage
• Field – a name/value pair
• Document – a sequence of fields
• Index – a collection of documents

Physical Index Storage
• Lucene generates a
series of files within a
single directory
• Moving an index is a
copy-and-paste
operation
• You can compress or zip
an index to archive it

Luke
• Lucene Index Toolbox
• Built in Java, but can
read Lucene.Net
indexes
• http://code.google.com
/p/luke/

Analyzers and Tokens
• Analyzers take strings of text and break them
into tokens
• Tokens are chunks of text and associated
metadata

Terms, Queries and Hits
• Terms – the basic unit for searching. A field
name and a value to seek.
• Queries – combine terms to form search
criteria
• Hits – a ranked list of pointers to documents

Create documents demo
• IndexWriter
• Directory
• Analyzer
• Document
• Field

Read documents demo
• IndexReader
• Term
• Query
• Hits

Update documents demo
• IndexWriter
• Document
• Term

Delete documents demo
• IndexWriter
• Query
• Term

Search demo
• IndexSearcher
• QueryParser
• Query
• Term
• TopDocs
• ScoreDoc

THINGS TO KNOW

Transactional Lucene
• Lucene supports ACID commits to its indexes
• Lucene uses the Commit and Rollback syntax,
much like relational databases.
• Source:
http://blog.mikemccandless.com/2012/03/tra
nsactional-lucene.html

Lucene index types
FSDirectory
• Stores indexed documents
on disk
• Persists data across sessions
• Best choice for most
applications
Your first choice
RAMDirectory
• Stores indexed documents
in memory
• Entire index must fit into
available memory
• Does not persist data
• Faster than FSDirectory
Useful for unit testing

Precalculation
• How you store things in Lucene matters –
choose field options and analyzers carefully
• The way you retrieve information determines
how it should be stored
• Smaller indexes give you better performance

Field.Store
Yes – stores the text in its original form
No – the original text is not preserved

Field.Index
• No – the field is not indexed, so it is not
searchable
• Not analyzed – the text is treated as single
unit and indexed whole
• Analyzed – the text is broken down into
tokens and indexed

Field.TermVector
• No – Does not store term vectors
• Yes – Stores the term vectors of each
document (terms and number of occurrences)
• With Positions Offsets – Term vector, token
position and offset information

Field types indexing options
Field Stored Analyzed Vectored
Id Yes Not analyzed No
Modified Yes Not analyzed No
Path Yes Analyzed No
Content No Analyzed With Positions Offsets
An example of storing fields related to files on
your computer.

Analyzers
• Break apart text into tokens; each token gets
indexed separately
• Remove stop words
• Decide how to handle punctuation
• Handle languages and case sensitivity
• You can create your own by building from
scratch or chaining exiting analyzers

Types of Queries
• TermQuery
• PhraseQuery
• RangeQuery
• PrefixQuery , Wildcard Query
• FuzzyQuery
• Use BooleanQuery to combine them

Query syntax
Query Type Purpose Sample
TermQuery Single word query scarlett
PhraseQuery Matches terms in order “frankly my dear”
RangeQuery Matches documents between the
terms
[1861 to 1865]
{1861 to 1865}
WildcardQuery Lightweight regex-like term matching Atl*
D?m?
PrefixQuery Matches terms that being with the
string
War*
FuzzyQuery Closeness matching cry~
BooleanQuery Combines other queries into complex
expressions
Scarlett AND “frankly my
dear” -voldemort

Query, Filter, and Sort
• Lucene.Net can handle all three
• Default sort is by relevance
• Prefer queries to filters – they perform better

Linq Providers
• LINQ to Lucene
• http://linqtolucene.cod
eplex.com/
• Lucene.Net.Linq
• https://github.com/the
motleyfool/Lucene.Net.
Linq
• Chris Eldredge
• MotleyFool

Recap
• Why would I use a search engine?
• Why would I use Lucene.Net?
• How would I add Lucene.Net to my project?
– Web
– Desktop
• Where could I go to learn more?
• When can I buy Dean a beer?

REFERENCES

Web References
• Lucene.Net – http://lucenenet.apache.org
• Solr – http://lucene.apache.org/solr
• Wikipedia
– http://en.wikipedia.org/wiki/Lucene
– http://en.wikipedia.org/wiki/Search_engine_indexing
• Academic discussions
– http://lucene.sourceforge.net/talks/pisa/
– http://lucene.sourceforge.net/talks/inktomi/

Books
• Lucene in Action,
Second Edition
• Michael McCandless,
Erick Hatcher, Otis
Gospodnetić
• Manning Publications
• July 2010
• http://www.manning.co
m/hatcher3/

Books
• Taming Text
• Grant S. Ingersoll,
Thomas S. Morton,
Andrew L. Farris
• Manning Publications
• January 2013
• http://www.manning.co
m/ingersoll/

Books
• Introduction to
Information Retrieval
• Christopher D. Manning,
Prabhakar Raghavan,
Hinrich Schutze
• Cambridge University Press
• 2008
• http://www-
nlp.stanford.edu/IR-book/

Presentations
• http://www.slideshare.net/nitin_stephens/luc
ene-basics

Blogs
• http://blog.mikemccandless.com/

Sample Files
All the literature shown in the code samples
comes from Project Gutenberg.
http://www.gutenberg.org/

Illuminating Lucene.Net

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Illuminating Lucene.Net

Ähnlich wie Illuminating Lucene.Net (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Illuminating Lucene.Net

Hinweis der Redaktion