6. What is Lucene.Net?
Lucene.Net is a port of the Lucene search engine
library, written in C# and targeted at .NET
runtime users.
7. What is Lucene?
Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java.
Apache Lucene is an open source project
available for free download.
8. History
1997 – Lucene project began by Doug Cutting
2000 – First open source release
2002 – First Apache Jakarta release
2005 – Lucene becomes a top-level project
2006 – Lucene.Net gets Apache incubation status
2010 – Lucene.Net orphaned by original committers
2011 – Lucene.Net reaccepted into Apache Incubator
2012 – Lucene.Net graduates from the Incubator
9. Why you should care
You want to provide
customers with a
“Google-like” search
experience
You want to tune
incoming queries or
results ranking
You want better
performance than SQL
“like” searches
You want to avoid
deploying a separate
search tool with your
website or application
10. What does it do?
• Allows you to index and search vast amounts
of text quickly
• Provides a powerful query syntax
• Integrates into applications easily
11. How it works
• Lucene uses an inverted index
– Maps terms to the documents that contain them
• Lucene manages its index
– Stores the index in memory or on disk
– Allows documents to be added or removed
• Makes an index for each document
• Merges the index with a set of other indices
14. Differences between Java and .Net
The Lucene.Net API:
• Lags a few steps behind the Java version of
Lucene
• Takes advantage of advanced .Net features not
found in Java
But it:
• Preserves the core Lucene concepts
• Maintains indexes that are compatible with the
Java version
15. Logical Index Storage
• Field – a name/value pair
• Document – a sequence of fields
• Index – a collection of documents
16. Physical Index Storage
• Lucene generates a
series of files within a
single directory
• Moving an index is a
copy-and-paste
operation
• You can compress or zip
an index to archive it
17. Luke
• Lucene Index Toolbox
• Built in Java, but can
read Lucene.Net
indexes
• http://code.google.com
/p/luke/
18. Analyzers and Tokens
• Analyzers take strings of text and break them
into tokens
• Tokens are chunks of text and associated
metadata
19. Terms, Queries and Hits
• Terms – the basic unit for searching. A field
name and a value to seek.
• Queries – combine terms to form search
criteria
• Hits – a ranked list of pointers to documents
27. Transactional Lucene
• Lucene supports ACID commits to its indexes
• Lucene uses the Commit and Rollback syntax,
much like relational databases.
• Source:
http://blog.mikemccandless.com/2012/03/tra
nsactional-lucene.html
28. Lucene index types
FSDirectory
• Stores indexed documents
on disk
• Persists data across sessions
• Best choice for most
applications
Your first choice
RAMDirectory
• Stores indexed documents
in memory
• Entire index must fit into
available memory
• Does not persist data
• Faster than FSDirectory
Useful for unit testing
29. Precalculation
• How you store things in Lucene matters –
choose field options and analyzers carefully
• The way you retrieve information determines
how it should be stored
• Smaller indexes give you better performance
31. Field.Index
• No – the field is not indexed, so it is not
searchable
• Not analyzed – the text is treated as single
unit and indexed whole
• Analyzed – the text is broken down into
tokens and indexed
32. Field.TermVector
• No – Does not store term vectors
• Yes – Stores the term vectors of each
document (terms and number of occurrences)
• With Positions Offsets – Term vector, token
position and offset information
33. Field types indexing options
Field Stored Analyzed Vectored
Id Yes Not analyzed No
Modified Yes Not analyzed No
Path Yes Analyzed No
Content No Analyzed With Positions Offsets
An example of storing fields related to files on
your computer.
34. Analyzers
• Break apart text into tokens; each token gets
indexed separately
• Remove stop words
• Decide how to handle punctuation
• Handle languages and case sensitivity
• You can create your own by building from
scratch or chaining exiting analyzers
35. Types of Queries
• TermQuery
• PhraseQuery
• RangeQuery
• PrefixQuery , Wildcard Query
• FuzzyQuery
• Use BooleanQuery to combine them
36. Query syntax
Query Type Purpose Sample
TermQuery Single word query scarlett
PhraseQuery Matches terms in order “frankly my dear”
RangeQuery Matches documents between the
terms
[1861 to 1865]
{1861 to 1865}
WildcardQuery Lightweight regex-like term matching Atl*
D?m?
PrefixQuery Matches terms that being with the
string
War*
FuzzyQuery Closeness matching cry~
BooleanQuery Combines other queries into complex
expressions
Scarlett AND “frankly my
dear” -voldemort
37. Query, Filter, and Sort
• Lucene.Net can handle all three
• Default sort is by relevance
• Prefer queries to filters – they perform better
40. Recap
• Why would I use a search engine?
• Why would I use Lucene.Net?
• How would I add Lucene.Net to my project?
– Web
– Desktop
• Where could I go to learn more?
• When can I buy Dean a beer?
43. Books
• Lucene in Action,
Second Edition
• Michael McCandless,
Erick Hatcher, Otis
Gospodnetić
• Manning Publications
• July 2010
• http://www.manning.co
m/hatcher3/
44. Books
• Taming Text
• Grant S. Ingersoll,
Thomas S. Morton,
Andrew L. Farris
• Manning Publications
• January 2013
• http://www.manning.co
m/ingersoll/
45. Books
• Introduction to
Information Retrieval
• Christopher D. Manning,
Prabhakar Raghavan,
Hinrich Schutze
• Cambridge University Press
• 2008
• http://www-
nlp.stanford.edu/IR-book/
48. Sample Files
All the literature shown in the code samples
comes from Project Gutenberg.
http://www.gutenberg.org/
Hinweis der Redaktion
Egad, the PUNishment! Well, at least I didn’t have a boring “Introduction to Lucene.NET” title.
Oooh, an agenda. Aren’t I organized?
Please send me an email to get in touch with me. Keep up with what I’m doing on the Infovark website or on my LinkedIn profile. I’ve listed my twitter handles – personal and work – but I rarely log into Twitter for any length of time. Send me a private message if you want to get my attention on Twitter.
Doug Cutting had written search engines in other languages, but he wanted to teach himself Java. So the Lucene project began. Although he started building a commercial venture around the project, he decided that he preferred writing code to running a business. He open sourced the code in 2000.Lucene got adopted by the Apache Software foundation in 2001. Lucene.Net, which began as an independent port of Lucene, was accepted by the ASF in 2006.In 2010, Lucene.Net hit a rough patch, but thatnk’s to the efforts of the Alt.Net community, it was reintroduced to the Apache Incubator. In 2012, it graduated from the Incubator and became a full-fledged Apache project.
Inverted indexMaps terms to the documents that contain themTerms may include metadata to improve rankingTerms may include position data for proximity searches
These are a few examples of websites, applications, and platforms that use Lucene.Net. If I included those that use Lucene, the Java version, the list would be huge. Even if you don’t use Lucene.Net directly, chances are good that you use something that does. Lucene has become a foundational technology for many of the tools and sites we use today, but not many folks working on the Microsoft side are familiar with it. Some prominent Java examples include: LinkedIn, Twitter, IBM’s OpenFind, and many more.
The .Net version is catching up with the Java version, but it remains nearly a full version behind.The .Net API is much nicer to work with, having good collections and generics support.Tools that interact with a Lucene index will work regardless of the Lucene library that created it.
Although we’ll be working with the Lucene.NET API tonight, many of the concepts you’ll hear will apply to any search engine, though the specific terminology may differ a little. Let’s review some basic definitions we’ll use throughout the rest of the presentation.Index – a collection of documentsDocument – a sequence of fieldsField – a string name/value pair
Luke is one of the ugliest applications I’ve ever seen, but it’s extremely useful. It exposes just about every aspect of the Lucene API, so it makes a great test-bed for trying out different ideas.
Analyzer – breaks field values into tokensToken – a tuple consisting of a chunk of text and its associated metadata. Tokens are the raw bits that gets indexed.(Tokens and terms are closely related.)
Query – a way to ask a question of an indexTerm – a tuple containing a field and a value to seek
Here are some of the key classes used to add documents to the index.I really ought to add some details to the slide for folks who can’t see the code sample.
Updating is a fairly new operation in the Lucene.Net API. Under the hood, it’s doing a Delete operation then an Add operation.
Did you know that you can use an IndexReader to update and delete documents, too? Yes, but I don’t recommend it. This is one of the parts of the API that’s getting revised in the near future.
Unlike a relational database, there’s no “normal form” to guide you when structuring a Lucene index. The key thing to remember is that the
Keeping the original text within the Lucene index is convenient, but can vastly increase the size of your indexes.
Term Vector Yes
Just an example of how you might combine the flags when adding fields to a document.
TermQuery – retrieve documents by a keyPrefixQuery – matches the start of a string valueRangeQuery – searches starting at one term and ending at another (useful for date searches)BooleanQuery – lets you combine other queries using AND, OR, NOT operationsPhaseQuery – finds terms a specified distance from one anotherFuzzyQuery – matches terms similar to a specified term
Examples of query syntax.
Some odds and ends on Queries, filters and sorting.
We can finally dispose of our Lucene objects in versions 2.9.4 and later. If you’re using older versions, you must remember to try/finally the FSDirectory and IndexWriter.Remember that it’s much more efficient to add a bunch of documents within a single using statement than to open a new IndexWriter each time.