SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Introduction to Information
Retrieval with Lucene
By Stylianos Gkorilas
Introductions


Presenter


Architect/Development Team Leader @Trasys Greece




IR (Information Retrieval)





Java EE projects for European Agencies

The tracing and recovery of specific information from stored
data
IR is interdisciplinary, based on computer science,
mathematics, library science, information science, information
architecture, cognitive psychology, linguistics, and statistics.

Lucene







Open Source – Apache Software License
(http://lucene.apache.org)
Founder: Doug Cutting
0.01 release on March 2000 (SourceForge)
1.2 release June 2002 (First apache Jakarta Release)
Its own top level apache project in 2005
Current version is 3.1
More Lucene Intro…


Lucene is high performance, scalable IR
library (not a ready to use application)






Number of full featured search applications
built on top (More later…)

Lucene ports and bindings in many other
programming environments incl. Perl,
Python, Ruby, C/C++, PHP and C# (.NET)
Lucene „Powered By‟ apps (a few of
many): LinkedIn, Apple, MySpace, Eclipse
IDE, MS Outlook, Atlassian (JIRA). See
more @ http://wiki.apache.org/lucenejava/PoweredBy
Components of a Search
Application (1/4)


Acquire Content


Gather and scope the content




e.g. from the web with a spider
or crawler, a CMS, a Database
or a file system

Projects helping
Solr: handles RDBMS and XML
feeds and rich documents
through Tika integration
 Nutch: web crawler - sister
project at apache
 Grub: open source web crawler

Components of a Search
Application (2/4)


Build document


Define the document







The unit of the search engine
Has fields
De-normalization involved

Projects helping: Usually the
same frameworks cover both this
and the previous step






Compass and its evolution
ElasticSearch
Hibernate Search
DBSight
Oracle/Lucene Integration
Components of a Search
Application (3/4)


Analyze Document


Handled by Analyzers
Built-in and contributed
 Built with tokenizers and token
filters




Index Document




Through Lucene API or your
framework of choice

Search User
Interface/Render Results


Application specific
Components of a Search
Application (4/4)


Query Builder





Lucene provides one
Frameworks provide extensions but also
the application itself e.g. advanced
search

Run Query



Retrieve documents running the query
built
Three common theoretical models






Administration




Boolean model
Vector space model
Probabilistic model
e.g. tuning options

Analytics


reporting
How Lucene models content







Documents
Fields
Denormalization of content
Flexible Schema
Inverted Index
Basic Lucene Classes


Indexing
IndexWriter
 Directory
 Analyzer
 Document
 Field




Searching
IndexSearcher
 Query
 TopDocs
 Term
 QueryParser

Basic Indexing


Adding documents
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(“post",
"the JHUG meeting is on this Saturday",
Field.Store.YES,
Field.Index.ANALYZED));




Deleting and updating documents
Field options







Store
Analyze
Norms
Term vectors
Boost
Scoring – The formula
tf(t in d): Term frequency factor for the term (t) in the document
(d), i.e. how many times the term t occurs in the document.
idf(t): Inverse document frequency of the term: a measure of how
“unique” the term is. Very common terms have a low idf; very
rare terms have a high idf.
boost(t.field in d): Field & Document boost, as set during indexing.
This may be used to statically boost certain fields and certain
documents over others.
lengthNorm(t.field in d): Normalization value of a field, given the
number of terms within the field. This value is computed during
indexing and stored in the index norms. Shorter fields (fewer
tokens) get a bigger boost from this factor.
coord(q, d): Coordination factor, based on the number of query
terms the document contains. The coordination factor gives an
AND-like boost to documents that contain more of the search
terms than other documents
queryNorm(q): Normalization value for a query, given the sum of
the squared weights of each of the query terms.
Querying – the API


Variety of Query class implementations















TermQuery
PhraseQuery
TermRangeQuery
NumericRangeQuery
PrefixQuery
BooleanQuery
WildCardQuery
FuzzyQuery
MatchAllDocsQuery
…
Querying - Example
private void indexSingleFieldDocs(Field[] fields) throws Exception {
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < fields.length; i++) {
Document doc = new Document();
doc.add(fields[i]);
writer.addDocument(doc);
}

writer.optimize();
writer.close();
}
public void wildcard() throws Exception {
indexSingleFieldDocs(new Field[]
{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });
IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = new WildcardQuery(new Term("contents", "?ild*"));
TopDocs matches = searcher.search(query, 10);
}
Querying - QueryParser
Query query = new QueryParser("subject",
analyzer).parse("(clinical OR ethics) AND
methodology");














trachea AND esophagus
The default join condition is OR e.g. trachea esophagus
cough AND (trachea OR esophagus)
trachea NOT esophagus
full_title:trachea
"trachea disease"
"trachea disease“~5
is_gender_male:y
[2010-01-01 TO 2010-07-01]
esophaguz~
Trachea^5 esophagus
Analyzers - Internals



At Indexing and querying time
Inside an analyzer



Operates on a TokenStream
A token has a text value and metadata like








Start end character offsets
Token type
Position increment
Optionally application specific bit flags and byte[]
payload

Token stream is abstract. Tokenizer and TokenFilter
are the concrete ones





Tokenizer reads chars and produces tokens
Token filter ingests tokens and produces new ones
The composite pattern is implemented and they form
a chain of one another
Analyzers – building blocks



Analyzers can be created by combining token streams (Order is
important)
Building blocks provided in core


















CharTokenizer
WhitespaceTokenizer
KeywordTokenizer.
LetterTokenizer
LowerCaseTokenizer
SinkTokenizer
StandardTokenizer
LowerCaseFilter
StopFilter
PorterStemFilter
TeeTokenFilter
ASCIIFoldingFilter
CachingTokenFilter
LengthFilter
StandardFilter
Analyzers - core







WhitespaceAnalyzer Splits tokens at
whitespace
SimpleAnalyzer Divides text at non letter
characters and lowercases
StopAnalyzer Divides text at non letter
characters, lowercases, and removes stop words
KeywordAnalyzer Treats entire text as a single
token
StandardAnalyzer Tokenizes based on a
sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more
lowercases and removes stop words
Analyzers – Example (1/2)
Analyzing “The JHUG meeting is on this Saturday"
WhitespaceAnalyzer:
[The] [JHUG] [meeting] [is] [on] [this] [Saturday]
SimpleAnalyzer:
[the] [jhug] [meeting] [is] [on] [this] [saturday]
StopAnalyzer:
[jhug] [meeting] [saturday]
StandardAnalyzer:
[jhug] [meeting] [Saturday]
Analyzers – Example (2/2)
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
Analyzers – Beyond the built in


language-specific analyzers, under contrib/analyzers.







language-specific stemming and stop-word removal

Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer
that transforms terms to their phonetic roots
SynonymAnalyzer
Nutch Analysis: bigrams for stop words
Stemming analysis


The PorterStemFilter. It stems words using the Porter
stemming algorithm created by Dr. Martin Porter, and it‟s
best defined in his own words:




The Porter stemming algorithm (or „Porter stemmer‟) is a process
for removing the commoner morphological and inflexional endings
from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up
Information Retrieval systems.

SnowballAnalyzer: Stemming for many European
languages
Filters





Narrow the search space
Overloaded search methods that
accept Filter instances
Examples








TermRangeFilter
NumericRangeFilter
PrefixFilter
QueryWrapperFilter
SpanQueryFilter
ChainedFilter
Example: Filters for Security


Constraints known at indexing time





Index the constraint as a field
Search wrapping a TermQuery on the constraint
field with a QueryWrapperFilter

Factor in information at search time





A custom filter
Filter will access an external privilege store that will
provide some means of identifying documents in
the index e.g. a unique term with regard to
permissions
Return an DocIdSet to Lucene. Bit positions match
the document numbers. Enabled bits mean the
document for that position is available to be
searched against the query; unset bits mean the
documents won‟t be considered in the search
Internals - Concurrency


Any number of IndexReaders open




Only one IndexWriter at a time




Locking with write lock file

IndexReaders may be open while the
index is being changed by an
IndexWriter




IndexSearchers use underlying
IndexReaders

It will see changes only when the writer
commits and is reopened

Both are thread safe/friendly classes
Internals - Indexing concepts





Index is made up from segment files
Deleting documents does not actually deletes - only
marks for deletion
Index writes are buffered and flushed periodically
Segments need to be merged





Automatically by the IndexWriter
Explicit calls to optimize

There is the notion of commit (as you would
expect), which has 4 steps






Flush buffered documents and deletions
Sync files; force OS to write to stable storage of the
underlying I/O system
Write and sync the segments_N file
Remove old commits
Internals - Transactions


Two-phase commit is supported




prepareCommit performs steps 1,2 and
most of 3

Lucene implements the ACID
transactional model






Atomicity: all or nothing commit
Consistency: e.g. update will mean both
delete and add
Isolation: IndexReaders cannot see what
has not been comitted
Durability: Index is not corrupted and
persists in storage
Architectures


Cluster nodes that share a remote file system
index





Index in database




Much slower

Separate write and read indexes (replication)





Slower than local
Possible limitations due to client side caching
(Samba, NFS, AFP) or stale file handles (NFS)

relies on the IndexDeletionPolicy feature of Lucene
Out of the box in Solr and ElasticSearch

Autonomous search servers (e.g. Solr,
ElasticSearch)


Loose coupling through JSON or XML
Frameworks– Compass Document
definition via JPA mapping
<compass-core-mapping package="eu.emea.eudract.model.entity">
<class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false">
<id name="ctaIdentificationId">
<meta-data>cta_id</meta-data>
</id>
<dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name
</dynamic-meta-data>
<property name="fullTitle">
<meta-data>cta_full_title</meta-data>
</property><property name="sponsorProtocolVersionDate">
<meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data>
</property>
<property name="isResubmission">
<meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data>
</property>
<component name="eudractNumber" />
</class>
<class name="eudractnumber.EudractNumber" alias="eudract_number" root="false">
<property name="eudractNumberId">
<meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data>
<meta-data>eudract_number</meta-data>
</property>
<property name="paediatricClinicalTrial">
<meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial
</meta-data>
</property>
</class>
.....
</compass-core-mapping>
Frameworks– Solr Document definition
via DB mapping
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
<entity name="feature" query="select description from feature where item_id='${item.ID}'">
<field name="features" column="description" />
</entity>
<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">
<entity name="category" query="select description from category where id =
'${item_category.CATEGORY_ID}'">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
Frameworks– Compass/Lucene
Configuration
<compass name="default">
<setting name="compass.transaction.managerLookup">
org.compass.core.transaction.manager.OC4J</setting>
<setting name="compass.transaction.factory">
org.compass.core.transaction.JTASyncTransactionFactory</setting>
<setting name="compass.transaction.lockPollInterval">400</setting>
<setting name="compass.transaction.lockTimeout">90</setting>
<setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting>
<!--<setting name="compass.engine.connection">
jdbc://jdbc/EudractV8DataSourceSecure</setting>-->
<!--<setting name="compass.engine.store.jdbc.connection.provider.class">-->
<!--org.compass.core.lucene.engine.store.jdbc.JndiDataSourceProvider-->
<!--</setting>-->
<!--<setting name="compass.engine.ramBufferSize">512</setting>-->
<!--<setting name="compass.engine.maxBufferedDocs">-1</setting>-->
<setting name="compass.converter.dashHandlingConverter.type">
eu.emea.eudract.compasssearch.DashHandlingConverter
</setting>
<setting name="compass.converter.shortToYesNoNaConverter.type">
eu.emea.eudract.compasssearch.ShortToYesNoNaConverter
</setting>
<setting name="compass.converter.shortToPerDayOrTotalConverter.type">
eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter
</setting>
<setting name="compass.engine.store.jdbc.dialect">
org.apache.lucene.store.jdbc.dialect.OracleDialect
</setting>
<setting name="compass.engine.analyzer.default.type">
org.apache.lucene.analysis.standard.StandardAnalyzer
</setting>
</compass>
Cool extra features- Spellchecking




You will need a dictionary of valid words
You could use the unique terms in your index
Given the dictionary you could






To present or not to present (the suggestion)




Use a Sounds like algorithm like Soundex or Metaphone
Or use Ngrams
E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a
4gram squi, quir, uirr, irre, rrel. Mistakenly searching for
squirel would match 5 grams, with 2 shared between the
3grams and 4grams. This would score high!
Maybe use the Levenshtein distance

Other ideas





Use the rest of the terms in the query to bias
Maybe combine distance with frequency of term
Check result numbers in initial and corrected searches
Even More features


Sorting





SpanQueries





Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery
Beware it uses FieldCache which resides in RAM!
distance between terms (span)
Family of queries like SpanNearQuery or SpanOrQuery and others

Synonyms


Injection during indexing or during searching?





Leverage a synonyms knowledge base








Key thing is to understand that synonyms must be injected on the same position
increments

Answer to the query “Greek Restaurants Near Me”
An efficient technique is to use grids



Assign non-unique grid numbers at areas (e.g. in a mercator space)
Index documents with a field containing the grid numbers that match their positional lingitude and
latitude

MoreLikeThis




A good strategy is to convert it into an index

Spatial Searches




A MultiPhraseQuery is appropriate for searching time
During indexing will allow faster searches

One use of term vectors

Function Queries


e.g. add boosts for fields at search time
Some last things to bare in mind


It would be wise to back up you index




Performance has some trade-offs














search latency
indexing throughput
near real time results
index replication
index optimization

Resource consumption




You can have hot back ups (supported through the
CommitDeletionPolicy)

Disk space
File descriptors
Memory

„Luke‟ is a really handy tool
You can repair a corrupted index (corrupted
segments are just lost… D‟oh!)
Resources







Book: Lucene in Action
Solr:
http://lucene.apache.org/solr/
Vector Space Model:
http://en.wikipedia.org/wiki/V
ector_Space_Model
IR Links:
http://wiki.apache.org/lucenejava/InformationRetrieval
That’s it

Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko
 
Route Redistribution
Route RedistributionRoute Redistribution
Route RedistributionNetwax Lab
 
Introduction to Nginx
Introduction to NginxIntroduction to Nginx
Introduction to NginxKnoldus Inc.
 
Inter-Process Communication in Microservices using gRPC
Inter-Process Communication in Microservices using gRPCInter-Process Communication in Microservices using gRPC
Inter-Process Communication in Microservices using gRPCShiju Varghese
 
[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩NHN FORWARD
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...MinhLeNguyenAnh2
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and ImplementationVarun Talwar
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
Query DSL In Elasticsearch
Query DSL In ElasticsearchQuery DSL In Elasticsearch
Query DSL In ElasticsearchKnoldus Inc.
 
Multi-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-finalMulti-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-finalBiju Nair
 
ELK, a real case study
ELK,  a real case studyELK,  a real case study
ELK, a real case studyPaolo Tonin
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Data Structures in and on IPFS
Data Structures in and on IPFSData Structures in and on IPFS
Data Structures in and on IPFSC4Media
 

Was ist angesagt? (20)

The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Route Redistribution
Route RedistributionRoute Redistribution
Route Redistribution
 
Introduction to Nginx
Introduction to NginxIntroduction to Nginx
Introduction to Nginx
 
Inter-Process Communication in Microservices using gRPC
Inter-Process Communication in Microservices using gRPCInter-Process Communication in Microservices using gRPC
Inter-Process Communication in Microservices using gRPC
 
[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩[2019] 200만 동접 게임을 위한 MySQL 샤딩
[2019] 200만 동접 게임을 위한 MySQL 샤딩
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and Implementation
 
Static Routing
Static RoutingStatic Routing
Static Routing
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
Query DSL In Elasticsearch
Query DSL In ElasticsearchQuery DSL In Elasticsearch
Query DSL In Elasticsearch
 
Multi-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-finalMulti-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-final
 
ELK, a real case study
ELK,  a real case studyELK,  a real case study
ELK, a real case study
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Data Structures in and on IPFS
Data Structures in and on IPFSData Structures in and on IPFS
Data Structures in and on IPFS
 

Ähnlich wie IR with lucene

Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialGasperi Jerome
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise SearchFindwise
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysisstat
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 

Ähnlich wie IR with lucene (20)

Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatial
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

IR with lucene

  • 1. Introduction to Information Retrieval with Lucene By Stylianos Gkorilas
  • 2. Introductions  Presenter  Architect/Development Team Leader @Trasys Greece   IR (Information Retrieval)    Java EE projects for European Agencies The tracing and recovery of specific information from stored data IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics. Lucene       Open Source – Apache Software License (http://lucene.apache.org) Founder: Doug Cutting 0.01 release on March 2000 (SourceForge) 1.2 release June 2002 (First apache Jakarta Release) Its own top level apache project in 2005 Current version is 3.1
  • 3. More Lucene Intro…  Lucene is high performance, scalable IR library (not a ready to use application)    Number of full featured search applications built on top (More later…) Lucene ports and bindings in many other programming environments incl. Perl, Python, Ruby, C/C++, PHP and C# (.NET) Lucene „Powered By‟ apps (a few of many): LinkedIn, Apple, MySpace, Eclipse IDE, MS Outlook, Atlassian (JIRA). See more @ http://wiki.apache.org/lucenejava/PoweredBy
  • 4. Components of a Search Application (1/4)  Acquire Content  Gather and scope the content   e.g. from the web with a spider or crawler, a CMS, a Database or a file system Projects helping Solr: handles RDBMS and XML feeds and rich documents through Tika integration  Nutch: web crawler - sister project at apache  Grub: open source web crawler 
  • 5. Components of a Search Application (2/4)  Build document  Define the document     The unit of the search engine Has fields De-normalization involved Projects helping: Usually the same frameworks cover both this and the previous step     Compass and its evolution ElasticSearch Hibernate Search DBSight Oracle/Lucene Integration
  • 6. Components of a Search Application (3/4)  Analyze Document  Handled by Analyzers Built-in and contributed  Built with tokenizers and token filters   Index Document   Through Lucene API or your framework of choice Search User Interface/Render Results  Application specific
  • 7. Components of a Search Application (4/4)  Query Builder    Lucene provides one Frameworks provide extensions but also the application itself e.g. advanced search Run Query   Retrieve documents running the query built Three common theoretical models     Administration   Boolean model Vector space model Probabilistic model e.g. tuning options Analytics  reporting
  • 8. How Lucene models content      Documents Fields Denormalization of content Flexible Schema Inverted Index
  • 9. Basic Lucene Classes  Indexing IndexWriter  Directory  Analyzer  Document  Field   Searching IndexSearcher  Query  TopDocs  Term  QueryParser 
  • 10. Basic Indexing  Adding documents RAMDirectory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); Document doc = new Document(); doc.add(new Field(“post", "the JHUG meeting is on this Saturday", Field.Store.YES, Field.Index.ANALYZED));   Deleting and updating documents Field options      Store Analyze Norms Term vectors Boost
  • 11. Scoring – The formula tf(t in d): Term frequency factor for the term (t) in the document (d), i.e. how many times the term t occurs in the document. idf(t): Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. boost(t.field in d): Field & Document boost, as set during indexing. This may be used to statically boost certain fields and certain documents over others. lengthNorm(t.field in d): Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. coord(q, d): Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents queryNorm(q): Normalization value for a query, given the sum of the squared weights of each of the query terms.
  • 12. Querying – the API  Variety of Query class implementations           TermQuery PhraseQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery WildCardQuery FuzzyQuery MatchAllDocsQuery …
  • 13. Querying - Example private void indexSingleFieldDocs(Field[] fields) throws Exception { IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < fields.length; i++) { Document doc = new Document(); doc.add(fields[i]); writer.addDocument(doc); } writer.optimize(); writer.close(); } public void wildcard() throws Exception { indexSingleFieldDocs(new Field[] { new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) }); IndexSearcher searcher = new IndexSearcher(directory, true); Query query = new WildcardQuery(new Term("contents", "?ild*")); TopDocs matches = searcher.search(query, 10); }
  • 14. Querying - QueryParser Query query = new QueryParser("subject", analyzer).parse("(clinical OR ethics) AND methodology");            trachea AND esophagus The default join condition is OR e.g. trachea esophagus cough AND (trachea OR esophagus) trachea NOT esophagus full_title:trachea "trachea disease" "trachea disease“~5 is_gender_male:y [2010-01-01 TO 2010-07-01] esophaguz~ Trachea^5 esophagus
  • 15. Analyzers - Internals   At Indexing and querying time Inside an analyzer   Operates on a TokenStream A token has a text value and metadata like      Start end character offsets Token type Position increment Optionally application specific bit flags and byte[] payload Token stream is abstract. Tokenizer and TokenFilter are the concrete ones    Tokenizer reads chars and produces tokens Token filter ingests tokens and produces new ones The composite pattern is implemented and they form a chain of one another
  • 16. Analyzers – building blocks   Analyzers can be created by combining token streams (Order is important) Building blocks provided in core                CharTokenizer WhitespaceTokenizer KeywordTokenizer. LetterTokenizer LowerCaseTokenizer SinkTokenizer StandardTokenizer LowerCaseFilter StopFilter PorterStemFilter TeeTokenFilter ASCIIFoldingFilter CachingTokenFilter LengthFilter StandardFilter
  • 17. Analyzers - core      WhitespaceAnalyzer Splits tokens at whitespace SimpleAnalyzer Divides text at non letter characters and lowercases StopAnalyzer Divides text at non letter characters, lowercases, and removes stop words KeywordAnalyzer Treats entire text as a single token StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more lowercases and removes stop words
  • 18. Analyzers – Example (1/2) Analyzing “The JHUG meeting is on this Saturday" WhitespaceAnalyzer: [The] [JHUG] [meeting] [is] [on] [this] [Saturday] SimpleAnalyzer: [the] [jhug] [meeting] [is] [on] [this] [saturday] StopAnalyzer: [jhug] [meeting] [saturday] StandardAnalyzer: [jhug] [meeting] [Saturday]
  • 19. Analyzers – Example (2/2) Analyzing "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  • 20. Analyzers – Beyond the built in  language-specific analyzers, under contrib/analyzers.      language-specific stemming and stop-word removal Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer that transforms terms to their phonetic roots SynonymAnalyzer Nutch Analysis: bigrams for stop words Stemming analysis  The PorterStemFilter. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it‟s best defined in his own words:   The Porter stemming algorithm (or „Porter stemmer‟) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. SnowballAnalyzer: Stemming for many European languages
  • 21. Filters    Narrow the search space Overloaded search methods that accept Filter instances Examples       TermRangeFilter NumericRangeFilter PrefixFilter QueryWrapperFilter SpanQueryFilter ChainedFilter
  • 22. Example: Filters for Security  Constraints known at indexing time    Index the constraint as a field Search wrapping a TermQuery on the constraint field with a QueryWrapperFilter Factor in information at search time    A custom filter Filter will access an external privilege store that will provide some means of identifying documents in the index e.g. a unique term with regard to permissions Return an DocIdSet to Lucene. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query; unset bits mean the documents won‟t be considered in the search
  • 23. Internals - Concurrency  Any number of IndexReaders open   Only one IndexWriter at a time   Locking with write lock file IndexReaders may be open while the index is being changed by an IndexWriter   IndexSearchers use underlying IndexReaders It will see changes only when the writer commits and is reopened Both are thread safe/friendly classes
  • 24. Internals - Indexing concepts     Index is made up from segment files Deleting documents does not actually deletes - only marks for deletion Index writes are buffered and flushed periodically Segments need to be merged    Automatically by the IndexWriter Explicit calls to optimize There is the notion of commit (as you would expect), which has 4 steps     Flush buffered documents and deletions Sync files; force OS to write to stable storage of the underlying I/O system Write and sync the segments_N file Remove old commits
  • 25. Internals - Transactions  Two-phase commit is supported   prepareCommit performs steps 1,2 and most of 3 Lucene implements the ACID transactional model     Atomicity: all or nothing commit Consistency: e.g. update will mean both delete and add Isolation: IndexReaders cannot see what has not been comitted Durability: Index is not corrupted and persists in storage
  • 26. Architectures  Cluster nodes that share a remote file system index    Index in database   Much slower Separate write and read indexes (replication)    Slower than local Possible limitations due to client side caching (Samba, NFS, AFP) or stale file handles (NFS) relies on the IndexDeletionPolicy feature of Lucene Out of the box in Solr and ElasticSearch Autonomous search servers (e.g. Solr, ElasticSearch)  Loose coupling through JSON or XML
  • 27. Frameworks– Compass Document definition via JPA mapping <compass-core-mapping package="eu.emea.eudract.model.entity"> <class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false"> <id name="ctaIdentificationId"> <meta-data>cta_id</meta-data> </id> <dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name </dynamic-meta-data> <property name="fullTitle"> <meta-data>cta_full_title</meta-data> </property><property name="sponsorProtocolVersionDate"> <meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data> </property> <property name="isResubmission"> <meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data> </property> <component name="eudractNumber" /> </class> <class name="eudractnumber.EudractNumber" alias="eudract_number" root="false"> <property name="eudractNumberId"> <meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data> <meta-data>eudract_number</meta-data> </property> <property name="paediatricClinicalTrial"> <meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial </meta-data> </property> </class> ..... </compass-core-mapping>
  • 28. Frameworks– Solr Document definition via DB mapping <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="WEIGHT" name="weight" /> <field column="PRICE" name="price" /> <field column="POPULARITY" name="popularity" /> <field column="INSTOCK" name="inStock" /> <field column="INCLUDES" name="includes" /> <entity name="feature" query="select description from feature where item_id='${item.ID}'"> <field name="features" column="description" /> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'"> <entity name="category" query="select description from category where id = '${item_category.CATEGORY_ID}'"> <field column="description" name="cat" /> </entity> </entity> </entity> </document> </dataConfig>
  • 29. Frameworks– Compass/Lucene Configuration <compass name="default"> <setting name="compass.transaction.managerLookup"> org.compass.core.transaction.manager.OC4J</setting> <setting name="compass.transaction.factory"> org.compass.core.transaction.JTASyncTransactionFactory</setting> <setting name="compass.transaction.lockPollInterval">400</setting> <setting name="compass.transaction.lockTimeout">90</setting> <setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting> <!--<setting name="compass.engine.connection"> jdbc://jdbc/EudractV8DataSourceSecure</setting>--> <!--<setting name="compass.engine.store.jdbc.connection.provider.class">--> <!--org.compass.core.lucene.engine.store.jdbc.JndiDataSourceProvider--> <!--</setting>--> <!--<setting name="compass.engine.ramBufferSize">512</setting>--> <!--<setting name="compass.engine.maxBufferedDocs">-1</setting>--> <setting name="compass.converter.dashHandlingConverter.type"> eu.emea.eudract.compasssearch.DashHandlingConverter </setting> <setting name="compass.converter.shortToYesNoNaConverter.type"> eu.emea.eudract.compasssearch.ShortToYesNoNaConverter </setting> <setting name="compass.converter.shortToPerDayOrTotalConverter.type"> eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter </setting> <setting name="compass.engine.store.jdbc.dialect"> org.apache.lucene.store.jdbc.dialect.OracleDialect </setting> <setting name="compass.engine.analyzer.default.type"> org.apache.lucene.analysis.standard.StandardAnalyzer </setting> </compass>
  • 30. Cool extra features- Spellchecking    You will need a dictionary of valid words You could use the unique terms in your index Given the dictionary you could     To present or not to present (the suggestion)   Use a Sounds like algorithm like Soundex or Metaphone Or use Ngrams E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a 4gram squi, quir, uirr, irre, rrel. Mistakenly searching for squirel would match 5 grams, with 2 shared between the 3grams and 4grams. This would score high! Maybe use the Levenshtein distance Other ideas    Use the rest of the terms in the query to bias Maybe combine distance with frequency of term Check result numbers in initial and corrected searches
  • 31. Even More features  Sorting    SpanQueries    Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery Beware it uses FieldCache which resides in RAM! distance between terms (span) Family of queries like SpanNearQuery or SpanOrQuery and others Synonyms  Injection during indexing or during searching?    Leverage a synonyms knowledge base     Key thing is to understand that synonyms must be injected on the same position increments Answer to the query “Greek Restaurants Near Me” An efficient technique is to use grids   Assign non-unique grid numbers at areas (e.g. in a mercator space) Index documents with a field containing the grid numbers that match their positional lingitude and latitude MoreLikeThis   A good strategy is to convert it into an index Spatial Searches   A MultiPhraseQuery is appropriate for searching time During indexing will allow faster searches One use of term vectors Function Queries  e.g. add boosts for fields at search time
  • 32. Some last things to bare in mind  It would be wise to back up you index   Performance has some trade-offs          search latency indexing throughput near real time results index replication index optimization Resource consumption   You can have hot back ups (supported through the CommitDeletionPolicy) Disk space File descriptors Memory „Luke‟ is a really handy tool You can repair a corrupted index (corrupted segments are just lost… D‟oh!)
  • 33. Resources     Book: Lucene in Action Solr: http://lucene.apache.org/solr/ Vector Space Model: http://en.wikipedia.org/wiki/V ector_Space_Model IR Links: http://wiki.apache.org/lucenejava/InformationRetrieval