SlideShare a Scribd company logo
1 of 85
Download to read offline
Information Access with
     Apache Lucene
                              Pierpaolo Basile, PhD
     Research Assistant Univeristy Of Bari Aldo Moro
                      co-founder QuestionCube s.r.l.
         pierpaolo.basile@{uniba.it, questioncube.com}


             CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012
             S. VITO DEI NORMANNI (BR)
Outline
• Search Engine
  – Vector Space Model
  – Inverted Index
• Apache Lucene
  – Indexing
  – Searching
• Advanced topics
  – Relevance feedback
  – Document similarity
  – Apache Tika
SEARCH ENGINE
Searching

            Collection of
             documents



            User query



             Relevant
            documents
Information Retrieval

Information Retrieval (IR) is the area of study
concerned with searching for documents,
for information within documents, and
for metadata about documents, as well as that of
searching structured storage, relational databases,
and the World Wide Web.

                                         Wikipedia
Information Retrieval Process
            User feedback        User
                               Interface

                                    Query                                Docs
                                                                        Doc
                                                          Text        Doc
                            Text Operations                          Doc

 Ranked
documents                      Logic view
              Query
                                              Indexing
            Operations
                   Query                            Inverted Index

            Searching                          Index

                   Retrived documents

             Ranking
Information Retrieval Model
<D, Q, F, R(qi, dj)>
• D: document representation
• Q: query representation
• F: query/document representation framework
• R(qi, dj): ranking function
Bag-of-words representation
Document/query as unordered collection of words

John likes to watch movies. Mary likes too. John also likes to
watch football games.



 {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6,
 "football": 7, "games": 8, "Mary": 9, "too": 10}
Vector Space Model
Query/Document Framework
• Algebraic model
• Documents/queries are represented as vector
  in a geometric space
• Terms are vector components
Vector Space Model

     d j  ( w1, j , w2, j ,, wt , j )
     q  ( w1,q , w2,q ,, wt ,q )
Each vector component corresponds to a term

The component value wi,j is the weight of the term ti
in the document dj
Vector Space Model
dj=(hit:5, refresh:2)
q1=(refresh:1)
q2=(hit:1)      refresh




                                      dj

                     q1


                          q2               hit
Vector Space Model
Ranking function
• Vector similarity between query and
  document computed by cosine

                              d j q
  R(d j , q)  cos  
                             dj  q
Vector Space Model
dj=(hit:5, refresh:2)
q1=(refresh:1)
q2=(hit:1)      refresh




                                      dj

                     q1
                          θ1
                                θ2
                               q2          hit
Term-Document matrix

 • ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’
 • ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number
   One,’ said Alice.




 • Alice watched the White King as he slowly struggled up from bar to bar, till at
   last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate.
 • Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a
   good deal!’ was her first remark.



 • In the pool of light was a billiards table, with two figures moving around it. Alice
   walked toward them, and as she approached they turned to look at her.
 • Alice lay back, and closed her eyes. There was the Red Queen again, with that
   incessant grin. Or was it the Cheshire cat's grin?
Term-Document matrix
                 D1      D2      D3
Cheshire Cat     1       0       1
Alice            2       2       2
book             1       0       0
King             1       1       0
table            0       1       1
Queen            0       1       1
grin             0       0       2
Term-Document matrix
                  D1        D2   D3
Cheshire Cat      1         0    1
Alice             2         2    2
book              1         0    0
King              1         1    0
table             0         1    1
Queen             0         1    1
grin              0         0    2

                        0
Query:                  1
                        0
Alice AND Queen
                        0
                        0
                        1
                        0
Inverted Index
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary     Posting List
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary     Posting List                 Position List
Alice        D1:2 D2:2 D3:2   14, 49        1, 40           18, 34
…
book         D1:1             33
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…                                      Term positions in D1
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index: query processing
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…                              AND (intersection )
Cheshire Cat D1:1 D3:1
…                                  D1:2 D2:2 D3:2
grin         D3:2
                                  <Alice><King>
…
                                     D1:1 D2:1
King         D1:1 D2:1
…
Queen        D2:1 D3:1        Alice AND King -> (D1, D2)
…
table        D2:1 D3:1
…
Inverted Index: query processing
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…                              OR (union )
Cheshire Cat D1:1 D3:1
…                                      D3:2
grin         D3:2
                                  <grin>  <King>
…
                                     D1:1 D2:1
King         D1:1 D2:1
…
Queen        D2:1 D3:1        Alice AND King -> (D1, D2, D3)
…
table        D2:1 D3:1
…
Inverted Index: query processing
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…                              NOT (complement )
Cheshire Cat D1:1 D3:1
…                                   D1:2 D2:2 D3:2
grin         D3:2
                                  <Alice>  <grin>
…
                                        D3:2
King         D1:1 D2:1
…
Queen        D2:1 D3:1        Alice NOT grin -> (D1, D2)
…
table        D2:1 D3:1
…
Term-weight
• Measures the term relevance in a document
  – component value in the document representation
  – TF*IDF
     • TF (term frequency): term occurrences in the document
     • IDF (inverse document frequency): inverse to the
       number of documents in which the term occurs


                                                  D
   tf * idf (t , d )  tf (t , d ) * log
                                           {d  D : t  d }
Term-weight
• Measures the term relevance in a document
  – component value in the document representation
  – TF*IDF
     • TF (term frequency): term occurrences in the document
     • IDF (inverse document frequency): inverse to the
       number of documents in which the term occurs
                                                        Number of
                                                        documents in
                                                  D     the collection
   tf * idf (t , d )  tf (t , d ) * log
                                           {d  D : t  d }
                                                  idf
TF*IDF insights
• increases proportionally to the term
  frequency in the document
• decreases to the number of documents in
  which the term belongs
  – common words are generally more frequent in the
    collection
• IDF depends on the collection, TF on the
  document
Inverted index/TF*IDF
• TF: computed by term occurrences in the
  posting list
• IDF: computed by the posting list cardinality

  Alice   D1:2 D2:2 D3:2   |D|=100

                           TF
                                   100
  tf * idf ( Alice , D1)  2 * log
                                    3
Inverted index/TF*IDF
• TF: computed by term occurrences in the
  posting list
• IDF: computed by the posting list cardinality

  Alice   D1:2 D2:2 D3:2   |D|=100

                           IDF
                                   100
  tf * idf ( Alice , D1)  2 * log
                                    3
http://lucene.apache.org
APACHE LUCENE
Apache Lucene
• Apache Software Foundation project
  – http://lucene.apache.org
• What Lucene is
  – Search library: indexing and searching Application
    Programming Interface (API)
• What Lucene is not
  – Search engine (no crawling, server, user interface,
    etc.)
Lucene
•   Full-text search
•   High performance
•   Scalable
•   Cross-platform
•   100%-pure Java
•   Porting in other languages: .NET, Python
Information Retrieval Process
            User feedback        User
                               Interface

                                    Query
                                                          Text       Doc
                            Text Operations
 Ranked
documents                      Logic view
              Query
                                              Indexing
            Operations
                   Query                            Inverted Index

            Searching                          Index

                   Retrived documents

             Ranking
Information Retrieval Process
             User feedback           User
                                   Interface
                                                           org.apache.lucene
                                        Query
                                                                Text       Document
                           Analyzer/Tokenizer/Filter
 Ranked
documents                         Logic view

            QueryParser                          IndexWriter

                     Query                                Inverted Index

                                   IndexReader
            IndexSearcher                              Index

                     Retrived documents

              Similarity
Lucene Model 1/2
• Based on Vector Space Model
• Features:
  –   multi-field document
  –   tf–term frequency: number of matching terms in field
  –   lengthNorm: number of tokens in field
  –   idf: inverse document frequency
  –   coord: coordination factor, number of matching
• Terms
  – field boost
  – query clause boost
Lucene Model 1/2




• coor(q,d) score factor based on how many of the query terms are
  found in the specified document
• queryNorm(q,d) is a normalizing factor used to make scores
  between queries comparable
• boost(t) term boost
• norm(t,d) encapsulates a few (indexing time) boost and length
  factors
Lucene

         TopDocs
Document Indexing
FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, config);
Document doc = new Document();
doc.add(new Field("super_name", "Sandman", Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
Field Options
• Field.Store
  – NO: field value is not stored
  – YES: field value is stored
• Field.Index
  – ANALYZED: field is analyzed by the analyzer
  – NO: not indexed
  – NOT_ANALYZED: indexed but not analyzed
Luke
• Lucene diagnostic tool: http://code.google.com/p/luke/
Document: Fields Representation
http://www.bbc.co.uk/news/technology-15365207
Document: Fields Representation
http://www.bbc.co.uk/news/technology-15365207                    URL: Index.Store.YES,
                                   Data: Index.Store.YES,          Field.Index.NO
                                Field.Index.NOT_ANALYZED
                                            Authors: Index.Store.YES,
                                             Field.Index.ANALYZED

                                                          Title: Index.Store.NO,
                                                          Field.Index.ANALYZED
                                                          Abstract: Index.Store.NO,
                                                           Field.Index.ANALYZED




   Content: Index.Store.NO,
    Field.Index.ANALYZED                                 Comments: Index.Store.NO,
                                                           Field.Index.ANALYZED
Text Operations
• Tokenization
  – split text in token
• Stop word elimination
  – remove common words and closed word class
    (e.g. function words)
• Stemming
  – reducing inflected (or sometimes derived) words
    to their stem
Analyzers
• KeywordAnalyzer
   – Returns a single token
• WhitespaceAnalyzer
   – Splits tokens at whitespace
• Simple Analyzer
   – Divides text at nonletter characters and lowercases
• StopAnalyzer
   – Divides text at non letter characters, lowercases, and removes
     stop words
• StandardAnalyzer
   – Tokenizes based on sophisticated grammar that recognizes e-
     mail addresses, acronyms, etc.; lowercases and removes stop
     words (optional)
Analyzers
    The quick brown fox jumped over the lazy dogs
•   WhitespaceAnalyzer:
    [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
•   SimpleAnalyzer:
    [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
•   StopAnalyzer:
    [quick] [brown] [fox] [jumped] [lazy] [dogs]
•   StandardAnalyzer:
    [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzers
  "XY&Z Corporation - xyz@example.com"
• WhitespaceAnalyzer:
  [XY&Z] [Corporation] [-] [xyz@example.com]
• SimpleAnalyzer:
  [xy] [z] [corporation] [xyz] [example] [com]
• StandardAnalyzer:
  [xy&z] [corporation] [xyz@example.com]
Tokenizers
• A Tokenizer is a TokenStream whose input is a Reader
• Tokenizers break field text into tokens
• StandardTokenizer
   – source string: “full-text lucene.apache.org”
   – “full” “text” “lucene.apache.org”
• WhitespaceTokenizer
   – “full-text” “lucene.apache.org”
• LetterTokenizer
   – “full” “text” “lucene” “apache” “org”
TokenFilters
• A TokenFilter is a TokenStream whose input is
  another token stream
• LowerCaseFilter
• StopFilter
• LengthFilter
• PorterStemFilter
   – stemming: reducing words to root form
   – rides, ride, riding => ride
   – country, countries => countri
Analyzer
public class MyAnalyzer2 extends Analyzer {
private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new
    ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET));


    @Override
    public TokenStream tokenStream(String string, Reader reader) {
        TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader);
        ts = new LowerCaseFilter(Version.LUCENE_36, ts);
        ts = new StopFilter(Version.LUCENE_36, ts, sw);
        return ts;
    }
}
TermVectors
• Store information about term frequency,
  positions and offsets
• Relevance Feedback and “More Like This”
• Clustering
• Similarity between two documents
• Highlighter
  – needs offsets info
Lucene TermVector
• TermFreqVector is a representation of all of the
  terms and term counts in a specific Field of a
  Document instance
• As a tuple:
   – termFreq = <term, term countD>
   – <fieldName, <…,termFreqi, termFreqi+1,…>>
• As Java:
   – public String getField();
   – public String[] getTerms();
   – public int[] getTermFrequencies()
Lucene TermPositionVector
• TermPositionVector is a representation of all
  of the terms, term counts, term position and term
  offsets in a specific Field of a Document instance
• As Java:
   –   public String getField();
   –   public String[] getTerms();
   –   public int[] getTermFrequencies()
   –   public int[] getTermPositions(int index)
   –   public TermVectorOffsetInfo[]
       getOffsets(int index)
Store TermVector
• During indexing, create a Field that stores Term
• Vectors:
   – new
     Field(“text",text,Field.Store.YES,Field.Index.A
     NALYZED,Field.TermVector.YES);
• Options are:
   –   Field.TermVector.YES
   –   Field.TermVector.NO
   –   Field.TermVector.WITH_POSITIONS
   –   Field.TermVector.WITH_OFFSETS
   –   Field.TermVector.WITH_POSITIONS_OFFSETS
Term Vector: Positions and Offsets


       Positions

   0           1            2           3              4              5        6           7           8
The quick brown fox                              jumped over the lazy dogs
0<--->3    4<----->9   10<----->15   16<--->19   20 <------> 26   27<---->31 32<--->35 36<---->40   41<---->45


                                                                                                      Offsets
Example
• Main method with three parameters
   – Input_dir: directory to index
   – Index_dir: directory containing the index files
   – Flag (string) corresponding to different
     Field.TermVector... options
      •   t: tokenize
      •   tv: term vectors
      •   tvp: term vectors with positions
      •   tvo: term vectors with positions and offsets
• The constructor of the Index class takes in input
  the three parameters and indexes the directory
   – Filters only “.txt” files
Lucene Search
• IndexReader: interface for accessing an index
• QueryParser: parses the user query
• IndexSearcher: implements search over a
  single IndexReader
  – search(Query query, int numDoc)
  – TopDocs -> result of search
     • TopDocs.scoreDocs returns an array of retrieved
       documents
Lucene Search
Directory dir = FSDirectory.open(new File(args[0]));
IndexReader reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query query = parser.parse(args[1]);
TopDocs topDocs = searcher.search(query, 100);
System.out.println("matches: " + topDocs.totalHits);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
    Document doc = reader.document(topDocs.scoreDocs[i].doc);
    System.out.println("name: " + doc.get("name") + ", score: " +
      topDocs.scoreDocs[i].score);
    System.out.println();
}
searcher.close();
Lucene QueryParser
• Example:
  – queryParser.parse(“name:SpiderMan”);
• good human entered queries, debugging
• does text analysis and constructs appropriate
  queries
• not all query types supported
QUERY SYNTAX 1/2
• title:"The Right Way" AND text:go
    – Phrase query + term query
• Wildcard Searches
    – tes* (test - tests - tester)
    – te?t (test – text)
    – te*t (tempt)
    – tes?
• Fuzzy Searches (Levenshtein Distance) (TERM)
    – roam~ (foam – roams)
    – roam~0.8
• Range Searches
    – mod_date:[20020101 TO 20030101]
    – title:{Aida TO Carmen}
• Proximity Searches (PHRASE)
    – "jakarta apache"~10
QUERY SYNTAX 2/2
• Boosting a Term
    – jakarta^4 apache
    – "jakarta apache"^4 "Apache Lucene”
• Boolean Operator
    – NOT, OR, AND
    – + required operator
    – - prohibit operator
•   Grouping by ( )
•   Field Grouping
•   title:(+return +"pink panther")
•   Escaping Special Characters by
QUERY (BY CODE) 1/2
TermQuery query =
new TermQuery(new Term(“name”,”Spider-Man”))
• explicit, no escaping necessary
• does not do text analysis for you
• Query
  –   TermQuery
  –   BooleanQuery
  –   PhraseQuery
  –   FuzzyQuery / WildcardQuery /
QUERY (BY CODE) 2/2
TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))

TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”))

BooleanQuery bq = new BooleanQuery();

bq.add(tq1, BooleanClause.Occur.MUST);

bq.add(tq2, BooleanClause.Occur.SHOULD);
DELETING DOCUMENTS
IndexWriter
  – deleteDocuments(Term t)
  – deleteDocuments(Term… t)
  – deleteDocuments(Query t)
  – deleteDocuments(Query… t)
  – updateDocument(Term t, Document d)
  – updateDocument(Term t, Document d,
    Analyzer a)
  – deleting does not immediately reclaim space
Relevance feedback
Documents similarity
Apache Tika

ADVANCED TOPICS
Relevance feedback
• Improve retrieval performance using
  information about document relevance
  – Explicit: relevance indicated by the user (assessor)
  – Implicit: from user behavior
  – Blind (or pseudo): using information about the top
    k retrieved documents
Relevance feedback

User query

                                                  New
                                                retrieved
                                   Searcher    documents




  Retrieved   Relevance feedback
 documents                         New query
Rocchio Algorithm
                  1                1             
Qnew    Q           Dj            Dk
                   Drel D j Drel    Dnorel Dk Dnorel

      Original query

                       Relevant document
                                           Non-relevant document
Rocchio Algorithm (blind)
                  1                1             
Qnew    Q           Dj            Dk
                   Drel D j Drel    Dnorel Dk Dnorel



 In blind (pseudo) relevance feedback we know only
 relevant documents (supposed to be the top K
 documents)
                      (generally adopted by search engine)
Relevance feedback
   Q = Alice                                   Qnew = Alice Cheshire Cat King
                                               book
It’s a friend of mine — a Cheshire             It’s a friend of mine — a Cheshire
Cat,’ said Alice: ‘allow me to                 Cat,’ said Alice: ‘allow me to
introduce it.                                  introduce it.
It’s the oldest rule in the book,’             It’s the oldest rule in the book,’
said the King. ‘Then it ought to               said the King. ‘Then it ought to
be Number One,’ said Alice                     be Number One,’ said Alice

Alice looked round eagerly, and found          Alice book was reprinted and
that it was the Red Queen. ‘She’s grown a      published in 1866.
good deal!’ was her first remark.
                                               …
Alice lay back, and closed her eyes. There
was the Red Queen again, with that
incessant grin. Or was it the Cheshire cat's
grin?
Relevance feedback in Lucene
• Possible using TermVector
  – Retrieve the top K documents
  – Build the Qnew using TermVector
  – Re-query using Qnew
Document similarity
• Documents are represented by vectors
  – Build the document vector using TermVector
  – Compute the similarity using cosine similarity
• Implement a functionality like “similar to this
  document”
Apache Tika
• Extract metadata and text from several file
  formats
  – XML, HTML, OpenOffice, Microsoft Office, PDF
  – MBOX format (email)
  – Compressed files
  – EXIF metadata from image
  – Metadata from audio file (MP3)
• Apache project as Lucene
Extract metadata and text
Tika tika = new Tika();
String type = tika.detect(new File(args[0]));
System.out.println("File type: " + type);
Metadata metadata = new Metadata();
String text = tika.parseToString(new FileInputStream(new File(args[0])),
     metadata);
System.out.println("Metadata");
String[] names = metadata.names();
for (int i = 0; i < names.length; i++) {
  String value = metadata.get(names[i]);
  if (value != null) {
      System.out.println(names[i] + "=" + value);
  }
}
System.out.println("Text: " + text);
Apache Tika + Lucene

Files/URLs



              metadata


  Tika                   Lucene

                text
CONCLUSIONS
Conclusions
• A popular IR model
  – Vector Space Model
• Lucene
  – provides API to build search engine
• Apace Tika
  – extracts metadata and text from files and URLs
• LET’S GO TO BUILD YOUR SEARCH ENGINE!
Lucene Contrib
• analyzers: new analyzer
  – Arabic, Chinese, …
• highlighter: a set of classes for highlighting
  matching terms in search results
• Snowball: stemmer analyzer based on
  Snowball library http://snowball.tartarus.org
• spellchecker: tools for spellchecking and
  suggestions with Lucene
Lucene related project
• Apache Solr: open source enterprise search
  platform from the Apache Lucene
  – Full-Text Search Capabilities, High Volume Web Traffic,
    HTML Administration Interfaces




• Apache Nutch: web-search engine based on Solr
  and Lucene
  – crawler, link-graph database, HTML
Thank you for your attention!
         Questions?

More Related Content

More from QIRIS

follow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlightfollow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlightQIRIS
 
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...QIRIS
 
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications QIRIS
 
follow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUGfollow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUGQIRIS
 
follow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successofollow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successoQIRIS
 
follow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQueryfollow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQueryQIRIS
 
follow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercatofollow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercatoQIRIS
 
follow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al businessfollow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al businessQIRIS
 
follow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & Salesfollow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & SalesQIRIS
 
follow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utilifollow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utiliQIRIS
 
follow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leaderfollow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leaderQIRIS
 
follow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPOfollow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPOQIRIS
 
follow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditorefollow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditoreQIRIS
 
dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)QIRIS
 

More from QIRIS (14)

follow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlightfollow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlight
 
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
 
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
 
follow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUGfollow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUG
 
follow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successofollow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successo
 
follow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQueryfollow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQuery
 
follow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercatofollow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercato
 
follow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al businessfollow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al business
 
follow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & Salesfollow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & Sales
 
follow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utilifollow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utili
 
follow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leaderfollow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leader
 
follow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPOfollow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPO
 
follow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditorefollow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditore
 
dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)
 

Recently uploaded

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 

[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene

  • 1. Information Access with Apache Lucene Pierpaolo Basile, PhD Research Assistant Univeristy Of Bari Aldo Moro co-founder QuestionCube s.r.l. pierpaolo.basile@{uniba.it, questioncube.com} CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012 S. VITO DEI NORMANNI (BR)
  • 2. Outline • Search Engine – Vector Space Model – Inverted Index • Apache Lucene – Indexing – Searching • Advanced topics – Relevance feedback – Document similarity – Apache Tika
  • 4. Searching Collection of documents User query Relevant documents
  • 5. Information Retrieval Information Retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web. Wikipedia
  • 6. Information Retrieval Process User feedback User Interface Query Docs Doc Text Doc Text Operations Doc Ranked documents Logic view Query Indexing Operations Query Inverted Index Searching Index Retrived documents Ranking
  • 7. Information Retrieval Model <D, Q, F, R(qi, dj)> • D: document representation • Q: query representation • F: query/document representation framework • R(qi, dj): ranking function
  • 8. Bag-of-words representation Document/query as unordered collection of words John likes to watch movies. Mary likes too. John also likes to watch football games. {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}
  • 9. Vector Space Model Query/Document Framework • Algebraic model • Documents/queries are represented as vector in a geometric space • Terms are vector components
  • 10. Vector Space Model d j  ( w1, j , w2, j ,, wt , j ) q  ( w1,q , w2,q ,, wt ,q ) Each vector component corresponds to a term The component value wi,j is the weight of the term ti in the document dj
  • 11. Vector Space Model dj=(hit:5, refresh:2) q1=(refresh:1) q2=(hit:1) refresh dj q1 q2 hit
  • 12. Vector Space Model Ranking function • Vector similarity between query and document computed by cosine d j q R(d j , q)  cos   dj  q
  • 13. Vector Space Model dj=(hit:5, refresh:2) q1=(refresh:1) q2=(hit:1) refresh dj q1 θ1 θ2 q2 hit
  • 14. Term-Document matrix • ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’ • ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice. • Alice watched the White King as he slowly struggled up from bar to bar, till at last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate. • Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark. • In the pool of light was a billiards table, with two figures moving around it. Alice walked toward them, and as she approached they turned to look at her. • Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?
  • 15. Term-Document matrix D1 D2 D3 Cheshire Cat 1 0 1 Alice 2 2 2 book 1 0 0 King 1 1 0 table 0 1 1 Queen 0 1 1 grin 0 0 2
  • 16. Term-Document matrix D1 D2 D3 Cheshire Cat 1 0 1 Alice 2 2 2 book 1 0 0 King 1 1 0 table 0 1 1 Queen 0 1 1 grin 0 0 2 0 Query: 1 0 Alice AND Queen 0 0 1 0
  • 18. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 19. Inverted Index Dictionary Posting List Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 20. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 21. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 22. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 23. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 24. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 25. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 26. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 27. Inverted Index Dictionary Posting List Position List Alice D1:2 D2:2 D3:2 14, 49 1, 40 18, 34 … book D1:1 33 … Cheshire Cat D1:1 D3:1 … grin D3:2 … Term positions in D1 King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 28. Inverted Index: query processing Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … AND (intersection ) Cheshire Cat D1:1 D3:1 … D1:2 D2:2 D3:2 grin D3:2 <Alice><King> … D1:1 D2:1 King D1:1 D2:1 … Queen D2:1 D3:1 Alice AND King -> (D1, D2) … table D2:1 D3:1 …
  • 29. Inverted Index: query processing Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … OR (union ) Cheshire Cat D1:1 D3:1 … D3:2 grin D3:2 <grin>  <King> … D1:1 D2:1 King D1:1 D2:1 … Queen D2:1 D3:1 Alice AND King -> (D1, D2, D3) … table D2:1 D3:1 …
  • 30. Inverted Index: query processing Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … NOT (complement ) Cheshire Cat D1:1 D3:1 … D1:2 D2:2 D3:2 grin D3:2 <Alice> <grin> … D3:2 King D1:1 D2:1 … Queen D2:1 D3:1 Alice NOT grin -> (D1, D2) … table D2:1 D3:1 …
  • 31. Term-weight • Measures the term relevance in a document – component value in the document representation – TF*IDF • TF (term frequency): term occurrences in the document • IDF (inverse document frequency): inverse to the number of documents in which the term occurs D tf * idf (t , d )  tf (t , d ) * log {d  D : t  d }
  • 32. Term-weight • Measures the term relevance in a document – component value in the document representation – TF*IDF • TF (term frequency): term occurrences in the document • IDF (inverse document frequency): inverse to the number of documents in which the term occurs Number of documents in D the collection tf * idf (t , d )  tf (t , d ) * log {d  D : t  d } idf
  • 33. TF*IDF insights • increases proportionally to the term frequency in the document • decreases to the number of documents in which the term belongs – common words are generally more frequent in the collection • IDF depends on the collection, TF on the document
  • 34. Inverted index/TF*IDF • TF: computed by term occurrences in the posting list • IDF: computed by the posting list cardinality Alice D1:2 D2:2 D3:2 |D|=100 TF 100 tf * idf ( Alice , D1)  2 * log 3
  • 35. Inverted index/TF*IDF • TF: computed by term occurrences in the posting list • IDF: computed by the posting list cardinality Alice D1:2 D2:2 D3:2 |D|=100 IDF 100 tf * idf ( Alice , D1)  2 * log 3
  • 37. Apache Lucene • Apache Software Foundation project – http://lucene.apache.org • What Lucene is – Search library: indexing and searching Application Programming Interface (API) • What Lucene is not – Search engine (no crawling, server, user interface, etc.)
  • 38. Lucene • Full-text search • High performance • Scalable • Cross-platform • 100%-pure Java • Porting in other languages: .NET, Python
  • 39. Information Retrieval Process User feedback User Interface Query Text Doc Text Operations Ranked documents Logic view Query Indexing Operations Query Inverted Index Searching Index Retrived documents Ranking
  • 40. Information Retrieval Process User feedback User Interface org.apache.lucene Query Text Document Analyzer/Tokenizer/Filter Ranked documents Logic view QueryParser IndexWriter Query Inverted Index IndexReader IndexSearcher Index Retrived documents Similarity
  • 41. Lucene Model 1/2 • Based on Vector Space Model • Features: – multi-field document – tf–term frequency: number of matching terms in field – lengthNorm: number of tokens in field – idf: inverse document frequency – coord: coordination factor, number of matching • Terms – field boost – query clause boost
  • 42. Lucene Model 1/2 • coor(q,d) score factor based on how many of the query terms are found in the specified document • queryNorm(q,d) is a normalizing factor used to make scores between queries comparable • boost(t) term boost • norm(t,d) encapsulates a few (indexing time) boost and length factors
  • 43. Lucene TopDocs
  • 44. Document Indexing FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer); config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); IndexWriter writer = new IndexWriter(dir, config); Document doc = new Document(); doc.add(new Field("super_name", "Sandman", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); writer.close();
  • 45. Field Options • Field.Store – NO: field value is not stored – YES: field value is stored • Field.Index – ANALYZED: field is analyzed by the analyzer – NO: not indexed – NOT_ANALYZED: indexed but not analyzed
  • 46. Luke • Lucene diagnostic tool: http://code.google.com/p/luke/
  • 48. Document: Fields Representation http://www.bbc.co.uk/news/technology-15365207 URL: Index.Store.YES, Data: Index.Store.YES, Field.Index.NO Field.Index.NOT_ANALYZED Authors: Index.Store.YES, Field.Index.ANALYZED Title: Index.Store.NO, Field.Index.ANALYZED Abstract: Index.Store.NO, Field.Index.ANALYZED Content: Index.Store.NO, Field.Index.ANALYZED Comments: Index.Store.NO, Field.Index.ANALYZED
  • 49. Text Operations • Tokenization – split text in token • Stop word elimination – remove common words and closed word class (e.g. function words) • Stemming – reducing inflected (or sometimes derived) words to their stem
  • 50. Analyzers • KeywordAnalyzer – Returns a single token • WhitespaceAnalyzer – Splits tokens at whitespace • Simple Analyzer – Divides text at nonletter characters and lowercases • StopAnalyzer – Divides text at non letter characters, lowercases, and removes stop words • StandardAnalyzer – Tokenizes based on sophisticated grammar that recognizes e- mail addresses, acronyms, etc.; lowercases and removes stop words (optional)
  • 51. Analyzers The quick brown fox jumped over the lazy dogs • WhitespaceAnalyzer: [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] • SimpleAnalyzer: [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] • StopAnalyzer: [quick] [brown] [fox] [jumped] [lazy] [dogs] • StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
  • 52. Analyzers "XY&Z Corporation - xyz@example.com" • WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] • SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] • StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  • 53. Tokenizers • A Tokenizer is a TokenStream whose input is a Reader • Tokenizers break field text into tokens • StandardTokenizer – source string: “full-text lucene.apache.org” – “full” “text” “lucene.apache.org” • WhitespaceTokenizer – “full-text” “lucene.apache.org” • LetterTokenizer – “full” “text” “lucene” “apache” “org”
  • 54. TokenFilters • A TokenFilter is a TokenStream whose input is another token stream • LowerCaseFilter • StopFilter • LengthFilter • PorterStemFilter – stemming: reducing words to root form – rides, ride, riding => ride – country, countries => countri
  • 55. Analyzer public class MyAnalyzer2 extends Analyzer { private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET)); @Override public TokenStream tokenStream(String string, Reader reader) { TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader); ts = new LowerCaseFilter(Version.LUCENE_36, ts); ts = new StopFilter(Version.LUCENE_36, ts, sw); return ts; } }
  • 56. TermVectors • Store information about term frequency, positions and offsets • Relevance Feedback and “More Like This” • Clustering • Similarity between two documents • Highlighter – needs offsets info
  • 57. Lucene TermVector • TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance • As a tuple: – termFreq = <term, term countD> – <fieldName, <…,termFreqi, termFreqi+1,…>> • As Java: – public String getField(); – public String[] getTerms(); – public int[] getTermFrequencies()
  • 58. Lucene TermPositionVector • TermPositionVector is a representation of all of the terms, term counts, term position and term offsets in a specific Field of a Document instance • As Java: – public String getField(); – public String[] getTerms(); – public int[] getTermFrequencies() – public int[] getTermPositions(int index) – public TermVectorOffsetInfo[] getOffsets(int index)
  • 59. Store TermVector • During indexing, create a Field that stores Term • Vectors: – new Field(“text",text,Field.Store.YES,Field.Index.A NALYZED,Field.TermVector.YES); • Options are: – Field.TermVector.YES – Field.TermVector.NO – Field.TermVector.WITH_POSITIONS – Field.TermVector.WITH_OFFSETS – Field.TermVector.WITH_POSITIONS_OFFSETS
  • 60. Term Vector: Positions and Offsets Positions 0 1 2 3 4 5 6 7 8 The quick brown fox jumped over the lazy dogs 0<--->3 4<----->9 10<----->15 16<--->19 20 <------> 26 27<---->31 32<--->35 36<---->40 41<---->45 Offsets
  • 61. Example • Main method with three parameters – Input_dir: directory to index – Index_dir: directory containing the index files – Flag (string) corresponding to different Field.TermVector... options • t: tokenize • tv: term vectors • tvp: term vectors with positions • tvo: term vectors with positions and offsets • The constructor of the Index class takes in input the three parameters and indexes the directory – Filters only “.txt” files
  • 62. Lucene Search • IndexReader: interface for accessing an index • QueryParser: parses the user query • IndexSearcher: implements search over a single IndexReader – search(Query query, int numDoc) – TopDocs -> result of search • TopDocs.scoreDocs returns an array of retrieved documents
  • 63. Lucene Search Directory dir = FSDirectory.open(new File(args[0])); IndexReader reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer); Query query = parser.parse(args[1]); TopDocs topDocs = searcher.search(query, 100); System.out.println("matches: " + topDocs.totalHits); for (int i = 0; i < topDocs.scoreDocs.length; i++) { Document doc = reader.document(topDocs.scoreDocs[i].doc); System.out.println("name: " + doc.get("name") + ", score: " + topDocs.scoreDocs[i].score); System.out.println(); } searcher.close();
  • 64. Lucene QueryParser • Example: – queryParser.parse(“name:SpiderMan”); • good human entered queries, debugging • does text analysis and constructs appropriate queries • not all query types supported
  • 65. QUERY SYNTAX 1/2 • title:"The Right Way" AND text:go – Phrase query + term query • Wildcard Searches – tes* (test - tests - tester) – te?t (test – text) – te*t (tempt) – tes? • Fuzzy Searches (Levenshtein Distance) (TERM) – roam~ (foam – roams) – roam~0.8 • Range Searches – mod_date:[20020101 TO 20030101] – title:{Aida TO Carmen} • Proximity Searches (PHRASE) – "jakarta apache"~10
  • 66. QUERY SYNTAX 2/2 • Boosting a Term – jakarta^4 apache – "jakarta apache"^4 "Apache Lucene” • Boolean Operator – NOT, OR, AND – + required operator – - prohibit operator • Grouping by ( ) • Field Grouping • title:(+return +"pink panther") • Escaping Special Characters by
  • 67. QUERY (BY CODE) 1/2 TermQuery query = new TermQuery(new Term(“name”,”Spider-Man”)) • explicit, no escaping necessary • does not do text analysis for you • Query – TermQuery – BooleanQuery – PhraseQuery – FuzzyQuery / WildcardQuery /
  • 68. QUERY (BY CODE) 2/2 TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”)) TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”)) BooleanQuery bq = new BooleanQuery(); bq.add(tq1, BooleanClause.Occur.MUST); bq.add(tq2, BooleanClause.Occur.SHOULD);
  • 69. DELETING DOCUMENTS IndexWriter – deleteDocuments(Term t) – deleteDocuments(Term… t) – deleteDocuments(Query t) – deleteDocuments(Query… t) – updateDocument(Term t, Document d) – updateDocument(Term t, Document d, Analyzer a) – deleting does not immediately reclaim space
  • 71. Relevance feedback • Improve retrieval performance using information about document relevance – Explicit: relevance indicated by the user (assessor) – Implicit: from user behavior – Blind (or pseudo): using information about the top k retrieved documents
  • 72. Relevance feedback User query New retrieved Searcher documents Retrieved Relevance feedback documents New query
  • 73. Rocchio Algorithm   1  1  Qnew    Q      Dj      Dk Drel D j Drel Dnorel Dk Dnorel Original query Relevant document Non-relevant document
  • 74. Rocchio Algorithm (blind)   1  1  Qnew    Q      Dj      Dk Drel D j Drel Dnorel Dk Dnorel In blind (pseudo) relevance feedback we know only relevant documents (supposed to be the top K documents) (generally adopted by search engine)
  • 75. Relevance feedback Q = Alice Qnew = Alice Cheshire Cat King book It’s a friend of mine — a Cheshire It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to Cat,’ said Alice: ‘allow me to introduce it. introduce it. It’s the oldest rule in the book,’ It’s the oldest rule in the book,’ said the King. ‘Then it ought to said the King. ‘Then it ought to be Number One,’ said Alice be Number One,’ said Alice Alice looked round eagerly, and found Alice book was reprinted and that it was the Red Queen. ‘She’s grown a published in 1866. good deal!’ was her first remark. … Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?
  • 76. Relevance feedback in Lucene • Possible using TermVector – Retrieve the top K documents – Build the Qnew using TermVector – Re-query using Qnew
  • 77. Document similarity • Documents are represented by vectors – Build the document vector using TermVector – Compute the similarity using cosine similarity • Implement a functionality like “similar to this document”
  • 78. Apache Tika • Extract metadata and text from several file formats – XML, HTML, OpenOffice, Microsoft Office, PDF – MBOX format (email) – Compressed files – EXIF metadata from image – Metadata from audio file (MP3) • Apache project as Lucene
  • 79. Extract metadata and text Tika tika = new Tika(); String type = tika.detect(new File(args[0])); System.out.println("File type: " + type); Metadata metadata = new Metadata(); String text = tika.parseToString(new FileInputStream(new File(args[0])), metadata); System.out.println("Metadata"); String[] names = metadata.names(); for (int i = 0; i < names.length; i++) { String value = metadata.get(names[i]); if (value != null) { System.out.println(names[i] + "=" + value); } } System.out.println("Text: " + text);
  • 80. Apache Tika + Lucene Files/URLs metadata Tika Lucene text
  • 82. Conclusions • A popular IR model – Vector Space Model • Lucene – provides API to build search engine • Apace Tika – extracts metadata and text from files and URLs • LET’S GO TO BUILD YOUR SEARCH ENGINE!
  • 83. Lucene Contrib • analyzers: new analyzer – Arabic, Chinese, … • highlighter: a set of classes for highlighting matching terms in search results • Snowball: stemmer analyzer based on Snowball library http://snowball.tartarus.org • spellchecker: tools for spellchecking and suggestions with Lucene
  • 84. Lucene related project • Apache Solr: open source enterprise search platform from the Apache Lucene – Full-Text Search Capabilities, High Volume Web Traffic, HTML Administration Interfaces • Apache Nutch: web-search engine based on Solr and Lucene – crawler, link-graph database, HTML
  • 85. Thank you for your attention! Questions?