SlideShare a Scribd company logo
1 of 45
Download to read offline
Sandhan(CLIA) -
Nutch and Lucene Framework
                    -Gaurav Arora
                    IRLAB,DA-IICT
N
2
u

c   Outline
h
a    Introduction
n    Behavior
d               of Nutch (Offline and Online)
L    Lucene Features
u    Sandhan Demo
c
e    RJ Interface
n
e
F
r
a
m
e
w
o
r
k
3
N
u

c   Introduction
h
a    Nutch  is an opensource search engine
n    Implemented in Java
d
L    Nutch is comprised of Lucene, Solr, Hadoop
u
c
      etc..
e    Lucene is an implementation of indexing and
n     searching crawled data
e
F    Both Nutch and Lucene are developed using
r     plugin framework
a
     Easy to customize
m
e
w
o
r
k
4
N
u

c   Where do they fit in IR?
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
5
N
u

c   Nutch – complete search engine
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
6
N
u

c   Nutch – offline processing
h
a    Crawling
n     Starts with set of seed URLs
d
      Goes deeper in the web and starts fetching the
L
u      content
c     Content need to be analyzed before storing
e     Storing the content
n
e     Makes suitable for searching
F    Issues
r
a
      Time consuming process
m     Freshness of the crawl (How often should I crawl?)
e     Coverage of content
w
o
r
k
7
N
u

c   Nutch – online processing
h
a    Searching
n     Analysis of the query
d
      Processing of few words(tokens) in the query
L
u     Query tokens matched against stored
c      tokens(index)
e
     Fast and Accurate
n
e    Involves ordering the matching results
F    Ranking affects User’s satisfaction directly
r
a    Supports distributed searching
m
e
w
o
r
k
9
N
u

c   Nutch – Data structures
h
a    Web Database or WebDB
n
       Mirrors the properties/structure of web graph being
d
L       crawled
u
c
e
     Segment
n      Intermediate index
e      Contains pages fetched in a single run
F
r
a    Index
m
       Final inverted index obtained by “merging”
e
w       segments (Lucene)
o
r
k
Nutch – Data
Web Database or WebDB
Crawldb - This contains information about every URL
known to Nutch, including whether it was fetched.
Linkdb. - This contains the list of known links to each URL,
including both the source URL and anchor text of the link.

Index

Invert index : Posting list ,Mapping from words
to its documents.
Nutch Data - Segment
Each segment is a set of URLs that are fetched as a unit.
segment contains:-

 a crawl_generate names a set of URLs to be fetched

 a crawl_fetch contains the status of fetching each URL

 a content contains the raw content retrieved from each URL

 a parse_text contains the parsed text of each URL

 a parse_data contains outlinks and metadata parsed from each URL

 a crawl_parse contains the outlink URLs, used to update the crawldb
12
oter>


        Nutch –Crawling
         Inject:   initial creation of CrawlDB
          Insert seed URLs
          Initial LinkDB is empty


         Generate new shard's fetchlist
         Fetch raw content
         Parse content (discovers outlinks)
         Update CrawlDB from shards
         Update LinkDB from shards
         Index shards
13


     Wide Crawling vs. Focused Crawling
      Differences:
       Little technical difference in configuration
       Big difference in operations, maintenance and
        quality
      Wide   crawling:
       (Almost) Unlimited crawling frontier
       High risk of spamming and junk content
       “Politeness” a very important limiting factor
       Bandwidth & DNS considerations
      Focused   (vertical or enterprise) crawling:
       Limited crawling frontier
       Bandwidth or politeness is often not an issue
       Low risk of spamming and junk content
14
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
 Crawling Architecture
k
15
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep1 : Injector   injects the list of seed URLs into the
o
r
  CrawlDB
k
16
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step2 : Generator takes the list of seed URLs from CrawlDB, forms
o
r fetch list, adds crawl_generate folder into the segments
k
17
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step3 : These fetch lists are used by fetchers to fetch the raw
o
r content of the document. It is then stored in segments.
k
18
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step4 : Parser is called to parse the content of the document
o
r and parsed content is stored back in segments.
k
19
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step5 : The links are inverted in the link graph and stored in
o
r LinkDB
k
20
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
  Step6 : Indexing the terms present in segments is done and
r indices are updated in the segments
k
21
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep7 : Information on the   newly fetched documents are
o
r updated in the CrwalDB
k
22
N
u

c    Crawling: 10 stage process
h
a    bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
n     1. admin db –create: Create a new WebDB.
d     2. inject: Inject root URLs into the WebDB.
L
      3. generate: Generate a fetchlist from the WebDB in a new segment.
u
c     4. fetch: Fetch content from URLs in the fetchlist.
e     5. updatedb: Update the WebDB with links from fetched pages.
n     6. Repeat steps 3-5 until the required depth is reached.
e     7. updatesegs: Update segments with scores and links from the WebDB.
F
      8. index: Index the fetched pages.
r
a      9. dedup: Eliminate duplicate content (and duplicate URLs) from the
     indexes.
m
e     10. merge: Merge the indexes into a single index for searching
w
o
r
k
23
N
u

c    De-duplication Algorithm
h
a
n
     (MD5 hash, float score, int indexID, int
d    docID, int urlLen)
L
u    for each page
c       to eliminate URL duplicates from a
e
n    segmentsDir:
e
F
        open a temporary file
r       for each segment:
a
m
           for each document in its index:
e             append a tuple for the document to
w
o    the        temporary file with
r    hash=MD5(URL)
k
        close the temporary file
24
N
u

c    URL Filtering
h
a
n
d        URL Filters (Text file) (conf/crawl-urlfilter.txt)
L          Regular expression to filter URLs during crawling
u          E.g.
c            To ignore files with certain suffix:
e
          -.(gif|exe|zip|ico)$
n            To accept host in a certain domain
e
F         +^http://([a-z0-9]*.)*apache.org/
r
a
m
e
w
o
r
k
25
N
u

c    Few API’s
h
a     Site   we would crawl: http://www.iitb.ac.in
n        bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&
d         crawl.log
L     Analyze
u                 the database:
c        bin/nutch readdb <db dir> –stats
e        bin/nutch readdb <db dir> –dumppageurl
n        bin/nutch readdb <db dir> –dumplinks
e        s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread
F         -dump $s
r
a
m
e
w
o
r
k
26
N
u

c    Map-Reduce Function
h
a     Works  in distributed environment
n     map() and reduce() functions are implemented
d
L      in most of the modules
u     Both map() and reduce() functions uses <key,
c
e      value> pairs
n     Useful in case of processing large data (eg:
e
F      Indexing)
r     Some applications need sequence of map-
a
m      reduce
e        Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
w
o
r
k
27
N
u

c    Map-Reduce Architecture
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
28
N
u

c    Nutch – Map-Reduce Indexing
h
a     Map()just assembles all parts of documents
n     Reduce() performs text analysis + indexing:
d
L        Adds to a local Lucene index
u
c
e
     Other possible MR indexing models:
n     Hadoop contrib/indexing model:
e      analysis and indexing on map() side
F
       Index merging on reduce() side
r
a     Modified   Nutch model:
m      Analysis on map() side
e
       Indexing on reduce() side
w
o
r
k
29
N
u

c    Nutch - Ranking
h
a     Nutch   Ranking
n
d
L
u
c
e      queryNorm() : indicates the normalization factor for
n       the query
e      coord() : indicates how many query terms are
F
r       present in the given document
a      norm() : score indicating field based normalization
m       factor
e      tf : term frequency and idf : inverse document
w
o       frequency
r      t.boost() : score indicating the importance of terms
k       occurrence in a particular field
30
N
u

c    Lucene - Features
h
a     Field based indexing and searching
n     Different fields of a webpage are
d
L      Title
u      URL
c      Anchor text
e
       Content, etc..
n
e     Different   boost factors to give importance to
F
r
       fields
a     Uses inverted index to store content of
m
e
       crawled documents
w     Open source Apache project
o
r
k
31
N
u

c    Lucene - Index
h
a
n     Concepts
d      Index: sequence of documents (a.k.a. Directory)
L
       Document: sequence of fields
u
c      Field: named sequence of terms
e      Term: a text string (e.g., a word)
n
e
F     Statistics
r        Term frequencies and positions
a
m
e
w
o
r
k
32
N
u

c    Writing to Index
h
a
n    IndexWriter writer =
d
L         new IndexWriter(directory, analyzer,
u    true);
c
e
n      Document doc = new Document();
e        // add fields to document (next slide)
F
r      writer.addDocument(doc);
a      writer.close();
m
e
w
o
r
k
33
N
u

c    Adding Fields
h
a    doc.add(Field.Keyword("isbn", isbn));
n
d    doc.add(Field.Keyword("category",
L    category));
u
c
     doc.add(Field.Text("title", title));
e    doc.add(Field.Text("author", author));
n    doc.add(Field.UnIndexed("url", url));
e
F    doc.add(Field.UnStored("subjects",
r    subjects, true));
a
m    doc.add(Field.Keyword("pubmonth",
e    pubmonth));
w
o
     doc.add(Field.UnStored("contents",author
r    + " " + subjects));
k
     doc.add(Field.Keyword("modified",
     DateField.timeToString(file.lastModified())
34
N
u

c    Fields Description
h
a     Attributes
n      Stored: original content retrievable
d
       Indexed: inverted, searchable
L
u      Tokenized: analyzed, split into tokens
c     Factory   methods
e
n
       Keyword: stored and indexed as single term
e      Text: indexed, tokenized, and stored if String
F      UnIndexed: stored
r      UnStored: indexed, tokenized
a
m     Terms   are what matters for searching
e
w
o
r
k
35
N
u

c    Searching an Index
h
a    IndexSearcher searcher =
n
d          new IndexSearcher(directory);
L
u
c
     Query query =
e    QueryParser.parse(queryExpression,
n      "contents“,analyzer);
e
F    Hits hits = searcher.search(query);
r    for (int i = 0; i < hits.length(); i++) {
a
m      Document doc = hits.doc(i);
e      System.out.println(doc.get("title"));
w
o
     }
r
k
36
N
u

c    Analyzer
h
a
n     Analysis   occurs
d      For each tokenized field during indexing
L
       For each term or phrase in QueryParser
u
c
e     Several   analyzers built-in
n
e
       Many more in the sandbox
F      Straightforward to create your own
r
a
      Choosing   the right analyzer is important!
m
e
w
o
r
k
37
N
u

c    WhiteSpace Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [The] [quick] [brown] [fox] [jumps] [over]
m
e    [the]
w    [lazy] [dog.]
o
r
k
38
N
u

c    Simple Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [the] [quick] [brown] [fox] [jumps] [over]
m
e    [the]
w    [lazy] [dog]
o
r
k
39
N
u

c    Stop Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a
m
e    [quick] [brown] [fox] [jumps] [over] [lazy]
w    [dog]
o
r
k
40
N
u

c    Snowball Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [the] [quick] [brown] [fox] [jump] [over]
m
e    [the]
w    [lazy] [dog]
o
r
k
41
N
u

c    Query Creation
h
a     Searching by a term – TermQuery
n     Searching within a range – RangeQuery
d
L     Searching on a string – PrefixQuery
u     Combining queries – BooleanQuery
c
e     Searching by phrase – PhraseQuery
n     Searching by wildcard – WildcardQuery
e
F     Searching for similar terms - FuzzyQuery
r
a
m
e
w
o
r
k
42
N
u

c    Lucene Queries
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
43
N
u

c    Conclusions
h
a     Nutch   as a starting point
n     Crawling in Nutch
d
L     Detailed map-reduce architecture
u     Different query formats in Lucene
c
e     Built-in analyzers in Lucene
n     Same analyzer need to be used both while
e
F      indexing and searching
r
a
m
e
w
o
r
k
44
N
u

c    Resources Used
h
a     Gospodnetic, Otis; Erik Hatcher (December 1,
n
d      2004). Lucene in Action (1st ed.).
L      Manning Publications. pp. 456. ISBN 
u
c
       978-1-932394-28-3.
e     Nutch Wiki http://wiki.apache.org/nutch/
n
e
F
r
a
m
e
w
o
r
k
45
N
u

c    Thanks
h
a     Questions   ??
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

More Related Content

What's hot

Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Jamey Hanson
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsFujio Turner
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf Conference
 
Analysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationAnalysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationCarlo Carandang
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYEmanuel Calvo
 

What's hot (20)

Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
The MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 InterfaceThe MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 Interface
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC Systems
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC Systems
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
1 technical-dns-workshop-day1
1 technical-dns-workshop-day11 technical-dns-workshop-day1
1 technical-dns-workshop-day1
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Rbootcamp Day 1
Rbootcamp Day 1Rbootcamp Day 1
Rbootcamp Day 1
 
Introduction to DNS
Introduction to DNSIntroduction to DNS
Introduction to DNS
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Analysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationAnalysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia Presentation
 
LaTeX Tutorial
LaTeX TutorialLaTeX Tutorial
LaTeX Tutorial
 
Horizons doc
Horizons docHorizons doc
Horizons doc
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
 

Viewers also liked

Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysisSigmoid
 
The Velocity12 markets
The Velocity12 marketsThe Velocity12 markets
The Velocity12 marketsBenoit Wiesser
 
Social Media by Konceptika
Social Media by KonceptikaSocial Media by Konceptika
Social Media by KonceptikaKonceptika
 
บทที่51
 บทที่51 บทที่51
บทที่51kik.nantanit
 
Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge Outdoors
 
Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors
 
Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge Outdoors
 
DEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPDEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPCreative S.I
 
บทที่51
 บทที่51 บทที่51
บทที่51kik.nantanit
 
Telephone
TelephoneTelephone
Telephonesumipf
 
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors
 
Zonificacion merged
Zonificacion mergedZonificacion merged
Zonificacion mergedDanger
 
Журналисты 2.0
Журналисты 2.0Журналисты 2.0
Журналисты 2.0Igor Kulakov
 
Informática i
Informática iInformática i
Informática iricardo
 

Viewers also liked (20)

Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
The Velocity12 markets
The Velocity12 marketsThe Velocity12 markets
The Velocity12 markets
 
Chapter 3 rev
Chapter 3 revChapter 3 rev
Chapter 3 rev
 
Global warming
Global warmingGlobal warming
Global warming
 
Listing Presentation-ko2
Listing Presentation-ko2Listing Presentation-ko2
Listing Presentation-ko2
 
Social Media by Konceptika
Social Media by KonceptikaSocial Media by Konceptika
Social Media by Konceptika
 
บทที่51
 บทที่51 บทที่51
บทที่51
 
Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012
 
The Quirindongo’S Wedding
The Quirindongo’S WeddingThe Quirindongo’S Wedding
The Quirindongo’S Wedding
 
Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011
 
Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012
 
0471251240
04712512400471251240
0471251240
 
DEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPDEV PVH 2015 MeetUP
DEV PVH 2015 MeetUP
 
บทที่51
 บทที่51 บทที่51
บทที่51
 
Telephone
TelephoneTelephone
Telephone
 
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 Catalog
 
Zonificacion merged
Zonificacion mergedZonificacion merged
Zonificacion merged
 
โปรแกรมเพื่อการศึกษา
โปรแกรมเพื่อการศึกษาโปรแกรมเพื่อการศึกษา
โปรแกรมเพื่อการศึกษา
 
Журналисты 2.0
Журналисты 2.0Журналисты 2.0
Журналисты 2.0
 
Informática i
Informática iInformática i
Informática i
 

Similar to Nutch and lucene_framework

Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Steve Min
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 

Similar to Nutch and lucene_framework (20)

How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
Data Science
Data ScienceData Science
Data Science
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
hadoop
hadoophadoop
hadoop
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Nutch and lucene_framework

  • 1. Sandhan(CLIA) - Nutch and Lucene Framework -Gaurav Arora IRLAB,DA-IICT
  • 2. N 2 u c Outline h a  Introduction n  Behavior d of Nutch (Offline and Online) L  Lucene Features u  Sandhan Demo c e  RJ Interface n e F r a m e w o r k
  • 3. 3 N u c Introduction h a  Nutch is an opensource search engine n  Implemented in Java d L  Nutch is comprised of Lucene, Solr, Hadoop u c etc.. e  Lucene is an implementation of indexing and n searching crawled data e F  Both Nutch and Lucene are developed using r plugin framework a  Easy to customize m e w o r k
  • 4. 4 N u c Where do they fit in IR? h a n d L u c e n e F r a m e w o r k
  • 5. 5 N u c Nutch – complete search engine h a n d L u c e n e F r a m e w o r k
  • 6. 6 N u c Nutch – offline processing h a  Crawling n  Starts with set of seed URLs d  Goes deeper in the web and starts fetching the L u content c  Content need to be analyzed before storing e  Storing the content n e  Makes suitable for searching F  Issues r a  Time consuming process m  Freshness of the crawl (How often should I crawl?) e  Coverage of content w o r k
  • 7. 7 N u c Nutch – online processing h a  Searching n  Analysis of the query d  Processing of few words(tokens) in the query L u  Query tokens matched against stored c tokens(index) e  Fast and Accurate n e  Involves ordering the matching results F  Ranking affects User’s satisfaction directly r a  Supports distributed searching m e w o r k
  • 8.
  • 9. 9 N u c Nutch – Data structures h a  Web Database or WebDB n  Mirrors the properties/structure of web graph being d L crawled u c e  Segment n  Intermediate index e  Contains pages fetched in a single run F r a  Index m  Final inverted index obtained by “merging” e w segments (Lucene) o r k
  • 10. Nutch – Data Web Database or WebDB Crawldb - This contains information about every URL known to Nutch, including whether it was fetched. Linkdb. - This contains the list of known links to each URL, including both the source URL and anchor text of the link. Index Invert index : Posting list ,Mapping from words to its documents.
  • 11. Nutch Data - Segment Each segment is a set of URLs that are fetched as a unit. segment contains:- a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL a crawl_parse contains the outlink URLs, used to update the crawldb
  • 12. 12 oter> Nutch –Crawling  Inject: initial creation of CrawlDB  Insert seed URLs  Initial LinkDB is empty  Generate new shard's fetchlist  Fetch raw content  Parse content (discovers outlinks)  Update CrawlDB from shards  Update LinkDB from shards  Index shards
  • 13. 13 Wide Crawling vs. Focused Crawling  Differences:  Little technical difference in configuration  Big difference in operations, maintenance and quality  Wide crawling:  (Almost) Unlimited crawling frontier  High risk of spamming and junk content  “Politeness” a very important limiting factor  Bandwidth & DNS considerations  Focused (vertical or enterprise) crawling:  Limited crawling frontier  Bandwidth or politeness is often not an issue  Low risk of spamming and junk content
  • 15. 15 N u c h a n d L u c e n e F r a m e wStep1 : Injector injects the list of seed URLs into the o r CrawlDB k
  • 16. 16 N u c h a n d L u c e n e F r a m e w Step2 : Generator takes the list of seed URLs from CrawlDB, forms o r fetch list, adds crawl_generate folder into the segments k
  • 17. 17 N u c h a n d L u c e n e F r a m e w Step3 : These fetch lists are used by fetchers to fetch the raw o r content of the document. It is then stored in segments. k
  • 18. 18 N u c h a n d L u c e n e F r a m e w Step4 : Parser is called to parse the content of the document o r and parsed content is stored back in segments. k
  • 19. 19 N u c h a n d L u c e n e F r a m e w Step5 : The links are inverted in the link graph and stored in o r LinkDB k
  • 20. 20 N u c h a n d L u c e n e F r a m e w o Step6 : Indexing the terms present in segments is done and r indices are updated in the segments k
  • 21. 21 N u c h a n d L u c e n e F r a m e wStep7 : Information on the newly fetched documents are o r updated in the CrwalDB k
  • 22. 22 N u c Crawling: 10 stage process h a bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log n 1. admin db –create: Create a new WebDB. d 2. inject: Inject root URLs into the WebDB. L 3. generate: Generate a fetchlist from the WebDB in a new segment. u c 4. fetch: Fetch content from URLs in the fetchlist. e 5. updatedb: Update the WebDB with links from fetched pages. n 6. Repeat steps 3-5 until the required depth is reached. e 7. updatesegs: Update segments with scores and links from the WebDB. F 8. index: Index the fetched pages. r a 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes. m e 10. merge: Merge the indexes into a single index for searching w o r k
  • 23. 23 N u c De-duplication Algorithm h a n (MD5 hash, float score, int indexID, int d docID, int urlLen) L u for each page c to eliminate URL duplicates from a e n segmentsDir: e F open a temporary file r for each segment: a m for each document in its index: e append a tuple for the document to w o the temporary file with r hash=MD5(URL) k close the temporary file
  • 24. 24 N u c URL Filtering h a n d  URL Filters (Text file) (conf/crawl-urlfilter.txt) L  Regular expression to filter URLs during crawling u  E.g. c  To ignore files with certain suffix: e -.(gif|exe|zip|ico)$ n  To accept host in a certain domain e F +^http://([a-z0-9]*.)*apache.org/ r a m e w o r k
  • 25. 25 N u c Few API’s h a  Site we would crawl: http://www.iitb.ac.in n  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& d crawl.log L  Analyze u the database: c  bin/nutch readdb <db dir> –stats e  bin/nutch readdb <db dir> –dumppageurl n  bin/nutch readdb <db dir> –dumplinks e  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread F -dump $s r a m e w o r k
  • 26. 26 N u c Map-Reduce Function h a  Works in distributed environment n  map() and reduce() functions are implemented d L in most of the modules u  Both map() and reduce() functions uses <key, c e value> pairs n  Useful in case of processing large data (eg: e F Indexing) r  Some applications need sequence of map- a m reduce e  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n w o r k
  • 27. 27 N u c Map-Reduce Architecture h a n d L u c e n e F r a m e w o r k
  • 28. 28 N u c Nutch – Map-Reduce Indexing h a  Map()just assembles all parts of documents n  Reduce() performs text analysis + indexing: d L  Adds to a local Lucene index u c e Other possible MR indexing models: n  Hadoop contrib/indexing model: e  analysis and indexing on map() side F  Index merging on reduce() side r a  Modified Nutch model: m  Analysis on map() side e  Indexing on reduce() side w o r k
  • 29. 29 N u c Nutch - Ranking h a  Nutch Ranking n d L u c e  queryNorm() : indicates the normalization factor for n the query e  coord() : indicates how many query terms are F r present in the given document a  norm() : score indicating field based normalization m factor e  tf : term frequency and idf : inverse document w o frequency r  t.boost() : score indicating the importance of terms k occurrence in a particular field
  • 30. 30 N u c Lucene - Features h a  Field based indexing and searching n  Different fields of a webpage are d L  Title u  URL c  Anchor text e  Content, etc.. n e  Different boost factors to give importance to F r fields a  Uses inverted index to store content of m e crawled documents w  Open source Apache project o r k
  • 31. 31 N u c Lucene - Index h a n  Concepts d  Index: sequence of documents (a.k.a. Directory) L  Document: sequence of fields u c  Field: named sequence of terms e  Term: a text string (e.g., a word) n e F  Statistics r  Term frequencies and positions a m e w o r k
  • 32. 32 N u c Writing to Index h a n IndexWriter writer = d L new IndexWriter(directory, analyzer, u true); c e n Document doc = new Document(); e // add fields to document (next slide) F r writer.addDocument(doc); a writer.close(); m e w o r k
  • 33. 33 N u c Adding Fields h a doc.add(Field.Keyword("isbn", isbn)); n d doc.add(Field.Keyword("category", L category)); u c doc.add(Field.Text("title", title)); e doc.add(Field.Text("author", author)); n doc.add(Field.UnIndexed("url", url)); e F doc.add(Field.UnStored("subjects", r subjects, true)); a m doc.add(Field.Keyword("pubmonth", e pubmonth)); w o doc.add(Field.UnStored("contents",author r + " " + subjects)); k doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())
  • 34. 34 N u c Fields Description h a  Attributes n  Stored: original content retrievable d  Indexed: inverted, searchable L u  Tokenized: analyzed, split into tokens c  Factory methods e n  Keyword: stored and indexed as single term e  Text: indexed, tokenized, and stored if String F  UnIndexed: stored r  UnStored: indexed, tokenized a m  Terms are what matters for searching e w o r k
  • 35. 35 N u c Searching an Index h a IndexSearcher searcher = n d new IndexSearcher(directory); L u c Query query = e QueryParser.parse(queryExpression, n "contents“,analyzer); e F Hits hits = searcher.search(query); r for (int i = 0; i < hits.length(); i++) { a m Document doc = hits.doc(i); e System.out.println(doc.get("title")); w o } r k
  • 36. 36 N u c Analyzer h a n  Analysis occurs d  For each tokenized field during indexing L  For each term or phrase in QueryParser u c e  Several analyzers built-in n e  Many more in the sandbox F  Straightforward to create your own r a  Choosing the right analyzer is important! m e w o r k
  • 37. 37 N u c WhiteSpace Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [The] [quick] [brown] [fox] [jumps] [over] m e [the] w [lazy] [dog.] o r k
  • 38. 38 N u c Simple Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [the] [quick] [brown] [fox] [jumps] [over] m e [the] w [lazy] [dog] o r k
  • 39. 39 N u c Stop Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a m e [quick] [brown] [fox] [jumps] [over] [lazy] w [dog] o r k
  • 40. 40 N u c Snowball Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [the] [quick] [brown] [fox] [jump] [over] m e [the] w [lazy] [dog] o r k
  • 41. 41 N u c Query Creation h a  Searching by a term – TermQuery n  Searching within a range – RangeQuery d L  Searching on a string – PrefixQuery u  Combining queries – BooleanQuery c e  Searching by phrase – PhraseQuery n  Searching by wildcard – WildcardQuery e F  Searching for similar terms - FuzzyQuery r a m e w o r k
  • 42. 42 N u c Lucene Queries h a n d L u c e n e F r a m e w o r k
  • 43. 43 N u c Conclusions h a  Nutch as a starting point n  Crawling in Nutch d L  Detailed map-reduce architecture u  Different query formats in Lucene c e  Built-in analyzers in Lucene n  Same analyzer need to be used both while e F indexing and searching r a m e w o r k
  • 44. 44 N u c Resources Used h a  Gospodnetic, Otis; Erik Hatcher (December 1, n d 2004). Lucene in Action (1st ed.). L Manning Publications. pp. 456. ISBN  u c 978-1-932394-28-3. e  Nutch Wiki http://wiki.apache.org/nutch/ n e F r a m e w o r k
  • 45. 45 N u c Thanks h a  Questions ?? n d L u c e n e F r a m e w o r k