Nutch and lucene_framework

Sandhan(CLIA) -
Nutch and Lucene Framework
-Gaurav Arora
IRLAB,DA-IICT

N
2
u

c Outline
h
a  Introduction
n  Behavior
d of Nutch (Offline and Online)
L  Lucene Features
u  Sandhan Demo
c
e  RJ Interface
n
e
F
r
a
m
e
w
o
r
k

3
N
u

c Introduction
h
a  Nutch is an opensource search engine
n  Implemented in Java
d
L  Nutch is comprised of Lucene, Solr, Hadoop
u
c
etc..
e  Lucene is an implementation of indexing and
n searching crawled data
e
F  Both Nutch and Lucene are developed using
r plugin framework
a
 Easy to customize
m
e
w
o
r
k

4
N
u

c Where do they fit in IR?
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

5
N
u

c Nutch – complete search engine
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

6
N
u

c Nutch – offline processing
h
a  Crawling
n  Starts with set of seed URLs
d
 Goes deeper in the web and starts fetching the
L
u content
c  Content need to be analyzed before storing
e  Storing the content
n
e  Makes suitable for searching
F  Issues
r
a
 Time consuming process
m  Freshness of the crawl (How often should I crawl?)
e  Coverage of content
w
o
r
k

7
N
u

c Nutch – online processing
h
a  Searching
n  Analysis of the query
d
 Processing of few words(tokens) in the query
L
u  Query tokens matched against stored
c tokens(index)
e
 Fast and Accurate
n
e  Involves ordering the matching results
F  Ranking affects User’s satisfaction directly
r
a  Supports distributed searching
m
e
w
o
r
k

9
N
u

c Nutch – Data structures
h
a  Web Database or WebDB
n
 Mirrors the properties/structure of web graph being
d
L crawled
u
c
e
 Segment
n  Intermediate index
e  Contains pages fetched in a single run
F
r
a  Index
m
 Final inverted index obtained by “merging”
e
w segments (Lucene)
o
r
k

Nutch – Data
Web Database or WebDB
Crawldb - This contains information about every URL
known to Nutch, including whether it was fetched.
Linkdb. - This contains the list of known links to each URL,
including both the source URL and anchor text of the link.

Index

Invert index : Posting list ,Mapping from words
to its documents.

Nutch Data - Segment
Each segment is a set of URLs that are fetched as a unit.
segment contains:-

a crawl_generate names a set of URLs to be fetched

a crawl_fetch contains the status of fetching each URL

a content contains the raw content retrieved from each URL

a parse_text contains the parsed text of each URL

a parse_data contains outlinks and metadata parsed from each URL

a crawl_parse contains the outlink URLs, used to update the crawldb

12
oter>

Nutch –Crawling
 Inject: initial creation of CrawlDB
 Insert seed URLs
 Initial LinkDB is empty

 Generate new shard's fetchlist
 Fetch raw content
 Parse content (discovers outlinks)
 Update CrawlDB from shards
 Update LinkDB from shards
 Index shards

13

Wide Crawling vs. Focused Crawling
 Differences:
 Little technical difference in configuration
 Big difference in operations, maintenance and
quality
 Wide crawling:
 (Almost) Unlimited crawling frontier
 High risk of spamming and junk content
 “Politeness” a very important limiting factor
 Bandwidth & DNS considerations
 Focused (vertical or enterprise) crawling:
 Limited crawling frontier
 Bandwidth or politeness is often not an issue
 Low risk of spamming and junk content

14
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
Crawling Architecture
k

15
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep1 : Injector injects the list of seed URLs into the
o
r
CrawlDB
k

16
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
Step2 : Generator takes the list of seed URLs from CrawlDB, forms
o
r fetch list, adds crawl_generate folder into the segments
k

17
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
Step3 : These fetch lists are used by fetchers to fetch the raw
o
r content of the document. It is then stored in segments.
k

18
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
Step4 : Parser is called to parse the content of the document
o
r and parsed content is stored back in segments.
k

19
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
Step5 : The links are inverted in the link graph and stored in
o
r LinkDB
k

20
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
Step6 : Indexing the terms present in segments is done and
r indices are updated in the segments
k

21
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep7 : Information on the newly fetched documents are
o
r updated in the CrwalDB
k

22
N
u

c Crawling: 10 stage process
h
a bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
n 1. admin db –create: Create a new WebDB.
d 2. inject: Inject root URLs into the WebDB.
L
3. generate: Generate a fetchlist from the WebDB in a new segment.
u
c 4. fetch: Fetch content from URLs in the fetchlist.
e 5. updatedb: Update the WebDB with links from fetched pages.
n 6. Repeat steps 3-5 until the required depth is reached.
e 7. updatesegs: Update segments with scores and links from the WebDB.
F
8. index: Index the fetched pages.
r
a 9. dedup: Eliminate duplicate content (and duplicate URLs) from the
indexes.
m
e 10. merge: Merge the indexes into a single index for searching
w
o
r
k

23
N
u

c De-duplication Algorithm
h
a
n
(MD5 hash, float score, int indexID, int
d docID, int urlLen)
L
u for each page
c to eliminate URL duplicates from a
e
n segmentsDir:
e
F
open a temporary file
r for each segment:
a
m
for each document in its index:
e append a tuple for the document to
w
o the temporary file with
r hash=MD5(URL)
k
close the temporary file

24
N
u

c URL Filtering
h
a
n
d  URL Filters (Text file) (conf/crawl-urlfilter.txt)
L  Regular expression to filter URLs during crawling
u  E.g.
c  To ignore files with certain suffix:
e
-.(gif|exe|zip|ico)$
n  To accept host in a certain domain
e
F +^http://([a-z0-9]*.)*apache.org/
r
a
m
e
w
o
r
k

25
N
u

c Few API’s
h
a  Site we would crawl: http://www.iitb.ac.in
n  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&
d crawl.log
L  Analyze
u the database:
c  bin/nutch readdb <db dir> –stats
e  bin/nutch readdb <db dir> –dumppageurl
n  bin/nutch readdb <db dir> –dumplinks
e  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread
F -dump $s
r
a
m
e
w
o
r
k

26
N
u

c Map-Reduce Function
h
a  Works in distributed environment
n  map() and reduce() functions are implemented
d
L in most of the modules
u  Both map() and reduce() functions uses <key,
c
e value> pairs
n  Useful in case of processing large data (eg:
e
F Indexing)
r  Some applications need sequence of map-
a
m reduce
e  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
w
o
r
k

27
N
u

c Map-Reduce Architecture
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

28
N
u

c Nutch – Map-Reduce Indexing
h
a  Map()just assembles all parts of documents
n  Reduce() performs text analysis + indexing:
d
L  Adds to a local Lucene index
u
c
e
Other possible MR indexing models:
n  Hadoop contrib/indexing model:
e  analysis and indexing on map() side
F
 Index merging on reduce() side
r
a  Modified Nutch model:
m  Analysis on map() side
e
 Indexing on reduce() side
w
o
r
k

29
N
u

c Nutch - Ranking
h
a  Nutch Ranking
n
d
L
u
c
e  queryNorm() : indicates the normalization factor for
n the query
e  coord() : indicates how many query terms are
F
r present in the given document
a  norm() : score indicating field based normalization
m factor
e  tf : term frequency and idf : inverse document
w
o frequency
r  t.boost() : score indicating the importance of terms
k occurrence in a particular field

30
N
u

c Lucene - Features
h
a  Field based indexing and searching
n  Different fields of a webpage are
d
L  Title
u  URL
c  Anchor text
e
 Content, etc..
n
e  Different boost factors to give importance to
F
r
fields
a  Uses inverted index to store content of
m
e
crawled documents
w  Open source Apache project
o
r
k

31
N
u

c Lucene - Index
h
a
n  Concepts
d  Index: sequence of documents (a.k.a. Directory)
L
 Document: sequence of fields
u
c  Field: named sequence of terms
e  Term: a text string (e.g., a word)
n
e
F  Statistics
r  Term frequencies and positions
a
m
e
w
o
r
k

32
N
u

c Writing to Index
h
a
n IndexWriter writer =
d
L new IndexWriter(directory, analyzer,
u true);
c
e
n Document doc = new Document();
e // add fields to document (next slide)
F
r writer.addDocument(doc);
a writer.close();
m
e
w
o
r
k

33
N
u

c Adding Fields
h
a doc.add(Field.Keyword("isbn", isbn));
n
d doc.add(Field.Keyword("category",
L category));
u
c
doc.add(Field.Text("title", title));
e doc.add(Field.Text("author", author));
n doc.add(Field.UnIndexed("url", url));
e
F doc.add(Field.UnStored("subjects",
r subjects, true));
a
m doc.add(Field.Keyword("pubmonth",
e pubmonth));
w
o
doc.add(Field.UnStored("contents",author
r + " " + subjects));
k
doc.add(Field.Keyword("modified",
DateField.timeToString(file.lastModified())

34
N
u

c Fields Description
h
a  Attributes
n  Stored: original content retrievable
d
 Indexed: inverted, searchable
L
u  Tokenized: analyzed, split into tokens
c  Factory methods
e
n
 Keyword: stored and indexed as single term
e  Text: indexed, tokenized, and stored if String
F  UnIndexed: stored
r  UnStored: indexed, tokenized
a
m  Terms are what matters for searching
e
w
o
r
k

35
N
u

c Searching an Index
h
a IndexSearcher searcher =
n
d new IndexSearcher(directory);
L
u
c
Query query =
e QueryParser.parse(queryExpression,
n "contents“,analyzer);
e
F Hits hits = searcher.search(query);
r for (int i = 0; i < hits.length(); i++) {
a
m Document doc = hits.doc(i);
e System.out.println(doc.get("title"));
w
o
}
r
k

36
N
u

c Analyzer
h
a
n  Analysis occurs
d  For each tokenized field during indexing
L
 For each term or phrase in QueryParser
u
c
e  Several analyzers built-in
n
e
 Many more in the sandbox
F  Straightforward to create your own
r
a
 Choosing the right analyzer is important!
m
e
w
o
r
k

37
N
u

c WhiteSpace Analyzer
h
a
n The quick brown fox jumps over the lazy
d
L dog.
u
c
e
n
e
F
r
a [The] [quick] [brown] [fox] [jumps] [over]
m
e [the]
w [lazy] [dog.]
o
r
k

38
N
u

c Simple Analyzer
h
a
d
L dog.
u
c
e
n
e
F
r
a [the] [quick] [brown] [fox] [jumps] [over]
m
e [the]
w [lazy] [dog]
o
r
k

39
N
u

c Stop Analyzer
h
a
d
L dog.
u
c
e
n
e
F
r
a
m
e [quick] [brown] [fox] [jumps] [over] [lazy]
w [dog]
o
r
k

40
N
u

c Snowball Analyzer
h
a
d
L dog.
u
c
e
n
e
F
r
a [the] [quick] [brown] [fox] [jump] [over]
m
e [the]
w [lazy] [dog]
o
r
k

41
N
u

c Query Creation
h
a  Searching by a term – TermQuery
n  Searching within a range – RangeQuery
d
L  Searching on a string – PrefixQuery
u  Combining queries – BooleanQuery
c
e  Searching by phrase – PhraseQuery
n  Searching by wildcard – WildcardQuery
e
F  Searching for similar terms - FuzzyQuery
r
a
m
e
w
o
r
k

42
N
u

c Lucene Queries
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

43
N
u

c Conclusions
h
a  Nutch as a starting point
n  Crawling in Nutch
d
L  Detailed map-reduce architecture
u  Different query formats in Lucene
c
e  Built-in analyzers in Lucene
n  Same analyzer need to be used both while
e
F indexing and searching
r
a
m
e
w
o
r
k

44
N
u

c Resources Used
h
a  Gospodnetic, Otis; Erik Hatcher (December 1,
n
d 2004). Lucene in Action (1st ed.).
L Manning Publications. pp. 456. ISBN
u
c
978-1-932394-28-3.
e  Nutch Wiki http://wiki.apache.org/nutch/
n
e
F
r
a
m
e
w
o
r
k

45
N
u

c Thanks
h
a  Questions ??
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

Nutch and lucene_framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Nutch and lucene_framework

Similar to Nutch and lucene_framework (20)

Recently uploaded

Recently uploaded (20)

Nutch and lucene_framework