2. N
2
u
c Outline
h
a Introduction
n Behavior
d of Nutch (Offline and Online)
L Lucene Features
u Sandhan Demo
c
e RJ Interface
n
e
F
r
a
m
e
w
o
r
k
3. 3
N
u
c Introduction
h
a Nutch is an opensource search engine
n Implemented in Java
d
L Nutch is comprised of Lucene, Solr, Hadoop
u
c
etc..
e Lucene is an implementation of indexing and
n searching crawled data
e
F Both Nutch and Lucene are developed using
r plugin framework
a
Easy to customize
m
e
w
o
r
k
4. 4
N
u
c Where do they fit in IR?
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
5. 5
N
u
c Nutch – complete search engine
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
6. 6
N
u
c Nutch – offline processing
h
a Crawling
n Starts with set of seed URLs
d
Goes deeper in the web and starts fetching the
L
u content
c Content need to be analyzed before storing
e Storing the content
n
e Makes suitable for searching
F Issues
r
a
Time consuming process
m Freshness of the crawl (How often should I crawl?)
e Coverage of content
w
o
r
k
7. 7
N
u
c Nutch – online processing
h
a Searching
n Analysis of the query
d
Processing of few words(tokens) in the query
L
u Query tokens matched against stored
c tokens(index)
e
Fast and Accurate
n
e Involves ordering the matching results
F Ranking affects User’s satisfaction directly
r
a Supports distributed searching
m
e
w
o
r
k
8.
9. 9
N
u
c Nutch – Data structures
h
a Web Database or WebDB
n
Mirrors the properties/structure of web graph being
d
L crawled
u
c
e
Segment
n Intermediate index
e Contains pages fetched in a single run
F
r
a Index
m
Final inverted index obtained by “merging”
e
w segments (Lucene)
o
r
k
10. Nutch – Data
Web Database or WebDB
Crawldb - This contains information about every URL
known to Nutch, including whether it was fetched.
Linkdb. - This contains the list of known links to each URL,
including both the source URL and anchor text of the link.
Index
Invert index : Posting list ,Mapping from words
to its documents.
11. Nutch Data - Segment
Each segment is a set of URLs that are fetched as a unit.
segment contains:-
a crawl_generate names a set of URLs to be fetched
a crawl_fetch contains the status of fetching each URL
a content contains the raw content retrieved from each URL
a parse_text contains the parsed text of each URL
a parse_data contains outlinks and metadata parsed from each URL
a crawl_parse contains the outlink URLs, used to update the crawldb
12. 12
oter>
Nutch –Crawling
Inject: initial creation of CrawlDB
Insert seed URLs
Initial LinkDB is empty
Generate new shard's fetchlist
Fetch raw content
Parse content (discovers outlinks)
Update CrawlDB from shards
Update LinkDB from shards
Index shards
13. 13
Wide Crawling vs. Focused Crawling
Differences:
Little technical difference in configuration
Big difference in operations, maintenance and
quality
Wide crawling:
(Almost) Unlimited crawling frontier
High risk of spamming and junk content
“Politeness” a very important limiting factor
Bandwidth & DNS considerations
Focused (vertical or enterprise) crawling:
Limited crawling frontier
Bandwidth or politeness is often not an issue
Low risk of spamming and junk content
22. 22
N
u
c Crawling: 10 stage process
h
a bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
n 1. admin db –create: Create a new WebDB.
d 2. inject: Inject root URLs into the WebDB.
L
3. generate: Generate a fetchlist from the WebDB in a new segment.
u
c 4. fetch: Fetch content from URLs in the fetchlist.
e 5. updatedb: Update the WebDB with links from fetched pages.
n 6. Repeat steps 3-5 until the required depth is reached.
e 7. updatesegs: Update segments with scores and links from the WebDB.
F
8. index: Index the fetched pages.
r
a 9. dedup: Eliminate duplicate content (and duplicate URLs) from the
indexes.
m
e 10. merge: Merge the indexes into a single index for searching
w
o
r
k
23. 23
N
u
c De-duplication Algorithm
h
a
n
(MD5 hash, float score, int indexID, int
d docID, int urlLen)
L
u for each page
c to eliminate URL duplicates from a
e
n segmentsDir:
e
F
open a temporary file
r for each segment:
a
m
for each document in its index:
e append a tuple for the document to
w
o the temporary file with
r hash=MD5(URL)
k
close the temporary file
24. 24
N
u
c URL Filtering
h
a
n
d URL Filters (Text file) (conf/crawl-urlfilter.txt)
L Regular expression to filter URLs during crawling
u E.g.
c To ignore files with certain suffix:
e
-.(gif|exe|zip|ico)$
n To accept host in a certain domain
e
F +^http://([a-z0-9]*.)*apache.org/
r
a
m
e
w
o
r
k
25. 25
N
u
c Few API’s
h
a Site we would crawl: http://www.iitb.ac.in
n bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&
d crawl.log
L Analyze
u the database:
c bin/nutch readdb <db dir> –stats
e bin/nutch readdb <db dir> –dumppageurl
n bin/nutch readdb <db dir> –dumplinks
e s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread
F -dump $s
r
a
m
e
w
o
r
k
26. 26
N
u
c Map-Reduce Function
h
a Works in distributed environment
n map() and reduce() functions are implemented
d
L in most of the modules
u Both map() and reduce() functions uses <key,
c
e value> pairs
n Useful in case of processing large data (eg:
e
F Indexing)
r Some applications need sequence of map-
a
m reduce
e Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
w
o
r
k
27. 27
N
u
c Map-Reduce Architecture
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
28. 28
N
u
c Nutch – Map-Reduce Indexing
h
a Map()just assembles all parts of documents
n Reduce() performs text analysis + indexing:
d
L Adds to a local Lucene index
u
c
e
Other possible MR indexing models:
n Hadoop contrib/indexing model:
e analysis and indexing on map() side
F
Index merging on reduce() side
r
a Modified Nutch model:
m Analysis on map() side
e
Indexing on reduce() side
w
o
r
k
29. 29
N
u
c Nutch - Ranking
h
a Nutch Ranking
n
d
L
u
c
e queryNorm() : indicates the normalization factor for
n the query
e coord() : indicates how many query terms are
F
r present in the given document
a norm() : score indicating field based normalization
m factor
e tf : term frequency and idf : inverse document
w
o frequency
r t.boost() : score indicating the importance of terms
k occurrence in a particular field
30. 30
N
u
c Lucene - Features
h
a Field based indexing and searching
n Different fields of a webpage are
d
L Title
u URL
c Anchor text
e
Content, etc..
n
e Different boost factors to give importance to
F
r
fields
a Uses inverted index to store content of
m
e
crawled documents
w Open source Apache project
o
r
k
31. 31
N
u
c Lucene - Index
h
a
n Concepts
d Index: sequence of documents (a.k.a. Directory)
L
Document: sequence of fields
u
c Field: named sequence of terms
e Term: a text string (e.g., a word)
n
e
F Statistics
r Term frequencies and positions
a
m
e
w
o
r
k
32. 32
N
u
c Writing to Index
h
a
n IndexWriter writer =
d
L new IndexWriter(directory, analyzer,
u true);
c
e
n Document doc = new Document();
e // add fields to document (next slide)
F
r writer.addDocument(doc);
a writer.close();
m
e
w
o
r
k
33. 33
N
u
c Adding Fields
h
a doc.add(Field.Keyword("isbn", isbn));
n
d doc.add(Field.Keyword("category",
L category));
u
c
doc.add(Field.Text("title", title));
e doc.add(Field.Text("author", author));
n doc.add(Field.UnIndexed("url", url));
e
F doc.add(Field.UnStored("subjects",
r subjects, true));
a
m doc.add(Field.Keyword("pubmonth",
e pubmonth));
w
o
doc.add(Field.UnStored("contents",author
r + " " + subjects));
k
doc.add(Field.Keyword("modified",
DateField.timeToString(file.lastModified())
34. 34
N
u
c Fields Description
h
a Attributes
n Stored: original content retrievable
d
Indexed: inverted, searchable
L
u Tokenized: analyzed, split into tokens
c Factory methods
e
n
Keyword: stored and indexed as single term
e Text: indexed, tokenized, and stored if String
F UnIndexed: stored
r UnStored: indexed, tokenized
a
m Terms are what matters for searching
e
w
o
r
k
35. 35
N
u
c Searching an Index
h
a IndexSearcher searcher =
n
d new IndexSearcher(directory);
L
u
c
Query query =
e QueryParser.parse(queryExpression,
n "contents“,analyzer);
e
F Hits hits = searcher.search(query);
r for (int i = 0; i < hits.length(); i++) {
a
m Document doc = hits.doc(i);
e System.out.println(doc.get("title"));
w
o
}
r
k
36. 36
N
u
c Analyzer
h
a
n Analysis occurs
d For each tokenized field during indexing
L
For each term or phrase in QueryParser
u
c
e Several analyzers built-in
n
e
Many more in the sandbox
F Straightforward to create your own
r
a
Choosing the right analyzer is important!
m
e
w
o
r
k
37. 37
N
u
c WhiteSpace Analyzer
h
a
n The quick brown fox jumps over the lazy
d
L dog.
u
c
e
n
e
F
r
a [The] [quick] [brown] [fox] [jumps] [over]
m
e [the]
w [lazy] [dog.]
o
r
k
38. 38
N
u
c Simple Analyzer
h
a
n The quick brown fox jumps over the lazy
d
L dog.
u
c
e
n
e
F
r
a [the] [quick] [brown] [fox] [jumps] [over]
m
e [the]
w [lazy] [dog]
o
r
k
39. 39
N
u
c Stop Analyzer
h
a
n The quick brown fox jumps over the lazy
d
L dog.
u
c
e
n
e
F
r
a
m
e [quick] [brown] [fox] [jumps] [over] [lazy]
w [dog]
o
r
k
40. 40
N
u
c Snowball Analyzer
h
a
n The quick brown fox jumps over the lazy
d
L dog.
u
c
e
n
e
F
r
a [the] [quick] [brown] [fox] [jump] [over]
m
e [the]
w [lazy] [dog]
o
r
k
41. 41
N
u
c Query Creation
h
a Searching by a term – TermQuery
n Searching within a range – RangeQuery
d
L Searching on a string – PrefixQuery
u Combining queries – BooleanQuery
c
e Searching by phrase – PhraseQuery
n Searching by wildcard – WildcardQuery
e
F Searching for similar terms - FuzzyQuery
r
a
m
e
w
o
r
k
42. 42
N
u
c Lucene Queries
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
43. 43
N
u
c Conclusions
h
a Nutch as a starting point
n Crawling in Nutch
d
L Detailed map-reduce architecture
u Different query formats in Lucene
c
e Built-in analyzers in Lucene
n Same analyzer need to be used both while
e
F indexing and searching
r
a
m
e
w
o
r
k
44. 44
N
u
c Resources Used
h
a Gospodnetic, Otis; Erik Hatcher (December 1,
n
d 2004). Lucene in Action (1st ed.).
L Manning Publications. pp. 456. ISBN
u
c
978-1-932394-28-3.
e Nutch Wiki http://wiki.apache.org/nutch/
n
e
F
r
a
m
e
w
o
r
k
45. 45
N
u
c Thanks
h
a Questions ??
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k