Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...
Generating Links by Mining Quotations
1. Generating Links by Mining
Quotations
OKAN KOLAK AND BILL N. SCHILIT
PRESENTATION BY DUSTIN SMITH
THE UNIVERSITY OF TEXAS AT AUSTIN
SCHOOL OF INFORMATION
3. Introduction
3
What is the goal and why?
Engaging user interface in Google Books
Richer hypertext for scanned books
Achieving these goals at scale for large sets of books
Via MapReduce
INF384H 10/24/2011
4. Challenges
4
Mining quality quotation from millions of books in a
scalable and efficient manner.
Filtering out misleading quotations and ranking the
good quotations based on quality.
Incorporating the proposed link structure online in a
clear and effective way for users.
INF384H 10/24/2011
5. Algorithm: Phase 1
5
Generation of shingle tables
Text is parsed,
Pass text through normalized, and Generate a shingle
shingler output as a stream of table
overlapping shingles
INF384H 10/24/2011
6. Algorithm: Phase 1 (cont)
6
Each book is passed through the shingler
A shingle is a stream of text of k length.
Ex.
A 2-shingle for the text “a lucky dog” would be “a lucky” and
“lucky dog”.
INF384H 10/24/2011
7. Algorithm: Phase 1 (cont)
7
Prior to shingling, the text is parsed and normalized.
Possible normalizations:
Lowercasing
Removing punctuations and accents
Stemming
Removing stop-words
Collapsing numbers to single tokens
INF384H 10/24/2011
8. Algorithm: Phase 1 (cont)
8
Shingle Tables
Key Shingle info Shingle info
Shingle key(1) <B,i> <B,i>
Shingle key(2) <B,i> <B,i>
Shingle key: a unique shingle footprint
B: Book ID where the shingle exists
i: index of the shingle in its relative B
INF384H 10/24/2011
9. Algorithm: Phase 1 (cont)
9
Shingle Tables
Requires a single linear pass and a very large sorting phase
They observe that quotes of length <8 are not significant
quotations and so they set their shingle length to 8 words.
INF384H 10/24/2011
10. Algorithm: Phase 2
10
Involves extracting shingles that are shared between
books
Books are processed 1 at a time
Current book = “Source book”
All other books = “Target books”
INF384H 10/24/2011
11. Algorithm: Phase 2 (cont)
11
Process for a single book:
Take each shingle
Generate a list of
and use the
shingles in the
shingle table to
order that they
find all other
appear
occurrences
INF384H 10/24/2011
13. Algorithm: Phase 2 (cont)
13
MapReduce adaptation:
Mapper:
Start with shingle table as input into the Mapper
Use the equivalent method for looking up all shingle buckets for
a given book’s shingles
Emit (source book ID, relevant shingle bucket)
Reducer:
Input (source book ID, list of relevant shingle buckets)
Use the algorithm from previous slide (Figure 1) with a few
modifications
INF384H 10/24/2011
14. Algorithm: Phase 2 (cont)
14
One notable issue:
Common shingles that are shared by many books will greatly
increase overhead.
These are often insignificant quotes and should be discarded.
INF384H 10/24/2011
16. Algorithm: Phase 3 (cont)
16
Sequence Grouping:
How does it work?
INF384H 10/24/2011
17. Filtering and Ranking
17
They identify certain phrases as copyright sentences,
legal boilerplate, publisher addresses, bibliography
citations, publisher addresses, titles of other books
by the author or publisher
These are not desirable or quality quotations.
Need to filter these out
INF384H 10/24/2011
18. Filtering and Ranking (cont)
18
Filtering:
• Quotations on “low content” pages
• Unusual characteristic filtering
• Too many digits or special characters, repeated tokens, etc.
• Book edition filtering
INF384H 10/24/2011
19. Filtering and Ranking (cont)
19
Ranking:
Some quotes are more interesting than others, ie:
“The unemployment rate is the percentage of the
labor force that is unemployed” vs. “All human
beings are born free and equal in dignity and
rights…”
• This is difficult to distinguish automatically
INF384H 10/24/2011
20. Filtering and Ranking (cont)
20
Scoring method for ranking
Basically:
Too short and too long receive low scores
Optimal length and is in the middle ground and a
piecewise function is used to represent this scoring.
• What defines “too short ” and “too long” is
determined by “experimental tuning”
• Same scoring method for frequency
INF384H 10/24/2011
21. User Interface
21
How to present this concept of general links between
books?
“Popular Passages” not “Quotations”
Display issues:
Long quotes containing shorter, more familiar quotes
Quote order variations
Skyline vectors are used to address these issues and
does so effectively.
• Basically the “best” quotes are chosen for presentation to the
user
INF384H 10/24/2011
22. User Interface (cont)
22
Navigation within books
Goals:
Provide a general feel for the book
Provide an interface in which the user can quickly navigate to
important passages within the book
INF384H 10/24/2011
24. Evaluation
24
Manual labeling to determine accuracy
User studied (passive) over a 30 day period
Analysis of distribution of link types within Google’s
scanned books.
INF384H 10/24/2011
25. Evaluation (cont)
25
Manual labeling:
• Sampled 120 passages from low scores and 120 from
high scores (to avoid precision bias).
• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3
meaning neutral, and 4-5 meaning bad.
• Inter-annotator agreement was 88.5% (± 3.5% to
account for neutral labels)
• 88% marked good
INF384H 10/24/2011
26. Evaluation (cont)
26
User study:
• Consisted of monitoring user activity in Google
Books.
• Specifically if they navigated via popular passages
(Quotations); other book edition links (Editions); to other
similar books within a cluster (Related); or to books that cite
the current book (Cited By)
• Results
INF384H 10/24/2011