Generating Links by Mining Quotations

Generating Links by Mining
Quotations

OKAN KOLAK AND BILL N. SCHILIT

PRESENTATION BY DUSTIN SMITH
THE UNIVERSITY OF TEXAS AT AUSTIN
SCHOOL OF INFORMATION

Outline
2

 Introduction
 Challenges
 Algorithm
 Phase 1: Generating the Shingle Table

 Phase 2: Extracting Shared Sequences

 Phase 3: Sequence Grouping

 Filtering and Ranking

 User Interface
 Evaluation

INF384H 10/24/2011

Introduction
3

 What is the goal and why?
 Engaging user interface in Google Books

 Richer hypertext for scanned books

 Achieving these goals at scale for large sets of books
 Via MapReduce

INF384H 10/24/2011

Challenges
4

 Mining quality quotation from millions of books in a
scalable and efficient manner.
 Filtering out misleading quotations and ranking the
good quotations based on quality.
 Incorporating the proposed link structure online in a
clear and effective way for users.

INF384H 10/24/2011

Algorithm: Phase 1
5

 Generation of shingle tables

Text is parsed,
Pass text through normalized, and Generate a shingle
shingler output as a stream of table
overlapping shingles

INF384H 10/24/2011

Algorithm: Phase 1 (cont)
6

 Each book is passed through the shingler
 A shingle is a stream of text of k length.
 Ex.
 A 2-shingle for the text “a lucky dog” would be “a lucky” and
“lucky dog”.

INF384H 10/24/2011

7

 Prior to shingling, the text is parsed and normalized.
 Possible normalizations:
 Lowercasing

 Removing punctuations and accents

 Stemming

 Removing stop-words

 Collapsing numbers to single tokens

INF384H 10/24/2011

8

 Shingle Tables

Key Shingle info Shingle info
Shingle key(1) <B,i> <B,i>
Shingle key(2) <B,i> <B,i>

 Shingle key: a unique shingle footprint
 B: Book ID where the shingle exists
 i: index of the shingle in its relative B

INF384H 10/24/2011

9

 Shingle Tables
 Requires a single linear pass and a very large sorting phase

 They observe that quotes of length <8 are not significant
quotations and so they set their shingle length to 8 words.

INF384H 10/24/2011

Algorithm: Phase 2
10

 Involves extracting shingles that are shared between
books
 Books are processed 1 at a time
 Current book = “Source book”
 All other books = “Target books”

INF384H 10/24/2011

11

 Process for a single book:

Take each shingle
Generate a list of
and use the
shingles in the
shingle table to
order that they
find all other
appear
occurrences

INF384H 10/24/2011

12

 Pseudo-code for Phase 2:

INF384H 10/24/2011

13

 MapReduce adaptation:
Mapper:
Start with shingle table as input into the Mapper
Use the equivalent method for looking up all shingle buckets for
a given book’s shingles
Emit (source book ID, relevant shingle bucket)

Reducer:
Input (source book ID, list of relevant shingle buckets)
Use the algorithm from previous slide (Figure 1) with a few
modifications

INF384H 10/24/2011

14

 One notable issue:
 Common shingles that are shared by many books will greatly
increase overhead.
 These are often insignificant quotes and should be discarded.

INF384H 10/24/2011

Algorithm: Phase 3
15

 Sequence Grouping:
 Why?

INF384H 10/24/2011

16

 Sequence Grouping:
 How does it work?

INF384H 10/24/2011

Filtering and Ranking
17

 They identify certain phrases as copyright sentences,
legal boilerplate, publisher addresses, bibliography
citations, publisher addresses, titles of other books
by the author or publisher
 These are not desirable or quality quotations.
 Need to filter these out

INF384H 10/24/2011

Filtering and Ranking (cont)
18

 Filtering:
• Quotations on “low content” pages
• Unusual characteristic filtering
• Too many digits or special characters, repeated tokens, etc.

• Book edition filtering

INF384H 10/24/2011

19

 Ranking:
Some quotes are more interesting than others, ie:
“The unemployment rate is the percentage of the
labor force that is unemployed” vs. “All human
beings are born free and equal in dignity and
rights…”
• This is difficult to distinguish automatically

INF384H 10/24/2011

20

 Scoring method for ranking
Basically:
Too short and too long receive low scores
Optimal length and is in the middle ground and a
piecewise function is used to represent this scoring.
• What defines “too short ” and “too long” is
determined by “experimental tuning”
• Same scoring method for frequency

INF384H 10/24/2011

User Interface
21

 How to present this concept of general links between
books?
 “Popular Passages” not “Quotations”
 Display issues:
 Long quotes containing shorter, more familiar quotes
 Quote order variations
Skyline vectors are used to address these issues and
does so effectively.
• Basically the “best” quotes are chosen for presentation to the
user

INF384H 10/24/2011

User Interface (cont)
22

 Navigation within books
 Goals:
 Provide a general feel for the book
 Provide an interface in which the user can quickly navigate to
important passages within the book

INF384H 10/24/2011

User Interface (cont)
23

 Navigation between books

INF384H 10/24/2011

Evaluation
24

 Manual labeling to determine accuracy
 User studied (passive) over a 30 day period
 Analysis of distribution of link types within Google’s
scanned books.

INF384H 10/24/2011

Evaluation (cont)
25

 Manual labeling:
• Sampled 120 passages from low scores and 120 from
high scores (to avoid precision bias).
• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3
meaning neutral, and 4-5 meaning bad.
• Inter-annotator agreement was 88.5% (± 3.5% to
account for neutral labels)
• 88% marked good

INF384H 10/24/2011

Evaluation (cont)
26

 User study:
• Consisted of monitoring user activity in Google
Books.
• Specifically if they navigated via popular passages
(Quotations); other book edition links (Editions); to other
similar books within a cluster (Related); or to books that cite
the current book (Cited By)

• Results 

INF384H 10/24/2011

Evaluation (cont)
27

INF384H 10/24/2011

Evaluation (cont)
28

 Coverage:
 What is the distribution of these link types in scanned books?

INF384H 10/24/2011

Related Work & Future Work
29

 Related Work
 Automatic Hypertext

 Plagiarism Detection

 Future Work
 Improved Ranking

 Incremental Processing

 Primary Source Identification

 Attribution

INF384H 10/24/2011

Questions + Discussion
30

The End.

Questions & discussion.

….Go Rangers!

INF384H 10/24/2011

Generating Links by Mining Quotations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Generating Links by Mining Quotations

Ähnlich wie Generating Links by Mining Quotations (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (9)

Generating Links by Mining Quotations