SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Generating Links by Mining
        Quotations

   OKAN KOLAK AND BILL N. SCHILIT

    PRESENTATION BY DUSTIN SMITH
  THE UNIVERSITY OF TEXAS AT AUSTIN
       SCHOOL OF INFORMATION
Outline
                                2


 Introduction
 Challenges
 Algorithm
   Phase 1: Generating the Shingle Table

   Phase 2: Extracting Shared Sequences

   Phase 3: Sequence Grouping

   Filtering and Ranking

 User Interface
 Evaluation


INF384H                                     10/24/2011
Introduction
                                  3

 What is the goal and why?
   Engaging user interface in Google Books

   Richer hypertext for scanned books

   Achieving these goals at scale for large sets of books
         Via MapReduce




INF384H                                                      10/24/2011
Challenges
                           4

 Mining quality quotation from millions of books in a
  scalable and efficient manner.
 Filtering out misleading quotations and ranking the
  good quotations based on quality.
 Incorporating the proposed link structure online in a
  clear and effective way for users.




INF384H                                          10/24/2011
Algorithm: Phase 1
                                       5

 Generation of shingle tables




                                 Text is parsed,
          Pass text through     normalized, and       Generate a shingle
              shingler        output as a stream of         table
                              overlapping shingles




INF384H                                                                    10/24/2011
Algorithm: Phase 1 (cont)
                                 6

 Each book is passed through the shingler
 A shingle is a stream of text of k length.
 Ex.
   A 2-shingle for the text “a lucky dog” would be “a lucky” and
    “lucky dog”.




INF384H                                                      10/24/2011
Algorithm: Phase 1 (cont)
                                7

 Prior to shingling, the text is parsed and normalized.
 Possible normalizations:
   Lowercasing

   Removing punctuations and accents

   Stemming

   Removing stop-words

   Collapsing numbers to single tokens




INF384H                                           10/24/2011
Algorithm: Phase 1 (cont)
                                      8

 Shingle Tables

          Key              Shingle info   Shingle info
          Shingle key(1)   <B,i>          <B,i>
          Shingle key(2)   <B,i>          <B,i>

 Shingle key: a unique shingle footprint
 B: Book ID where the shingle exists
 i: index of the shingle in its relative B




INF384H                                                  10/24/2011
Algorithm: Phase 1 (cont)
                                  9

 Shingle Tables
   Requires a single linear pass and a very large sorting phase

   They observe that quotes of length <8 are not significant
    quotations and so they set their shingle length to 8 words.




INF384H                                                      10/24/2011
Algorithm: Phase 2
                                  10

 Involves extracting shingles that are shared between
  books
 Books are processed 1 at a time
     Current book = “Source book”
     All other books = “Target books”




INF384H                                          10/24/2011
Algorithm: Phase 2 (cont)
                                  11

 Process for a single book:


                                       Take each shingle
             Generate a list of
                                          and use the
              shingles in the
                                        shingle table to
              order that they
                                         find all other
                  appear
                                          occurrences




INF384H                                                    10/24/2011
Algorithm: Phase 2 (cont)
                             12

 Pseudo-code for Phase 2:




INF384H                                10/24/2011
Algorithm: Phase 2 (cont)
                                 13

 MapReduce adaptation:
  Mapper:
  Start with shingle table as input into the Mapper
  Use the equivalent method for looking up all shingle buckets for
  a given book’s shingles
  Emit (source book ID, relevant shingle bucket)

  Reducer:
  Input (source book ID, list of relevant shingle buckets)
  Use the algorithm from previous slide (Figure 1) with a few
  modifications

INF384H                                                     10/24/2011
Algorithm: Phase 2 (cont)
                                 14

 One notable issue:
   Common shingles that are shared by many books will greatly
    increase overhead.
   These are often insignificant quotes and should be discarded.




INF384H                                                    10/24/2011
Algorithm: Phase 3
                       15

 Sequence Grouping:
 Why?




INF384H                            10/24/2011
Algorithm: Phase 3 (cont)
                       16

 Sequence Grouping:
 How does it work?




INF384H                                10/24/2011
Filtering and Ranking
                                   17

 They identify certain phrases as copyright sentences,
  legal boilerplate, publisher addresses, bibliography
  citations, publisher addresses, titles of other books
  by the author or publisher
     These are not desirable or quality quotations.
     Need to filter these out




INF384H                                                10/24/2011
Filtering and Ranking (cont)
                                 18

 Filtering:
• Quotations on “low content” pages
• Unusual characteristic filtering
  • Too many digits or special characters, repeated tokens, etc.

• Book edition filtering




INF384H                                                      10/24/2011
Filtering and Ranking (cont)
                          19

 Ranking:
Some quotes are more interesting than others, ie:
“The unemployment rate is the percentage of the
labor force that is unemployed” vs. “All human
beings are born free and equal in dignity and
rights…”
• This is difficult to distinguish automatically




INF384H                                         10/24/2011
Filtering and Ranking (cont)
                           20

 Scoring method for ranking
Basically:
Too short and too long receive low scores
Optimal length and is in the middle ground and a
piecewise function is used to represent this scoring.
• What defines “too short ” and “too long” is
  determined by “experimental tuning”
• Same scoring method for frequency



INF384H                                           10/24/2011
User Interface
                                   21

 How to present this concept of general links between
  books?
 “Popular Passages” not “Quotations”
 Display issues:
     Long quotes containing shorter, more familiar quotes
     Quote order variations
Skyline vectors are used to address these issues and
does so effectively.
  •   Basically the “best” quotes are chosen for presentation to the
      user


INF384H                                                       10/24/2011
User Interface (cont)
                                     22

 Navigation within books
   Goals:
       Provide a general feel for the book
       Provide an interface in which the user can quickly navigate to
        important passages within the book




INF384H                                                            10/24/2011
User Interface (cont)
                        23

 Navigation between books




INF384H                              10/24/2011
Evaluation
                            24

 Manual labeling to determine accuracy
 User studied (passive) over a 30 day period
 Analysis of distribution of link types within Google’s
  scanned books.




INF384H                                            10/24/2011
Evaluation (cont)
                           25

 Manual labeling:
• Sampled 120 passages from low scores and 120 from
  high scores (to avoid precision bias).
• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3
  meaning neutral, and 4-5 meaning bad.
• Inter-annotator agreement was 88.5% (± 3.5% to
  account for neutral labels)
• 88% marked good




INF384H                                           10/24/2011
Evaluation (cont)
                                   26

 User study:
• Consisted of monitoring user activity in Google
  Books.
  •   Specifically if they navigated via popular passages
      (Quotations); other book edition links (Editions); to other
      similar books within a cluster (Related); or to books that cite
      the current book (Cited By)

  •   Results 




INF384H                                                         10/24/2011
Evaluation (cont)
                  27




INF384H                       10/24/2011
Evaluation (cont)
                                 28

 Coverage:
   What is the distribution of these link types in scanned books?




INF384H                                                     10/24/2011
Related Work & Future Work
                                29

 Related Work
   Automatic Hypertext

   Plagiarism Detection

 Future Work
   Improved Ranking

   Incremental Processing

   Primary Source Identification

   Attribution




INF384H                                 10/24/2011
Questions + Discussion
                          30

The End.



Questions & discussion.



….Go Rangers!




INF384H                              10/24/2011

Weitere ähnliche Inhalte

Andere mochten auch

27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungukadektedy
 
32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungukadektedy
 
Zadanie 4.-Korczak Książki
Zadanie 4.-Korczak KsiążkiZadanie 4.-Korczak Książki
Zadanie 4.-Korczak KsiążkiAdam Adamskic
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 

Andere mochten auch (9)

27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu
 
32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu
 
Zadanie 4.-Korczak Książki
Zadanie 4.-Korczak KsiążkiZadanie 4.-Korczak Książki
Zadanie 4.-Korczak Książki
 
Touch Screen Technologies
Touch Screen TechnologiesTouch Screen Technologies
Touch Screen Technologies
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Road extraction article
Road extraction articleRoad extraction article
Road extraction article
 
Teletraffic engineering
Teletraffic engineeringTeletraffic engineering
Teletraffic engineering
 
Interference coordination
Interference coordinationInterference coordination
Interference coordination
 
How Touch Screens works
How Touch Screens worksHow Touch Screens works
How Touch Screens works
 

Ähnlich wie Generating Links by Mining Quotations

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoderDan Elton
 
Phenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesPhenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesMartin Chapman
 
[Ris cy business]
[Ris cy business][Ris cy business]
[Ris cy business]Dino, llc
 

Ähnlich wie Generating Links by Mining Quotations (6)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Tools4BPEL4Chor
Tools4BPEL4ChorTools4BPEL4Chor
Tools4BPEL4Chor
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoder
 
Phenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesPhenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable Phenotypes
 
[Ris cy business]
[Ris cy business][Ris cy business]
[Ris cy business]
 

Kürzlich hochgeladen

RBS学位证,鹿特丹商学院毕业证书1:1制作
RBS学位证,鹿特丹商学院毕业证书1:1制作RBS学位证,鹿特丹商学院毕业证书1:1制作
RBS学位证,鹿特丹商学院毕业证书1:1制作f3774p8b
 
Computer Organization and Architecture 10th - William Stallings, Ch01.pdf
Computer Organization and Architecture 10th - William Stallings, Ch01.pdfComputer Organization and Architecture 10th - William Stallings, Ch01.pdf
Computer Organization and Architecture 10th - William Stallings, Ch01.pdfShahdAbdElsamea2
 
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjekAIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjekpavan402055
 
Dubai Call Girls O525547819 Spring Break Fast Call Girls Dubai
Dubai Call Girls O525547819 Spring Break Fast Call Girls DubaiDubai Call Girls O525547819 Spring Break Fast Call Girls Dubai
Dubai Call Girls O525547819 Spring Break Fast Call Girls Dubaikojalkojal131
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...Amil Baba Dawood bangali
 
澳洲Deakin学位证,迪肯大学毕业证书1:1制作
澳洲Deakin学位证,迪肯大学毕业证书1:1制作澳洲Deakin学位证,迪肯大学毕业证书1:1制作
澳洲Deakin学位证,迪肯大学毕业证书1:1制作rpb5qxou
 
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作ss846v0c
 
Kwin - Trang Tải App Game Kwin68 Club Chính Thức
Kwin - Trang Tải App Game Kwin68 Club Chính ThứcKwin - Trang Tải App Game Kwin68 Club Chính Thức
Kwin - Trang Tải App Game Kwin68 Club Chính ThứcKwin68 Club
 
Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...
Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...
Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...Amil baba
 

Kürzlich hochgeladen (9)

RBS学位证,鹿特丹商学院毕业证书1:1制作
RBS学位证,鹿特丹商学院毕业证书1:1制作RBS学位证,鹿特丹商学院毕业证书1:1制作
RBS学位证,鹿特丹商学院毕业证书1:1制作
 
Computer Organization and Architecture 10th - William Stallings, Ch01.pdf
Computer Organization and Architecture 10th - William Stallings, Ch01.pdfComputer Organization and Architecture 10th - William Stallings, Ch01.pdf
Computer Organization and Architecture 10th - William Stallings, Ch01.pdf
 
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjekAIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
AIMA_ch3_L2-complement.ppt kjekfkjekjfkjefkjefkjek
 
Dubai Call Girls O525547819 Spring Break Fast Call Girls Dubai
Dubai Call Girls O525547819 Spring Break Fast Call Girls DubaiDubai Call Girls O525547819 Spring Break Fast Call Girls Dubai
Dubai Call Girls O525547819 Spring Break Fast Call Girls Dubai
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uk England Northern ...
 
澳洲Deakin学位证,迪肯大学毕业证书1:1制作
澳洲Deakin学位证,迪肯大学毕业证书1:1制作澳洲Deakin学位证,迪肯大学毕业证书1:1制作
澳洲Deakin学位证,迪肯大学毕业证书1:1制作
 
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
 
Kwin - Trang Tải App Game Kwin68 Club Chính Thức
Kwin - Trang Tải App Game Kwin68 Club Chính ThứcKwin - Trang Tải App Game Kwin68 Club Chính Thức
Kwin - Trang Tải App Game Kwin68 Club Chính Thức
 
Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...
Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...
Uae-NO1 Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addres...
 

Generating Links by Mining Quotations

  • 1. Generating Links by Mining Quotations OKAN KOLAK AND BILL N. SCHILIT PRESENTATION BY DUSTIN SMITH THE UNIVERSITY OF TEXAS AT AUSTIN SCHOOL OF INFORMATION
  • 2. Outline 2  Introduction  Challenges  Algorithm  Phase 1: Generating the Shingle Table  Phase 2: Extracting Shared Sequences  Phase 3: Sequence Grouping  Filtering and Ranking  User Interface  Evaluation INF384H 10/24/2011
  • 3. Introduction 3  What is the goal and why?  Engaging user interface in Google Books  Richer hypertext for scanned books  Achieving these goals at scale for large sets of books  Via MapReduce INF384H 10/24/2011
  • 4. Challenges 4  Mining quality quotation from millions of books in a scalable and efficient manner.  Filtering out misleading quotations and ranking the good quotations based on quality.  Incorporating the proposed link structure online in a clear and effective way for users. INF384H 10/24/2011
  • 5. Algorithm: Phase 1 5  Generation of shingle tables Text is parsed, Pass text through normalized, and Generate a shingle shingler output as a stream of table overlapping shingles INF384H 10/24/2011
  • 6. Algorithm: Phase 1 (cont) 6  Each book is passed through the shingler  A shingle is a stream of text of k length.  Ex.  A 2-shingle for the text “a lucky dog” would be “a lucky” and “lucky dog”. INF384H 10/24/2011
  • 7. Algorithm: Phase 1 (cont) 7  Prior to shingling, the text is parsed and normalized.  Possible normalizations:  Lowercasing  Removing punctuations and accents  Stemming  Removing stop-words  Collapsing numbers to single tokens INF384H 10/24/2011
  • 8. Algorithm: Phase 1 (cont) 8  Shingle Tables Key Shingle info Shingle info Shingle key(1) <B,i> <B,i> Shingle key(2) <B,i> <B,i>  Shingle key: a unique shingle footprint  B: Book ID where the shingle exists  i: index of the shingle in its relative B INF384H 10/24/2011
  • 9. Algorithm: Phase 1 (cont) 9  Shingle Tables  Requires a single linear pass and a very large sorting phase  They observe that quotes of length <8 are not significant quotations and so they set their shingle length to 8 words. INF384H 10/24/2011
  • 10. Algorithm: Phase 2 10  Involves extracting shingles that are shared between books  Books are processed 1 at a time  Current book = “Source book”  All other books = “Target books” INF384H 10/24/2011
  • 11. Algorithm: Phase 2 (cont) 11  Process for a single book: Take each shingle Generate a list of and use the shingles in the shingle table to order that they find all other appear occurrences INF384H 10/24/2011
  • 12. Algorithm: Phase 2 (cont) 12  Pseudo-code for Phase 2: INF384H 10/24/2011
  • 13. Algorithm: Phase 2 (cont) 13  MapReduce adaptation: Mapper: Start with shingle table as input into the Mapper Use the equivalent method for looking up all shingle buckets for a given book’s shingles Emit (source book ID, relevant shingle bucket) Reducer: Input (source book ID, list of relevant shingle buckets) Use the algorithm from previous slide (Figure 1) with a few modifications INF384H 10/24/2011
  • 14. Algorithm: Phase 2 (cont) 14  One notable issue:  Common shingles that are shared by many books will greatly increase overhead.  These are often insignificant quotes and should be discarded. INF384H 10/24/2011
  • 15. Algorithm: Phase 3 15  Sequence Grouping:  Why? INF384H 10/24/2011
  • 16. Algorithm: Phase 3 (cont) 16  Sequence Grouping:  How does it work? INF384H 10/24/2011
  • 17. Filtering and Ranking 17  They identify certain phrases as copyright sentences, legal boilerplate, publisher addresses, bibliography citations, publisher addresses, titles of other books by the author or publisher  These are not desirable or quality quotations.  Need to filter these out INF384H 10/24/2011
  • 18. Filtering and Ranking (cont) 18  Filtering: • Quotations on “low content” pages • Unusual characteristic filtering • Too many digits or special characters, repeated tokens, etc. • Book edition filtering INF384H 10/24/2011
  • 19. Filtering and Ranking (cont) 19  Ranking: Some quotes are more interesting than others, ie: “The unemployment rate is the percentage of the labor force that is unemployed” vs. “All human beings are born free and equal in dignity and rights…” • This is difficult to distinguish automatically INF384H 10/24/2011
  • 20. Filtering and Ranking (cont) 20  Scoring method for ranking Basically: Too short and too long receive low scores Optimal length and is in the middle ground and a piecewise function is used to represent this scoring. • What defines “too short ” and “too long” is determined by “experimental tuning” • Same scoring method for frequency INF384H 10/24/2011
  • 21. User Interface 21  How to present this concept of general links between books?  “Popular Passages” not “Quotations”  Display issues:  Long quotes containing shorter, more familiar quotes  Quote order variations Skyline vectors are used to address these issues and does so effectively. • Basically the “best” quotes are chosen for presentation to the user INF384H 10/24/2011
  • 22. User Interface (cont) 22  Navigation within books  Goals:  Provide a general feel for the book  Provide an interface in which the user can quickly navigate to important passages within the book INF384H 10/24/2011
  • 23. User Interface (cont) 23  Navigation between books INF384H 10/24/2011
  • 24. Evaluation 24  Manual labeling to determine accuracy  User studied (passive) over a 30 day period  Analysis of distribution of link types within Google’s scanned books. INF384H 10/24/2011
  • 25. Evaluation (cont) 25  Manual labeling: • Sampled 120 passages from low scores and 120 from high scores (to avoid precision bias). • Use a Likert scale of 1 to 5 with 1-2 meaning good, 3 meaning neutral, and 4-5 meaning bad. • Inter-annotator agreement was 88.5% (± 3.5% to account for neutral labels) • 88% marked good INF384H 10/24/2011
  • 26. Evaluation (cont) 26  User study: • Consisted of monitoring user activity in Google Books. • Specifically if they navigated via popular passages (Quotations); other book edition links (Editions); to other similar books within a cluster (Related); or to books that cite the current book (Cited By) • Results  INF384H 10/24/2011
  • 27. Evaluation (cont) 27 INF384H 10/24/2011
  • 28. Evaluation (cont) 28  Coverage:  What is the distribution of these link types in scanned books? INF384H 10/24/2011
  • 29. Related Work & Future Work 29  Related Work  Automatic Hypertext  Plagiarism Detection  Future Work  Improved Ranking  Incremental Processing  Primary Source Identification  Attribution INF384H 10/24/2011
  • 30. Questions + Discussion 30 The End. Questions & discussion. ….Go Rangers! INF384H 10/24/2011