SlideShare a Scribd company logo
1 of 23
Experiences in Mass Digitization: Examining OCR Quality at Scale Paul Fogel California Digital Library University of California
HathiTrust
 
The Partnership
Consortia Committee on Institutional Cooperation  Triangle Research Libraries Network University of California Individual Institutions Arizona State University Baylor University Boston University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Lafayette College Library of Congress Mass. Inst. of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University  Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley University of California Davis University of California Irvine University of California Los Angeles University of California Merced University of California Riverside University of California San Diego University of California San Francisco University of California Santa Barbara University of California Santa Cruz The University of Chicago University of Connecticut University of Florida University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota University of Nebraska-Lincoln The University of North Carolina University of Notre Dame University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Yale University Library
The Corpus
 
Visualizations
 
 
Issues of Scale: The Index
Size
Performance or
Issues of Scale: Languages
Lots of them
A sad example: the page image the extracted text
Complexity
Issues of Scale: OCR Failures and Correction
OCR Failures
Corrections
Corrections
Corrections
Thank you. [email_address] http://www.cdlib.org/services/collections/massdig/

More Related Content

What's hot (6)

Colleges & universities23
Colleges & universities23Colleges & universities23
Colleges & universities23
 
Academic Credentials Report
Academic Credentials ReportAcademic Credentials Report
Academic Credentials Report
 
Semantics, technology and linked data in open access repositories on agricult...
Semantics, technology and linked data in open access repositories on agricult...Semantics, technology and linked data in open access repositories on agricult...
Semantics, technology and linked data in open access repositories on agricult...
 
VIVO2015 - Leveraging Personalized Google Analytics for Greater RNS Engagement
VIVO2015 - Leveraging Personalized Google Analytics for Greater RNS EngagementVIVO2015 - Leveraging Personalized Google Analytics for Greater RNS Engagement
VIVO2015 - Leveraging Personalized Google Analytics for Greater RNS Engagement
 
Data Literacy for Librarians
Data Literacy for LibrariansData Literacy for Librarians
Data Literacy for Librarians
 
National Name Exchange 2.0 [Overview for UW Depts]
National Name Exchange 2.0 [Overview for UW Depts]National Name Exchange 2.0 [Overview for UW Depts]
National Name Exchange 2.0 [Overview for UW Depts]
 

Viewers also liked

IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocrIMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Centre of Competence
 
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Final Conference - Language Parallel Sessions -  LandsbergenIMPACT Final Conference - Language Parallel Sessions -  Landsbergen
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Centre of Competence
 

Viewers also liked (20)

IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Khalil Rouhana
IMPACT Final Conference - Khalil  RouhanaIMPACT Final Conference - Khalil  Rouhana
IMPACT Final Conference - Khalil Rouhana
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
 
IMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly ContehIMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly Conteh
 
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocrIMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
 
IMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven KrauwerIMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven Krauwer
 
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to Taverna
 
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Final Conference - Language Parallel Sessions -  LandsbergenIMPACT Final Conference - Language Parallel Sessions -  Landsbergen
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
 

Similar to IMPACT Final Conference - Paul Fogel

Key note Joanna Motion
Key note Joanna MotionKey note Joanna Motion
Key note Joanna Motion
Hans Hoornstra
 
Dmp tool presentation
Dmp tool presentationDmp tool presentation
Dmp tool presentation
Sherry Lake
 
Second life-for-education-17293
Second life-for-education-17293Second life-for-education-17293
Second life-for-education-17293
angel pazos presas
 

Similar to IMPACT Final Conference - Paul Fogel (20)

Affiliated schools
Affiliated schoolsAffiliated schools
Affiliated schools
 
Institutional Uses of HathiTrust
Institutional Uses of HathiTrustInstitutional Uses of HathiTrust
Institutional Uses of HathiTrust
 
University Map
University MapUniversity Map
University Map
 
Leap+can+campuses
Leap+can+campusesLeap+can+campuses
Leap+can+campuses
 
List of universities in california
List of universities in californiaList of universities in california
List of universities in california
 
CILogon 2.0 Update at TechEx 2016
CILogon 2.0 Update at TechEx 2016CILogon 2.0 Update at TechEx 2016
CILogon 2.0 Update at TechEx 2016
 
Bonner Goal and Strategies
Bonner Goal and StrategiesBonner Goal and Strategies
Bonner Goal and Strategies
 
David E. Herrington, Bobbie Eddins, Ann Farris, Brenda Russell, Jeffrey Kirk,...
David E. Herrington, Bobbie Eddins, Ann Farris, Brenda Russell, Jeffrey Kirk,...David E. Herrington, Bobbie Eddins, Ann Farris, Brenda Russell, Jeffrey Kirk,...
David E. Herrington, Bobbie Eddins, Ann Farris, Brenda Russell, Jeffrey Kirk,...
 
HathiTrust
HathiTrustHathiTrust
HathiTrust
 
National FORUM Journals - Partial Listing of Affiliated Colleges, Universitie...
National FORUM Journals - Partial Listing of Affiliated Colleges, Universitie...National FORUM Journals - Partial Listing of Affiliated Colleges, Universitie...
National FORUM Journals - Partial Listing of Affiliated Colleges, Universitie...
 
New logo college members list
New logo college members listNew logo college members list
New logo college members list
 
College members list.cbr logo 2015
College members list.cbr logo 2015College members list.cbr logo 2015
College members list.cbr logo 2015
 
CMA Digital Survey
CMA Digital SurveyCMA Digital Survey
CMA Digital Survey
 
Employer Brochure
Employer BrochureEmployer Brochure
Employer Brochure
 
National Name Exchange
National Name ExchangeNational Name Exchange
National Name Exchange
 
NATIONAL FORUM JOURNALS (Founded 1982) William Allan Kritsonis, Editor-in-Chief
NATIONAL FORUM JOURNALS (Founded 1982) William Allan Kritsonis, Editor-in-ChiefNATIONAL FORUM JOURNALS (Founded 1982) William Allan Kritsonis, Editor-in-Chief
NATIONAL FORUM JOURNALS (Founded 1982) William Allan Kritsonis, Editor-in-Chief
 
Key note Joanna Motion
Key note Joanna MotionKey note Joanna Motion
Key note Joanna Motion
 
JiscOAWeek_LAIR_slides_October2023.pptx
JiscOAWeek_LAIR_slides_October2023.pptxJiscOAWeek_LAIR_slides_October2023.pptx
JiscOAWeek_LAIR_slides_October2023.pptx
 
Dmp tool presentation
Dmp tool presentationDmp tool presentation
Dmp tool presentation
 
Second life-for-education-17293
Second life-for-education-17293Second life-for-education-17293
Second life-for-education-17293
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Recently uploaded (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

IMPACT Final Conference - Paul Fogel

  • 1. Experiences in Mass Digitization: Examining OCR Quality at Scale Paul Fogel California Digital Library University of California
  • 3.  
  • 5. Consortia Committee on Institutional Cooperation Triangle Research Libraries Network University of California Individual Institutions Arizona State University Baylor University Boston University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Lafayette College Library of Congress Mass. Inst. of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley University of California Davis University of California Irvine University of California Los Angeles University of California Merced University of California Riverside University of California San Diego University of California San Francisco University of California Santa Barbara University of California Santa Cruz The University of Chicago University of Connecticut University of Florida University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota University of Nebraska-Lincoln The University of North Carolina University of Notre Dame University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Yale University Library
  • 7.  
  • 9.  
  • 10.  
  • 11. Issues of Scale: The Index
  • 12. Size
  • 14. Issues of Scale: Languages
  • 16. A sad example: the page image the extracted text
  • 18. Issues of Scale: OCR Failures and Correction
  • 23. Thank you. [email_address] http://www.cdlib.org/services/collections/massdig/

Editor's Notes

  1. thank you grateful for the invitation to attend this conf. & learn about the work that is being done in this project before I begin, I’d like to that TBW at the Univ. of Michigan who has led the majority of the work that I’ll be describing (and continues to focus on improving the services we offer).
  2. i’ve bolded the only non-US institution to join thus far there are another 13 or so partners that should be annouced soon, including several Canadian institutions
  3. almost 10 million books number of words in corpus between 800 billion & 1 trillion
  4. i’ll get back to languages in a bit
  5. Hathi uses Solr/Lucene for indexing & search Size of the index: 6 TB total; at least 2/3 the size of the text files used to produce it split into 12 shards across 6 servers; 2 servers used for indexing nightly incremental indexing; only complete reindexing twice (at current size, would take 10-14 days to run; with some tweaks should be able to get it to 4-5 days) each index shard contains ~3 billion unique terms (but total not 3x12)
  6. performance: great deal of work has gone into memory management constant monitoring of query response times wildcarding not possible b/c of query latency (resulting from number of tokens/terms in the index)
  7. over 400 languages in Hathi Google’s OCR engine (formerly Abbyy, recently switched to Tesseract) can handle about 20 well and only about 60 total; there is very limited “ground truth” text for all but the “top” 20 languages when the OCR engine is unable to determine the lang. gibberish or empty OCR may be the result; we recently found that only .069% (6730) of books have empty OCR; there rest of the 340-odd languages probably just have gibberish
  8. example of a book in Mongolian: page image: http://babel.hathitrust.org/cgi/pt?id=mdp.39015025118004;seq=51;size=125;view=image extracted text: http://babel.hathitrust.org/cgi/pt?seq=51;id=mdp.39015025118004;page=root;view=plaintext;size=100;orient=0
  9. complexity image of...? a key that fits all locks? filtering or pre-processing of dirty OCR (prior to indexing) will have to work across all languages content is from a variety of disciplines and spans a wide range of time periods; improving the OCR training or cleanup tasks would require different kinds of dictionaries: technical, academic, etc. what about mulit-language texts? ground truth collection: Google trying to collect as much as possible; in the academic setting would it be possible to establish an open text ground truth center? a lot of work/difficult to achieve, but worth thinking about
  10. obvious that much more effective to make changes to OCR engine than improvements post-OCR’ing;
  11. mis-identification of page sections: images and figures interpreted as text regions also variations in book design, fonts, physical & condition problems (smudges, etc.), page decorations, etc. according to a UNLV study, bad OCR increases the number of words per document by about one third
  12. not indexing words with alphas & numbers mixed when doing a few searches in the HT full text index, I discovered that “76 trombones” not only is in a musical but was a term used to describe McNamara’s requests for position papers during his tenure in the department of defense.”
  13. removing words with only one occurrence: hapax ~50% of the unique terms/tokens in the index only occur once; “if the word occurs in a query it would bring the document containing the word to the top, so removing these really hurts retrieval for those queries”; these have high IDF (inverse document frequency Martin Reynaert (Tilburg U, Netherlands) actually did some estimates on the Reuter’s corpus (not OCR, just typos), and discovered that by removing the hapax, over 32% of the unique words that are legitimate words would be removed and 35% of the unique words that are errors would remain
  14. corrections at Google (majority of corpus) Tesseract instead of Abbyy9 reCaptcha voting layout definition (GooDr.) phrase correction by language & image model Quality working group: just beginning to dig into OCR issues; in part the reason I wanted to attend this conference; need more information from G about how effective are the above; also how broad is the coverage (what %age being improved); what other ideas does G have?
  15. as difficult as our problems are, I am optimistic that in the end they will not proven to be intractable. it is my hope that thru hearing about the work being done here in IMPACT, that we can put to use some of the ideas and techniques that are being developed and begin chipping away at some of the problems we face. it is also my hope that we can forge new collaborations to help everyone working in this space and to better services to our users, making our books easier to find and use. again, thank you