5. Consortia Committee on Institutional Cooperation Triangle Research Libraries Network University of California Individual Institutions Arizona State University Baylor University Boston University California Digital Library Columbia University Cornell University Dartmouth College Duke University Emory University Harvard University Library Indiana University Johns Hopkins University Lafayette College Library of Congress Mass. Inst. of Technology Michigan State University New York University New York Public Library North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Texas A&M University Universidad Complutense de Madrid University of California Berkeley University of California Davis University of California Irvine University of California Los Angeles University of California Merced University of California Riverside University of California San Diego University of California San Francisco University of California Santa Barbara University of California Santa Cruz The University of Chicago University of Connecticut University of Florida University of Illinois University of Illinois at Chicago The University of Iowa University of Maryland University of Michigan University of Minnesota University of Nebraska-Lincoln The University of North Carolina University of Notre Dame University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Yale University Library
thank you grateful for the invitation to attend this conf. & learn about the work that is being done in this project before I begin, I’d like to that TBW at the Univ. of Michigan who has led the majority of the work that I’ll be describing (and continues to focus on improving the services we offer).
i’ve bolded the only non-US institution to join thus far there are another 13 or so partners that should be annouced soon, including several Canadian institutions
almost 10 million books number of words in corpus between 800 billion & 1 trillion
i’ll get back to languages in a bit
Hathi uses Solr/Lucene for indexing & search Size of the index: 6 TB total; at least 2/3 the size of the text files used to produce it split into 12 shards across 6 servers; 2 servers used for indexing nightly incremental indexing; only complete reindexing twice (at current size, would take 10-14 days to run; with some tweaks should be able to get it to 4-5 days) each index shard contains ~3 billion unique terms (but total not 3x12)
performance: great deal of work has gone into memory management constant monitoring of query response times wildcarding not possible b/c of query latency (resulting from number of tokens/terms in the index)
over 400 languages in Hathi Google’s OCR engine (formerly Abbyy, recently switched to Tesseract) can handle about 20 well and only about 60 total; there is very limited “ground truth” text for all but the “top” 20 languages when the OCR engine is unable to determine the lang. gibberish or empty OCR may be the result; we recently found that only .069% (6730) of books have empty OCR; there rest of the 340-odd languages probably just have gibberish
example of a book in Mongolian: page image: http://babel.hathitrust.org/cgi/pt?id=mdp.39015025118004;seq=51;size=125;view=image extracted text: http://babel.hathitrust.org/cgi/pt?seq=51;id=mdp.39015025118004;page=root;view=plaintext;size=100;orient=0
complexity image of...? a key that fits all locks? filtering or pre-processing of dirty OCR (prior to indexing) will have to work across all languages content is from a variety of disciplines and spans a wide range of time periods; improving the OCR training or cleanup tasks would require different kinds of dictionaries: technical, academic, etc. what about mulit-language texts? ground truth collection: Google trying to collect as much as possible; in the academic setting would it be possible to establish an open text ground truth center? a lot of work/difficult to achieve, but worth thinking about
obvious that much more effective to make changes to OCR engine than improvements post-OCR’ing;
mis-identification of page sections: images and figures interpreted as text regions also variations in book design, fonts, physical & condition problems (smudges, etc.), page decorations, etc. according to a UNLV study, bad OCR increases the number of words per document by about one third
not indexing words with alphas & numbers mixed when doing a few searches in the HT full text index, I discovered that “76 trombones” not only is in a musical but was a term used to describe McNamara’s requests for position papers during his tenure in the department of defense.”
removing words with only one occurrence: hapax ~50% of the unique terms/tokens in the index only occur once; “if the word occurs in a query it would bring the document containing the word to the top, so removing these really hurts retrieval for those queries”; these have high IDF (inverse document frequency Martin Reynaert (Tilburg U, Netherlands) actually did some estimates on the Reuter’s corpus (not OCR, just typos), and discovered that by removing the hapax, over 32% of the unique words that are legitimate words would be removed and 35% of the unique words that are errors would remain
corrections at Google (majority of corpus) Tesseract instead of Abbyy9 reCaptcha voting layout definition (GooDr.) phrase correction by language & image model Quality working group: just beginning to dig into OCR issues; in part the reason I wanted to attend this conference; need more information from G about how effective are the above; also how broad is the coverage (what %age being improved); what other ideas does G have?
as difficult as our problems are, I am optimistic that in the end they will not proven to be intractable. it is my hope that thru hearing about the work being done here in IMPACT, that we can put to use some of the ideas and techniques that are being developed and begin chipping away at some of the problems we face. it is also my hope that we can forge new collaborations to help everyone working in this space and to better services to our users, making our books easier to find and use. again, thank you