Generative AI for Technical Writer or Information Developers
Building a Public Research Center for the HathiTrust Digital Library
1. Building a Public Research Center for the HathiTrust Digital Library @hathitresearch | @hathitrust http://www.hathitrust-research.org Robert H. McDonald Associate Dean for Library Technologies and Digital Libraries Associate Director-Data to Insight Center, Pervasive Technology Institute Indiana University June 14, 2011 JCDL 2011: Big Data! Big Deal? Panel
2. HathiTrust Research Center (HTRC) Team Indiana University Beth Plale – Director Robert McDonald – Executive Committee University of Illinois Scott Poole – Co-Director John Unsworth – Executive Committee
3. HathiTrust Digital Library History To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge. Launched in October 2008 University of Michigan Indiana University Used Google Books Repository at Michigan as Model Expanded to include content from CIC Member Libraries UC System Libraries University of Virginia Now includes more than 50 partner institutions and more than 8 million volumes
4.
5. Worked to identify key stakeholders from HT institutions to collaborate and write RFP
6. Google Settlement in early 2011 did not stop the centerDeveloped specific RFP for HathiTrust to solicit proposals – Summer/Fall 2009 HTRC RFP Working Group RFP Released – Winter 2010
7. Our Collaboration HTRC is founded as a joint venture between Indiana University and the University of Illinois Urbana-Champaign, aimed at solving the difficult challenges of increasing computational access to the public domain and copyrighted material in HathiTrust.
8. Our Mission Phase I : starting Apr 2011 and going for 18 mos. Phase II : starting Fall 2012 and going for … Goal: enable strong computational research and education on a collection that has not been amenable to computational exploration EVER before!
9. Our Goals Maintain repository of text mining algorithms and retrieval tools available on-line for human and programmatic discovery. Also register derived data sets, indexes, and versions in registry repository. Be a user-driven resource, with an active advisory board, and a community model that allows users to share algorithms and tools. Support interoperability across collections and institutions, through use of inCommon SAML identity.
10. Our Future Support innovation in cyberinfrastructure to deliver optimal access and use of the HathiTrust corpus. Implement “Non-consumptive” research: a technical and intellectual challenge Identify and host existing data analysis, text mining and retrieval toolsthat are of interest to the community. Stimulate development of new analytical methods and tools. We hope that the scale of the HTRC will promote new levels of collaboration in tool development.
11. HathiTrust Research Center Today HTRC is dedicated to the provision of access to a comprehensive body of published works for scholarship and education for computational research purposes. Lightweight Organization Executive Committee Beth Plale, Indiana Scott Poole, Illinois Robert H. McDonald, Indiana John Unsworth, Illinois Advisory Board TBD HathiTrust Executive Committee Liaison Laine Farley, California Digital Library
12. HathiTrust Research Center Today $250K in funding for initial 18 month startup Creating Themed Collections for early Use Cases Astronomy – Victorian Literature - Influenza Ingest and Replication Mechanisms Between HT and HTRC Full-text SOLR indexes Data Capsule integration Karma integration Integration with SEASR/MEANDRE SOA services at NCSA Alignment with Bamboo Technology Project Alignment with international Google Books Research Centers Establishing long-term non-consumptive research methodologies
13. HTRC Proposed Technical Architecture Courtesy IU Data to Insight Center – Beth Plale/Yiming Sun
14. Courtesy IU Data to Insight Center – Felix Terkhorn/Yiming Sun Current SEASR Integration Demo 1. User enters Author name or Volume title 2. Query RIS for Author Name or Volume Title Sample Collection Bibliography Database JS/PHP Auto-completer Book Search Interface by Author or Title 3. Volume ID 7. Tag Cloud returned to user 4. Invoke Tag Cloud service with URL Converted from MARC to RIS 5. Use URL to Retrieve Volume Public-domain OCR Web Access Servlet A persistent RESTful Web Service Tag Cloud Viewer Data Flow 6. OCR for volume Sample Public Domain Collection Meandre Workbench Organized as pairtree for demo only SEASR Infrastructure Administrator creates tag cloud viewer in advance through SEASR
15. Non-Consumptive Research Track No action or set of actions on the part of HathiTrust Research Center users, either acting alone or in cooperation with other users over the duration of one or multiple sessions can result in sufficient information gathered from the HathiTrust collection to reassemble pages from the collection. Beth Plale (Indiana University) Atul Prakash (University of Michigan) Geoffrey Fox (Indiana University) Robert H. McDonald (Indiana University)
19. HathiTrust Research Center Events HTRC Kickoff Event at Digital Humanities Conference 2011 Stanford University - June 20, 2011 Working on models for collaborative research AHRC/ESRC/IMLS/JISC/NEH/NSF/NOW/SSHRC Digging into Data Round 2 http://www.diggingintodata.org/ Working on early advanced user case studies for the HathiTrust Corpus
20. Support and Acknowledgements IU UITS Research Technologies National Center for Supercomputing Applications IU Data to Insight Center iCHASS Illinois Informatics Institute Lilly Endowment, Inc. The Alfred P. Sloan Foundation
21. For More on HathiTrust Research Center See – http://www.hathitrust-research.org Follow us @hathitresearch on twitter Robert H. McDonald @mcdonald on twitter robert@indiana.edu
Editor's Notes
State Core Team NamesTalk about Partnership between IU and UIUC
Basic History of HathiTrust Digital Library – Digital Public Library of America - LAC