SlideShare ist ein Scribd-Unternehmen logo
1 von 30
HATHI TRUST RESEARCH
CENTER

Opportunities and Challenges of Text
Mining HathiTrust Digital Library
Koninklijke Bibiotheek | 15.Nov.13

Beth Plale – @bplale
Professor, School of Informatics and Computing
Director, Data To Insight Center
Indiana University

Tweet us - @HathiTrust #HTRC
Thanks to sponsors
HathiTrust Digital Library
• HathiTrust is a partnership of academic &
research institutions, offering a collection of
millions of titles digitized from libraries around
the world.
– Founding members of HathiTrust along with
University of Michigan are Indiana University,
University of California, and University of Virginia
http://www.hathitrust.org

http://www.hathitrust.org/htrc
#HTRC @HathiTrust
 HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-scale corpi
through computational tools, and
time-based analysis
Restricted nature of HT content
suggests need for new forms of
access that preserve intimate
nature of research investigation
while honoring restrictions
 Paradigm: computation takes
place close to the data
#HTRC @HathiTrust
Mission of HT Research Center
• Research arm of HathiTrust
• Goal: enable researchers world-wide to carry out
computational investigation of HT repository through
– Develop model for access: the ‘workset’
– Develop tools that facilitate research by digital humanities
and informatics communities
– Develop secure cyberinfrastructure that allows
computational investigation of entire copyrighted and
public domain HathiTrust repository

• Established: July, 2011
• Collaborative effort of Indiana University and
University of Illinois
#HTRC @HathiTrust
HTRC
Request

Spatial plots
All the complexity

Statistical plots

Complexity hiding interface

Tabular info
Complexity hiding interface
Workset builder
HTRC Timeline
• Phase I: development 01 Jul 2011 – 31 Mar 2013
– HTRC software and services release v1.0
http://sourceforge.net/p/htrc/code/
• Phase II: outreach, 01 Apr 2013 - present
– 2nd HTRC UnCamp Sep ‘13
Attendees of UnCamp’13

#HTRC @HathiTrust
•
•
•
•
•
•
•
•

Philosophy: computation moves to data
Web services (REST) architecture and protocols
WS02 Registry for worksets and results
Solr Indexes: full text, MARC, and new metadata
noSQL (Cassandra) store as volume store
Authentication using WS02 Identity Server
Portal front-end, programmatic access
Mining tools: currently SEASR
#HTRC @HathiTrust
Portal

Secure
Capsule
Instance
Manager

Secure
Capsule
Cluster

Sigiri
job
deployment

Hadoop
Cluster
(MapReduce/
HDFS)

IU compute resources

Agent
instance
Agent

Registry
Services, worksets

Agent
instance

instance

SEASR analytics
service
Solr index
Meandre
Orchestration

SEASR service
execution

HTRC Data API v0.1

Secure
Capsule
Service

User session
management

Blacklight

Identity Server

ssh client

Volume store
Volume store
(Cassandra)
Volume store
(Cassandra)
(Cassandra)

rsync

HathiTrust
corpus

Page/volume
tree (file system)
University of
Michigan
HTRC’s guiding principle to
computational access
• No computational action or set of actions on part of
users, either acting alone or in cooperation with other
users over duration of one or multiple sessions can
result in sufficient information gathered from the HT
repository to reassemble pages from collection for
reading
• Definition disallows collusion between users, or
accumulation of material over time.
• Defining “sufficient information”: research has shown
need to interact directly with select texts. How much
of a text to show? Google withholds from showing to
reader every 10th page of a book (Int’l NYTimes Nov 16-17, 2013)
#HTRC @HathiTrust
HTRC Secure
Capsule
Architecture
Researcher
requests
new VM

VM
Image
Store
VM
Image
Manager
VM
Manager

VM
Image
Builder

Secure
Capsule:
VM
controls I/O
instance
behind
scenes

Researcher

SSH

Research install tools onto
VM through window on her
desktop

Research
results

Registry
Services,
worksets

Final location
of results is
registry
Extensions to HT Use
1. Enhanced metadata
Miao Chen, i-school, Indiana Univ

2. Large scale text mining using HPC compute
resources
Guangchen Ruan, computer science, Indiana Univ

3. Enhanced metadata to include gender author
information
Stacy Kowalczyk, library science, Dominican Univ
#HTRC @HathiTrust
Miao Chen, PhD, Indiana University

METADATA ENHANCEMENT:
PRELIMINARY STUDY
Metadata Enhancement
• Preliminary study of limits of HT Repository
metadata
• Current metadata fields are MARC-based
– E.g. publication date, authors, title, subject

• MARC fields are fundamental
• Needed: more fields of users’ interest for
granular analytics (Metadata Enhancement)
• Solicit user requirements and prioritize for
implementation
Thanks to Miao Chen
#HTRC @HathiTrust
Top Metadata Enhancement Items
• 1st round user survey, top requested items
– Word frequency count and document length. At
volume level ✔
– Author gender identification ✔
– Metadata de-duplication
– Word frequency count at page level
– Word frequency count for full 10.8 M volume
repository

#HTRC @HathiTrust
Other Metadata Enhancement Items
•
•
•
•
•
•
•
•

Stats analysis: TF-IDF
Readability score
Language identification
Topic modeling (e.g. LDA probability)
Genre
Era of compilation
Book length (e.g. short or long)
Concordance index (indexing with context)
#HTRC @HathiTrust
Guangchen Ruan, PhD candidate, IU

LARGE SCALE DATA ANALYSIS ON XSEDE
Study Objective
• Study how large scale text analysis of the HT
repository can be carried out using public
compute resources
• Task: compute word frequency over entire
public domain portion of HT corpus
• Uses Blacklight, an HPC machine at Pittsburgh
supercomputer center

#HTRC @HathiTrust
Experimental Environment and Results
• Dataset
2,592,210 volumes, in total 2.1 TB, divided into 1024
partitions of 2GB each

• Computation platform
XSEDE Blacklight, 1024-core, each 2.27 GHz, 8.2
TB memory. Each core processes one partition

• Results
Whole corpus word count completed in 1,454
seconds or 24.23 minutes
#HTRC @HathiTrust
Computation Time Distribution

#HTRC @HathiTrust
Word Frequency Distribution

#HTRC @HathiTrust
Stacy Kowalczyk, Asst. Professor, Dominican University
Zong Peng, HTRC, Indiana University

GENDER IDENTIFICATION OF HTRC
AUTHORS BY NAMES
Ref talk by Stacy Kowalczyk, http://www.hathitrust.org/htrc_uncamp2013
Gender Identification of Text
• Can we use author names in bibliographic records to
identify gender?
• 2.6 million bibliographic records
– Extracted personal author data
– Marc 100 abcd and 700 abcd

• 606,437 unique personal author strings
• Bibliographic data is not fielded like patent names
• Relying on Standard cataloging practice
– Last name, first name middle name, titles/honorifics,
dates
#HTRC @HathiTrust
Authors vs Names
• Methuen, Algernon Methuen Marshall, Sir bart., 18561924
• Methuem, Algernon
• Methuen Algernon
• Methuen Marshall, Sir, bart., 1856• Methuen, A. Sir, 1856-1924
• Methuen, A. Sir, bart., 1856-1924
• Methuen Marshall, Sir bart 1856-1924
• Methuen, Algernon Methuen Marshall, Sir, 1856-1924
• Methuen, Algernon Methuen Marshall, Sir, bart., 18561924
• Methuen, Algernon, 1856-1924
#HTRC @HathiTrust
Sources of Data
• The Virtual International Authority File
– Hosted by OCLC

• Harvested names from multiple data sources
– Census bureau
– Baby name sites

• EU Patent Research names list (Frietsch et al, 2009; Naldi
et al. 2005)

– Developed an extensive list of European names

• Titles and honorifics
– Multiple web resources
– Sir, Baron, Count, Duke, Father, Cardinal, etc
– Lady, Mrs. Miss, Countess, Duchess, Sister, etc
#HTRC @HathiTrust
Initial Gender Results
• Approximately 80% of name strings have initial
gender identification
– Female
• 59,365
• 10%

– Male
• 425,994
• 70%

– Unknown
• 114,204
• 19%

– Ambiguous
• 5,965
• Less than 1%
#HTRC @HathiTrust
Results by Data Source
Against the whole set of name strings
• VIAF
– 19% hit rate

• Web Names
– 54% hit rate

• Patents Names
– 8%

#HTRC @HathiTrust
• For details http://www.hathitrust.org/htrc/faq
• General contact info
– Beth Plale, Director HTRC, plale@indiana.edu
• Requests for capability, interest
– Miao Chen, Asst. Director HTRC
miaochen@indiana.edu
#HTRC @HathiTrust

Weitere ähnliche Inhalte

Was ist angesagt?

New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...Stefan Schmunk
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining Bhawi247
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsSeth Grimes
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
Data Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural ProgramData Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural ProgramGlobus
 
Open hpi semweb-06-part4
Open hpi semweb-06-part4Open hpi semweb-06-part4
Open hpi semweb-06-part4Nadine Ludwig
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataFilip Ilievski
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusGlobus
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Filip Ilievski
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...Micah Altman
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collectionslljohnston
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 

Was ist angesagt? (20)

New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
Text mining
Text miningText mining
Text mining
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
co:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatosco:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatos
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
 
Data Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural ProgramData Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural Program
 
Open hpi semweb-06-part4
Open hpi semweb-06-part4Open hpi semweb-06-part4
Open hpi semweb-06-part4
 
Webmining (1)
Webmining (1)Webmining (1)
Webmining (1)
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by Globus
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
 
Sanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUDSanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUD
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 

Ähnlich wie HathiTrust Reserach Center Nov2013

Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesRobert H. McDonald
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryRobert H. McDonald
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?Brian Vetruba
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14Robert H. McDonald
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsBeth Plale
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for LibrariesThomas King
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsBeth Plale
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesRobert H. McDonald
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...Hazel Hall
 
The Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South AfricaThe Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South AfricaCrossref
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBeth Plale
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth
 
23 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 201723 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 2017ARDC
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoRobert H. McDonald
 

Ähnlich wie HathiTrust Reserach Center Nov2013 (20)

Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
The Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South AfricaThe Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South Africa
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
C N I20080404
C N I20080404C N I20080404
C N I20080404
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
 
23 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 201723 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 2017
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
 

Mehr von Beth Plale

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science researchBeth Plale
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Beth Plale
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedBeth Plale
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADBeth Plale
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceBeth Plale
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for scienceBeth Plale
 

Mehr von Beth Plale (7)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 

Kürzlich hochgeladen

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Kürzlich hochgeladen (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

HathiTrust Reserach Center Nov2013

  • 1. HATHI TRUST RESEARCH CENTER Opportunities and Challenges of Text Mining HathiTrust Digital Library Koninklijke Bibiotheek | 15.Nov.13 Beth Plale – @bplale Professor, School of Informatics and Computing Director, Data To Insight Center Indiana University Tweet us - @HathiTrust #HTRC
  • 3. HathiTrust Digital Library • HathiTrust is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. – Founding members of HathiTrust along with University of Michigan are Indiana University, University of California, and University of Virginia http://www.hathitrust.org http://www.hathitrust.org/htrc #HTRC @HathiTrust
  • 4.  HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions  Paradigm: computation takes place close to the data #HTRC @HathiTrust
  • 5. Mission of HT Research Center • Research arm of HathiTrust • Goal: enable researchers world-wide to carry out computational investigation of HT repository through – Develop model for access: the ‘workset’ – Develop tools that facilitate research by digital humanities and informatics communities – Develop secure cyberinfrastructure that allows computational investigation of entire copyrighted and public domain HathiTrust repository • Established: July, 2011 • Collaborative effort of Indiana University and University of Illinois #HTRC @HathiTrust
  • 6. HTRC Request Spatial plots All the complexity Statistical plots Complexity hiding interface Tabular info
  • 9. HTRC Timeline • Phase I: development 01 Jul 2011 – 31 Mar 2013 – HTRC software and services release v1.0 http://sourceforge.net/p/htrc/code/ • Phase II: outreach, 01 Apr 2013 - present – 2nd HTRC UnCamp Sep ‘13 Attendees of UnCamp’13 #HTRC @HathiTrust
  • 10. • • • • • • • • Philosophy: computation moves to data Web services (REST) architecture and protocols WS02 Registry for worksets and results Solr Indexes: full text, MARC, and new metadata noSQL (Cassandra) store as volume store Authentication using WS02 Identity Server Portal front-end, programmatic access Mining tools: currently SEASR #HTRC @HathiTrust
  • 11. Portal Secure Capsule Instance Manager Secure Capsule Cluster Sigiri job deployment Hadoop Cluster (MapReduce/ HDFS) IU compute resources Agent instance Agent Registry Services, worksets Agent instance instance SEASR analytics service Solr index Meandre Orchestration SEASR service execution HTRC Data API v0.1 Secure Capsule Service User session management Blacklight Identity Server ssh client Volume store Volume store (Cassandra) Volume store (Cassandra) (Cassandra) rsync HathiTrust corpus Page/volume tree (file system) University of Michigan
  • 12. HTRC’s guiding principle to computational access • No computational action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from the HT repository to reassemble pages from collection for reading • Definition disallows collusion between users, or accumulation of material over time. • Defining “sufficient information”: research has shown need to interact directly with select texts. How much of a text to show? Google withholds from showing to reader every 10th page of a book (Int’l NYTimes Nov 16-17, 2013) #HTRC @HathiTrust
  • 13. HTRC Secure Capsule Architecture Researcher requests new VM VM Image Store VM Image Manager VM Manager VM Image Builder Secure Capsule: VM controls I/O instance behind scenes Researcher SSH Research install tools onto VM through window on her desktop Research results Registry Services, worksets Final location of results is registry
  • 14. Extensions to HT Use 1. Enhanced metadata Miao Chen, i-school, Indiana Univ 2. Large scale text mining using HPC compute resources Guangchen Ruan, computer science, Indiana Univ 3. Enhanced metadata to include gender author information Stacy Kowalczyk, library science, Dominican Univ #HTRC @HathiTrust
  • 15. Miao Chen, PhD, Indiana University METADATA ENHANCEMENT: PRELIMINARY STUDY
  • 16. Metadata Enhancement • Preliminary study of limits of HT Repository metadata • Current metadata fields are MARC-based – E.g. publication date, authors, title, subject • MARC fields are fundamental • Needed: more fields of users’ interest for granular analytics (Metadata Enhancement) • Solicit user requirements and prioritize for implementation Thanks to Miao Chen #HTRC @HathiTrust
  • 17. Top Metadata Enhancement Items • 1st round user survey, top requested items – Word frequency count and document length. At volume level ✔ – Author gender identification ✔ – Metadata de-duplication – Word frequency count at page level – Word frequency count for full 10.8 M volume repository #HTRC @HathiTrust
  • 18. Other Metadata Enhancement Items • • • • • • • • Stats analysis: TF-IDF Readability score Language identification Topic modeling (e.g. LDA probability) Genre Era of compilation Book length (e.g. short or long) Concordance index (indexing with context) #HTRC @HathiTrust
  • 19. Guangchen Ruan, PhD candidate, IU LARGE SCALE DATA ANALYSIS ON XSEDE
  • 20. Study Objective • Study how large scale text analysis of the HT repository can be carried out using public compute resources • Task: compute word frequency over entire public domain portion of HT corpus • Uses Blacklight, an HPC machine at Pittsburgh supercomputer center #HTRC @HathiTrust
  • 21. Experimental Environment and Results • Dataset 2,592,210 volumes, in total 2.1 TB, divided into 1024 partitions of 2GB each • Computation platform XSEDE Blacklight, 1024-core, each 2.27 GHz, 8.2 TB memory. Each core processes one partition • Results Whole corpus word count completed in 1,454 seconds or 24.23 minutes #HTRC @HathiTrust
  • 24. Stacy Kowalczyk, Asst. Professor, Dominican University Zong Peng, HTRC, Indiana University GENDER IDENTIFICATION OF HTRC AUTHORS BY NAMES Ref talk by Stacy Kowalczyk, http://www.hathitrust.org/htrc_uncamp2013
  • 25. Gender Identification of Text • Can we use author names in bibliographic records to identify gender? • 2.6 million bibliographic records – Extracted personal author data – Marc 100 abcd and 700 abcd • 606,437 unique personal author strings • Bibliographic data is not fielded like patent names • Relying on Standard cataloging practice – Last name, first name middle name, titles/honorifics, dates #HTRC @HathiTrust
  • 26. Authors vs Names • Methuen, Algernon Methuen Marshall, Sir bart., 18561924 • Methuem, Algernon • Methuen Algernon • Methuen Marshall, Sir, bart., 1856• Methuen, A. Sir, 1856-1924 • Methuen, A. Sir, bart., 1856-1924 • Methuen Marshall, Sir bart 1856-1924 • Methuen, Algernon Methuen Marshall, Sir, 1856-1924 • Methuen, Algernon Methuen Marshall, Sir, bart., 18561924 • Methuen, Algernon, 1856-1924 #HTRC @HathiTrust
  • 27. Sources of Data • The Virtual International Authority File – Hosted by OCLC • Harvested names from multiple data sources – Census bureau – Baby name sites • EU Patent Research names list (Frietsch et al, 2009; Naldi et al. 2005) – Developed an extensive list of European names • Titles and honorifics – Multiple web resources – Sir, Baron, Count, Duke, Father, Cardinal, etc – Lady, Mrs. Miss, Countess, Duchess, Sister, etc #HTRC @HathiTrust
  • 28. Initial Gender Results • Approximately 80% of name strings have initial gender identification – Female • 59,365 • 10% – Male • 425,994 • 70% – Unknown • 114,204 • 19% – Ambiguous • 5,965 • Less than 1% #HTRC @HathiTrust
  • 29. Results by Data Source Against the whole set of name strings • VIAF – 19% hit rate • Web Names – 54% hit rate • Patents Names – 8% #HTRC @HathiTrust
  • 30. • For details http://www.hathitrust.org/htrc/faq • General contact info – Beth Plale, Director HTRC, plale@indiana.edu • Requests for capability, interest – Miao Chen, Asst. Director HTRC miaochen@indiana.edu #HTRC @HathiTrust

Hinweis der Redaktion

  1. Need to add slide that combines images of the blacklight view and the portal showing a workset. Building a query needs to go after architecture diagram.
  2. HTRC hides complexity of analytics. In this sense, it is like Google search, which is a simple interface that hides complexity to search billions of pages. The kinds of things returned from HTRC interaction are spatial relationship of words (and their frequency obviously), statistical plots of information or tabular information.
  3. Shifting the complexity hiding interface to the right, we open up the cloud to see what’s inside. HTRC at it simplest has 1) algorithms – these are drawn from SEASR and from other analysis tool suites including Mahout and mapreduce, the 2) HT corpus (and subsets of the corpus that users either have personally as part of a workset, or are publically available, and 3) other data sets that are used. HTRC brokers the bringing together of these pieces so that computation can take place on a resource like Big Red II (or XSEDE). Note that there is an arrow from the compute engine to the complexity hiding interface. This is because researcher interaction with the texts isn’t an automated workflow; it is one requiring levels of interaction with the computation as it is running.
  4. Workset is created using this page then gets connected with algorithms, other resources
  5. Add WS02 LogAdd logo brii and quarry
  6. tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.[1] One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.[1]
  7. In the experiment we use 1024 cores and each core processes the same amount of data, i.e. 2T/1024, the histogram distribution (using 100-bin) shows the time consumed by processes. So the y-axis indicate the # of processes that finish the processing within the time indicated by the x-axis. From the distribution we can see that the variance is small, most processes are able to finish within 15-16 mins.