Generative Artificial Intelligence: How generative AI works.pdf
HathiTrust Reserach Center Nov2013
1. HATHI TRUST RESEARCH
CENTER
Opportunities and Challenges of Text
Mining HathiTrust Digital Library
Koninklijke Bibiotheek | 15.Nov.13
Beth Plale – @bplale
Professor, School of Informatics and Computing
Director, Data To Insight Center
Indiana University
Tweet us - @HathiTrust #HTRC
3. HathiTrust Digital Library
• HathiTrust is a partnership of academic &
research institutions, offering a collection of
millions of titles digitized from libraries around
the world.
– Founding members of HathiTrust along with
University of Michigan are Indiana University,
University of California, and University of Virginia
http://www.hathitrust.org
http://www.hathitrust.org/htrc
#HTRC @HathiTrust
4. HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-scale corpi
through computational tools, and
time-based analysis
Restricted nature of HT content
suggests need for new forms of
access that preserve intimate
nature of research investigation
while honoring restrictions
Paradigm: computation takes
place close to the data
#HTRC @HathiTrust
5. Mission of HT Research Center
• Research arm of HathiTrust
• Goal: enable researchers world-wide to carry out
computational investigation of HT repository through
– Develop model for access: the ‘workset’
– Develop tools that facilitate research by digital humanities
and informatics communities
– Develop secure cyberinfrastructure that allows
computational investigation of entire copyrighted and
public domain HathiTrust repository
• Established: July, 2011
• Collaborative effort of Indiana University and
University of Illinois
#HTRC @HathiTrust
10. •
•
•
•
•
•
•
•
Philosophy: computation moves to data
Web services (REST) architecture and protocols
WS02 Registry for worksets and results
Solr Indexes: full text, MARC, and new metadata
noSQL (Cassandra) store as volume store
Authentication using WS02 Identity Server
Portal front-end, programmatic access
Mining tools: currently SEASR
#HTRC @HathiTrust
12. HTRC’s guiding principle to
computational access
• No computational action or set of actions on part of
users, either acting alone or in cooperation with other
users over duration of one or multiple sessions can
result in sufficient information gathered from the HT
repository to reassemble pages from collection for
reading
• Definition disallows collusion between users, or
accumulation of material over time.
• Defining “sufficient information”: research has shown
need to interact directly with select texts. How much
of a text to show? Google withholds from showing to
reader every 10th page of a book (Int’l NYTimes Nov 16-17, 2013)
#HTRC @HathiTrust
14. Extensions to HT Use
1. Enhanced metadata
Miao Chen, i-school, Indiana Univ
2. Large scale text mining using HPC compute
resources
Guangchen Ruan, computer science, Indiana Univ
3. Enhanced metadata to include gender author
information
Stacy Kowalczyk, library science, Dominican Univ
#HTRC @HathiTrust
15. Miao Chen, PhD, Indiana University
METADATA ENHANCEMENT:
PRELIMINARY STUDY
16. Metadata Enhancement
• Preliminary study of limits of HT Repository
metadata
• Current metadata fields are MARC-based
– E.g. publication date, authors, title, subject
• MARC fields are fundamental
• Needed: more fields of users’ interest for
granular analytics (Metadata Enhancement)
• Solicit user requirements and prioritize for
implementation
Thanks to Miao Chen
#HTRC @HathiTrust
17. Top Metadata Enhancement Items
• 1st round user survey, top requested items
– Word frequency count and document length. At
volume level ✔
– Author gender identification ✔
– Metadata de-duplication
– Word frequency count at page level
– Word frequency count for full 10.8 M volume
repository
#HTRC @HathiTrust
18. Other Metadata Enhancement Items
•
•
•
•
•
•
•
•
Stats analysis: TF-IDF
Readability score
Language identification
Topic modeling (e.g. LDA probability)
Genre
Era of compilation
Book length (e.g. short or long)
Concordance index (indexing with context)
#HTRC @HathiTrust
20. Study Objective
• Study how large scale text analysis of the HT
repository can be carried out using public
compute resources
• Task: compute word frequency over entire
public domain portion of HT corpus
• Uses Blacklight, an HPC machine at Pittsburgh
supercomputer center
#HTRC @HathiTrust
21. Experimental Environment and Results
• Dataset
2,592,210 volumes, in total 2.1 TB, divided into 1024
partitions of 2GB each
• Computation platform
XSEDE Blacklight, 1024-core, each 2.27 GHz, 8.2
TB memory. Each core processes one partition
• Results
Whole corpus word count completed in 1,454
seconds or 24.23 minutes
#HTRC @HathiTrust
24. Stacy Kowalczyk, Asst. Professor, Dominican University
Zong Peng, HTRC, Indiana University
GENDER IDENTIFICATION OF HTRC
AUTHORS BY NAMES
Ref talk by Stacy Kowalczyk, http://www.hathitrust.org/htrc_uncamp2013
25. Gender Identification of Text
• Can we use author names in bibliographic records to
identify gender?
• 2.6 million bibliographic records
– Extracted personal author data
– Marc 100 abcd and 700 abcd
• 606,437 unique personal author strings
• Bibliographic data is not fielded like patent names
• Relying on Standard cataloging practice
– Last name, first name middle name, titles/honorifics,
dates
#HTRC @HathiTrust
27. Sources of Data
• The Virtual International Authority File
– Hosted by OCLC
• Harvested names from multiple data sources
– Census bureau
– Baby name sites
• EU Patent Research names list (Frietsch et al, 2009; Naldi
et al. 2005)
– Developed an extensive list of European names
• Titles and honorifics
– Multiple web resources
– Sir, Baron, Count, Duke, Father, Cardinal, etc
– Lady, Mrs. Miss, Countess, Duchess, Sister, etc
#HTRC @HathiTrust
28. Initial Gender Results
• Approximately 80% of name strings have initial
gender identification
– Female
• 59,365
• 10%
– Male
• 425,994
• 70%
– Unknown
• 114,204
• 19%
– Ambiguous
• 5,965
• Less than 1%
#HTRC @HathiTrust
29. Results by Data Source
Against the whole set of name strings
• VIAF
– 19% hit rate
• Web Names
– 54% hit rate
• Patents Names
– 8%
#HTRC @HathiTrust
30. • For details http://www.hathitrust.org/htrc/faq
• General contact info
– Beth Plale, Director HTRC, plale@indiana.edu
• Requests for capability, interest
– Miao Chen, Asst. Director HTRC
miaochen@indiana.edu
#HTRC @HathiTrust
Hinweis der Redaktion
Need to add slide that combines images of the blacklight view and the portal showing a workset. Building a query needs to go after architecture diagram.
HTRC hides complexity of analytics. In this sense, it is like Google search, which is a simple interface that hides complexity to search billions of pages. The kinds of things returned from HTRC interaction are spatial relationship of words (and their frequency obviously), statistical plots of information or tabular information.
Shifting the complexity hiding interface to the right, we open up the cloud to see what’s inside. HTRC at it simplest has 1) algorithms – these are drawn from SEASR and from other analysis tool suites including Mahout and mapreduce, the 2) HT corpus (and subsets of the corpus that users either have personally as part of a workset, or are publically available, and 3) other data sets that are used. HTRC brokers the bringing together of these pieces so that computation can take place on a resource like Big Red II (or XSEDE). Note that there is an arrow from the compute engine to the complexity hiding interface. This is because researcher interaction with the texts isn’t an automated workflow; it is one requiring levels of interaction with the computation as it is running.
Workset is created using this page then gets connected with algorithms, other resources
Add WS02 LogAdd logo brii and quarry
tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.[1] One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.[1]
In the experiment we use 1024 cores and each core processes the same amount of data, i.e. 2T/1024, the histogram distribution (using 100-bin) shows the time consumed by processes. So the y-axis indicate the # of processes that finish the processing within the time indicated by the x-axis. From the distribution we can see that the variance is small, most processes are able to finish within 15-16 mins.