Remarrying research and collection services around access to corpora and text mining, are new technical literacy skills needed? Was presented by Ingrid Mason (Deployment Strategist, AARNet) at the Research Support Community Day 2018
1. Are new technical literacy
skills needed?
Remarrying research and collection services around
access to corpora and text mining.
INGRID MASON
DEPLOYMENT STRATEGIST
3. Text & data mining
(TDM) is being used by a
range of researchers to
target relevant literature
and in HASS research.
More
research
support will
need to be
provided.
3
Should TDM
services be
coordinated
nationally?
4. Many many questions
What library technical skills are needed (if there is a growing research support need)?
Where do researchers go if they want to find, use, move, store or create a corpus?
How do researchers learn to build, evaluate, and text mine a corpus?
Where can/does/should this specialist service sit (in Library Research Support or in
eResearch or in Faculty or in national research infrastructure services)?
Psst. I don’t have answers, just the questions at this point. Sorry!
4
5. O M G O M G O M G
OK, what’s a corpus? Find a definition, somewhere reliable [searches the web].
What does a corpus look like? Linguists will know this [searches the web].
How on earth do you “make that blob of stuff accessible”? [compute/storage?]
How big is that text blob and what’s it made of? Corpus analyst? [new job title?]
Who do I know that knows how to build a corpus? Ah, Steve Cassidy from Alveo VL.
What makes for a well balanced/formed corpus? Breathe, reach for library skills.
What about commercially hosted text blobs? Read: Kylie Poulton’s VALA 2018 paper.
5
6. I’m a corpus
building &
TDM novice -
I feel like an
imposter.
6
I’m old style
but I’d like to
give this a go.
Would you?
Schonfeld, Roger C & Christine Wolff-Eisenberg (2017). Taking a Closer Look at Talent Management: Findings from the US Library Survey, 10
April 2017. Ithaka S+R Blog. http://www.sr.ithaka.org/blog/taking-a-closer-look-at-talent-management/ Last accessed: 18/04/2017
12. Alan Liu’s DH Toychest
Data Collections and Datasets
Question: How does this arrangement of resources in Liu’s DH Toychest change my
understanding of collecting resources for research and supporting research?
Answer: Quite a lot, I feel out of my depth, but also very intrigued and my fingers are
tingling. Why?
Challenge: I need to start looking into corpora and have a go at constructing a corpus
(hint: two projects this year).
12
14. 14
Library Technical Skills
Research support in:
Research Data Management / Digital Scholarship / Digital Curation / Research
Techniques
Using:
iPython (now Jupyter) notebook - Natural Language Toolkit / Library Carpentry or Data
Carpentry or Software Carpentry / Text Mining with R (O’Reilly)
Psst we aim for Jupyter notebooks connected to CloudStor (1 notebook pp to play with)
15. A Trend
Expertise lies in the university to
support text mining for research
and scholarly literature searches.
Biomedical Text Mining
An important problem that text mining attempts to address is
information overload and overlook. Examples of solutions to this
problem include Information Extraction, Document Summarisation,
and Document Classification. In the following example we
demonstrate the use of Text Mining to classify sentences in
biomedical articles and extract key units of information. This
provides a way for busy professionals to reduce the amount of
information to which they are exposed and focus only on salient
aspects in which they are interested.
From Text Mining Collaboration - UNSW
15
16. Learn More
Some history and definition of the
terms (and more) is offered.
Text mining & Text analysis - what
is the difference?
Text mining began with the computational and information
management fields (e.g. database searching and information
retrieval), whereas Text analysis began in the humanities with the
manual analysis of text, (e.g Bible concordances and newspaper
indexes). More recently, the two terms have become synonymous,
and now generally refer to the use of computational methods to
search, retrieve, and analyse text data.
"Text mining or text analytics is an umbrella term describing a
range of techniques that seek to extract useful information
from document collections through the identification and
exploration of interesting patterns in the unstructured textual
data of various types of documents – such as books, web pages,
emails, reports or product descriptions." (Truyens & van Eecke,
2014)
From: Text Mining and Text Analysis - UQ (Research Techniques)
16
17. Digital
Scholarship
How can research support for
corpus building and text mining be
scaled up?
17
Text and data mining
Analyse large scale datasets in your research
Data mining is the process of applying open-ended
computational methods to large scale datasets to
discover new insights that may not be revealed through
targeted smaller scale analyses. When the datasets used
are bodies of text, this process is often termed text
mining and can provide a complementary approach to
traditional close readings of texts. Text and data mining
(TDM) approaches can open up new areas of scholarly
enquiry.
Research Data Management - USYD (RDM)
18. 18
Institutional vs National
Services for Corpus
Building & TDM?
More library
minds and
coordination
is needed in
this space.
What overlap
is there with
CAUL/CEIRC
& NCRIS?
19. 19
Sydney Stock Exchange Records - Institutional
Digitisation for research. AARNet partnership with ANU Library and Noel Butlin Archive.
Stock and Share Lists include ~199 registers of printed and written (copperplate)
information that requires format conversion and automated translation. Records
includes company names, price of stocks, and share transactions from 1901-1950.
An archival series that can be delivered for search and browse via an interface.
A corpus that can be built and text mined and analysed via an interface.
20. HASS DEVL - National
The Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab (DEVL) will
bring together fragmented data, tools and services into a shared workspace.
Key outcomes from the project will be:
● Lowering barriers to entry for HASS infrastructure
● Increased interoperability between existing HASS platforms
● More joined up data landscape
● Data curation for better reuse, reproduction, and publishing of research data sets
● Game-changing skills and training activities
Funding and co-investment via NCRIS and institutional partners. https://www.ands-
nectar-rds.org.au/ 20
22. HASS DEVL
Data curation package
- Datasets sourced from Prosecution Project, NLA/TROVE, SLQ and APO
- Datasets processed via Alveo and AURIN
- Data curation framework between UoM, Alveo, AURIN, and NLA/TROVE
Will these composites of digital objects be a digital collection, a dataset, a data collection,
a series, a demo corpus, a text corpus, or a linguistic corpus?
We will need to explore this question together [please all don your curator’s hat].
22
23. Digital Collections
● AU government gazettes (NLA)
● QLD records of railway workers / publicans / government workers (SLQ)
● Court records from various states and territories (PP)
● Historical census data (ADA)
● Grey literature (APO)
Trick question: which of these collections could be text mined and/or become a corpus?
23
27. Text Mining
Identifying linguistic patterns in text (as data)
Categorising, clustering, or identifying named entities
Abstracting, analysing and summarising (the textual content)
Constrained by the extent and scope of the textual data
Using programming languages like R or tools like Voyant
27
28. Text Corpora
The selection, extraction and processing of the text may involve linguistic methods but
may not be for the purpose of studying language, rather to investigate the nature of text
as semantic content.
Take a look at Visualising Raynal - three editions of Guillame-Thomas Raynal’s Histoire de
deux Indes (1770, 1774, 1780).
Part of the ANU Digitizing Raynal project led by Glenn Roe (working with Centre for
Literary and Linguistic Computing (UoN)).
PDFs from BNF (1770 + 1780) and Bodleian (1774).
28
29. Corpus (Corpora)
If in doubt - dictionary time!
a : all the writings or works of a particular kind or on a particular subject; especially : the
complete works of an author
b : a collection or body of knowledge or evidence; especially : a collection of recorded
utterances used as a basis for the descriptive analysis of a language
https://www.merriam-webster.com/dictionary/corpus
29
https://media.giphy.com/media/JIX9t2j0ZTN9S/giphy.gif
Some further questions…
I investigated the last few years of aaDH and library involvement/participation featuring in the bi-annual conference.
Has there been a history of collaboration? Yes.
Is there mutual benefit through collaboration? Yes.
Let’s take a quick look at the results of some rudimentary results from treating a corpus of text #asData
https://voyant-tools.org/
2016 aaDH DHA Conference Programme
https://voyant-tools.org/?corpus=0fb9b2d0a2a09dede1f2cc0609c52f69
https://voyant-tools.org/?corpus=0fb9b2d0a2a09dede1f2cc0609c52f69&mode=corpus&view=CollocatesGraph
https://voyant-tools.org/?corpus=0fb9b2d0a2a09dede1f2cc0609c52f69&query=data&view=Contexts
Context (11 words) - raw frequencies
Top term in the corpus: “digital”
Addition of the term “librar*”
The word “collections” is highly connected to: “digital” “librar*” “archives”. There is clustering here… common work terrain around collections … of digital stuff.
Core community strengths is across the GLAMS, librarians and archivists emerge in association with discussions about collections and domain expertise.
Where a distinction was made between demo corpora and linguistic corpora and datasets and collections.
Sydney Stock Exchange Stock and Share Lists includes ~199 registers of records written in copperplate that require format conversion and automated translation. Records includes company names, price of stocks, and share transactions from 1901-1950. Deposit N193 in Noel Butlin Archive.