These are slides from a workshop held during the RLUK2017 Conference http://rlukconference.com/ presented by Dr Danny Kingsley, Dr Deborah Hansen and Anna Vernon.
The Abstract:
"The library community has been almost silent on the issue of text and data mining (T&DM) partly due to concerns about the risk of having institutions ‘cut off’ from subscriptions due to large downloads of research articles for the purpose of mining. This workshop is an intention to identify where the information rests about T&DM - including looking at the details as they appear in Jisc negotiated licenses - consider some case studies and develop together a set of principles that identify the position of research libraries in the on the issue of T&DM. "
Developing a research Library position statement on Text and Data Mining in the UK
1. OS
C
Office of Scholarly Communication
Developing a research library
position statement on
Text and Data Mining
in the UK
RLUK 2017
Dr Danny Kingsley, Dr Debbie Hansen - University of Cambridge
Anna Vernon - Jisc
British Library - 9th March 2017
7. OS
C
Office of Scholarly Communication
What is TDM?
“the use of large online text collections to
discover new facts and trends about the world
itself” (Hearst, 1999†)
†Hearst, M. A., Untangling Text Data Mining, Proceedings of ACL'99:
the 37th Annual Meeting of the Association for Computational
Linguistics, University of Maryland, June 20-26, 1999 (invited paper)
8. OS
CWhy do you do it?
Fast literature review
Extract new facts
Answer research questions
Access wide range of sources for a topic
Saves time
More back for same cost
Not achievable through manual searches
Research
Innovation
11. OS
C
Office of Scholarly Communication
What is the legal situation?
Hargreaves exception
Licensing
12. OS
CHargreaves Review
• Independent review of UK Intellectual
Property system, focus UK Copyright law
• Recommendations from Hargreaves Review:
– introduce a copyright exception
– allow TDM for non-commercial purposes
– prohibit exclusion of TDM through
contract
• Government introduction of exception June
2014:
– if user has lawful access, works can be
copied for TDM for non-commercial
research
14. OS
C
• Open Access Scholarly Publishers
Association
• Expressed their support of TDM efforts
– By signing Hague Declaration in 2016
• Support the adoption of best practice and
behaviours regarding TDM
• Statement:
– ‘reasonable best practice for those
engaged in TDM to inform publishers …
that such content mining is planned’
OASPA, December 2016, http://oaspa.org/oaspa-comment-text-data-mining-proposed-eu-copyright-
reform/
Comment from OASPA
15. OS
C
£500K per annum
• Estimate for TDM activity compliance
–Before TDM exception
• Figure from OA staffing costs
–OA staffing costs to monitor TDM activity
–TDM clearance fees with rightsholders
University College London study
16. OS
CDo the publishers have statements?
Oxford University
Press
● No permission needed for non-commercial TDM
○ But can contact for consultation on TDM (e.g. to avoid
technical safeguard triggers)
● Can contact to request TDM for commercial purposes
https://academic.oup.com/journals/pages/help/third_party_data_mining
Royal Society
● Support use of computers to extract from scholarly
publications
● Members of subscribing institutions have permission to mine
○ For non-commercial and commercial purposes
○ Respect copyright and cite where possible
● Let them know when you intend to do TDM
○ To prevent automatic lock-out
https://royalsociety.org/journals/ethics-policies/data-sharing-mining/
Cambridge University Press International Union of Crystallography now also:
17. OS
CDo the publishers have statements?
Elsevier
Different licenses, different rules, e.g.:
● CC BY - yes to TDM
● CC BY-NC-SA - yes to TDM for non-commercial purposes
● CC BY-NC-ND - no to TDM
● Open Archives (content made available after an embargo)
○ Yes to TDM for non-commercial purposes and cite
authors and source
https://www.elsevier.com/connect/what-changes-when-publishing-open-access-understanding-the-
fine-print
18. OS
C
• Hindawi facilitate the use of their content
for data mining purposes -
https://www.hindawi.com/corpus/.
• Full XML content available for download
–as single .zip file
–.zip file updated daily
–(XML files adhere to the US National Library
of Medicine Document Type Definition)
• Not advertised widely
– over last 12 months, 1,770 unique visits
– btwn 60-90 downloads per/month
– roughly 720-1080 downloads for the year
Hindawi
19. OS
C
Negotiations between a publisher and
Cambridge University in May 2015 over
TDM.
–Original contract would have been binding
for whole University
–Data only available on a hard drive and not
downloadable onto a server
–Charge of £1,100 for the cost of the hard
drive
–Substantial number of limitations and
restrictions
Example situation
20. OS
C
Office of Scholarly Communication
Talk about any experiences you have had with TDM. Feedback into
the group:
* Challenges encountered?
* Concerns?
* Successes?
Group discussion - about your
experiences
21. OS
CFeedback from discussion
Situation
• Hard drive provided.
• Not know what is being asked for - who is
responsible?
• What is the IT responsibility here?
• Copyright and compliance officer needed to do a lot
of work.
Solutions:
• Clearer understanding of the licesning situation.
• Mechanism of where to go for advice.
• Procedures of what to do with it - policy issue
22. OS
C
• Issue:
– Researcher behaviour - academics not concerned by copyright
• Library implications:
– Librarians are not always aware of TDM taking place.
– Help if have better understanding.
– New legislation, so we are currently reactive to it
– Change of role of the library - traditionally to preserve access to
items.
– TDM could threaten access, so internal disquiet
– Would like to be enabling this activity rather than saying no you
can’t
• Solutions?
– Help if publishers deliver material in different ways - not a hard
drive. Could this be part of a platform?
– Good if material was produced in a format that allowed TDM (at
no extra cost)
Feedback from discussion 2
23. OS
C
Office of Scholarly Communication
International activity in this
area
There are several large initiatives looking at Text and Data Mining
24. OS
C
Office of Scholarly Communication
Work in this area - FutureTDM
• Background: America and Asia lead activity in
TDM
• FutureTDM seek to increase TDM activity in EU
• Engagement with stakeholders (e.g. researchers,
developers, publishers)
–Why is uptake lower in EU?
–Raise awareness
–Develop solutions
25. OS
C
Office of Scholarly Communication
Work in this area - European Commission
Proposal:
New copyright exception for research
organisations carrying out research in public
interest
– to carry out TDM of copyright protected
content
– if they have lawful access (e.g.
subscription)
– without prior authorisation
26. OS
C
Office of Scholarly Communication
You can have your say on the EU reform
• Sign the Hague Declaration and ask your researchers to sign it
http://thehaguedeclaration.com
– (not just about copyright reform but about advancing research more
generally)
• Ask your institutions to support this joint letter for LIBER, LERU etc.
http://libereurope.eu/blog/2017/01/10/eu-copyright-reform-liber-joins-
leading-research-groups-call-change/
• Write to your local MEP saying why you support a European exception on
TDM. Mary Honeyball and Catherine Stihler 2 key UK MEPs
• Collect examples of TDM projects, problems, solutions, share and
promote them. Make the UK Intellectual Property Office aware of issues
that you have with the UK legislation.
• Once the report goes through European Parliament it will go to the
European council (EU heads of state) so contacting your national
representatives (ministers for research etc., IP Office) will be key at this
point.
European Commission MEMO (MEMO/16/3011)
27. OS
C
Office of Scholarly Communication
UK-based TDM activity
British Library
Content Mine - Wikimedia project
ChemDataExtractor
NaCTeM
28. OS
CBritish Library EThOS
• E-Thesis On-line Service
• British Library opportunity for PhD student
placement*
• TDM on 150,000 theses held in EThOS
–Extract new metadata information
–E.g. Names of supervisors from
Acknowledgements, funding information
–Outputs feed into future initiatives
British Library, 2017, https://www.bl.uk/news/2016/november/british-library-phd-placements-call-for-
applications
*Applications closed 20 February 2017
29. OS
CContentMine and WikiFactMine
ContentMine supplies open
source TDM software to
access and analyse
documents
Project grant to develop WikiFactMine
– ContentMine partnering with Wikimedia Foundation
– Project aims to make scientific data available to editors of Wikidata
and Wikipedia
http://contentmine.org/
30. OS
CChemDataExtractor
• Molecular Engineering Group, University of
Cambridge
• Chemical information from scientific
documentation (e.g. text, tables)
• Open source software package
• Extracted data for onward analysis
Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the
Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207
31. OS
C
Office of Scholarly Communication
Biomedical Text Mining
• Manchester Institute of Biotechnology -
National Centre for Text Mining (NaCTeM)
• Text mining tools and services in the
biomedical field
http://www.nactem.ac.uk/index.php
32. OS
C
Office of Scholarly Communication
Libraries are worried about getting cut off from their subscription by
publishers due to large downloads of papers through TDM activity
The problem we are trying to
solve
33. OS
CBeing cut off - how it works
• Publishers systems pre-programmed to react to
suspicious activity
• TDM may invoke automated investigation, may
cause access block
• For Universities to maintain support mechanism
to ensure continuity of access
–Require workflows for swift resolution, fast
communication, team of communicators
• Also requires education of researchers of
potential issues
34. OS
C
Office of Scholarly Communication
Discussion
Write on three separate post it notes your top three reasons
why your organisation is not actively supporting TDM (Yellow
post its)
If your organisation is supporting TDM write the top three
challenges you face (Green post-its)
35. OS
C
Slide title here
Discussion feedback: Why not
supporting?
● Practical
○ Challenges of handling physical media
○ Risk of lockout
● Lack of demand
○ We are not getting enquiries Perhaps not coming to the Library. Someone
in IT supporting research computing may not even pass on the queries.
Internal discussion needed.
○ Not much call
● Who is responsible?-
○ No institutional view on TDM because the issues are not raised at academic
level. - POLICY NEEDED
○ How can a library provide a service - responding to individual queries, how
do we scale it up?
○ Not joined up - assumption in the discussion that the Library is at the
centre of all this and we are not joined up as organisations
36. OS
C
Slide title hereDiscussion feedback: Challenges?
● When making research within a specific environment it
should be relatively straightforward if it remains within the
environment.
● Complicated
○ In order to provide access to the data, there are
requirements at the content owner level - everyone
needs to understand the need.
○ Intrusive on the researcher process.
○ Need to ensure it is not commercial use, and ensuring
people know their responsibilities
● Time
○ A contract with a particular publisher to allow our
researchers to TDM took two years to finalise.
37. OS
C
To draft a statement for a Service Level
Agreement for publishers to assure us that
if the activity is legal we will be reinstated
within 1 hour (or something like that).
Discuss - What are the issues if we did
this?
Proposal
38. OS
C
Expectation of publishers?
• Publishers contact the library to give a grace period to
investigate rather than being cut off
• Way publisher platforms operate -
– LOCKSS crawls publisher software without getting
trapped.
– This could work in the same way with a bank of IP
addresses that is secured for this purpose.
– Avoid some of the manual work. Third party IP registry.
• Basis for the conversation over the SLA
– The law is on the subscriber’s side if everyone is doing it
legally.
– We need an understanding of the extent of infringing
activity going on with University networks
(understanding that people can ‘mask’ themselves).
– Useful for thinking of thresholds.
Discussion feedback 1
39. OS
C
Expectation of libraries?
• Not like to do a register.
• Range of IP addresses to be part of the license
agreement
• Create a safe space for TDM? Or is this a barrier?
• Tryinf to design something which is bolted onto a
different use content. Large scale computational
reading is something totally different.
• Two issues
– How do we manage the licenses we are
currently signed up to?
– How do we manage licensing into the future so
we separate the different uses?
Discussion feedback 2
40. OS
C
Time frames?
– Being cut off for a week or two weeks with no
redress is unusual at best!
Discussion feedback 3
41. OS
C
1.Don’t cut us off! Have a conversation first
(and if you want to cut us off - prove
there are all these activities happening in
the UK)
2.If you do cut us off and it turns out to be
legitimate then we expect compensation
for the time we were cut off
3.Mechanisms for TDM where certain
behaviours are expected - built into
separate licensing agreements for TDM
Agreed Expectations: