Every year 500 Billion USD of public funding is spent on research, but much of this lies hidden in papers that are never read. I describe how machines can help us to read the literature. However there is massive opposition from publishers who are trying to prevent open scholarship and who build walled gardens that they control
Cultivation of KODO MILLET . made by Ghanshyam pptx
Digital Scholarship
1. Digital Scholarship: Enlightenment or
Devastated Landscape?
Peter Murray-Rust,
University of Cambridge
IT Future Conference, Informatics Forum, Edinburgh, UK 2015-12-17
(Glen Feshie, remains of forest, CC-BY-SA 2.0 Ian Shiell http://www.geograph.org/uk/photo/3944612.jpg )
2. University of Stirling 1972
student occupations and sit-ins
University of Stirling
Used without permission but with thanks and Love
Liverpool , Warwick, Emmanuel Coll Camb., UCL, Glasgow, Middlesex, …
Peter Murray-Rust,
Lecturer
3. Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] per month
>2.5 million (papers + supplemental data) /year*
4500 m high per year [2]
Representing ? 500 Billion USD public funding
[1] http://www.crossref.org/01company/crossref_indicators.html
6. Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
13. Systematic reviews of the
Neuroscience literature:
• 30,000 papers in 1 year
• Extraction of data from graphs
Malcolm Macleod, Professor of Neurology and
Translational Neuroscience at the Centre for
Clinical Brain Sciences, University of Edinburgh,
with ContentMine 2015
17. Polly has 20 seconds to read this paper…
…and 10,000 more
18. ContentMine software can cut the effort by 50%
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
19. ContentMine Tools*
http://iucn.contentmine.org (endangered species)
http://fotd.contentmine.org (fact of the day)
http://bubbles.contentmine.org (network analysis of
papers)
*Dr. Mark MacGillivray, Informatics Forum, University of Edinburgh
20. Fact of the Day
• http://fotd.contentmine.co/?s=daily20151209
(images from https://en.wikipedia.org/wiki/Caenorhabditis_elegans CC-BY-SA)
22. http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [digital
scholarship] by all scientists, scholars, teachers,
students, and other curious minds. …
…share the learning of the rich with the poor and the
poor with the rich, … and lay the foundation for
uniting humanity in a common intellectual
conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
23. DNADigest + ContentMine looking for DNA datasets in the literature
European Bioinformatics Institute, 2015-12-11
24. C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
27. Chris Hartgerink, University of Tilburg
I am a statistician interested in detecting potentially
problematic research such as data fabrication, which
results in unreliable findings and can harm policy-making,
confound funding decisions, and hampers research
progress.
…I am content mining results reported in the psychology
literature
28. I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified
my university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post
“Elsevier stopped me doing my research”
29. The Right to Read
is
The Right to Roam
The Right to Mine
Kinder Mass Trespass
used without permission but with love and thanks
30. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
34. STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: we have NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
Licences destroy Content Mining
35. Julia Reda MEP
Julia Reda MEP
The current copyright regime is undermining our ability
to produce evidence. It is time that academics in large
numbers … speak up about this issue. Decreasing the very
substantial burdens and transaction costs for research and
education is one of the declared goals of the Commission’s
copyright reform proposal, and the European Parliament has
echoed that sentiment in my report.
Prof Ian Hargreaves:
…make sure that the voices of the digital many
are not drowned out in policy discussions by
the digitally self-interested few.
http://www.create.ac.uk/blog/2015/09/16/epip2015-opening-keynote-response-
transcript/
there’s a serious risk
of Europe digging
itself deeper into a
digital black hole on
copyright,
36. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]*: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
*Still behind a 35USD paywall
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
37. [1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia
Glory+?
$$, MS
review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]
38. Panton Principles for Open Scientific Data
Jenny Molloy
Ross Mounce
Sam Moore Peter Kraker Rosie GraySophie Kay
PANTON ARMS
Panton Fellows
CC02010
http://pantonprinciples.org/about/
41. Thanks to some Children
of the Digital Enlightenment
• David Carroll & Joe McArthur: OAButton
• Rayna Stamboliyska & Pierre-Carl Langlais
• Jon Tennant
• Ross Mounce
• Jenny Molloy
• Erin McKiernan
• Jack Andraka
• Michelle Brook
• Heather Piwowar
• TheContentMine Team
• Mark MacGillivray
• Rufus Pollock
• Jonathan Gray
• Sophie Kay
• Aaron Swartz
• Chris Hartgerink
Jean-Claude Bradley [1] a chemist
developed Open notebook science;
making the entire primary record of a
research project publicly available
online as it is recorded. (WP)
J-C promoted these ideas with
UNDERGRADUATE scientists.
[1] Unfortunately J-C died in 2014;
we held a memorial meeting in
Cambridge
Sophie
Kay
42. http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [digital
scholarship] by all scientists, scholars, teachers,
students, and other curious minds. …
…share the learning of the rich with the poor and the
poor with the rich, … and lay the foundation for
uniting humanity in a common intellectual
conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
43. Discussion
• Let’s concentrate on what we can do to create
positive change, rather than explain why we
can’t do anything.*
• [1] “It’s not our fault, it’s (a) librarians (b) researchers (c) publishers (d) funders (e)
governments (f) scholarly societies (g) principals/Vice-chancellors … “
Hinweis der Redaktion
Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.
In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
ChemBark
Elsevier stopped me doing my research
33 Replies
0000-0003-1050-6809
I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2
[MINOR EDITS: the link to the article was broken, should be fixed now. Also, I made the mistake of using "0.0021GB/s" which is now changed into "0.0021GB/min"; I also added "35KB/s" for completeness. One last thing: I am aware of Elsevier's TDM License agreement, and I nonetheless thank those who directed me towards it.]