This document discusses content mining of scientific literature in Europe. It describes what content mining is and why it is useful, particularly for tasks like mapping clinical trials to related papers. However, copyright restrictions and technical obstacles imposed by publishers currently limit widespread content mining. The document advocates for policies and technologies that enable open content mining of facts and data from the complete scientific literature for reproducible research.
The Mariana Trench remarkable geological features on Earth.pptx
Content Mining of Science in Europe
1. Content Mining of Science in Europe
Peter Murray-Rust,
ContentMine.org, University of Cambridge & Open Forum Europe
OFA, Brussels, BE 2015-10-22
What is mining?
Why is it useful?
How YOU can do it without using publishers’ APIs
Copyright and restrictive practices are still a major problem
2. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
4. Use Cases of ContentMining
• Epidemiology of obesity (Cambridge U)
• (OKF, OpenTrials) Mapping clinical trials
repositories to reports in scientific literature
• Mining chemical reactions from patents
• Creating a bacterial supertree-of-life from
4500 papers
5. Polly has 20 seconds to read this paper…
…and 10,000 more
6. ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
7. 400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
12. Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
20. Copyright and Mining
• PMR-premise: You cannot do reproducible
scientific mining and avoid violating copyright.
• UK (“Hargreaves”) 2014 legislation:
– “personal” “non-commercial*” “research” “data
analytics”
– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
21. Publishing and ICT
Trust these as much as you trust these
Elsevier Microsoft
Mendeley (Elsevier) Facebook
Digital Science/Macmillan Apple
Wiley
etc
Etc.
22. STM Publishers prevent Mining
• FUD & disinformation about legality (Elsevier)
• Monopolies on infrastructure (“API”s, CCC
Rightfind)
• Technical obstruction (Wiley Captcha,
Macmillan Readcube)
• Restrictive contracts with libraries (ALL) [1]
• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories
in a way that would [… ] have the potential to substitute and/or replicate
any other existing Elsevier products, services and/or solutions.
23. WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHA
User has to type words
24. ContentMine working with Libraries
• Cambridge: Library, Plant Sciences,
Epidemiology, Chemistry
• Cochrane Collaboration on Systematic Reviews
of Clinical Trials
• FutureTDM (H2020, LIBER)
• Running workshops and training
Hinweis der Redaktion
Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.
In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.