Amanuens.is HUmans and machines annotating scholarly literature

Amanuens.is
ContentMine
IAnnotate!, Berlin, DE, 2016-
05-18
Peter Murray-Rust
[1]University of Cambridge [2]TheContentMine
ContentMine + Hypothesis annotate the scientific literature!
100, 000 + per day.
Live demos!

Scholarly publishing
• 10, 000 articles per day
• 20 Billion USD / year [1]
• Totally and scandalously broken. Primary revenue
comes from throttling the flow of knowledge
• Massive disruption likely (Sci-Hub)
• Mining and annotation liberation tools.
[1] (2x digital music industry!)

• Science can be read and understood by
human-machine Amanuensis-symbionts.
• Amanuenses based on Wikipedia Wikidata,
software (ContentMine’s AMI)
• Results are fed back into WP and WikiData
• Annotation through Hypothes.is
http://en.wikipedia.org/wiki/Symbiosishttp://en.wikipedia.org/wiki/Eric_Fenby

What plants produce Carvone?
https://en.wikipedia.org/wiki/Carvone

WIKIDATA

catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
Latest 20150908

Mining for phytochemicals
• getpapers –q carvone –o carvone –x –k 100
Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone
Normalize papers;
search locally for species, sequences, diseases, drugs
Results in dataTables.html
and results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>
send annotations -> hypothes.is

Annotation (entity in context)
prefix
surface
label
location
suffix

articles facets
gene disease drug Phyto
chem
species genus words

Remote &
Local papers
Disease
ICD-10
phytochemicals
species
Commonest
words

Annotation sent to hypothes.is
prefix
suffix
source
user
text
uri
maybe 100+ annotations per paper
text

@Senficon (Julia Reda) :Text & Data mining in times of
#copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevi
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink

I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my
university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post

Amanuens.is HUmans and machines annotating scholarly literature

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Amanuens.is HUmans and machines annotating scholarly literature

Ähnlich wie Amanuens.is HUmans and machines annotating scholarly literature (20)

Mehr von petermurrayrust

Mehr von petermurrayrust (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Amanuens.is HUmans and machines annotating scholarly literature