SlideShare ist ein Scribd-Unternehmen logo
1 von 56
ContentMine: Open data 
And Social Machines 
Peter Murray-Rust 
, 
Computation Lab, Univ of Chicago, 2014-11-12
ContentMine: We use machines to 
liberate[1] 100 million facts /yr from 
the scientific scholarly literature and 
make them free for everyone 
(WikiData) 
WikiData and ContentMines are social 
machines 
There are no longer any technical 
obstacles, only people. 
[1] Friday workshop: build your own social machine: scraping XML,
Liberation Software
http://en.wikipedia.org/wiki/Tim_Berners-Lee 
Everything in this presentation is ODOSOS 
(Open Data, Open Standards, Open Source) 
CC0, CC-BY, W3C etc., Apache2, etc. 
Open = “Free to use, re-use and redistribute 
http://contentmine.org 
http://bitbucket.org/petermr 
http://wwmm.ch.cam.ac.uk 
A promise: I (Petermr) will never sell out to non-transparent organizations.
http://www.budapestopenaccessinitiative.org/read 
… an unprecedented public good. … 
… completely free and unrestricted access to [peer-reviewed 
literature] by all scientists, scholars, teachers, 
students, and other curious minds. … 
…Removing access barriers to this literature will 
accelerate research, enrich education, share the 
learning of the rich with the poor and the poor with 
the rich, make this literature as useful as it can be, and 
lay the foundation for uniting humanity in a common 
intellectual conversation and quest for knowledge. 
(Budapest Open Access Initiative, 2003)
Scientific and Medical publication (STM)[+] 
• World Citizens pay $400,000,000,000… 
• … for research in 1,500,000 articles … 
• … cost $300,000 each to create … 
• … $7000 each to “publish” [*]… 
• … $10,000,000,000 from academic libraries … 
• … to “publishers” who forbid access to 99.9% of 
citizens of the world … 
[+] Figures probably +- 50 % 
[*] arXiV preprint server costs $7 USD per paper
petermr: I believe in Wikipedia 
• 2006 http://en.wikipedia.org/wiki/User:Petermr 
• 2006 started Open Data (term unknown then!) 
• 2009: “the bit of Wikipedia that I wrote is correct” [challenging the 
idea of “WP is junk”] 
• 2009: “Wikipedia is the digital library of this century” 
• 2012: I alert WP that Springer has copyrighted > 1000 of our 
images [Springergate] 
• 2014: “For facts in maths, physical and biological sciences I trust 
Wikipedia.” (Wikimania2014)
A meritocratic 
critical 
volunteer 
community
Volunteer community in chemistry: Open Data/Source/Standards
4 Billion USD on human genome 
yielded 800 Billion USD and 4 M job-years
Gloom Warning
Bad publication wastes science 
…three problems—flawed design, non-publication, 
and poor reporting—together 
meant >85% of research funds were wasted, a 
global total loss >100 billion USD per year. 
[Lancet 2009] 
[Even more] waste clearly occurs after 
publication: from poor access, poor 
dissemination, and poor uptake of the findings 
of research. [PLOS Medicine 2014-05-27]
Publishers’ PDFs destroy science 
PDFs do not contain words 
or subscripts! 
PDFs do not contain tables 
and do not have columns 
SVG is turned into JPEG because it’s easier to process
Elsevier wants to control Open Data 
[asked by Michelle Brook]
Licences destroy Content Mining 
WE WALKED OUT 
• Brit Library 
• JISC 
• RLUK 
• OKFN 
• … 
• Ross Mounce 
• PM-R 
STM Publishers Licence 
2012_03_15_Sample_Licence_Text_Data_Mining.pdf 
(Summary: PMR has NO rights) 
• [cannot publish to: ] “libraries, repositories, or archives” 
• [cannot] “Make the results of any TDM Output available on an externally facing server or 
website” 
• “Subscriber shall pay a […] fee” 
Heather Piwowar: “negotiating with publishers [made me physically ill]”
CLOSED ACCESS MEANS PEOPLE DIE 
CLOSED DATA MEANS PEOPLE DIE
Happiness Restored
The scientist’s amanuensis 
• "The bane of my life is doing things I know computers could do 
for me" (Dan Connolly, W3C) 
Example: A semantic amanuensis could 
• Give me a daily digest of mineralogy papers 
• Extract all the crystal structures from them 
• Compute physical properties with GULP and NWChem 
• Compare the results statistically 
• Preserve and distribute the complete operation 
• Prepare the results for publication 
The semantic web is having a personal amanuensis
Artificial Intelligence in science 
In 1970 chess and chemistry were the sandboxes for AI. Some 
approaches: 
• Lookup (Knowledge) 
• Natural Language Processing (NLP) 
• Brute force calculation (inc. physical methods) 
• Tree-pruning and heuristics 
• Logic (cf. OWL-DL) 
• Human-machine integration (crowdsourcing) 
• Computer Vision 
Domain-specific Turing test: Can a machine pass a first-year 
chemistry exam?
The Semantic Web 
"The Semantic Web is an extension of the 
current web in which information is given well-defined 
meaning, better enabling computers 
and people to work in cooperation." 
Tim Berners-Lee, James Hendler, Ora Lassila, The 
Semantic Web, Scientific American, May 2001 
CC-BY-SA Images from Wikipedia
Linked Open data from Wikipedia 
“Which Rivers flow into the Rhine and are longer 
than 50 kilometers?” or “Which Skyscrapers 
in China have more than 50 floors and have 
been constructed before the year 2000?” 
Open Crystallography? 
“Which countries where tropical diseases are 
endemic have published structures of chiral 
natural products?” 
CC-BY-SA from Wikipedia
The Right to Read is the Right to Mine 
http://contentmine.org
• Science can be read and understood by 
human-machine Amanuensis-symbionts. 
• Amanuenses are based on Wikipedia, 
databases and software (e.g. ContentMine’s 
AMI) 
• The results are fed back into WP and WikiData 
http://en.wikipedia.org/wiki/Eric_Fenby http://en.wikipedia.org/wiki/Symbiosis
Machine Extraction of scientific facts 
• Crawl scientific literature 
(Open Bibliography) 
• Scrape each scientific article 
(ContentMine-quickscrape) 
• Extract the facts (ContentMine-AMI) 
• Index (Wikipedia) 
• Republish (WikiData)
RSU: Richard Smith-Unna 
PMR: Peter Murray-Rust 
CL: CottageLabs 
Queues 
Repos 
Scientific 
literature 
Science 
Plugins 
Science 
Volunteers
Linked Open Data – the world’s knowledge 
GOV.uk 
very little physical science  
DBPedia 
BIO 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png 
Lib 
Comp 
PDB 
Ontologies 
GOV 
Music, 
Art 
Literature 
Social 
Knowledge 
bases 
RDF 
triples
Part of a COD RDF entry 
The Semantic Web understands this
Mathematics Markup Language 
Energy of c.c.p lattice of argon 
Human-friendly 4 pages clipped 
Machine-friendly 
Many editors and tools exist 
We used MathWeaver 
Automatic! 
MathML
CML (Chemical Markup Language) 
Automatic! 
Human-friendly Machine-friendly
Innovation with Componentisation 
Individual, manual, 
unreusable, flaky 
Commodity, standard, 
reliable, re-usable
Current scientific information flow 
… is broken for data-rich science 
Non-semantic 
data 
Human input 
Data extraction 
difficult and 
incomplete 
Human 
readers 
PDF 
Lineprinter output 
Text files
Semantic network closes the loop 
Data mined from 
document 
Computation 
Measurement 
Semantic 
Authoring 
Community 
Data available for 
e-science and re-use 
Analysis
The network grows autonomously 
Machine-machine 
Human-machine 
Machine-human 
Human-human
Humans and machines use different 
languages
How a machine reads a chemical thesis 
nodes are compounds; arrows are reactions
Human-machine symbionts can read science! 
WP_Lion 
WP_Aspergillus_oryzae 
WP_Soybean
With Wikipedia everyone can be a scientist 
Facts Marked by “non-scientists” in ContentMine workshops
“nuggets” in a scientific paper 
project places 
quantity 
units 
Value ranges 
chemical 
Humans aren’t designed to mine this … 
Parsing chemical sentences 
A FACT, uncopyrightable, and representable by triples
http://wwmm.ch.cam.ac.uk/chemicaltagger 
• Typical 
Typical chemical synthesis
Open Content Mining of FACTs 
Machines can interpret chemical reactions 
We have done 500,000 patents. There are > 
3,000,000 reactions/year. Added value > 1B Eur.
We can’t turn a hamburger into a cow 
But we can now 
turn PDFs into 
Science
UNITS 
TICKS 
QUANTITY 
SCALE 
TITLES 
DATA!! 
2000+ points
Dumb PDF 
CSV 
Semantic 
Spectrum 
Automatic 
extraction 
Takes < 1 second 
Gaussian 
Filter 
2nd Derivative
Chemical Computer Vision 
Raw Mobile photo; problems: 
Shadows, contrast, noise, skew, clipping
Binarization (pixels = 0,1) 
Irregular edges
Thinning: thick lines to 1-pixel
Chemical Optical Character Recognition 
Small alphabet, clean typefaces, clear boundaries make 
this relatively tractable. Problems are “I” “O” etc.
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home 
AMI reads the complete diagram, 
recognizes the paths and 
generates the molecules. Then 
she creates a stop-fram animation 
showing how the 12 reactions 
lead into each other 
CLICK HERE FOR ANIMATION 
(may be browser dependent) 
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI Demo 
http://www.mdpi.com/2218-1989/2/1/39/pdf 
https://bitbucket.org/AndyHowlett/ami2-poc 
ami2-poc -i example 
-v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor 
May take time to start if not connected to web 
Output:./target/output/reactionsexample/ 
SVG: ./page1annotated.svg 
CML: image.g.1.4.svg.reaction0.cml 
Avogadro 
Viewer:
Bacterial WP_phylogenetic tree 
Genbank ID 
American Type 
Culture Collection 
WP: Clostridium_butyricum 
Our machines have read and interpreted 4300 in an hour with > 95% accuracy 
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
http://en.wikipedia.org/wiki/Digital_image_processing 
http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics 
((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),( 
(((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n 
215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187), 
n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n 
102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),((( 
n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1 
60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139 
,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))) 
)))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,( 
n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),(( 
n53,n131),n159))))))); 
(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – 
“Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .
Open notebook science is the practice of 
making the entire primary record of a research 
project publicly available online as it is 
recorded. (WP) 
Jean-Claude Bradley was a chemist who 
actively promoted Open Science in 
chemistry,… He coined the term Open 
Notebook Science. … A memorial 
symposium was held July 14, 2014 at 
Cambridge University, UK.[9]
RSU: Richard Smith-Unna 
PMR: Peter Murray-Rust 
CL: CottageLabs 
Queues 
Repos 
Scientific 
literature 
Science 
Plugins 
Science 
Volunteers
Thanks 
• Shuttleworth Foundation and Fellowship 
• Contentmine.org: Michelle Brook, Jenny Molloy, 
Ross Mounce, Richard Smith-Unna, 
CottageLabs, Charles Oppenheim 
• Open Knowledge Foundation Community 
• Wikimedia Community 
• Blue Obelisk Community
My/our Dream 
• An Open Bibliography of science, updated 
daily 
• An interface for ContentMine to feed new 
facts into WikiData 
• Domain-specific enthusiasts to create and run 
fact extraction and validation 
• Wikipedia to become a C21 publisher of 
reference science

Weitere ähnliche Inhalte

Was ist angesagt?

Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Datapetermurrayrust
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humanspetermurrayrust
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)petermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trustpetermurrayrust
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technologypetermurrayrust
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Sciencepetermurrayrust
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolutionpetermurrayrust
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)petermurrayrust
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusDuncan Hull
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteTheContentMine
 
Dagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slidesDagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slidesTim Clark
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open ScienceTheContentMine
 
Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Duncan Hull
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complexpetermurrayrust
 

Was ist angesagt? (20)

Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Csvconf
CsvconfCsvconf
Csvconf
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
Dagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slidesDagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slides
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 

Andere mochten auch (19)

Hospice letter
Hospice letterHospice letter
Hospice letter
 
The engineer’s licensing guidance document ELGD 2007
The engineer’s licensing guidance document ELGD 2007The engineer’s licensing guidance document ELGD 2007
The engineer’s licensing guidance document ELGD 2007
 
1200 j lipman
1200 j lipman1200 j lipman
1200 j lipman
 
Question 3 – what have you learnt from
Question 3 – what have you learnt fromQuestion 3 – what have you learnt from
Question 3 – what have you learnt from
 
Storytime updated ppt
Storytime updated pptStorytime updated ppt
Storytime updated ppt
 
Comm skills1
Comm skills1Comm skills1
Comm skills1
 
nullcon 2011 - Buffer UnderRun Exploits
nullcon 2011 - Buffer UnderRun Exploitsnullcon 2011 - Buffer UnderRun Exploits
nullcon 2011 - Buffer UnderRun Exploits
 
米羅
米羅米羅
米羅
 
Hugps138
Hugps138Hugps138
Hugps138
 
Characteristics of narration
Characteristics of  narrationCharacteristics of  narration
Characteristics of narration
 
Caldwell recognition-2012
Caldwell recognition-2012Caldwell recognition-2012
Caldwell recognition-2012
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 
Spinal cord trauma
Spinal cord traumaSpinal cord trauma
Spinal cord trauma
 
Jft 13-desktop-optical-power-meter-jfopt
Jft 13-desktop-optical-power-meter-jfoptJft 13-desktop-optical-power-meter-jfopt
Jft 13-desktop-optical-power-meter-jfopt
 
Bab 5 9d
Bab 5 9dBab 5 9d
Bab 5 9d
 
Design for Social Sharing Workshop
Design for Social Sharing WorkshopDesign for Social Sharing Workshop
Design for Social Sharing Workshop
 
The Praying Indians of Megunko
The Praying Indians of MegunkoThe Praying Indians of Megunko
The Praying Indians of Megunko
 
Transactional learning and simulations: how far can we go in professional leg...
Transactional learning and simulations: how far can we go in professional leg...Transactional learning and simulations: how far can we go in professional leg...
Transactional learning and simulations: how far can we go in professional leg...
 
rgl test
rgl testrgl test
rgl test
 

Ähnlich wie ContentMine: Open Data and Social Machines

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDataTheContentMine
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Minepetermurrayrust
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migrationpetermurrayrust
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Sciencepetermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and HumansTheContentMine
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literaturepetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureTheContentMine
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesTheContentMine
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationDavid De Roure
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trialspetermurrayrust
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical TrialsTheContentMine
 
Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...
Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...
Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...CA API Management
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustLEARN Project
 
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...Jisc
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsDavid De Roure
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literaturepetermurrayrust
 

Ähnlich wie ContentMine: Open Data and Social Machines (20)

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Science
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly Collaboration
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...
Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...
Autonomous Agents on the Web: Beyond Linking and Meaning Mike Amundsen Keynot...
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and Analytics
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 

Mehr von petermurrayrust

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practicepetermurrayrust
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?petermurrayrust
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestpetermurrayrust
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusespetermurrayrust
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?petermurrayrust
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Bravepetermurrayrust
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcarepetermurrayrust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyonepetermurrayrust
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingpetermurrayrust
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archivepetermurrayrust
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everythingpetermurrayrust
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialismpetermurrayrust
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyonepetermurrayrust
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Librariespetermurrayrust
 
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?petermurrayrust
 

Mehr von petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
 

Kürzlich hochgeladen

bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 

Kürzlich hochgeladen (20)

bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 

ContentMine: Open Data and Social Machines

  • 1. ContentMine: Open data And Social Machines Peter Murray-Rust , Computation Lab, Univ of Chicago, 2014-11-12
  • 2. ContentMine: We use machines to liberate[1] 100 million facts /yr from the scientific scholarly literature and make them free for everyone (WikiData) WikiData and ContentMines are social machines There are no longer any technical obstacles, only people. [1] Friday workshop: build your own social machine: scraping XML,
  • 4. http://en.wikipedia.org/wiki/Tim_Berners-Lee Everything in this presentation is ODOSOS (Open Data, Open Standards, Open Source) CC0, CC-BY, W3C etc., Apache2, etc. Open = “Free to use, re-use and redistribute http://contentmine.org http://bitbucket.org/petermr http://wwmm.ch.cam.ac.uk A promise: I (Petermr) will never sell out to non-transparent organizations.
  • 5. http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer-reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 6. Scientific and Medical publication (STM)[+] • World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles … • … cost $300,000 each to create … • … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries … • … to “publishers” who forbid access to 99.9% of citizens of the world … [+] Figures probably +- 50 % [*] arXiV preprint server costs $7 USD per paper
  • 7. petermr: I believe in Wikipedia • 2006 http://en.wikipedia.org/wiki/User:Petermr • 2006 started Open Data (term unknown then!) • 2009: “the bit of Wikipedia that I wrote is correct” [challenging the idea of “WP is junk”] • 2009: “Wikipedia is the digital library of this century” • 2012: I alert WP that Springer has copyrighted > 1000 of our images [Springergate] • 2014: “For facts in maths, physical and biological sciences I trust Wikipedia.” (Wikimania2014)
  • 8. A meritocratic critical volunteer community
  • 9. Volunteer community in chemistry: Open Data/Source/Standards
  • 10. 4 Billion USD on human genome yielded 800 Billion USD and 4 M job-years
  • 12. Bad publication wastes science …three problems—flawed design, non-publication, and poor reporting—together meant >85% of research funds were wasted, a global total loss >100 billion USD per year. [Lancet 2009] [Even more] waste clearly occurs after publication: from poor access, poor dissemination, and poor uptake of the findings of research. [PLOS Medicine 2014-05-27]
  • 13. Publishers’ PDFs destroy science PDFs do not contain words or subscripts! PDFs do not contain tables and do not have columns SVG is turned into JPEG because it’s easier to process
  • 14. Elsevier wants to control Open Data [asked by Michelle Brook]
  • 15. Licences destroy Content Mining WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]”
  • 16. CLOSED ACCESS MEANS PEOPLE DIE CLOSED DATA MEANS PEOPLE DIE
  • 18. The scientist’s amanuensis • "The bane of my life is doing things I know computers could do for me" (Dan Connolly, W3C) Example: A semantic amanuensis could • Give me a daily digest of mineralogy papers • Extract all the crystal structures from them • Compute physical properties with GULP and NWChem • Compare the results statistically • Preserve and distribute the complete operation • Prepare the results for publication The semantic web is having a personal amanuensis
  • 19. Artificial Intelligence in science In 1970 chess and chemistry were the sandboxes for AI. Some approaches: • Lookup (Knowledge) • Natural Language Processing (NLP) • Brute force calculation (inc. physical methods) • Tree-pruning and heuristics • Logic (cf. OWL-DL) • Human-machine integration (crowdsourcing) • Computer Vision Domain-specific Turing test: Can a machine pass a first-year chemistry exam?
  • 20. The Semantic Web "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 CC-BY-SA Images from Wikipedia
  • 21. Linked Open data from Wikipedia “Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?” Open Crystallography? “Which countries where tropical diseases are endemic have published structures of chiral natural products?” CC-BY-SA from Wikipedia
  • 22. The Right to Read is the Right to Mine http://contentmine.org
  • 23. • Science can be read and understood by human-machine Amanuensis-symbionts. • Amanuenses are based on Wikipedia, databases and software (e.g. ContentMine’s AMI) • The results are fed back into WP and WikiData http://en.wikipedia.org/wiki/Eric_Fenby http://en.wikipedia.org/wiki/Symbiosis
  • 24. Machine Extraction of scientific facts • Crawl scientific literature (Open Bibliography) • Scrape each scientific article (ContentMine-quickscrape) • Extract the facts (ContentMine-AMI) • Index (Wikipedia) • Republish (WikiData)
  • 25. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers
  • 26. Linked Open Data – the world’s knowledge GOV.uk very little physical science  DBPedia BIO http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png Lib Comp PDB Ontologies GOV Music, Art Literature Social Knowledge bases RDF triples
  • 27. Part of a COD RDF entry The Semantic Web understands this
  • 28. Mathematics Markup Language Energy of c.c.p lattice of argon Human-friendly 4 pages clipped Machine-friendly Many editors and tools exist We used MathWeaver Automatic! MathML
  • 29. CML (Chemical Markup Language) Automatic! Human-friendly Machine-friendly
  • 30. Innovation with Componentisation Individual, manual, unreusable, flaky Commodity, standard, reliable, re-usable
  • 31. Current scientific information flow … is broken for data-rich science Non-semantic data Human input Data extraction difficult and incomplete Human readers PDF Lineprinter output Text files
  • 32. Semantic network closes the loop Data mined from document Computation Measurement Semantic Authoring Community Data available for e-science and re-use Analysis
  • 33. The network grows autonomously Machine-machine Human-machine Machine-human Human-human
  • 34. Humans and machines use different languages
  • 35. How a machine reads a chemical thesis nodes are compounds; arrows are reactions
  • 36. Human-machine symbionts can read science! WP_Lion WP_Aspergillus_oryzae WP_Soybean
  • 37. With Wikipedia everyone can be a scientist Facts Marked by “non-scientists” in ContentMine workshops
  • 38. “nuggets” in a scientific paper project places quantity units Value ranges chemical Humans aren’t designed to mine this … 
  • 39. Parsing chemical sentences A FACT, uncopyrightable, and representable by triples
  • 41. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 42. We can’t turn a hamburger into a cow But we can now turn PDFs into Science
  • 43. UNITS TICKS QUANTITY SCALE TITLES DATA!! 2000+ points
  • 44. Dumb PDF CSV Semantic Spectrum Automatic extraction Takes < 1 second Gaussian Filter 2nd Derivative
  • 45. Chemical Computer Vision Raw Mobile photo; problems: Shadows, contrast, noise, skew, clipping
  • 46. Binarization (pixels = 0,1) Irregular edges
  • 47. Thinning: thick lines to 1-pixel
  • 48. Chemical Optical Character Recognition Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.
  • 49. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent) Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
  • 50. AMI Demo http://www.mdpi.com/2218-1989/2/1/39/pdf https://bitbucket.org/AndyHowlett/ami2-poc ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor May take time to start if not connected to web Output:./target/output/reactionsexample/ SVG: ./page1annotated.svg CML: image.g.1.4.svg.reaction0.cml Avogadro Viewer:
  • 51. Bacterial WP_phylogenetic tree Genbank ID American Type Culture Collection WP: Clostridium_butyricum Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
  • 52. http://en.wikipedia.org/wiki/Digital_image_processing http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),( (((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n 215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187), n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n 102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),((( n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1 60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139 ,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))) )))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,( n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),(( n53,n131),n159))))))); (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .
  • 53. Open notebook science is the practice of making the entire primary record of a research project publicly available online as it is recorded. (WP) Jean-Claude Bradley was a chemist who actively promoted Open Science in chemistry,… He coined the term Open Notebook Science. … A memorial symposium was held July 14, 2014 at Cambridge University, UK.[9]
  • 54. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers
  • 55. Thanks • Shuttleworth Foundation and Fellowship • Contentmine.org: Michelle Brook, Jenny Molloy, Ross Mounce, Richard Smith-Unna, CottageLabs, Charles Oppenheim • Open Knowledge Foundation Community • Wikimedia Community • Blue Obelisk Community
  • 56. My/our Dream • An Open Bibliography of science, updated daily • An interface for ContentMine to feed new facts into WikiData • Domain-specific enthusiasts to create and run fact extraction and validation • Wikipedia to become a C21 publisher of reference science