Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Content Mining at Wellcome Trust

1.784 Aufrufe

Veröffentlicht am

Presentation for researchers and policy makers at a 2-day meeting sponsored by and at WellcomeTrust, UK.

Veröffentlicht in: Wissenschaft
  • Als Erste(r) kommentieren

Content Mining at Wellcome Trust

  1. 1. CONTENT-MINING IN SCIENCE TheContentMine Progress since “Hargreaves” legislation Opportunities for UK, and Europe Peter Murray-Rust, 2015-04-14 Workshop sponsored by Wellcome Trust
  2. 2. OUR TEAM @jenny_molloy Ross Mounce @rmounce Richard Smith-Unna @blahah404 Stephanie Smith-Unna @treblesteph Jenny Molloy Mark MacGillivray @cottagelabs Peter Murray-Rust @petermurrayrust Charles Oppenheim @CharlesOppenh Graham Steel @McDawg
  3. 3. OUR MISSION “make 100,000,000 facts from the STEM literature open, accessible and reusable”
  4. 4. WHY? http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection. Adage in public health: “The road to inaction is paved with research papers.” Bernice Dahn is the chief medical officer of Liberia’s Ministry of Health, where Vera Mussah is the director of county health services. Cameron Nutt is the Ebola response adviser to Partners in Health.
  5. 5. THE RIGHT TO READ IS THE RIGHT TO MINE The Hargreaves report (UK) , legalised 2014, allowing limitations and exceptions for non-commercial content mining for research. The Hague decal
  6. 6. THE SCALE OF THE TASK • ~ 27,000 peer reviewed journals* • > 5,000 publishers • ~ 3,000 new papers per day • “costing” 15 Billion USD to publish • Representing 500 Billion USD of research *Ulrich’s database: http://ulrichsweb.serialssolutions.com/login
  7. 7. OUR WORKSHOPS • Shuttleworth Foundation • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute (x2) • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US • JISC , London • LIBER • Cochrane UK • British Library • Wellcome Trust • WHO OUR COLLABORATORS • Shuttleworth Foundation • Wikimedia/Wikidata • Mozilla • Open Knowledge • LIBER • British Library • Wellcome Trust • EBI (Eur. Bioinf. Inst.) • JISC • BBSRC • Cochrane UK • Open Access Button • SPARC • Creative Commons • CORE • EuropePubmedCentral • Cambridge University Library
  8. 8. STRUCTURED INFORMATION • chemical names and structures • species • metabolism • phylogenetic trees • …
  9. 9. INTERACTIVE DEMO of content mining http://chemicaltagger.ch.cam.ac.uk/
  10. 10. ContentMine at Cochrane UK, 2015-03-16
  11. 11. CLINICAL TRIALS How to we find (mentions of) clinical trials? Is a document a (clinical) trial? What is the subject of the trial? What is the methodology used? How many/long? Does the design and practice conform to CONSORT? What are the outcomes? Can we extract specific re-usable information? Who are involved? (researchers, sponsors, patients?) Has a proposed trial been completed and reported?
  12. 12. COMMUNITY PROJECTS • Clinical Trials (with Cochrane UK) • Phyloinformatic Literature Unlocking Tools (PLUTo/BBSRC) • EBI – MetaboLights • Plant Sciences and farming (Cambridge, TGAC, OpenFarm) • Crystallography Open Database (COD) • OpenOil / OpenCorporates
  13. 13. METABOLIGHTS • European Bioinformatics Institute • database for metabolomics experiments and derived information • cross-species, cross-technique, structures, biological roles, locations, concentrations • http://www.ebi.ac.uk/metabolights/
  14. 14. CONTENTMINE WORKSHOPS AND HACKDAYS Open Science Brazil, 2014-08 Easily distributed software Get started in 30 mins Build application in a day Start simple: bagOfWords, Stemming, Regex, templates
  15. 15. What is “Content”? http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113 03&representation=PDF CC-BY SECTIONS MAPS TABLES CHEMISTRY TEXT MATH contentmine.org tackles these
  16. 16. What is “Content”? Emily Sena (neuroscience.ed.ac.uk) spends half a day digitising a diagram like this ContentMine will soon be able to do it in 1 second
  17. 17. Note Jaggy and broken pixels NEW Bacteria must have a phylogenetic tree Length _________Weight Binomial Name Culture/Strain GENBANK ID Evolution Rate
  18. 18. • CRAWL the web for scientific documents (articles, grey literature, repositories) • quickSCRAPE pages (text, graphics, images, data) • NORMA-lize page to semantic form …Open semantic science … • MINE pages with your methods and tools (AMI) • CAT-alogue results in searchable index • Automate daily process (CANARY) contentmine.org Infrastructure
  19. 19. quickscrape Crawl Feed Norma Index & Transform PDF XML URL DOI Scientific literature Repositories DOC CSV sHTML Plugins Regex SequencesSpecies Bespoke Scrapers XPathPer-Journal Taggers Per- Journal MetadataChemistry Phylogenetics Farming AMI BadHT ML OCR Diagrams Open NORMA-lized Scientific Literature + Facts CANARY pipeline CAT-alogue index
  20. 20. POSSIBLE USES • Indexing/searching the literature; G***** for science • Current awareness; alerts and practices • Extraction and re-use of facts; re-computation • Multidisciplinary integration; co-occurrence • Compliance with funder/institution policies • Managing your Research Data! • Finding similar and complementary colleagues • Reproducibility, checking data and avoiding fraud
  21. 21. How to leverage Content Mining for benefit of UK/EU • Create UK showcase of successes in mining • Graduate training by 3rd year UK graduate students. • Develop EuropePMC as world resource for bio-mining • Training/support for UK/EU libraries about Hargreaves. • Central collection of born-digital UK theses • Collect pre-copyright author manuscripts • Integrate CM into Research Data Management tools • Promote mining in all aspects of healthcare information • Open collection of extracted scientific facts for the world