SlideShare a Scribd company logo
1 of 21
Overview of Practical Content Mining 
Peter Murray-Rust 
JISC, London, 2014-12-01
What is Content Mining 
• Mining Text, Tables and Lists, Diagrams, Images 
• Born-digital documents 
• High-throughput (millions of items/year) 
• Formal and Informal Collaboration 
• Role of UK 
• Hands-on 
• Everything is OPEN (OSI , CC-BY, CC0)
The Right to Read is the Right to Mine 
http://contentmine.org
ContentMine 
• 1-2 year Shuttleworth Funding from 2014-03 
• Free to everyone, Open Source, updated daily 
• Structured Text, and Image/Diagram Mining 
• Workshops for training and training trainers 
• Bottom-up community development 
– Bioscience (EuropePMC, BBSRC) 
– Disease Ebola 
– Astrophysics (Stray Toaster) 
– Chemistry (TSB, EBI, PennState - Citeseer) 
• We fight for Justice and Freedom
ContentMine People 
• Jenny Molloy 
• Ross Mounce 
• Peter Murray-Rust + volunteers (Bioscience, disease) 
• Richard Smith-Unna + 20 quickscrape volunteers 
• Steph Unna 
• Cottage Labs (Mark MacGillivray, Emanuil Tolev, 
Richard Jones) 
• Prof Charles Oppenheim 
• Karien Bezuidenhout (Shuttleworth) 
• Advisory Board RSN
ContentMine Workshops 
(1-hour -> full day or more) 
2014-May->Nov 
• Budapest/Shuttleworth 
• Leicester Univ 
• Electronic Theses and Dissertations 
• Austrian Science Fund AT 
• OKFest DE 
• Eur. Bioinformatics Institute 
• Open Science Rio de Janeiro BR 
• Sci DataCon , Delhi IN 
• Univ of Chicago US 
• OpenCon 2014, Wash DC. US 
Upcoming 
• JISC 
• LIBER 
• BL 
• Wellcome Trust 
• WHO
Ebola Collaborators (Atlanta) 
Roxanne Further Moore, Jessie 
Gunter, April Clyburne-Sherin
Regular Expressions 
(Easier than Crosswords or Sudoku) 
Ebola Ebola 
Mali (not 
Malicious) 
MaliW (end of word) 
Bat or bat [Bb]at (alternatives) 
bat or bats bats? (optional letter) 
Bat or Bats or bat 
[Bb]ats? 
or bats 
Sudden onset [Ss]uddens+onset (space/s) 
Panthera leo or 
[A-Z][a-z]+s+[a-z]+ 
Gorilla gorilla 
(ranges of letters)
Ebola regex 
• <compoundRegex title="ebola"> 
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> 
• <regex weight="1.0" fields="marburg">(Marburg)</regex> 
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> 
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> 
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> 
• <regex weight="0.5" fields="guinea">(Guinea)</regex> 
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> 
• <regex weight="0.5" fields="liberia">(Liberia)</regex> 
• <regex weight="0.5" fields="mali">(Mali)W</regex> 
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> 
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> 
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> 
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> 
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> 
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> 
• </compoundRegex> 
I 
15 mins to create, 15 mins to install and test 
Or run online at CottageLabs
Results of Regex on Ebola 
• <resultsList xmlns="http://www.xml-cml.org/ami"> 
• <results xmlns=""> 
• <source xmlns="http://www.xml-cml.org/ami" 
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> 
• <result> 
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" 
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> 
• <regex xmlns="" weight="1.0" fields="[ebola]"> 
• <pattern>(Ebola)</pattern> 
• </regex> 
• <hits xmlns=""> 
• <hit ebola="Ebola" /> 
• </hits> 
• </regex> 
• </result> 
• <result> 
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" 
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> 
• <regex xmlns="" weight="0.5" fields="[sierra_leone]"> 
• <pattern>(Sierras+Leone)</pattern> 
• </regex> 
• <hits xmlns=""> 
• <hit sierra_leone="Sierra Leone" /> 
• </hits> 
• </regex> 
• </result>
Demo of Content Mining 
ChemicalTagger (Lezan Hawizy) a shallow, 
domain-specific, semantic parser for un/natural 
language.
Bacterial WP_phylogenetic tree 
Genbank ID 
American Type 
Culture Collection 
WP: Clostridium_butyricum 
Our machines have read and interpreted 4300 in an hour with > 95% accuracy 
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
RSU: Richard Smith-Unna 
PMR: Peter Murray-Rust 
CL: CottageLabs 
Queues 
Repos 
Scientific 
literature 
Science 
Plugins 
Science 
Volunteers 
Collaboration with 
Open Access Button
AMI (extraction) architecture 
PDF2SVG 
SVG2XML 
Image 
analysis 
sections tables 
AMI 
captioned 
diagrams 
Regex Species Phylo Chem
Immediate Stakeholders 
– Researchers (bio, EBI, chem, materials, astro) 
– Funders WT, FWF (Austria), RCUK, 
– Libraries (repositories, theses) 
– Service providers (EuropePMC) 
– knowledge-based SMEs 
– Library organisations (JISC, RLUK, LIBER, SPARC) 
– Non-profits (Wikimedia, WHO, Mozilla)
Content production 
• Scholarly articles 
• Theses 
• Repositories 
• Grey scientific literature 
• Grey politico-socio-legal literature 
• Company output (reports, accounts, contracts) 
(e.g. OpenOil)
Licences destroy Content Mining 
WE WALKED OUT 
• Brit Library 
• JISC 
• RLUK 
• OKFN 
• … 
• Ross Mounce 
• PM-R 
STM Publishers Licence 
2012_03_15_Sample_Licence_Text_Data_Mining.pdf 
(Summary: PMR has NO rights) 
• [cannot publish to: ] “libraries, repositories, or archives” 
• [cannot] “Make the results of any TDM Output available on an externally facing server or 
website” 
• “Subscriber shall pay a […] fee” 
Heather Piwowar: “negotiating with publishers [made me physically ill]”
Challenges 
• Active opposition from content “owners” 
including serious lobbying and FUD 
• Ignorance and apathy from universities; 
inappropriate reward system 
• Sub-optimal technology of publishers 
• Lack of common infrastructure, technology, 
APIs 
• And it’s objectively messy anyway
Technical problems 
• PDF: lacks words, tables, diagrams 
• Non-Unicode character sets (or worse) 
• Graphics objects largely destroyed (converted 
to PNG or worse) 
• No communal ontology for document 
structure. 
• HTML carries PublisherJunk and Javascript
Goals of Mining 
• Classification of resources 
• Entity extraction and indexing 
• Aggregation within discipline 
• Inter-disciplinary, e.g. biodiversity, 
phytochemistry 
• Repurposing (twitter, ePub, annotation) 
• Semantification/intelligent documents 
• Detection of error and fraud
What we need 
• Inter/national commitment to infrastructure 
• Common ontologies and APIs 
• Development of community 
• Go beyond academia; non-academic reward 
system

More Related Content

What's hot

Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Datapetermurrayrust
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Sciencepetermurrayrust
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humanspetermurrayrust
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technologypetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolutionpetermurrayrust
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)petermurrayrust
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome TrustTheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open ScienceTheContentMine
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDataTheContentMine
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complexpetermurrayrust
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migrationpetermurrayrust
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusDuncan Hull
 

What's hot (20)

Ebi
EbiEbi
Ebi
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Csvconf
CsvconfCsvconf
Csvconf
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
 

Similar to Petermrjisc20141201

Overview of Practical Content Mining
Overview of Practical Content Mining Overview of Practical Content Mining
Overview of Practical Content Mining TheContentMine
 
We've Got Issues: Issue Tracking and Workflow in the Digital Library
We've Got Issues: Issue Tracking and Workflow in the Digital LibraryWe've Got Issues: Issue Tracking and Workflow in the Digital Library
We've Got Issues: Issue Tracking and Workflow in the Digital LibraryElectronic Resources & Libraries
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulpetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literaturepetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureTheContentMine
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
 
Knowing what AI Systems Don't know and Why it matters
Knowing what AI  Systems Don't know and Why it mattersKnowing what AI  Systems Don't know and Why it matters
Knowing what AI Systems Don't know and Why it mattersJames Hendler
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Davood Rafiei
 

Similar to Petermrjisc20141201 (20)

Overview of Practical Content Mining
Overview of Practical Content Mining Overview of Practical Content Mining
Overview of Practical Content Mining
 
We've Got Issues: Issue Tracking and Workflow in the Digital Library
We've Got Issues: Issue Tracking and Workflow in the Digital LibraryWe've Got Issues: Issue Tracking and Workflow in the Digital Library
We've Got Issues: Issue Tracking and Workflow in the Digital Library
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for Everyone
 
Petermrbl20141127
Petermrbl20141127Petermrbl20141127
Petermrbl20141127
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
Ir1
Ir1Ir1
Ir1
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...
Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...
Wolven, Hickey, and Henderson, "Identifiers: New Problems, New Solutions, Par...
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Groeling, Tim: NewsScape: Preserving TV News
Groeling, Tim: NewsScape: Preserving TV NewsGroeling, Tim: NewsScape: Preserving TV News
Groeling, Tim: NewsScape: Preserving TV News
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Knowing what AI Systems Don't know and Why it matters
Knowing what AI  Systems Don't know and Why it mattersKnowing what AI  Systems Don't know and Why it matters
Knowing what AI Systems Don't know and Why it matters
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
PHP - Introduction to PHP MySQL Joins and SQL Functions
PHP -  Introduction to PHP MySQL Joins and SQL FunctionsPHP -  Introduction to PHP MySQL Joins and SQL Functions
PHP - Introduction to PHP MySQL Joins and SQL Functions
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 

More from petermurrayrust

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practicepetermurrayrust
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?petermurrayrust
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestpetermurrayrust
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literaturepetermurrayrust
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusespetermurrayrust
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?petermurrayrust
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Bravepetermurrayrust
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcarepetermurrayrust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyonepetermurrayrust
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingpetermurrayrust
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archivepetermurrayrust
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everythingpetermurrayrust
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Minepetermurrayrust
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialismpetermurrayrust
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyonepetermurrayrust
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 

More from petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 

Recently uploaded

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 

Recently uploaded (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 

Petermrjisc20141201

  • 1. Overview of Practical Content Mining Peter Murray-Rust JISC, London, 2014-12-01
  • 2. What is Content Mining • Mining Text, Tables and Lists, Diagrams, Images • Born-digital documents • High-throughput (millions of items/year) • Formal and Informal Collaboration • Role of UK • Hands-on • Everything is OPEN (OSI , CC-BY, CC0)
  • 3. The Right to Read is the Right to Mine http://contentmine.org
  • 4. ContentMine • 1-2 year Shuttleworth Funding from 2014-03 • Free to everyone, Open Source, updated daily • Structured Text, and Image/Diagram Mining • Workshops for training and training trainers • Bottom-up community development – Bioscience (EuropePMC, BBSRC) – Disease Ebola – Astrophysics (Stray Toaster) – Chemistry (TSB, EBI, PennState - Citeseer) • We fight for Justice and Freedom
  • 5. ContentMine People • Jenny Molloy • Ross Mounce • Peter Murray-Rust + volunteers (Bioscience, disease) • Richard Smith-Unna + 20 quickscrape volunteers • Steph Unna • Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones) • Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth) • Advisory Board RSN
  • 6. ContentMine Workshops (1-hour -> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US Upcoming • JISC • LIBER • BL • Wellcome Trust • WHO
  • 7. Ebola Collaborators (Atlanta) Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
  • 8. Regular Expressions (Easier than Crosswords or Sudoku) Ebola Ebola Mali (not Malicious) MaliW (end of word) Bat or bat [Bb]at (alternatives) bat or bats bats? (optional letter) Bat or Bats or bat [Bb]ats? or bats Sudden onset [Ss]uddens+onset (space/s) Panthera leo or [A-Z][a-z]+s+[a-z]+ Gorilla gorilla (ranges of letters)
  • 9. Ebola regex • <compoundRegex title="ebola"> • <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> • <regex weight="1.0" fields="marburg">(Marburg)</regex> • <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> • <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> • <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> • <regex weight="0.5" fields="guinea">(Guinea)</regex> • <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> • <regex weight="0.5" fields="liberia">(Liberia)</regex> • <regex weight="0.5" fields="mali">(Mali)W</regex> • <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> • <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> • <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> • <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> • <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> • <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> • </compoundRegex> I 15 mins to create, 15 mins to install and test Or run online at CottageLabs
  • 10. Results of Regex on Ebola • <resultsList xmlns="http://www.xml-cml.org/ami"> • <results xmlns=""> • <source xmlns="http://www.xml-cml.org/ami" • name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" • lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> • <regex xmlns="" weight="1.0" fields="[ebola]"> • <pattern>(Ebola)</pattern> • </regex> • <hits xmlns=""> • <hit ebola="Ebola" /> • </hits> • </regex> • </result> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" • lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> • <regex xmlns="" weight="0.5" fields="[sierra_leone]"> • <pattern>(Sierras+Leone)</pattern> • </regex> • <hits xmlns=""> • <hit sierra_leone="Sierra Leone" /> • </hits> • </regex> • </result>
  • 11. Demo of Content Mining ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
  • 12. Bacterial WP_phylogenetic tree Genbank ID American Type Culture Collection WP: Clostridium_butyricum Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
  • 13. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button
  • 14. AMI (extraction) architecture PDF2SVG SVG2XML Image analysis sections tables AMI captioned diagrams Regex Species Phylo Chem
  • 15. Immediate Stakeholders – Researchers (bio, EBI, chem, materials, astro) – Funders WT, FWF (Austria), RCUK, – Libraries (repositories, theses) – Service providers (EuropePMC) – knowledge-based SMEs – Library organisations (JISC, RLUK, LIBER, SPARC) – Non-profits (Wikimedia, WHO, Mozilla)
  • 16. Content production • Scholarly articles • Theses • Repositories • Grey scientific literature • Grey politico-socio-legal literature • Company output (reports, accounts, contracts) (e.g. OpenOil)
  • 17. Licences destroy Content Mining WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]”
  • 18. Challenges • Active opposition from content “owners” including serious lobbying and FUD • Ignorance and apathy from universities; inappropriate reward system • Sub-optimal technology of publishers • Lack of common infrastructure, technology, APIs • And it’s objectively messy anyway
  • 19. Technical problems • PDF: lacks words, tables, diagrams • Non-Unicode character sets (or worse) • Graphics objects largely destroyed (converted to PNG or worse) • No communal ontology for document structure. • HTML carries PublisherJunk and Javascript
  • 20. Goals of Mining • Classification of resources • Entity extraction and indexing • Aggregation within discipline • Inter-disciplinary, e.g. biodiversity, phytochemistry • Repurposing (twitter, ePub, annotation) • Semantification/intelligent documents • Detection of error and fraud
  • 21. What we need • Inter/national commitment to infrastructure • Common ontologies and APIs • Development of community • Go beyond academia; non-academic reward system