SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
WikiFactMine	for	Phytochemistry	
Tom	Arrow1,	Charles	Ma;hews1,	Jenny	Molloy1,2,	Ross	Mounce1,2	Peter	Murray-Rust1,3,	Richard	Smith-Unna1,2,	Lars	Willighagen1	
1The	ContentMine,	Cambridge,	CB4	2HY,	2Dept	of	Plant	Sciences,	University	of	Cambridge,	3Dept	of	Chemistry,	University	of	Cambridge	
Mining	the	scienGfic	literature	for	facts	
All	so&ware	(Apache2)	and	Data	(CC0)	are	Open.	h9p://github.com/ContentMine		.	ContentMine.org	is	a	not-for-profit	UK	company.		
We	thank	The	Shu9leworth	FoundaMon	for	a	Fellowship	tp	PMR	and	The	Wikimedia	FoundaMon	for	funding	for	TA	and	CM.	Contact	peter@contentmine.org			
ContentMine	and	Wikidata	
Wikidata		is	“Wikipedia	for	machines”	and	supports	
ContentMine’s	FullContent	search	of	the	Bioscience	literature.	
We	go	beyond	keywords	to	automaGcally	generated	
structured	dicGonaries	with	thousands	of	terms	and	aliases.	
FullContent	means	not	just	words,	but	structured	documents,	
tables	and	diagrams.	We	(and	you)	can	search	the	whole	
literature	(via	EuropePMC	or	Crossref)	every	day	automaMcally	
or	retrospecMvely	for	your	sub-areas	of	interest.		
Example:		
Find	facts	about	terpenes	emi;ed	by	conifers	in	Indonesia.		
We	autogenerate	3	large	dicMonaries	for	all	terpenes,	conifers	
and	Indonesian	place/island	names	in	Wikidata.		
		
IntroducGon	
Understanding	phytochemical	diversity	and	metabolism	can	
answer	many	important	scienMfic	quesMons	and	provide	
economically	important	informaMon;	forming	the	foundaMon	
for	metabolic	engineering	of	plant	compounds.	Phytochemical	
database	resources	exist	but	much	informaMon	on	their	
associaMon	with	species,	enzymes	and	places	without	the	
standardised	format	and	metadata	required	to	enable	machine	
analysis.	In	some	cases	it	is	painstakingly	extracted	manually,	
but	this	approach	is	not	scalable.		
Semi-automated	extracMon	of	phytochemical	data	across	the	
full-text	open	access	literature	is	anMcipated	to	significantly	
extend	previous	abstract-only	coverage.	Here	we	present	an	
open	source	pipeline	and	preliminary	results	for	terpene	data	
mining.	
Reusable	WikiFactMine	DicGonaries.We	expand	the	Wikidata	term	terpene	automaMcally	to	~450	items	(such	as	carvone)	giving	>1000	precise	search	terms	and	data.	Similarly	in	
a	few	seconds	we	can	generate	dicMonaries	of		conifers	(1899);		and	Indonesian	islands	(6344)	making	broad	queries	precise.	
Search	Strategies.	
(A)	Daily	search.	All	new	Open	publicaMons	(300-1000)	on	EuropePMC	are	downloaded	to	WikimediaLabs,	indexed	by	dicMonaries,	and	the	extracted	facts	(dicMonary	
hits)	stored	in	Zenodo	(CERN’s	Open	repository)	.	Each	paper	may	have	hundreds	or	thousands	of	facts.	
(B)	On-Demand.	A	researcher,	especially	those	doing	systemaGc	reviews.		creates	a	fairly	general	query	in	her	field	with	a	range	of	dates,	journals,	etc.	and	downloads	
papers	(getpapers	and	quickscrape)	.	The	papers	are	filtered	locally	with	a	much	more	precise	query	(norma/ami).	
Researcher	FileStore	
Publisher	Sites	
Tidying	(PDF)	
Tagging	
Science	Search	
Data	Search	
AutomaBcally	Extracted	Indexed		Facts		
getpapers	 quickscrape	
DicBonary	Search	
Diseases	 Drugs	
Phytochem	
Species	
Norma/	
ami	
Text	
Figure
s	
Genes	
Dat
a	
Researcher	FileStore	
B	
All	daily	
30,	000	pages/day	
A	
A.	All	EPMC	papers	are	
downloaded	every	day	
and	the	facts	are	
extracted	into	Zenodo	
and	made	publicly	
available.	
B.	Researcher	searches	
repositories	and	also	scrapes	
publisher	sites	for	whatever	
chunk	of	the	literature	she	
wants.	She	runs	local	
dicMonaries		and	saves	the	
results	to	disk	where	they	can	
be	further	analyzed.	She	can	
add	any	papers	she	has	legal	
access	to	and	re-run	
whenever	required.	E.g.	Bag	
Of	Words	is	a	powerful	tool	
for	classifying	papers	
(Bio)chemical	transformaGons	 PhylogeneGcs	
A.	Diagrams	of	Chemical	and	biochemical	
reacMons	can	be	automaMcally	extracted	from	
PDFs	into	the	Researcher’s	filestore.	
B.	PhylogeneMc	trees	can	be	automaMcally	extracted	from	bitmap	
diagrams	or	PDFs,	and	species	names	verified.	Mounce,	Murray-
Rust,	Wills:	h9p://doi.org/10.3897/rio.3.e13589		
Tables	and	graphs	
C.	Tables	and	graphs	can	be	automaMcally	extracted	into	
researcher’s	filestore	and	turned	into	CSV	tables	or	spectra.	
Designed	for	re-use	with	your	favourite	tools	(R,	Python,	etc.)	
INTELLIGENT	QUERIES	
INTELLIGENT	CONTENT

Weitere ähnliche Inhalte

Ähnlich wie WikiFactMine for Plant Chemistry

Ähnlich wie WikiFactMine for Plant Chemistry (20)

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Mining facts from the plant science iterature
Mining facts from the plant science iteratureMining facts from the plant science iterature
Mining facts from the plant science iterature
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 

Mehr von petermurrayrust

Mehr von petermurrayrust (18)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Biovision2017 Accessing the scientific literature
Biovision2017 Accessing the scientific literatureBiovision2017 Accessing the scientific literature
Biovision2017 Accessing the scientific literature
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
 

Kürzlich hochgeladen

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Kürzlich hochgeladen (20)

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 

WikiFactMine for Plant Chemistry