SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Sharing reusable phylogenetic data:
we're not there yet

Ross Mounce
@rmounce
http://orcid.org/0000-0002-3520-2046
A talk of
two halves
1.) Outlining the extent of the problem
(lack of) sharing, standards, care (?)
2.) What I'm trying to do about it:
Digging data out of PDFs
Re-releasing as
Where's the data?
Just ~4% of published phylogenetic studies in 2010
publicly archived their supporting phylo data in

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012

Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis
BMC Research Notes 10.1186/1756-0500-5-574

Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t
Scientists cannot be relied upon to
share published data upon request
This has been known for a while now
e.g. (in Psychology) Wicherts et al 2006
But has been confirmed to be true for phylogenetics too:
Drew et al 2013 'Lost Branches in the Tree of Life'
report that just ~16% of researchers contacted supplied
the requested ('published') phylo data.
My own experience tallies with this – I soon stopped bothering to try and
ask people via email for a copy of their published data. It's a waste of time.
The (Single) Supplementary Data File
was a Y2K solution – a dump
Many legacy journal supplementary data systems
bury data and leave it there to decompose
Often not re-usable in form e.g. a lazy PDF
Sometimes 'typeset', corrupting the data
A jumble of words & data where the bit you
want is on page 92 (no programmatic access)

Research
BURIED and really not very discoverable
Data

Do reviewers even look at it? I think not tbh
I wasted too much of my PhD
trying to get usable data to re-analyze
This is what I felt like...

So I tried to do something
about it...

An open letter in support of
palaeontology data archiving
www.supportpalaeodatarchiving.co.uk

Which was picked-up by Nature News
Which, in turn got me in touch with:
Part 2
Since few will help you to re-use their data
You've got to dig it out
and
make it re-usable yourself
AND
re-release it openly
so no-one else wastes their time doing this
It's not just phylogenetics.
I learned from the Open Knowledge Conference (Berlin 2011)
that a lot different academic fields seem also struggle to
make re-usable published data available.

If it's a common, shared-problem...
why not seek a shared, cross-disciplinary solution?
AMI (Amanuensis)
Building upon tools first developed
in computational chemistry by the Murray-Rust lab
e.g.
ChemicalTagger → PhyloTagger (Entity tagging)
(Chem)PubCrawler → (Phylo)PubCrawler
(to getting 10,000+ PDFs to work on)

https://bitbucket.org/nickday/pub-crawler
http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger
Open Source
BBSRC grant approved
“PLUTo: Phyloinformatic Literature Unlocking Tools”
Software for making published phyloinformatic
data discoverable, open, and reusable
...I just need to get my PhD viva done & rubber-stamped

Instructions for getting the current working setup here:
(multiple repositories, dependencies & requirements!)
http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/
PDF 
HTML


AMI

Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle Håstad
and Per Alström 4

2,3

Styles , superscripts
And diåcritics
preserved!
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus

Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
AMI
0.84
0.91
0.93
0.95
Posterior
probability

23.12
34.54
37.21
38.55
Branch
lengths

NexML
HTML

Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma

Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae

Genus

Family
Acknowledgements & Thanks

For the Panton Fellowship,
inspiration and support

To the organisers
of both the session:
Nico, Hilmar, Rutger
and the conference
as a whole!

For travel & accommodation
support, without which I couldn't
possibly attend TDWG

My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust

Weitere ähnliche Inhalte

Was ist angesagt?

Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08
Elizabeth Brown
 
SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008
Elizabeth Brown
 
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
Elizabeth Brown
 
Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...
Sukhdev Singh
 

Was ist angesagt? (20)

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
 
Open Notebook Science
Open Notebook ScienceOpen Notebook Science
Open Notebook Science
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europe
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08
 
SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
 
Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...
 

Ähnlich wie Sharing re-usable phylogenetic data: we're not there yet

Ähnlich wie Sharing re-usable phylogenetic data: we're not there yet (20)

Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
Cartegena051811
Cartegena051811Cartegena051811
Cartegena051811
 
FAIR and open biodiversity collection data management
FAIR and open biodiversity collection data managementFAIR and open biodiversity collection data management
FAIR and open biodiversity collection data management
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
Evolution of e-Research
Evolution of e-ResearchEvolution of e-Research
Evolution of e-Research
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Data Sharing in Economics – Opportunities and Limitations_Toepfer
Data Sharing in Economics – Opportunities and Limitations_ToepferData Sharing in Economics – Opportunities and Limitations_Toepfer
Data Sharing in Economics – Opportunities and Limitations_Toepfer
 
Reward, reproducibility and recognition in research - the case for going Open
Reward, reproducibility and recognition in research - the case for going OpenReward, reproducibility and recognition in research - the case for going Open
Reward, reproducibility and recognition in research - the case for going Open
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Open science curriculum for students, June 2019
Open science curriculum for students, June 2019Open science curriculum for students, June 2019
Open science curriculum for students, June 2019
 
Data sharing archiving discovery, Bill Michener
Data sharing archiving discovery, Bill MichenerData sharing archiving discovery, Bill Michener
Data sharing archiving discovery, Bill Michener
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
A basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyA basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and why
 

Mehr von Ross Mounce

Mehr von Ross Mounce (8)

The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014
 
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For Researchers
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for Science
 
Herding Cats
Herding CatsHerding Cats
Herding Cats
 
Content Mining
Content MiningContent Mining
Content Mining
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
 
ProgPal2011
ProgPal2011ProgPal2011
ProgPal2011
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Sharing re-usable phylogenetic data: we're not there yet

  • 1. Sharing reusable phylogenetic data: we're not there yet Ross Mounce @rmounce http://orcid.org/0000-0002-3520-2046
  • 2. A talk of two halves 1.) Outlining the extent of the problem (lack of) sharing, standards, care (?) 2.) What I'm trying to do about it: Digging data out of PDFs Re-releasing as
  • 3. Where's the data? Just ~4% of published phylogenetic studies in 2010 publicly archived their supporting phylo data in Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012 Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis BMC Research Notes 10.1186/1756-0500-5-574 Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t
  • 4. Scientists cannot be relied upon to share published data upon request This has been known for a while now e.g. (in Psychology) Wicherts et al 2006 But has been confirmed to be true for phylogenetics too: Drew et al 2013 'Lost Branches in the Tree of Life' report that just ~16% of researchers contacted supplied the requested ('published') phylo data. My own experience tallies with this – I soon stopped bothering to try and ask people via email for a copy of their published data. It's a waste of time.
  • 5. The (Single) Supplementary Data File was a Y2K solution – a dump Many legacy journal supplementary data systems bury data and leave it there to decompose Often not re-usable in form e.g. a lazy PDF Sometimes 'typeset', corrupting the data A jumble of words & data where the bit you want is on page 92 (no programmatic access) Research BURIED and really not very discoverable Data Do reviewers even look at it? I think not tbh
  • 6. I wasted too much of my PhD trying to get usable data to re-analyze This is what I felt like... So I tried to do something about it... An open letter in support of palaeontology data archiving www.supportpalaeodatarchiving.co.uk Which was picked-up by Nature News Which, in turn got me in touch with:
  • 7. Part 2 Since few will help you to re-use their data You've got to dig it out and make it re-usable yourself AND re-release it openly so no-one else wastes their time doing this
  • 8. It's not just phylogenetics. I learned from the Open Knowledge Conference (Berlin 2011) that a lot different academic fields seem also struggle to make re-usable published data available. If it's a common, shared-problem... why not seek a shared, cross-disciplinary solution?
  • 9. AMI (Amanuensis) Building upon tools first developed in computational chemistry by the Murray-Rust lab e.g. ChemicalTagger → PhyloTagger (Entity tagging) (Chem)PubCrawler → (Phylo)PubCrawler (to getting 10,000+ PDFs to work on) https://bitbucket.org/nickday/pub-crawler http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger Open Source
  • 10. BBSRC grant approved “PLUTo: Phyloinformatic Literature Unlocking Tools” Software for making published phyloinformatic data discoverable, open, and reusable ...I just need to get my PhD viva done & rubber-stamped Instructions for getting the current working setup here: (multiple repositories, dependencies & requirements!) http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/
  • 11. PDF  HTML  AMI Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad and Per Alström 4 2,3 Styles , superscripts And diåcritics preserved!
  • 12. PDF  Turdus iliacus Taeniopygia guttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  • 13. Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  • 15. Acknowledgements & Thanks For the Panton Fellowship, inspiration and support To the organisers of both the session: Nico, Hilmar, Rutger and the conference as a whole! For travel & accommodation support, without which I couldn't possibly attend TDWG My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust