SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
The (young) Researcher POV

           Text & Data Mining
             Content Mining

               Ross Mounce
    University of Bath, PhD Candidate
Open Knowledge Foundation Panton Fellow




                   #Licences4Europe @RMounce
What is content mining?

●   Text
●   Pictures
●   Videos
●   Audio
●   Metadata
●   ...not just text or data!


( I am making all my text on all these slides open and
re-usable under the Creative Commons Zero Waiver )
The importance of scale


●   The indexed WWW has >13 billion pages
●   >630,000,000 sites (Netcraft, Feb 2013)
●   2 million scholarly articles published per year c. 4%
    growth rate each year
●   >50,000,000 scholarly articles so far (Jinha, 2010)
●   ~72,000 PDFs from PloS are just ~15GB
●   >4,000,000 English-language Wikipedia pages
    downloaded as text are just ~9GB
My interest in Content Mining

●   Facts cannot be copyrighted
●   Billions of facts are unfortunately trapped only in
    copyright-protected scholarly research articles
●   My collaborators & I simply want to liberate these
    facts to make them openly available for
    everyone with no restrictions on access or use
●   Within this, my particular interest in evolutionary
    history; e.g. phylogenetic tree data
I want to help
reconstruct the
'Tree of Life'.
Thousands of other
scientists also want
to do this.

Phylogenetic
Research gets
published
piecemeal across
>100,000 papers.
Hard to synthesise,
much of it not in
PMC
76,523 articles
           out of a total of 91,788 for this period


                                                      34




My content of interest is scattered across 1000+ journals
Missing data,
~15,000 articles




                   1 or 2 journals not
                   shown (off-scale)
Lots of journals, lots of publishers

 Just those top 500 most phylogeny 'rich' journals
 are published by ~120 different publishers

       ~ 17%      Elsevier (subscription journals)
       ~ 13%      Springer (subscription journals)
       ~ 15%      Wiley (subscription journals)

   The biggest 3 publishers combined still only have
         less than 50% of what I want to mine


   The long tail of content (15,000 articles) in journals
  outside the 'top 500' is likely to come from many more
                    different publishers.
Standard license agreements

 ●   Many explicitly do not allow mining

e.g. Nature, JSTOR, AIP, ACS, Elsevier, Taylor & Francis...


“This licence does not include any derivative use of the Site or
the Materials, any collection and use of any product listings,
descriptions, or prices; any downloading or copying of account
information for the benefit of another merchant; or any use of
data mining, robots or similar data gathering and extraction
tools...”
                                    InformaWorld
     Source   See also: http://www.cdlib.org/services/collections/redactions/
                      http://www.mpdl.mpg.de/services/ezb-readme_en.htm
Asking permission doesn't scale
●   There are 90,000 different publishers in 215 different countries listed
    in Ulrich's Periodicals Directory & >336,000 periodicals.

         “I had a phone call on Friday with my university librarian
                     and six (!) Elsevier employees.”
                                                 Heather Piwowar


5 March → 16 April just to get permission/access to start work on
                 just one publisher's content

      Could have done all of the analysis in time period.
Hugely intimidating & patronizing process, an utter waste of time
Why aren't more people mining?
A) Many disciplines not computationally-savvy frankly


B) Lack of awareness that this technology & capability exist & what it can do


C) Lack of teaching/training about content mining


D) Futility of even trying – how can one get permission from 1000's of
publishers in the context of short research funding time-scales.
Choose another research project – it's less risky.


E) Live in the wrong jurisdiction (e.g. Europe) and you're at a massive
disadvantage relative to S. Korea, Japan, Israel, US... due to local laws.
(Probably more reasons too, this list is not exhaustive)
The Right to Read is The Right to Mine
●   Through my institutional affiliation (U. of Bath) I have legitimate paid
    access to many thousands of paywalled journals
●   If I visit the BL or NHM library I have legitimate access to even more
●   Our POV: “The Right to Read is The Right to Mine”
●   We do not see any distinction between computer-assisted 'human'
    reading aka 'ocular access' (e.g. viewing a PDF), and computer-
    assisted computer reading – aka content mining.
    The only difference is the latter is clearly more time & resource
    efficient – it's much quicker and it scales well.
Blocking & Criminalizing Research
●   I have had my access to at least one publisher (BioOne) cut-off
    before. My 'crime' – downloading more than 25 PDFs in 5mins.
●   Elsevier once blocked access to the entire U. of Cambridge campus
    for a week(!) because Peter Murray-Rust through legitimate access
    downloaded 'too many' PDFs in the course of his research
●   Countless other cases, large majority not widely reported
●   ...Aaron Swartz – need I say more?


    We have legitimate access, we do not seek to redistribute content
    wholesale. What's the problem? Why are we being criminalized?
OA publishers make it easy


One can easily
download the entire
content of many OA
publishers e.g. PloS,
BMC, Hindawi...

They actively facilitate
& encourage corpus
downloads

All of PLoS as PDF up
to mid-2010 is just 15GB
(~72,000 articles)
It's largely only OA publishers
   allowing content mining to occur

“When permission is requested [by researchers], 35% of publisher
respondents allow mining in the majority or all of cases”


[tellingly, 85% of those “35% of publishers” that allow it are Open
       Access publishers, not subscription-access publishers]


      from the Publishing Research Consortium's own report
                   (Smit & van der Graaf, 2011)
This is also the reason why only 13% of PMC is 'safe' to mine,
despite 100% of PMC being 'free to read' (ocular access only)
For-profit publishers have incentives
  to actively block content mining

 “53 % of publisher respondents will decline mining requests if the
results can replace or compete with their own products and services”


      from the Publishing Research Consortium's own report
                   (Smit & van der Graaf, 2011)


   My POV: some publishers are clearly blocking the liberation of
uncopyrightable facts from their content so they can continue making
         money from access to, or services around them.
           N.B. >80% of research is public sector funded
The Hargreaves Report (UK)

       Happily for me, the legal situation is about to change in the UK
                            (from October, 2013)

  The UK government has accepted and is enforcing many of the changes
                          recommended in:
      'Digital Opportunity: A Review of Intellectual Property and Growth'
            An independent report by Prof. Ian Hargreaves (2011)

Limitations and exceptions for non-commercial content mining research usage
                   of copyright material will soon be legal.


I, the Open Knowledge Foundation, and many other groups think this is clearly
    the optimal, well-tested solution to the mining access/permission problem
Thanks

      “But keep your minds open: maybe in some cases
                     licensing won’t be the solution”

        Neelie Kroes, Vice-President of the European Commission
             responsible for the Digital Agenda, 4th Feb 2013


         Content mining for research is clearly one of those cases


      Sincere thanks to the Open Knowledge Foundation for travel funding without which
                                     I wouldn't be here.

( Reminder: I am making all my text on all these slides open
 and re-usable under the Creative Commons Zero Waiver )

Weitere ähnliche Inhalte

Was ist angesagt?

The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
Open Access: What it is and why it is required for scholarly community?
Open Access: What it is and why it is required for scholarly community?Open Access: What it is and why it is required for scholarly community?
Open Access: What it is and why it is required for scholarly community?Sukhdev Singh
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Open access resources
Open access resourcesOpen access resources
Open access resourcesAkshay Kumar
 
Tonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final Revised
Tonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final RevisedTonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final Revised
Tonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final RevisedYasar Tonta
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century ResearchRoss Mounce
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research DataRoss Mounce
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Sciencepetermurrayrust
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTDBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTHerbert Van de Sompel
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...Brian Hole
 
Open Science @ Instituto Gukbenkian de Ciência
Open Science @ Instituto Gukbenkian de CiênciaOpen Science @ Instituto Gukbenkian de Ciência
Open Science @ Instituto Gukbenkian de CiênciaMaria Antónia Correia
 
Brian Hole - The Shift to Open Access Publishing, UCL DH 2013
Brian Hole - The Shift to Open Access Publishing, UCL DH 2013Brian Hole - The Shift to Open Access Publishing, UCL DH 2013
Brian Hole - The Shift to Open Access Publishing, UCL DH 2013Brian Hole
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureTheContentMine
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
 
Everyday digital scholarship: Using web-based tools for research
Everyday digital scholarship: Using web-based tools for researchEveryday digital scholarship: Using web-based tools for research
Everyday digital scholarship: Using web-based tools for researchFrancesca Di Donato
 

Was ist angesagt? (20)

The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
Open Access: What it is and why it is required for scholarly community?
Open Access: What it is and why it is required for scholarly community?Open Access: What it is and why it is required for scholarly community?
Open Access: What it is and why it is required for scholarly community?
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Open access resources
Open access resourcesOpen access resources
Open access resources
 
Tonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final Revised
Tonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final RevisedTonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final Revised
Tonta World Is Flat Yet Not Open Oslo Workshop 10 May 2006 Final Revised
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTDBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
 
Open Science @ Instituto Gukbenkian de Ciência
Open Science @ Instituto Gukbenkian de CiênciaOpen Science @ Instituto Gukbenkian de Ciência
Open Science @ Instituto Gukbenkian de Ciência
 
Brian Hole - The Shift to Open Access Publishing, UCL DH 2013
Brian Hole - The Shift to Open Access Publishing, UCL DH 2013Brian Hole - The Shift to Open Access Publishing, UCL DH 2013
Brian Hole - The Shift to Open Access Publishing, UCL DH 2013
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
 
Open Notebook Science
Open Notebook ScienceOpen Notebook Science
Open Notebook Science
 
Basic aspects of Open Access
Basic aspects of Open AccessBasic aspects of Open Access
Basic aspects of Open Access
 
Implementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource ConditionsImplementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource Conditions
 
Everyday digital scholarship: Using web-based tools for research
Everyday digital scholarship: Using web-based tools for researchEveryday digital scholarship: Using web-based tools for research
Everyday digital scholarship: Using web-based tools for research
 

Ähnlich wie Content Mining

Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literaturepetermurrayrust
 
Publishing your research: Open Access (introduction & overview)
Publishing your research: Open Access (introduction & overview)Publishing your research: Open Access (introduction & overview)
Publishing your research: Open Access (introduction & overview)Jamie Bisset
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 
Libraries at the centre of the debate on copyright and text and data mining: ...
Libraries at the centre of the debate on copyright and text and data mining: ...Libraries at the centre of the debate on copyright and text and data mining: ...
Libraries at the centre of the debate on copyright and text and data mining: ...LIBER Europe
 
Be open: what funders want you to do with your publications and research data
Be open: what funders want you to do with your publications and research dataBe open: what funders want you to do with your publications and research data
Be open: what funders want you to do with your publications and research dataLeon Osinski
 
Open sciencerefresher2019
Open sciencerefresher2019Open sciencerefresher2019
Open sciencerefresher2019heila1
 
Open Licensing for BioMed Central
Open Licensing for BioMed CentralOpen Licensing for BioMed Central
Open Licensing for BioMed CentralMattMcGregor
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustLEARN Project
 
Creative Commons for Tertiary Education
Creative Commons for Tertiary EducationCreative Commons for Tertiary Education
Creative Commons for Tertiary EducationMattMcGregor
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...
La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...
La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...Pedro Príncipe
 
Getting Started with Open Access
Getting Started with Open AccessGetting Started with Open Access
Getting Started with Open AccessALATechSource
 
Making ‘Everything Available’ – Transforming the (online) services and experi...
Making ‘Everything Available’ – Transforming the (online) services and experi...Making ‘Everything Available’ – Transforming the (online) services and experi...
Making ‘Everything Available’ – Transforming the (online) services and experi...Torsten Reimer
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career ResearchersRoss Mounce
 
Humanities Crowdsourcing on the Zooniverse Platform
Humanities Crowdsourcing on the Zooniverse PlatformHumanities Crowdsourcing on the Zooniverse Platform
Humanities Crowdsourcing on the Zooniverse PlatformUCLDH
 

Ähnlich wie Content Mining (20)

Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Publishing your research: Open Access (introduction & overview)
Publishing your research: Open Access (introduction & overview)Publishing your research: Open Access (introduction & overview)
Publishing your research: Open Access (introduction & overview)
 
PLOS slides
PLOS slidesPLOS slides
PLOS slides
 
Plosslides
PlosslidesPlosslides
Plosslides
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Libraries at the centre of the debate on copyright and text and data mining: ...
Libraries at the centre of the debate on copyright and text and data mining: ...Libraries at the centre of the debate on copyright and text and data mining: ...
Libraries at the centre of the debate on copyright and text and data mining: ...
 
Be open: what funders want you to do with your publications and research data
Be open: what funders want you to do with your publications and research dataBe open: what funders want you to do with your publications and research data
Be open: what funders want you to do with your publications and research data
 
Open sciencerefresher2019
Open sciencerefresher2019Open sciencerefresher2019
Open sciencerefresher2019
 
Open Licensing for BioMed Central
Open Licensing for BioMed CentralOpen Licensing for BioMed Central
Open Licensing for BioMed Central
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
Creative Commons for Tertiary Education
Creative Commons for Tertiary EducationCreative Commons for Tertiary Education
Creative Commons for Tertiary Education
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...
La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...
La Ciencia Abierta en la práctica: infraestructuras y políticas en Europa, se...
 
Getting Started with Open Access
Getting Started with Open AccessGetting Started with Open Access
Getting Started with Open Access
 
Making ‘Everything Available’ – Transforming the (online) services and experi...
Making ‘Everything Available’ – Transforming the (online) services and experi...Making ‘Everything Available’ – Transforming the (online) services and experi...
Making ‘Everything Available’ – Transforming the (online) services and experi...
 
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
UKSG 2018 Breakout - Setting your cites to open I4OC - MaccallumUKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career Researchers
 
Open Access explained
Open Access explainedOpen Access explained
Open Access explained
 
Humanities Crowdsourcing on the Zooniverse Platform
Humanities Crowdsourcing on the Zooniverse PlatformHumanities Crowdsourcing on the Zooniverse Platform
Humanities Crowdsourcing on the Zooniverse Platform
 

Mehr von Ross Mounce

Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce
 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Ross Mounce
 
The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014Ross Mounce
 
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Ross Mounce
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For ResearchersRoss Mounce
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for ScienceRoss Mounce
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetRoss Mounce
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Ross Mounce
 

Mehr von Ross Mounce (10)

Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]
 
The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014
 
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For Researchers
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for Science
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yet
 
Herding Cats
Herding CatsHerding Cats
Herding Cats
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
 
ProgPal2011
ProgPal2011ProgPal2011
ProgPal2011
 

Kürzlich hochgeladen

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

Kürzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Content Mining

  • 1. The (young) Researcher POV Text & Data Mining Content Mining Ross Mounce University of Bath, PhD Candidate Open Knowledge Foundation Panton Fellow #Licences4Europe @RMounce
  • 2. What is content mining? ● Text ● Pictures ● Videos ● Audio ● Metadata ● ...not just text or data! ( I am making all my text on all these slides open and re-usable under the Creative Commons Zero Waiver )
  • 3. The importance of scale ● The indexed WWW has >13 billion pages ● >630,000,000 sites (Netcraft, Feb 2013) ● 2 million scholarly articles published per year c. 4% growth rate each year ● >50,000,000 scholarly articles so far (Jinha, 2010) ● ~72,000 PDFs from PloS are just ~15GB ● >4,000,000 English-language Wikipedia pages downloaded as text are just ~9GB
  • 4. My interest in Content Mining ● Facts cannot be copyrighted ● Billions of facts are unfortunately trapped only in copyright-protected scholarly research articles ● My collaborators & I simply want to liberate these facts to make them openly available for everyone with no restrictions on access or use ● Within this, my particular interest in evolutionary history; e.g. phylogenetic tree data
  • 5. I want to help reconstruct the 'Tree of Life'. Thousands of other scientists also want to do this. Phylogenetic Research gets published piecemeal across >100,000 papers. Hard to synthesise, much of it not in PMC
  • 6. 76,523 articles out of a total of 91,788 for this period 34 My content of interest is scattered across 1000+ journals
  • 7. Missing data, ~15,000 articles 1 or 2 journals not shown (off-scale)
  • 8. Lots of journals, lots of publishers Just those top 500 most phylogeny 'rich' journals are published by ~120 different publishers ~ 17% Elsevier (subscription journals) ~ 13% Springer (subscription journals) ~ 15% Wiley (subscription journals) The biggest 3 publishers combined still only have less than 50% of what I want to mine The long tail of content (15,000 articles) in journals outside the 'top 500' is likely to come from many more different publishers.
  • 9. Standard license agreements ● Many explicitly do not allow mining e.g. Nature, JSTOR, AIP, ACS, Elsevier, Taylor & Francis... “This licence does not include any derivative use of the Site or the Materials, any collection and use of any product listings, descriptions, or prices; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots or similar data gathering and extraction tools...” InformaWorld Source See also: http://www.cdlib.org/services/collections/redactions/ http://www.mpdl.mpg.de/services/ezb-readme_en.htm
  • 10. Asking permission doesn't scale ● There are 90,000 different publishers in 215 different countries listed in Ulrich's Periodicals Directory & >336,000 periodicals. “I had a phone call on Friday with my university librarian and six (!) Elsevier employees.” Heather Piwowar 5 March → 16 April just to get permission/access to start work on just one publisher's content Could have done all of the analysis in time period. Hugely intimidating & patronizing process, an utter waste of time
  • 11. Why aren't more people mining? A) Many disciplines not computationally-savvy frankly B) Lack of awareness that this technology & capability exist & what it can do C) Lack of teaching/training about content mining D) Futility of even trying – how can one get permission from 1000's of publishers in the context of short research funding time-scales. Choose another research project – it's less risky. E) Live in the wrong jurisdiction (e.g. Europe) and you're at a massive disadvantage relative to S. Korea, Japan, Israel, US... due to local laws. (Probably more reasons too, this list is not exhaustive)
  • 12. The Right to Read is The Right to Mine ● Through my institutional affiliation (U. of Bath) I have legitimate paid access to many thousands of paywalled journals ● If I visit the BL or NHM library I have legitimate access to even more ● Our POV: “The Right to Read is The Right to Mine” ● We do not see any distinction between computer-assisted 'human' reading aka 'ocular access' (e.g. viewing a PDF), and computer- assisted computer reading – aka content mining. The only difference is the latter is clearly more time & resource efficient – it's much quicker and it scales well.
  • 13. Blocking & Criminalizing Research ● I have had my access to at least one publisher (BioOne) cut-off before. My 'crime' – downloading more than 25 PDFs in 5mins. ● Elsevier once blocked access to the entire U. of Cambridge campus for a week(!) because Peter Murray-Rust through legitimate access downloaded 'too many' PDFs in the course of his research ● Countless other cases, large majority not widely reported ● ...Aaron Swartz – need I say more? We have legitimate access, we do not seek to redistribute content wholesale. What's the problem? Why are we being criminalized?
  • 14. OA publishers make it easy One can easily download the entire content of many OA publishers e.g. PloS, BMC, Hindawi... They actively facilitate & encourage corpus downloads All of PLoS as PDF up to mid-2010 is just 15GB (~72,000 articles)
  • 15. It's largely only OA publishers allowing content mining to occur “When permission is requested [by researchers], 35% of publisher respondents allow mining in the majority or all of cases” [tellingly, 85% of those “35% of publishers” that allow it are Open Access publishers, not subscription-access publishers] from the Publishing Research Consortium's own report (Smit & van der Graaf, 2011) This is also the reason why only 13% of PMC is 'safe' to mine, despite 100% of PMC being 'free to read' (ocular access only)
  • 16. For-profit publishers have incentives to actively block content mining “53 % of publisher respondents will decline mining requests if the results can replace or compete with their own products and services” from the Publishing Research Consortium's own report (Smit & van der Graaf, 2011) My POV: some publishers are clearly blocking the liberation of uncopyrightable facts from their content so they can continue making money from access to, or services around them. N.B. >80% of research is public sector funded
  • 17. The Hargreaves Report (UK) Happily for me, the legal situation is about to change in the UK (from October, 2013) The UK government has accepted and is enforcing many of the changes recommended in: 'Digital Opportunity: A Review of Intellectual Property and Growth' An independent report by Prof. Ian Hargreaves (2011) Limitations and exceptions for non-commercial content mining research usage of copyright material will soon be legal. I, the Open Knowledge Foundation, and many other groups think this is clearly the optimal, well-tested solution to the mining access/permission problem
  • 18. Thanks “But keep your minds open: maybe in some cases licensing won’t be the solution” Neelie Kroes, Vice-President of the European Commission responsible for the Digital Agenda, 4th Feb 2013 Content mining for research is clearly one of those cases Sincere thanks to the Open Knowledge Foundation for travel funding without which I wouldn't be here. ( Reminder: I am making all my text on all these slides open and re-usable under the Creative Commons Zero Waiver )