SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Text Mining From Three
Perspectives - Publisher
Judson Dunham
Sr. Product Manager, Elsevier
Charleston, 2012
2010




http://vignettebricks.blogspot.com/2007/07/miss-librarian-in-library-with-slippery.html
Elsevier content and APIs for research community




  ARTICLE         AUTHOR
 CONTENT           DATA




 ABSTRACTS        AFFILIATION
                     DATA




             Collaboration at a different level
2011
2012
What researchers want from content mining
 Assurance of right to mine subscribed or
  freely available content across publishers
 Clear understanding of what they can do with
  the outputs
 Standardized, mineable formatting of data
 Stable, reliable systems for retrieving and
  storing content
How Elsevier is working to address this
Simple process to add text mining rights
 Researchers/Librarians can request access via
  tdmsupport@elsevier.com
 We welcome requests, we need to keep learning!
Provide efficient methods for content retrieval
 Access to APIs for individual researchers via click
  through agreements on the web
 Slimmed down content formats optimized for text
  mining
How Elsevier is working to address this
Clear policy on reuse and redistribution
 Simple CC-BY-NC license on output when
  redistributing results of text mining in a research
  support or other non-commercial tool
 DOI link back to mined article whenever feasible
  when displaying extracted content
 Clear guidelines and permissions on snippets
  around extracted entities to allow context when
  presenting results
USE CASE: Studying patterns of data sharing and reuse in
the published literature
Text mining small subset of content
 Using literature as a data source to explore a research issue
  (in this case, how is data in repositories shared, referenced
  and reused in the literature)
 Requires 1,000’s of documents (not 1,000,000’s)
 Will result in papers published in journals, as well as a
  publicly available database providing information about
  reuse of specific datasets (mined from the literature)

Solution: Access to subscribed content via API’s, permission to
reuse results in non-commercial tool


                                Heather Piwowar - Postdoc at Duke University
USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery
Text mining large volume of content
Identify all words in full-text scientific articles resembling DNA sequences, which are
extracted and then mapped to public genome sequences. They can then be displayed on
genome browser websites and used in data-mining applications.




                                                      Max Haeussler – Post Doc, UCSC
USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery




                                                             Link to paper with
                                                             Metadata and abstract




                      Mined Sequence (bold) with flanking sentence




                                            Max Haeussler – Post Doc, UCSC
USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery




                                 Max Haeussler – Post Doc, UCSC
USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery




                                 Max Haeussler – Post Doc, UCSC
Still some unsolved problems
Decentralized content aggregation:
 Retrieving and storing data locally = shipping loads
  of content around the web/world
 Still tons of formatting and preparation work for
  every single text miner
 Significant computational infrastructure required to
  do large scale text mining
A cloud-based service for running large-scale text mining
without having to build infrastructure or worry about the
            “plumbing” would be valuable.
Still some unsolved problems
Preservation of knowledge:
 Storing and redistributing (valuable) text mining
  results requires long term infrastructure, hosting
  and other services
 Many academic research efforts end “when the
  postdoc moves on” – where does the mined
  knowledge go?
 A service offering a reliable, persistent, managed,
  open venue for the hosting of content mining
  results would be valuable
END
• j.dunham@elsevier.com/@judso
• tdmsupport@elsevier.com
Thanks:
• Max Haeussler (UCSC)
• Bradley Allen (Elsevier Labs)

Weitere ähnliche Inhalte

Was ist angesagt?

20140327 rda plazi_final
20140327 rda plazi_final20140327 rda plazi_final
20140327 rda plazi_finalagosti
 
Library resources for Natural Sciences
Library resources for Natural SciencesLibrary resources for Natural Sciences
Library resources for Natural SciencesLynne Meehan
 
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...ASIS&T
 
Composite protein databases
Composite protein databasesComposite protein databases
Composite protein databasesShritilekhaDash
 
Ways for researchers to store, share, discover, and use data_Cousijn
Ways for researchers to store, share, discover, and use data_CousijnWays for researchers to store, share, discover, and use data_Cousijn
Ways for researchers to store, share, discover, and use data_CousijnPlatforma Otwartej Nauki
 
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...ASIS&T
 
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD
 
Met soc15 roccaserra-biocrates-datasharing
Met soc15 roccaserra-biocrates-datasharingMet soc15 roccaserra-biocrates-datasharing
Met soc15 roccaserra-biocrates-datasharingPhilippe Rocca-Serra
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseAnita de Waard
 
W3C HCLS Scientific Discourse Task Autumn 2010
W3C HCLS Scientific Discourse Task Autumn 2010W3C HCLS Scientific Discourse Task Autumn 2010
W3C HCLS Scientific Discourse Task Autumn 2010tim.clark
 
A research passport: library requirements
A research passport: library requirementsA research passport: library requirements
A research passport: library requirementsLIBER Europe
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...marcosmartinezromero
 
Data editors meeting at SEFS
Data editors meeting at SEFSData editors meeting at SEFS
Data editors meeting at SEFSAaike De Wever
 
Biol 4278: Finding Primary Literature in Biology
Biol 4278: Finding Primary Literature in BiologyBiol 4278: Finding Primary Literature in Biology
Biol 4278: Finding Primary Literature in BiologyKatherine Magner
 
Collaborative Thesaurus Tagging The Wikipedia Way
Collaborative Thesaurus Tagging The Wikipedia WayCollaborative Thesaurus Tagging The Wikipedia Way
Collaborative Thesaurus Tagging The Wikipedia Wayflyingsheep
 

Was ist angesagt? (20)

20140327 rda plazi_final
20140327 rda plazi_final20140327 rda plazi_final
20140327 rda plazi_final
 
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsAn Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
 
Library resources for Natural Sciences
Library resources for Natural SciencesLibrary resources for Natural Sciences
Library resources for Natural Sciences
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...
 
Composite protein databases
Composite protein databasesComposite protein databases
Composite protein databases
 
Ways for researchers to store, share, discover, and use data_Cousijn
Ways for researchers to store, share, discover, and use data_CousijnWays for researchers to store, share, discover, and use data_Cousijn
Ways for researchers to store, share, discover, and use data_Cousijn
 
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...
 
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability Science
 
Met soc15 roccaserra-biocrates-datasharing
Met soc15 roccaserra-biocrates-datasharingMet soc15 roccaserra-biocrates-datasharing
Met soc15 roccaserra-biocrates-datasharing
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with Dataverse
 
W3C HCLS Scientific Discourse Task Autumn 2010
W3C HCLS Scientific Discourse Task Autumn 2010W3C HCLS Scientific Discourse Task Autumn 2010
W3C HCLS Scientific Discourse Task Autumn 2010
 
A research passport: library requirements
A research passport: library requirementsA research passport: library requirements
A research passport: library requirements
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
 
Journal Data Requirements
Journal Data Requirements Journal Data Requirements
Journal Data Requirements
 
Data editors meeting at SEFS
Data editors meeting at SEFSData editors meeting at SEFS
Data editors meeting at SEFS
 
Biol 4278: Finding Primary Literature in Biology
Biol 4278: Finding Primary Literature in BiologyBiol 4278: Finding Primary Literature in Biology
Biol 4278: Finding Primary Literature in Biology
 
Collaborative Thesaurus Tagging The Wikipedia Way
Collaborative Thesaurus Tagging The Wikipedia WayCollaborative Thesaurus Tagging The Wikipedia Way
Collaborative Thesaurus Tagging The Wikipedia Way
 
The CATE Project
The CATE ProjectThe CATE Project
The CATE Project
 

Andere mochten auch

ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategyAnton Yuryev
 
USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...
USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...
USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...ChemAxon
 
Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLAN
Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLANParkinson drug sinemet-LIFE CYCLE STRATEGIC PLAN
Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLANChrysopigi Vardikou
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Matthew Clark
 
Merck group 7 strategic_mgmt
Merck group 7 strategic_mgmtMerck group 7 strategic_mgmt
Merck group 7 strategic_mgmtkirtishankar075
 
Merck & Co Presentation
Merck & Co PresentationMerck & Co Presentation
Merck & Co Presentationtylerrives
 
Elsevier Author Workshop – How to write a scientific paper… and get it published
Elsevier Author Workshop – How to write a scientific paper… and get it publishedElsevier Author Workshop – How to write a scientific paper… and get it published
Elsevier Author Workshop – How to write a scientific paper… and get it publishedGlobal Risk Forum GRFDavos
 
Merck presentation.
Merck presentation.Merck presentation.
Merck presentation.Bakryk
 
Merck & co., inc. barclays final 3-10-2015
Merck & co., inc.   barclays final 3-10-2015Merck & co., inc.   barclays final 3-10-2015
Merck & co., inc. barclays final 3-10-2015Marcel Brussee
 

Andere mochten auch (11)

ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategy
 
USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...
USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...
USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...
 
Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLAN
Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLANParkinson drug sinemet-LIFE CYCLE STRATEGIC PLAN
Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLAN
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016
 
Merck powerpoint
Merck powerpointMerck powerpoint
Merck powerpoint
 
Merck group 7 strategic_mgmt
Merck group 7 strategic_mgmtMerck group 7 strategic_mgmt
Merck group 7 strategic_mgmt
 
Merck & Co Presentation
Merck & Co PresentationMerck & Co Presentation
Merck & Co Presentation
 
Elsevier Author Workshop – How to write a scientific paper… and get it published
Elsevier Author Workshop – How to write a scientific paper… and get it publishedElsevier Author Workshop – How to write a scientific paper… and get it published
Elsevier Author Workshop – How to write a scientific paper… and get it published
 
merck and co. inc
merck and co. incmerck and co. inc
merck and co. inc
 
Merck presentation.
Merck presentation.Merck presentation.
Merck presentation.
 
Merck & co., inc. barclays final 3-10-2015
Merck & co., inc.   barclays final 3-10-2015Merck & co., inc.   barclays final 3-10-2015
Merck & co., inc. barclays final 3-10-2015
 

Ähnlich wie Text Mining from Three Perspectives - Publisher

December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...DeVonne Parks, CEM
 
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Deborah McGuinness
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Merce Crosas
 
Usage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital LibraryUsage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital LibraryAndre Vellino
 
Data Publishing Workflows with Dataverse
Data Publishing Workflows with DataverseData Publishing Workflows with Dataverse
Data Publishing Workflows with DataverseMicah Altman
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...Susanna-Assunta Sansone
 
Research Objects @ HARMONY 2014
Research Objects @ HARMONY 2014Research Objects @ HARMONY 2014
Research Objects @ HARMONY 2014seanb
 
Using Word Clouds to Assist with Collection Development
Using Word Clouds to Assist with Collection DevelopmentUsing Word Clouds to Assist with Collection Development
Using Word Clouds to Assist with Collection DevelopmentSunghae Ress
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeologyguest756e05
 
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Natsuko Nicholls
 
Establishing a UQ Research Data Management Service
Establishing a UQ Research Data Management Service Establishing a UQ Research Data Management Service
Establishing a UQ Research Data Management Service ARDC
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurghJun Zhao
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Susanna-Assunta Sansone
 
RDA - Long Tail Data Interest Group - NPG Scientitic Data oveview
RDA - Long Tail Data Interest Group - NPG Scientitic Data oveviewRDA - Long Tail Data Interest Group - NPG Scientitic Data oveview
RDA - Long Tail Data Interest Group - NPG Scientitic Data oveviewSusanna-Assunta Sansone
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)Duncan Hull
 

Ähnlich wie Text Mining from Three Perspectives - Publisher (20)

December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
 
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
Usage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital LibraryUsage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital Library
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Data Publishing Workflows with Dataverse
Data Publishing Workflows with DataverseData Publishing Workflows with Dataverse
Data Publishing Workflows with Dataverse
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
 
Research Objects @ HARMONY 2014
Research Objects @ HARMONY 2014Research Objects @ HARMONY 2014
Research Objects @ HARMONY 2014
 
Using Word Clouds to Assist with Collection Development
Using Word Clouds to Assist with Collection DevelopmentUsing Word Clouds to Assist with Collection Development
Using Word Clouds to Assist with Collection Development
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 
Chemspider Presentation at the ACS Meeting in New orleans
Chemspider Presentation at the ACS Meeting in New orleansChemspider Presentation at the ACS Meeting in New orleans
Chemspider Presentation at the ACS Meeting in New orleans
 
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
 
Establishing a UQ Research Data Management Service
Establishing a UQ Research Data Management Service Establishing a UQ Research Data Management Service
Establishing a UQ Research Data Management Service
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
 
RDA - Long Tail Data Interest Group - NPG Scientitic Data oveview
RDA - Long Tail Data Interest Group - NPG Scientitic Data oveviewRDA - Long Tail Data Interest Group - NPG Scientitic Data oveview
RDA - Long Tail Data Interest Group - NPG Scientitic Data oveview
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)
 

Text Mining from Three Perspectives - Publisher

  • 1. Text Mining From Three Perspectives - Publisher Judson Dunham Sr. Product Manager, Elsevier Charleston, 2012
  • 3. Elsevier content and APIs for research community ARTICLE AUTHOR CONTENT DATA ABSTRACTS AFFILIATION DATA Collaboration at a different level
  • 6. What researchers want from content mining  Assurance of right to mine subscribed or freely available content across publishers  Clear understanding of what they can do with the outputs  Standardized, mineable formatting of data  Stable, reliable systems for retrieving and storing content
  • 7. How Elsevier is working to address this Simple process to add text mining rights  Researchers/Librarians can request access via tdmsupport@elsevier.com  We welcome requests, we need to keep learning! Provide efficient methods for content retrieval  Access to APIs for individual researchers via click through agreements on the web  Slimmed down content formats optimized for text mining
  • 8. How Elsevier is working to address this Clear policy on reuse and redistribution  Simple CC-BY-NC license on output when redistributing results of text mining in a research support or other non-commercial tool  DOI link back to mined article whenever feasible when displaying extracted content  Clear guidelines and permissions on snippets around extracted entities to allow context when presenting results
  • 9. USE CASE: Studying patterns of data sharing and reuse in the published literature Text mining small subset of content  Using literature as a data source to explore a research issue (in this case, how is data in repositories shared, referenced and reused in the literature)  Requires 1,000’s of documents (not 1,000,000’s)  Will result in papers published in journals, as well as a publicly available database providing information about reuse of specific datasets (mined from the literature) Solution: Access to subscribed content via API’s, permission to reuse results in non-commercial tool Heather Piwowar - Postdoc at Duke University
  • 10. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Text mining large volume of content Identify all words in full-text scientific articles resembling DNA sequences, which are extracted and then mapped to public genome sequences. They can then be displayed on genome browser websites and used in data-mining applications. Max Haeussler – Post Doc, UCSC
  • 11. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Link to paper with Metadata and abstract Mined Sequence (bold) with flanking sentence Max Haeussler – Post Doc, UCSC
  • 12. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Max Haeussler – Post Doc, UCSC
  • 13. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Max Haeussler – Post Doc, UCSC
  • 14. Still some unsolved problems Decentralized content aggregation:  Retrieving and storing data locally = shipping loads of content around the web/world  Still tons of formatting and preparation work for every single text miner  Significant computational infrastructure required to do large scale text mining
  • 15. A cloud-based service for running large-scale text mining without having to build infrastructure or worry about the “plumbing” would be valuable.
  • 16. Still some unsolved problems Preservation of knowledge:  Storing and redistributing (valuable) text mining results requires long term infrastructure, hosting and other services  Many academic research efforts end “when the postdoc moves on” – where does the mined knowledge go?  A service offering a reliable, persistent, managed, open venue for the hosting of content mining results would be valuable
  • 17. END • j.dunham@elsevier.com/@judso • tdmsupport@elsevier.com Thanks: • Max Haeussler (UCSC) • Bradley Allen (Elsevier Labs)