Text Mining from Three Perspectives - Publisher

•Als PPTX, PDF herunterladen•

4 gefällt mir•1,772 views

judsondunham

Part of talk with Heather Piwowar and Teresa Lee at Charleston 2012

2010

http://vignettebricks.blogspot.com/2007/07/miss-librarian-in-library-with-slippery.html

Elsevier content and APIs for research community

ARTICLE AUTHOR
CONTENT DATA

ABSTRACTS AFFILIATION
DATA

Collaboration at a different level

What researchers want from content mining
 Assurance of right to mine subscribed or
freely available content across publishers
 Clear understanding of what they can do with
the outputs
 Standardized, mineable formatting of data
 Stable, reliable systems for retrieving and
storing content

How Elsevier is working to address this
Simple process to add text mining rights
 Researchers/Librarians can request access via
tdmsupport@elsevier.com
 We welcome requests, we need to keep learning!
Provide efficient methods for content retrieval
 Access to APIs for individual researchers via click
through agreements on the web
 Slimmed down content formats optimized for text
mining

How Elsevier is working to address this
Clear policy on reuse and redistribution
 Simple CC-BY-NC license on output when
redistributing results of text mining in a research
support or other non-commercial tool
 DOI link back to mined article whenever feasible
when displaying extracted content
 Clear guidelines and permissions on snippets
around extracted entities to allow context when
presenting results

USE CASE: Studying patterns of data sharing and reuse in
the published literature
Text mining small subset of content
 Using literature as a data source to explore a research issue
(in this case, how is data in repositories shared, referenced
and reused in the literature)
 Requires 1,000’s of documents (not 1,000,000’s)
 Will result in papers published in journals, as well as a
publicly available database providing information about
reuse of specific datasets (mined from the literature)

Solution: Access to subscribed content via API’s, permission to
reuse results in non-commercial tool

Heather Piwowar - Postdoc at Duke University

USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery
Text mining large volume of content
Identify all words in full-text scientific articles resembling DNA sequences, which are
extracted and then mapped to public genome sequences. They can then be displayed on
genome browser websites and used in data-mining applications.

Max Haeussler – Post Doc, UCSC

USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery

Link to paper with
Metadata and abstract

Mined Sequence (bold) with flanking sentence

Max Haeussler – Post Doc, UCSC

USE CASE: Extraction of DNA sequences from millions of
papers to facilitate biocuration and discovery

Max Haeussler – Post Doc, UCSC

Still some unsolved problems
Decentralized content aggregation:
 Retrieving and storing data locally = shipping loads
of content around the web/world
 Still tons of formatting and preparation work for
every single text miner
 Significant computational infrastructure required to
do large scale text mining

A cloud-based service for running large-scale text mining
without having to build infrastructure or worry about the
“plumbing” would be valuable.

Still some unsolved problems
Preservation of knowledge:
 Storing and redistributing (valuable) text mining
results requires long term infrastructure, hosting
and other services
 Many academic research efforts end “when the
postdoc moves on” – where does the mined
knowledge go?
 A service offering a reliable, persistent, managed,
open venue for the hosting of content mining
results would be valuable

END
• j.dunham@elsevier.com/@judso
• tdmsupport@elsevier.com
Thanks:
• Max Haeussler (UCSC)
• Bradley Allen (Elsevier Labs)

Weitere ähnliche Inhalte

Was ist angesagt?

20140327 rda plazi_finalagosti

An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsCEDAR: Center for Expanded Data Annotation and Retrieval

Library resources for Natural SciencesLynne Meehan

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...CEDAR: Center for Expanded Data Annotation and Retrieval

RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...ASIS&T

Composite protein databasesShritilekhaDash

Ways for researchers to store, share, discover, and use data_CousijnPlatforma Otwartej Nauki

RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...ASIS&T

SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD

Met soc15 roccaserra-biocrates-datasharingPhilippe Rocca-Serra

Entrez databasesHafiz Muhammad Zeeshan Raza

Networked Science, And Integrating with DataverseAnita de Waard

W3C HCLS Scientific Discourse Task Autumn 2010tim.clark

A research passport: library requirementsLIBER Europe

ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...marcosmartinezromero

Journal Data Requirements Western Sydney University

Data editors meeting at SEFSAaike De Wever

Biol 4278: Finding Primary Literature in BiologyKatherine Magner

Collaborative Thesaurus Tagging The Wikipedia Wayflyingsheep

The CATE ProjectKehan Harman

Was ist angesagt? (20)

20140327 rda plazi_final

An Open Repository Model for Acquiring Knowledge About Scientific Experiments

Library resources for Natural Sciences

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...

RDAP 16 Poster: Diving into Data: Implementing a Data Repository at the Texas...

Composite protein databases

Ways for researchers to store, share, discover, and use data_Cousijn

RDAP 16 Poster: A Proposed Course Model for Integrating RDM with Research Rep...

SEAD Prototype: Data Curation and Preservation for Sustainability Science

Met soc15 roccaserra-biocrates-datasharing

Entrez databases

Networked Science, And Integrating with Dataverse

W3C HCLS Scientific Discourse Task Autumn 2010

A research passport: library requirements

ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...

Journal Data Requirements

Data editors meeting at SEFS

Biol 4278: Finding Primary Literature in Biology

Collaborative Thesaurus Tagging The Wikipedia Way

The CATE Project

Andere mochten auch

ELSS use cases and strategyAnton Yuryev

USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...ChemAxon

Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLANChrysopigi Vardikou

Bioactivity Predictive ModelingMay2016Matthew Clark

Merck powerpointEric Thomason

Merck group 7 strategic_mgmtkirtishankar075

Merck & Co Presentationtylerrives

Elsevier Author Workshop – How to write a scientific paper… and get it publishedGlobal Risk Forum GRFDavos

merck and co. inctushar_mba2012

Merck presentation.Bakryk

Merck & co., inc. barclays final 3-10-2015Marcel Brussee

Andere mochten auch (11)

ELSS use cases and strategy

USUGM 2014 - Zhengwei Peng (Merck): Construction of a vast virtual chemical s...

Parkinson drug sinemet-LIFE CYCLE STRATEGIC PLAN

Bioactivity Predictive ModelingMay2016

Merck powerpoint

Merck group 7 strategic_mgmt

Merck & Co Presentation

Elsevier Author Workshop – How to write a scientific paper… and get it published

merck and co. inc

Merck presentation.

Merck & co., inc. barclays final 3-10-2015

Ähnlich wie Text Mining from Three Perspectives - Publisher

December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...DeVonne Parks, CEM

Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Deborah McGuinness

Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Merce Crosas

Usage-Based vs. Citation-Based Recommenders in a Digital LibraryAndre Vellino

A Clean Slate?Herbert Van de Sompel

Data Publishing Workflows with DataverseMicah Altman

Semantic Technologies for Big Sciences including AstrophysicsArtificial Intelligence Institute at UofSC

NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...Susanna-Assunta Sansone

Research Objects @ HARMONY 2014seanb

Using Word Clouds to Assist with Collection DevelopmentSunghae Ress

Open Research Data: Licensing | Standards | FutureRoss Mounce

An Open Context for Archaeologyguest756e05

Data Publishing in ArchaeozoologySarah Whitcher Kansa

Chemspider Presentation at the ACS Meeting in New orleansUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Natsuko Nicholls

Establishing a UQ Research Data Management Service ARDC

2011 03-provenance-workshop-edingurghJun Zhao

Scientific Data overview of Data Descriptors - WT Data-Literature integration...Susanna-Assunta Sansone

RDA - Long Tail Data Interest Group - NPG Scientitic Data oveviewSusanna-Assunta Sansone

The Future of Research (Science and Technology)Duncan Hull

Ähnlich wie Text Mining from Three Perspectives - Publisher (20)

December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...

Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017

Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...

Usage-Based vs. Citation-Based Recommenders in a Digital Library

A Clean Slate?

Data Publishing Workflows with Dataverse

Semantic Technologies for Big Sciences including Astrophysics

NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...

Research Objects @ HARMONY 2014

Using Word Clouds to Assist with Collection Development

Open Research Data: Licensing | Standards | Future

An Open Context for Archaeology

Data Publishing in Archaeozoology

Chemspider Presentation at the ACS Meeting in New orleans

Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...

Establishing a UQ Research Data Management Service

2011 03-provenance-workshop-edingurgh

Scientific Data overview of Data Descriptors - WT Data-Literature integration...

RDA - Long Tail Data Interest Group - NPG Scientitic Data oveview

The Future of Research (Science and Technology)

Text Mining from Three Perspectives - Publisher

1. Text Mining From Three Perspectives - Publisher Judson Dunham Sr. Product Manager, Elsevier Charleston, 2012

2. 2010 http://vignettebricks.blogspot.com/2007/07/miss-librarian-in-library-with-slippery.html

3. Elsevier content and APIs for research community ARTICLE AUTHOR CONTENT DATA ABSTRACTS AFFILIATION DATA Collaboration at a different level

4. 2011

5. 2012

6. What researchers want from content mining  Assurance of right to mine subscribed or freely available content across publishers  Clear understanding of what they can do with the outputs  Standardized, mineable formatting of data  Stable, reliable systems for retrieving and storing content

7. How Elsevier is working to address this Simple process to add text mining rights  Researchers/Librarians can request access via tdmsupport@elsevier.com  We welcome requests, we need to keep learning! Provide efficient methods for content retrieval  Access to APIs for individual researchers via click through agreements on the web  Slimmed down content formats optimized for text mining

8. How Elsevier is working to address this Clear policy on reuse and redistribution  Simple CC-BY-NC license on output when redistributing results of text mining in a research support or other non-commercial tool  DOI link back to mined article whenever feasible when displaying extracted content  Clear guidelines and permissions on snippets around extracted entities to allow context when presenting results

9. USE CASE: Studying patterns of data sharing and reuse in the published literature Text mining small subset of content  Using literature as a data source to explore a research issue (in this case, how is data in repositories shared, referenced and reused in the literature)  Requires 1,000’s of documents (not 1,000,000’s)  Will result in papers published in journals, as well as a publicly available database providing information about reuse of specific datasets (mined from the literature) Solution: Access to subscribed content via API’s, permission to reuse results in non-commercial tool Heather Piwowar - Postdoc at Duke University

10. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Text mining large volume of content Identify all words in full-text scientific articles resembling DNA sequences, which are extracted and then mapped to public genome sequences. They can then be displayed on genome browser websites and used in data-mining applications. Max Haeussler – Post Doc, UCSC

11. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Link to paper with Metadata and abstract Mined Sequence (bold) with flanking sentence Max Haeussler – Post Doc, UCSC

12. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Max Haeussler – Post Doc, UCSC

13. USE CASE: Extraction of DNA sequences from millions of papers to facilitate biocuration and discovery Max Haeussler – Post Doc, UCSC

14. Still some unsolved problems Decentralized content aggregation:  Retrieving and storing data locally = shipping loads of content around the web/world  Still tons of formatting and preparation work for every single text miner  Significant computational infrastructure required to do large scale text mining

15. A cloud-based service for running large-scale text mining without having to build infrastructure or worry about the “plumbing” would be valuable.

16. Still some unsolved problems Preservation of knowledge:  Storing and redistributing (valuable) text mining results requires long term infrastructure, hosting and other services  Many academic research efforts end “when the postdoc moves on” – where does the mined knowledge go?  A service offering a reliable, persistent, managed, open venue for the hosting of content mining results would be valuable

17. END • j.dunham@elsevier.com/@judso • tdmsupport@elsevier.com Thanks: • Max Haeussler (UCSC) • Bradley Allen (Elsevier Labs)

Text Mining from Three Perspectives - Publisher

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Text Mining from Three Perspectives - Publisher

Ähnlich wie Text Mining from Three Perspectives - Publisher (20)

Text Mining from Three Perspectives - Publisher