SlideShare a Scribd company logo
1 of 4
Download to read offline
Text and Data Mining (TDM) 
SciDataCon 2014 Workshop 
Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish) 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
What is MINING? 
1982 
“Automatically generating logical representations of text passages... by means of an 
analysis of the coherence structure of the passages.” 
Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - 
Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 
1999 
“(semi)automated discovery of trends and patterns across very large datasets” 
“Use of large online text collections to discover new facts and trends...” 
“(Automating) the tedious parts of the text manipulation process and (integrating) 
underlying computationally-driven text analysis with human-guided decision making within 
exploratory data analysis over text” 
Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL 
'99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679 
2008 
“The use of automated methods for exploiting the enormous amount of knowledge 
available in the biomedical literature.” 
Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 
18225946. 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
What is CONTENT? 
● Images 
● Photos 
● Graphs 
● Figures 
● Captions 
● Sound 
● Video 
● Tables 
● Datasets 
● Supplementary information 
● Metadata 
● Text 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
101 uses for content mining (nearly)... 
Which universities in SE Asia do scientists from Cambridge work with? (We get asked this 
sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of 
their co-authors we can get a very good approximation. (Feasible now). 
Which papers contain grayscale images which could be interpreted as Gels? A 
http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A 
typical gel (Wikipedia CC-BY-SA) looks like 
Find me papers in subjects which are (not) editorials, news, corrections, retractions, 
reviews, etc. Slightly journal/publisher-dependent but otherwise very simple. 
Find papers about chemistry in the German language. Highly tractable. Typical approach would be 
to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English 
(“one”, “the” …) 
Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial 
to extract references and authors. More difficult, of course to disambiguate. 
Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 
2006 when I started a Wikipedia article on it. 
Find papers where authors come from chemistry department(s) and a linguistics 
department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular 
Sciences”, “Biochemistry”)…) 
Find papers acknowledging support from the Wellcome Trust . (So we can check for OA 
compliance…). 
Find papers with supplemental data files. Journal-specific but easily scalable. 
Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, 
text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an 
enthusiast 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014

More Related Content

What's hot

How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...Jenn Riley
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked DataKaren Estlund
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century ResearchRoss Mounce
 
Changing The Way We Discover Research
Changing The Way We Discover ResearchChanging The Way We Discover Research
Changing The Way We Discover ResearchOpen Knowledge Maps
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio EditionJordan Chapman
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisrobertstevens65
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSSpetermurrayrust
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literaturepetermurrayrust
 
Altmetrics & visualizations for discovery
Altmetrics & visualizations for discoveryAltmetrics & visualizations for discovery
Altmetrics & visualizations for discoveryOpen Knowledge Maps
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinalDeborah McGuinness
 
Data publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminarData publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminarCarly Strasser
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...David Milward
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureTheContentMine
 
Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...Dario Taraborelli
 

What's hot (20)

How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
Changing The Way We Discover Research
Changing The Way We Discover ResearchChanging The Way We Discover Research
Changing The Way We Discover Research
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
 
Islt doctoral day may2018_marwa
Islt doctoral day may2018_marwaIslt doctoral day may2018_marwa
Islt doctoral day may2018_marwa
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Altmetrics & visualizations for discovery
Altmetrics & visualizations for discoveryAltmetrics & visualizations for discovery
Altmetrics & visualizations for discovery
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Stack queue
Stack queueStack queue
Stack queue
 
Data publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminarData publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminar
 
When the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big dataWhen the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big data
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Open science platforms
Open science platformsOpen science platforms
Open science platforms
 
Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...
 

Similar to SciDataCon 2014 TDM Workshop Intro Slides

Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessdatacite
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulpetermurrayrust
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Wouter Beek
 
text_mining.doc
text_mining.doctext_mining.doc
text_mining.docbutest
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeologyguest756e05
 
ContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarJenny Molloy
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONIJwest
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 

Similar to SciDataCon 2014 TDM Workshop Intro Slides (20)

Rudi
RudiRudi
Rudi
 
Rudi
RudiRudi
Rudi
 
Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
 
text_mining.doc
text_mining.doctext_mining.doc
text_mining.doc
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Reading avoidance
Reading avoidanceReading avoidance
Reading avoidance
 
ContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data Seminar
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 

More from Jenny Molloy

Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyJenny Molloy
 
ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)Jenny Molloy
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataJenny Molloy
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDMJenny Molloy
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open ScienceJenny Molloy
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMJenny Molloy
 

More from Jenny Molloy (7)

Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic Biology
 
ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research data
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDM
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open Science
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGM
 
Id2 presentation
Id2 presentationId2 presentation
Id2 presentation
 

Recently uploaded

Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureSérgio Sacani
 
Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxGOWTHAMIM22
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxArunLakshmiMeenakshi
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxmuralinath2
 
GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)Areesha Ahmad
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpSérgio Sacani
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeAreesha Ahmad
 
-case selection and treatment planing.pptx
-case selection and treatment planing.pptx-case selection and treatment planing.pptx
-case selection and treatment planing.pptxmohamedturki866
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison
 
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)Areesha Ahmad
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategyMansiBishnoi1
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsSérgio Sacani
 
IISc Bangalore M.E./M.Tech. courses and fees 2024
IISc Bangalore M.E./M.Tech. courses and fees 2024IISc Bangalore M.E./M.Tech. courses and fees 2024
IISc Bangalore M.E./M.Tech. courses and fees 2024SciAstra
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionAreesha Ahmad
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyanmuralinath2
 
family therapy psychotherapy types .pdf
family therapy psychotherapy types  .pdffamily therapy psychotherapy types  .pdf
family therapy psychotherapy types .pdfhaseebahmeddrama
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent Universitypablovgd
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Sahil Suleman
 

Recently uploaded (20)

Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
Isolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptxIsolation of AMF by wet sieving and decantation method pptx
Isolation of AMF by wet sieving and decantation method pptx
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
-case selection and treatment planing.pptx
-case selection and treatment planing.pptx-case selection and treatment planing.pptx
-case selection and treatment planing.pptx
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab (Microbiology Lab Safety Procedures)
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategy
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
 
IISc Bangalore M.E./M.Tech. courses and fees 2024
IISc Bangalore M.E./M.Tech. courses and fees 2024IISc Bangalore M.E./M.Tech. courses and fees 2024
IISc Bangalore M.E./M.Tech. courses and fees 2024
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interaction
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
Erythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C KalyanErythropoiesis- Dr.E. Muralinath-C Kalyan
Erythropoiesis- Dr.E. Muralinath-C Kalyan
 
family therapy psychotherapy types .pdf
family therapy psychotherapy types  .pdffamily therapy psychotherapy types  .pdf
family therapy psychotherapy types .pdf
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 

SciDataCon 2014 TDM Workshop Intro Slides

  • 1. Text and Data Mining (TDM) SciDataCon 2014 Workshop Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish) https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 2. What is MINING? 1982 “Automatically generating logical representations of text passages... by means of an analysis of the coherence structure of the passages.” Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 1999 “(semi)automated discovery of trends and patterns across very large datasets” “Use of large online text collections to discover new facts and trends...” “(Automating) the tedious parts of the text manipulation process and (integrating) underlying computationally-driven text analysis with human-guided decision making within exploratory data analysis over text” Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679 2008 “The use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature.” Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 18225946. https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 3. What is CONTENT? ● Images ● Photos ● Graphs ● Figures ● Captions ● Sound ● Video ● Tables ● Datasets ● Supplementary information ● Metadata ● Text https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 4. 101 uses for content mining (nearly)... Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now). Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple. Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English (“one”, “the” …) Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate. Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it. Find papers where authors come from chemistry department(s) and a linguistics department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular Sciences”, “Biochemistry”)…) Find papers acknowledging support from the Wellcome Trust . (So we can check for OA compliance…). Find papers with supplemental data files. Journal-specific but easily scalable. Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014