SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Text and Data Mining (TDM) 
SciDataCon 2014 Workshop 
Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish) 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
What is MINING? 
1982 
“Automatically generating logical representations of text passages... by means of an 
analysis of the coherence structure of the passages.” 
Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - 
Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 
1999 
“(semi)automated discovery of trends and patterns across very large datasets” 
“Use of large online text collections to discover new facts and trends...” 
“(Automating) the tedious parts of the text manipulation process and (integrating) 
underlying computationally-driven text analysis with human-guided decision making within 
exploratory data analysis over text” 
Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL 
'99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679 
2008 
“The use of automated methods for exploiting the enormous amount of knowledge 
available in the biomedical literature.” 
Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 
18225946. 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
What is CONTENT? 
● Images 
● Photos 
● Graphs 
● Figures 
● Captions 
● Sound 
● Video 
● Tables 
● Datasets 
● Supplementary information 
● Metadata 
● Text 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
101 uses for content mining (nearly)... 
Which universities in SE Asia do scientists from Cambridge work with? (We get asked this 
sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of 
their co-authors we can get a very good approximation. (Feasible now). 
Which papers contain grayscale images which could be interpreted as Gels? A 
http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A 
typical gel (Wikipedia CC-BY-SA) looks like 
Find me papers in subjects which are (not) editorials, news, corrections, retractions, 
reviews, etc. Slightly journal/publisher-dependent but otherwise very simple. 
Find papers about chemistry in the German language. Highly tractable. Typical approach would be 
to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English 
(“one”, “the” …) 
Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial 
to extract references and authors. More difficult, of course to disambiguate. 
Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 
2006 when I started a Wikipedia article on it. 
Find papers where authors come from chemistry department(s) and a linguistics 
department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular 
Sciences”, “Biochemistry”)…) 
Find papers acknowledging support from the Wellcome Trust . (So we can check for OA 
compliance…). 
Find papers with supplemental data files. Journal-specific but easily scalable. 
Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, 
text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an 
enthusiast 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014

Weitere ähnliche Inhalte

Was ist angesagt?

601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
Jordan Chapman
 
Islt doctoral day may2018_marwa
Islt doctoral day may2018_marwaIslt doctoral day may2018_marwa
Islt doctoral day may2018_marwa
Dr. Marwa Mekni-Toujani
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
robertstevens65
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 

Was ist angesagt? (20)

How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
Changing The Way We Discover Research
Changing The Way We Discover ResearchChanging The Way We Discover Research
Changing The Way We Discover Research
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
 
Islt doctoral day may2018_marwa
Islt doctoral day may2018_marwaIslt doctoral day may2018_marwa
Islt doctoral day may2018_marwa
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Altmetrics & visualizations for discovery
Altmetrics & visualizations for discoveryAltmetrics & visualizations for discovery
Altmetrics & visualizations for discovery
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Stack queue
Stack queueStack queue
Stack queue
 
Data publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminarData publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminar
 
When the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big dataWhen the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big data
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Open science platforms
Open science platformsOpen science platforms
Open science platforms
 
Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...
 

Ähnlich wie SciDataCon 2014 TDM Workshop Intro Slides

How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
drnigam
 
text_mining.doc
text_mining.doctext_mining.doc
text_mining.doc
butest
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
Theodore J. LaGrow
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
IJwest
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 

Ähnlich wie SciDataCon 2014 TDM Workshop Intro Slides (20)

Rudi
RudiRudi
Rudi
 
Rudi
RudiRudi
Rudi
 
Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
 
text_mining.doc
text_mining.doctext_mining.doc
text_mining.doc
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Reading avoidance
Reading avoidanceReading avoidance
Reading avoidance
 
ContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data Seminar
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 

Mehr von Jenny Molloy (7)

Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic Biology
 
ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research data
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDM
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open Science
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGM
 
Id2 presentation
Id2 presentationId2 presentation
Id2 presentation
 

Kürzlich hochgeladen

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Kürzlich hochgeladen (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 

SciDataCon 2014 TDM Workshop Intro Slides

  • 1. Text and Data Mining (TDM) SciDataCon 2014 Workshop Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish) https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 2. What is MINING? 1982 “Automatically generating logical representations of text passages... by means of an analysis of the coherence structure of the passages.” Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 1999 “(semi)automated discovery of trends and patterns across very large datasets” “Use of large online text collections to discover new facts and trends...” “(Automating) the tedious parts of the text manipulation process and (integrating) underlying computationally-driven text analysis with human-guided decision making within exploratory data analysis over text” Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679 2008 “The use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature.” Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 18225946. https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 3. What is CONTENT? ● Images ● Photos ● Graphs ● Figures ● Captions ● Sound ● Video ● Tables ● Datasets ● Supplementary information ● Metadata ● Text https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 4. 101 uses for content mining (nearly)... Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now). Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple. Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English (“one”, “the” …) Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate. Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it. Find papers where authors come from chemistry department(s) and a linguistics department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular Sciences”, “Biochemistry”)…) Find papers acknowledging support from the Wellcome Trust . (So we can check for OA compliance…). Find papers with supplemental data files. Journal-specific but easily scalable. Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014