1. Text and Data Mining (TDM)
SciDataCon 2014 Workshop
Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish)
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
2. What is MINING?
1982
“Automatically generating logical representations of text passages... by means of an
analysis of the coherence structure of the passages.”
Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics -
Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833
1999
“(semi)automated discovery of trends and patterns across very large datasets”
“Use of large online text collections to discover new facts and trends...”
“(Automating) the tedious parts of the text manipulation process and (integrating)
underlying computationally-driven text analysis with human-guided decision making within
exploratory data analysis over text”
Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL
'99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679
2008
“The use of automated methods for exploiting the enormous amount of knowledge
available in the biomedical literature.”
Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID
18225946.
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
3. What is CONTENT?
● Images
● Photos
● Graphs
● Figures
● Captions
● Sound
● Video
● Tables
● Datasets
● Supplementary information
● Metadata
● Text
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
4. 101 uses for content mining (nearly)...
Which universities in SE Asia do scientists from Cambridge work with? (We get asked this
sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of
their co-authors we can get a very good approximation. (Feasible now).
Which papers contain grayscale images which could be interpreted as Gels? A
http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A
typical gel (Wikipedia CC-BY-SA) looks like
Find me papers in subjects which are (not) editorials, news, corrections, retractions,
reviews, etc. Slightly journal/publisher-dependent but otherwise very simple.
Find papers about chemistry in the German language. Highly tractable. Typical approach would be
to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English
(“one”, “the” …)
Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial
to extract references and authors. More difficult, of course to disambiguate.
Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before
2006 when I started a Wikipedia article on it.
Find papers where authors come from chemistry department(s) and a linguistics
department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular
Sciences”, “Biochemistry”)…)
Find papers acknowledging support from the Wellcome Trust . (So we can check for OA
compliance…).
Find papers with supplemental data files. Journal-specific but easily scalable.
Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced,
text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an
enthusiast
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014