Published on Dec 01, 2014 by PMR
An overview of ContentMining for JISC (the infrastructure provider of UK academia). Examples, details leading to hands-on exercise (http://contentmine.org/workflow
2. What is Content Mining
• Mining Text, Tables and Lists, Diagrams, Images
• Born-digital documents
• High-throughput (millions of items/year)
• Formal and Informal Collaboration
• Role of UK
• Hands-on
• Everything is OPEN (OSI , CC-BY, CC0)
3. The Right to Read is the Right to Mine
http://contentmine.org
4. ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
5. ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN
6. ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO
8. Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
or bats
[Bb]ats?
Sudden onset [Ss]uddens+onset (space/s)
Panthera leoor
Gorilla gorilla
[A-Z][a-z]+s+[a-z]+
(ranges of letters)
9. Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs
10. Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>
11. Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.
12. Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
13. RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button
16. Content production
• Scholarly articles
• Theses
• Repositories
• Grey scientific literature
• Grey politico-socio-legal literature
• Company output (reports, accounts, contracts)
(e.g. OpenOil)
17. STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
Licences destroy Content Mining
18. Challenges
• Active opposition from content “owners”
including serious lobbying and FUD
• Ignorance and apathy from universities;
inappropriate reward system
• Sub-optimal technology of publishers
• Lack of common infrastructure, technology,
APIs
• And it’s objectively messy anyway
19. Technical problems
• PDF: lacks words, tables, diagrams
• Non-Unicode character sets (or worse)
• Graphics objects largely destroyed (converted
to PNG or worse)
• No communal ontology for document
structure.
• HTML carries PublisherJunk and Javascript
20. Goals of Mining
• Classification of resources
• Entity extraction and indexing
• Aggregation within discipline
• Inter-disciplinary, e.g. biodiversity,
phytochemistry
• Repurposing (twitter, ePub, annotation)
• Semantification/intelligent documents
• Detection of error and fraud
21. What we need
• Inter/national commitment to infrastructure
• Common ontologies and APIs
• Development of community
• Go beyond academia; non-academic reward
system