This document provides an overview of practical content mining. It discusses what content mining is, including mining text, tables, lists, diagrams and images from born-digital and high-throughput documents. It describes the ContentMine project, workshops held to train people in content mining, and collaborations with various scientific communities. Challenges to content mining include opposition from content owners and a lack of common infrastructure and technology.
2. What is Content Mining
• Mining Text, Tables and Lists, Diagrams, Images
• Born-digital documents
• High-throughput (millions of items/year)
• Formal and Informal Collaboration
• Role of UK
• Hands-on
• Everything is OPEN (OSI , CC-BY, CC0)
3. The Right to Read is the Right to Mine
http://contentmine.org
4. ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
5. ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN
6. ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO
8. Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
[Bb]ats?
or bats
Sudden onset [Ss]uddens+onset (space/s)
Panthera leo or
[A-Z][a-z]+s+[a-z]+
Gorilla gorilla
(ranges of letters)
9. Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs
10. Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>
11. Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.
12. Bacterial WP_phylogenetic tree
Genbank ID
American Type
Culture Collection
WP: Clostridium_butyricum
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
13. RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button
14. AMI (extraction) architecture
PDF2SVG
SVG2XML
Image
analysis
sections tables
AMI
captioned
diagrams
Regex Species Phylo Chem
16. Content production
• Scholarly articles
• Theses
• Repositories
• Grey scientific literature
• Grey politico-socio-legal literature
• Company output (reports, accounts, contracts)
(e.g. OpenOil)
17. Licences destroy Content Mining
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
18. Challenges
• Active opposition from content “owners”
including serious lobbying and FUD
• Ignorance and apathy from universities;
inappropriate reward system
• Sub-optimal technology of publishers
• Lack of common infrastructure, technology,
APIs
• And it’s objectively messy anyway
19. Technical problems
• PDF: lacks words, tables, diagrams
• Non-Unicode character sets (or worse)
• Graphics objects largely destroyed (converted
to PNG or worse)
• No communal ontology for document
structure.
• HTML carries PublisherJunk and Javascript
20. Goals of Mining
• Classification of resources
• Entity extraction and indexing
• Aggregation within discipline
• Inter-disciplinary, e.g. biodiversity,
phytochemistry
• Repurposing (twitter, ePub, annotation)
• Semantification/intelligent documents
• Detection of error and fraud
21. What we need
• Inter/national commitment to infrastructure
• Common ontologies and APIs
• Development of community
• Go beyond academia; non-academic reward
system