Petermrjisc20141201

Overview of Practical Content Mining
Peter Murray-Rust
JISC, London, 2014-12-01

What is Content Mining
• Mining Text, Tables and Lists, Diagrams, Images
• Born-digital documents
• High-throughput (millions of items/year)
• Formal and Informal Collaboration
• Role of UK
• Hands-on
• Everything is OPEN (OSI , CC-BY, CC0)

The Right to Read is the Right to Mine
http://contentmine.org

ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom

ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN

ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO

Ebola Collaborators (Atlanta)
Roxanne Further Moore, Jessie
Gunter, April Clyburne-Sherin

Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
[Bb]ats?
or bats
Sudden onset [Ss]uddens+onset (space/s)
Panthera leo or
[A-Z][a-z]+s+[a-z]+
Gorilla gorilla
(ranges of letters)

Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs

Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>

Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.

Bacterial WP_phylogenetic tree
Genbank ID
American Type
Culture Collection
WP: Clostridium_butyricum
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button

AMI (extraction) architecture
PDF2SVG
SVG2XML
Image
analysis
sections tables
AMI
captioned
diagrams
Regex Species Phylo Chem

Immediate Stakeholders
– Researchers (bio, EBI, chem, materials, astro)
– Funders WT, FWF (Austria), RCUK,
– Libraries (repositories, theses)
– Service providers (EuropePMC)
– knowledge-based SMEs
– Library organisations (JISC, RLUK, LIBER, SPARC)
– Non-profits (Wikimedia, WHO, Mozilla)

Content production
• Scholarly articles
• Theses
• Repositories
• Grey scientific literature
• Grey politico-socio-legal literature
• Company output (reports, accounts, contracts)
(e.g. OpenOil)

Licences destroy Content Mining
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”

Challenges
• Active opposition from content “owners”
including serious lobbying and FUD
• Ignorance and apathy from universities;
inappropriate reward system
• Sub-optimal technology of publishers
• Lack of common infrastructure, technology,
APIs
• And it’s objectively messy anyway

Technical problems
• PDF: lacks words, tables, diagrams
• Non-Unicode character sets (or worse)
• Graphics objects largely destroyed (converted
to PNG or worse)
• No communal ontology for document
structure.
• HTML carries PublisherJunk and Javascript

Goals of Mining
• Classification of resources
• Entity extraction and indexing
• Aggregation within discipline
• Inter-disciplinary, e.g. biodiversity,
phytochemistry
• Repurposing (twitter, ePub, annotation)
• Semantification/intelligent documents
• Detection of error and fraud

What we need
• Inter/national commitment to infrastructure
• Common ontologies and APIs
• Development of community
• Go beyond academia; non-academic reward
system

Petermrjisc20141201

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Petermrjisc20141201

Similar to Petermrjisc20141201 (20)

More from petermurrayrust

More from petermurrayrust (20)

Recently uploaded

Recently uploaded (20)

Petermrjisc20141201