SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Overview of Practical Content Mining
Peter Murray-Rust
JISC, London, 2014-12-01
What is Content Mining
• Mining Text, Tables and Lists, Diagrams, Images
• Born-digital documents
• High-throughput (millions of items/year)
• Formal and Informal Collaboration
• Role of UK
• Hands-on
• Everything is OPEN (OSI , CC-BY, CC0)
The Right to Read is the Right to Mine
http://contentmine.org
ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN
ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO
Ebola Collaborators (Atlanta)
Roxanne Further Moore, Jessie
Gunter, April Clyburne-Sherin
Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
or bats
[Bb]ats?
Sudden onset [Ss]uddens+onset (space/s)
Panthera leoor
Gorilla gorilla
[A-Z][a-z]+s+[a-z]+
(ranges of letters)
Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs
Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>
Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button
AMI (extraction) architecture
PDF2SVG
Image
analysis
SVG2XML
Regex Species Phylo Chem
AMI
tablessections
captioned
diagrams
Immediate Stakeholders
– Researchers (bio, EBI, chem, materials, astro)
– Funders WT, FWF (Austria), RCUK,
– Libraries (repositories, theses)
– Service providers (EuropePMC)
– knowledge-based SMEs
– Library organisations (JISC, RLUK, LIBER, SPARC)
– Non-profits (Wikimedia, WHO, Mozilla)
Content production
• Scholarly articles
• Theses
• Repositories
• Grey scientific literature
• Grey politico-socio-legal literature
• Company output (reports, accounts, contracts)
(e.g. OpenOil)
STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
Licences destroy Content Mining
Challenges
• Active opposition from content “owners”
including serious lobbying and FUD
• Ignorance and apathy from universities;
inappropriate reward system
• Sub-optimal technology of publishers
• Lack of common infrastructure, technology,
APIs
• And it’s objectively messy anyway
Technical problems
• PDF: lacks words, tables, diagrams
• Non-Unicode character sets (or worse)
• Graphics objects largely destroyed (converted
to PNG or worse)
• No communal ontology for document
structure.
• HTML carries PublisherJunk and Javascript
Goals of Mining
• Classification of resources
• Entity extraction and indexing
• Aggregation within discipline
• Inter-disciplinary, e.g. biodiversity,
phytochemistry
• Repurposing (twitter, ePub, annotation)
• Semantification/intelligent documents
• Detection of error and fraud
What we need
• Inter/national commitment to infrastructure
• Common ontologies and APIs
• Development of community
• Go beyond academia; non-academic reward
system

Weitere ähnliche Inhalte

Ähnlich wie Overview of Practical Content Mining

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDatapetermurrayrust
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDataTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulpetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literaturepetermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureTheContentMine
 
Oa and academic integrity for ph d students 2016
Oa and academic integrity for ph d students   2016Oa and academic integrity for ph d students   2016
Oa and academic integrity for ph d students 2016Lars Figenschou
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Jisc
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifestpetermurrayrust
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesespetermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationhierohiero
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome TrustTheContentMine
 
What's Driving Open Access?
What's Driving Open Access?What's Driving Open Access?
What's Driving Open Access?Dorothea Salo
 
The time is right to focus on a model organism database
The time is right to focus on a model organism databaseThe time is right to focus on a model organism database
The time is right to focus on a model organism databaseChristoph Steinbeck
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Wouter Beek
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Sciencepetermurrayrust
 
Biodiversity Heritage Library
Biodiversity Heritage LibraryBiodiversity Heritage Library
Biodiversity Heritage LibraryConnie Rinaldo
 

Ähnlich wie Overview of Practical Content Mining (20)

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Oa and academic integrity for ph d students 2016
Oa and academic integrity for ph d students   2016Oa and academic integrity for ph d students   2016
Oa and academic integrity for ph d students 2016
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifest
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly information
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
What's Driving Open Access?
What's Driving Open Access?What's Driving Open Access?
What's Driving Open Access?
 
The time is right to focus on a model organism database
The time is right to focus on a model organism databaseThe time is right to focus on a model organism database
The time is right to focus on a model organism database
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
Biodiversity Heritage Library
Biodiversity Heritage LibraryBiodiversity Heritage Library
Biodiversity Heritage Library
 

Mehr von TheContentMine

High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature TheContentMine
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature TheContentMine
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureTheContentMine
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016TheContentMine
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and MedicineTheContentMine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! TheContentMine
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts TheContentMine
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in CambridgeTheContentMine
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? TheContentMine
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteTheContentMine
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)TheContentMine
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!TheContentMine
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open ScienceTheContentMine
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific ImagesTheContentMine
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and TechnologyTheContentMine
 

Mehr von TheContentMine (20)

High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 

Kürzlich hochgeladen

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Kürzlich hochgeladen (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Overview of Practical Content Mining

  • 1. Overview of Practical Content Mining Peter Murray-Rust JISC, London, 2014-12-01
  • 2. What is Content Mining • Mining Text, Tables and Lists, Diagrams, Images • Born-digital documents • High-throughput (millions of items/year) • Formal and Informal Collaboration • Role of UK • Hands-on • Everything is OPEN (OSI , CC-BY, CC0)
  • 3. The Right to Read is the Right to Mine http://contentmine.org
  • 4. ContentMine • 1-2 year Shuttleworth Funding from 2014-03 • Free to everyone, Open Source, updated daily • Structured Text, and Image/Diagram Mining • Workshops for training and training trainers • Bottom-up community development – Bioscience (EuropePMC, BBSRC) – Disease Ebola – Astrophysics (Stray Toaster) – Chemistry (TSB, EBI, PennState - Citeseer) • We fight for Justice and Freedom
  • 5. ContentMine People • Jenny Molloy • Ross Mounce • Peter Murray-Rust + volunteers (Bioscience, disease) • Richard Smith-Unna + 20 quickscrape volunteers • Steph Unna • Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones) • Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth) • Advisory Board RSN
  • 6. ContentMine Workshops (1-hour -> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US Upcoming • JISC • LIBER • BL • Wellcome Trust • WHO
  • 7. Ebola Collaborators (Atlanta) Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
  • 8. Regular Expressions (Easier than Crosswords or Sudoku) Ebola Ebola Mali (not Malicious) MaliW (end of word) Bat or bat [Bb]at (alternatives) bat or bats bats? (optional letter) Bat or Bats or bat or bats [Bb]ats? Sudden onset [Ss]uddens+onset (space/s) Panthera leoor Gorilla gorilla [A-Z][a-z]+s+[a-z]+ (ranges of letters)
  • 9. Ebola regex • <compoundRegex title="ebola"> • <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> • <regex weight="1.0" fields="marburg">(Marburg)</regex> • <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> • <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> • <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> • <regex weight="0.5" fields="guinea">(Guinea)</regex> • <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> • <regex weight="0.5" fields="liberia">(Liberia)</regex> • <regex weight="0.5" fields="mali">(Mali)W</regex> • <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> • <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> • <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> • <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> • <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> • <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> • </compoundRegex> I 15 mins to create, 15 mins to install and test Or run online at CottageLabs
  • 10. Results of Regex on Ebola • <resultsList xmlns="http://www.xml-cml.org/ami"> • <results xmlns=""> • <source xmlns="http://www.xml-cml.org/ami" • name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" • lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> • <regex xmlns="" weight="1.0" fields="[ebola]"> • <pattern>(Ebola)</pattern> • </regex> • <hits xmlns=""> • <hit ebola="Ebola" /> • </hits> • </regex> • </result> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" • lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> • <regex xmlns="" weight="0.5" fields="[sierra_leone]"> • <pattern>(Sierras+Leone)</pattern> • </regex> • <hits xmlns=""> • <hit sierra_leone="Sierra Leone" /> • </hits> • </regex> • </result>
  • 11. Demo of Content Mining ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
  • 12. Bacterial WP_phylogenetic tree Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves) WP: Clostridium_butyricum Genbank ID American Type Culture Collection
  • 13. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button
  • 14. AMI (extraction) architecture PDF2SVG Image analysis SVG2XML Regex Species Phylo Chem AMI tablessections captioned diagrams
  • 15. Immediate Stakeholders – Researchers (bio, EBI, chem, materials, astro) – Funders WT, FWF (Austria), RCUK, – Libraries (repositories, theses) – Service providers (EuropePMC) – knowledge-based SMEs – Library organisations (JISC, RLUK, LIBER, SPARC) – Non-profits (Wikimedia, WHO, Mozilla)
  • 16. Content production • Scholarly articles • Theses • Repositories • Grey scientific literature • Grey politico-socio-legal literature • Company output (reports, accounts, contracts) (e.g. OpenOil)
  • 17. STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]” WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R Licences destroy Content Mining
  • 18. Challenges • Active opposition from content “owners” including serious lobbying and FUD • Ignorance and apathy from universities; inappropriate reward system • Sub-optimal technology of publishers • Lack of common infrastructure, technology, APIs • And it’s objectively messy anyway
  • 19. Technical problems • PDF: lacks words, tables, diagrams • Non-Unicode character sets (or worse) • Graphics objects largely destroyed (converted to PNG or worse) • No communal ontology for document structure. • HTML carries PublisherJunk and Javascript
  • 20. Goals of Mining • Classification of resources • Entity extraction and indexing • Aggregation within discipline • Inter-disciplinary, e.g. biodiversity, phytochemistry • Repurposing (twitter, ePub, annotation) • Semantification/intelligent documents • Detection of error and fraud
  • 21. What we need • Inter/national commitment to infrastructure • Common ontologies and APIs • Development of community • Go beyond academia; non-academic reward system