The document discusses using text mining and ontologies to improve knowledge discovery of microbial diversity from scientific literature and databases. It describes extracting habitat information for microorganisms from unstructured text sources and mapping it to structured classifications like the OntoBiotope ontology. This allows habitat descriptions to be standardized and enables better search, analysis and sharing of microbial isolation site data.
Text-mining and ontologies - new approaches to knowledge discovery of microbial diversity
1.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
1
4th International Microbial Diversity Conference,
Bari - Nov. 2017
Text-mining and ontologies
new approaches to knowledge discovery of
microbial diversity
Claire Nédellec, Bibliome MaIAGE
2.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
2
Microbial
diversity,
information
sources
Where
do
micro-‐organisms
live?
A
critical
information
that
is
collected
and
stored
in
many
public
databases
Huge
amount
of
isolation
site
information
on
micro-‐organisms
• Data
sources:
organism
collections,
sequence
databases,
...
• Documents:
scientific
papers,
reports
7
millions
PubMed
references
on
micro-‐organism
habitats
[Deléger
et
al,
2016]
Often
available
for
automatic
pipelines
on-‐line
access,
programming
interface
But
under
exploited
because
expressed
in
unstructured
free
text
Number
of
articles
about
"bacteria"
in
PubMed
24,150
"isolated
from"
entries
in
BacDive
(DSMZ)
18,000
"isolation"
entries
in
ATCC
25,000
"isolation
site"
for
bacteria
&
archae
in
Genome
On
Line
Database
Number
of
complete
genome
sequences
at
JGI
3.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
3
From
free
text
to
knowledge
Isolation
site,
always
in
free
text
Unified
representation
of
habitat
descriptions
a
major
challenge
for
data
access
and
curation
⇒
Facilitate
Information
access
by
reference
keywords
⇒
Enable
Interoperability
among
databases
⇒
Enhance
databases
by
scientific
published
knowledge
GenBank
example
Species TaxID Isolation site
Acetobacter lovaniensis 104100 fermented dairy products
Acetobacter lovaniensis 104100 fermented rice flour
Acetobacter lovaniensis 104100 vinegar
Acetobacter lovaniensis 104100 water kefir
fermented food
Needs
1.
A
classification
of
Habitats
relevant
to
microorganism
studies
2.
Information
extraction
method
for
mapping
free
text
entities
to
the
classes
OntoBiotope Ontology
Alvis text-mining Suite
4.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
4
Copyright Inra
Alvis pipeline - Florilège database
Mapping
various
terms
to
an
habitat
classification
PubMed DOCUMENT TAXON HABITAT HABITAT TERM
PMID: 21549046, 21247298, 16204502,
15992268, 2116711, 2116712,
15992260, 1348242, 11530195,
23042180, 23208291, 10458115,
11456331, 21669068, 17954748,
8867607, 23433372, 26325149,
8977904, 23880504, 8227616,
16156701, 15553633, 20494189,
24715203, 21441322, 19114514,
2125110, 19254151, 22980010
Listeria
monocytogenes
,
dairy
farm
Dairy farm, dairy farm environments, dairy
farms, dairy farm environmental samples,
environment of dairy farms, potential dairy
farm, Dairy farm environmental samples, single
dairy farm, Irish dairy farms, high-prevalence
dairy farm, dairy farm environment, dairy farms
of different size, local dairy farm, second
Northwest dairy farm, dairy cattle farms,
selected dairy farms, dairy farm, Dairy farms
Term
variation
10,000
habitats
of
Listeria
monocytogenes
in
PubMed
Reference class
5.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
5
A
classification
with
a
hierarchical
structure
Higher
habitat
classes
needed
for
ecology
&
evolution
studies
10,000
habitats
of
Listeria
monocytogenes
in
PubMed
Alvis IR semantic search engine
Scientific paper
extracts
Habitat
classes
Listeria
monocytogenes
contamination
in
Chinese
beef
processing
plants.
Listeria
monocytogenes
isolated
from
artisanal
Portuguese
cheses-‐making
dairy.
the
presence
of L.
monocytogenes
in
samples
collected
from
crab
processing
plant
Portuguese
cheses-‐making
dairy.
L.
monocytogenes
persisting
in
a
cold-‐smoked
fish
processing
plant.
two L.
monocytogenes
cheese
dairy
isolates
6.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
6
OntoBiotope
ontology
A
large
ontology
dedicated
to
microorganism
biotopes
What
structure
for
the
habitat
classification
Microbiology
research
domains
Reuse
of
existing
habitat
classifications
(ATCC,
GOLD,
FedEx2)
Gather
habitats
with
similar
physico-‐chemical
properties
Ontology
scope
Extensive
study
of
habitat
terminology
in
text
(databases
and
papers)
paper mill sludge /
anaerobic sludge of paper mill waste water
Collaborations
with
microbiologists
in
focused
projects
(phytobiome,
food
microbiome)
Evaluation
Text-‐mining
benchmarks:
Bacteria
Biotope
in
BioNLP
Shared
Tasks
Through
its
use
in
applications
(e.g.
food
positive
flora)
2329
habitat
classes
492
synonyms
13
levels
7.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
7
Habitats
in
OntoBiotope
ontology
Distributed
since
2012,
http://agroportal.lirmm.fr/ontologies/ONTOBIOTOPE
14
19
21
43
55
120
281
352
369
480
801
experimental
medium
aquaculture
habitat
bacteria
associated
habitat
medical
environment
agricultural
habitat
habitat
wrt
chemico-‐physical
property
artiBicial
environment
living
organism
natural
environment
habitat
part
of
living
organism
food
49
classes
in
the
gastrointestinal
tract
subtree
35
classes
in
the
waste
subtree
the
largest
classes
51
classes
in
the
soil
subtree
9.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
9
Information
extraction
from
text
and
mapping
to
the
habitat
classes
Ontology
lives_in
newborn
gut
Article
text
Bifidobacterium
longum
is
found
in
newborn
infant
as
a
normal
component
of
gut
flora
Article
text
Bifidobacterium
longum
is
found
in
newborn
infant
as
a
normal
component
of
gut
flora
Bifidobacterium
longum
subsp.
longum
is
found
in
newborn
infant
as
a
normal
component
of
gut
flora.
Information
Bacteria:
Bifidobacterium
longum
hosted
by:
newborn
infant
[baby]
lives_in:
gut
[intestine]
Information
Bacteria:
Bifidobacterium
longum
hosted
by:
newborn
infant
[baby]
lives_in:
gut
[intestine]
Bacteria
Bifidobacterium
longum
subsp.
longum
[taxid:
1679]
hosted
by
newborn
infant
[baby]
lives_in
gut
[intestine]
Ontology
simplified
view
Information
Extraction
Text
of
articles
Formal
representation
of
the
information
10.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
10
Information
extraction
and
classification
-‐
Process
...
virulence
of
aquatic
pathogen
Vibrio
anguillarum
towards
sea
bass
larvae
...
Artificial
Intelligence
methods
(machine
learning
and
natural
language
processing)
Implemented
in
several
components
(>
1
hundred)
of
Alvis
text-‐mining
pipeline.
1.
Entity
recognition
=
identification
(text
boundaries)
and
broad
type
assignment
2.
Entity
classification
=
assignment
of
an
OntoBiotope
class
3.
Relationship
prediction
=
links
microorganism
mentions
to
their
habitats
in
the
text
Microbial
species
HabitatHabitat
aquatic
environment
marine
farm
fish
Dicentrarchus labrax
larvae
Lives
in
TaxID5560
Ratkovic
et
al.,
BMC
Bioinformatics,
2012
Nédellec
et
al.,
Handbook
on
Ontology,
2009
11.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
11
Bibliographic
sources
Semantic
ressources
ontologies
Information
extraction
Full-‐text
data
and
metadata
Services
http://bibliome.jouy.inra.fr/demo/ontobiotope/alvisir2/webapi/search
Ba
&
Bossy,
LREC
2016
13.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
13
OntoBiotope
pipeline,
applied
to
PubMed
BioNLP-‐ST
Entity
detection
Detection
and
classification
Relation
(lives
in)
Recall
65%
50%
70
Precision
81%
62%
51,4
PubMed
Documents
2,3
millions
Habitats
18,5
millions
Taxa
8,4
millions
Relations
7,2
millions
Text
source
Data
of
the
international
competition
on
bacteria
information
extraction
Nédellec
et
al.,
BMC
Bioinformatics,
2015
Ratkovic
et
al.,
BMC
Bioinformatics,
2012
14.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
14
From
research
lab
to
infrastructure,
an
European
Open
Science
perspective
Deployment
on
OpenMinTed,
European
text-‐mining
infrastructure
offers
to
the
scientific
communities
A
fully
open
access
in
a
unified
framework
Reproducibility
and
flexibility.
Full-‐text
paper
collection
and
database
aggregation
and
standardisation
Przybyła
et
al.,
Database,
2016
15.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
15
Treemap
visualization
for
biodiversity
analytics
Semantic
relational
search
through
all
PubMed
references
On-‐line
services
Data
integration
http://genome.jouy.inra.fr/Florilege/
16.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
16
On-‐going
projects,
examples
of
application
Food
positive
flora
(Florilege)
MD
Poster
S2-‐23.
Characterization
of
biodiversity,
phenotypes,
uses
and
molecules
produced/degraded
Food
innovation
(nutrient
production,
biopreservation)
1
millions
phenotypes.
1,1
million
relationships
taxon
-‐
phenotype
Tracing
the
origin
(FoodMicrobiome
Transfert)
Cheese
ingredients
and
cheese
processing
bring
unexpected
strains
Text-‐mining
contributes
to
express
plausible
hypotheses
on
the
source
Likelihood
of
organism
identification
(metagenomics),
consistency
with
previous
results
(Visa
TM
project)
Has
this
microorganism
already
be
identified
in
this
place?
Of
the
same
family?
In
a
similar
place?
In
a
similar
ecosystem?
[INRA
-‐
CNIEL]
[INRA
Food
WG]
[INRA,
AgroPortal,
Inist]
17.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
17
Conclusion
Millions
of
microorganism
habitat
descriptions,
exponentially
increasing.
Invaluable
information
for
fundamental
research
and
applications
Largely
underused
because
mostly
expressed
in
free
text
OntoBiotope
ontology
and
Information
Extraction
from
text
provides
a
formal
representation
of
microorganisms
biotopes
Open
up
new
research
opportunities
• Not
only
for
data
curation
and
indexing
in
information
systems
• Analysis
in
combination
with
experimental
data
for
integrative
and
predictive
biology
A
prime
example
is
metagenomics
&
biodiversity
in
OpenMinTeD
18.
4th
International
Conference
on
Microbial
Diversity,
2017
Bari
18
Acknowledgements
and
funding
Mouhamadou
Ba,
Baptiste
Bohuon,
Robert
Bossy,
Philippe
Bessières,
Estelle
Chaix,
Louise
Deléger,
Sandra
Dérozier,
Arnaud
Ferré,
Wiktoria
Golik,
Julien
Jourde,
Valentin
Loux,
Frédéric
Papazian,
Jean-‐
Zorana
Ratkovic,
Dialekti
Valsamou
MEM
Méta-‐omiques
des
Ecosystèmes
Microbiens