1. BIODIVERSITY OCCURRENCES AND
PATTERNS FROM THE ANGLE OF
SYSTEMATICS
Julien TROUDET
Directeurs de thèse : Régine Vignes-Lebbe, Frédéric Legendre
Institut de Systématique, Evolution, Biodiversité
ISYEB - UMR 7205 - CNRS, MNHN, UPMC, EPHE, Sorbonne Université
Equipe Evolution fonctionnelle et Systématique (EVOFONC)
Laboratoire Informatique & Systématique (LIS)
■ ANTONELLI Alexandre
■ LESSARD Jean-Philippe
■ ARCHAMBEAU Anne-Sophie
■ PAGE Rod
■ VIGNES-LEBBE Régine
■ LEGENDRE Frédéric
University of Gothenburg
Concordia University
GBIF France
University of Glasgow
UMR 7205 ISYEB MNHN
UMR 7205 ISYEB MNHN
Rapporteur
Rapporteur
Examinateur
Examinateur
Directeur de thèse
Directeur de thèse
Jury:
3. Introduction
3
Remaining populations of indigenous species as a percentage of their original
populations (Newbold et al. 2016)
This sense of urgency has been seized by ecologists and conservationists:
Has land use pushed terrestrial
biodiversity beyond the planetary
boundary? A global assessment
Newbold et al. 2016
Biodiversity loss and its impact
on humanity
Cardinale et al. 2012
Biodiversity hotspots for
conservation priorities
Myers et al. 2000
Biodiversity is eroding at an
accelerating pace.
4. 4
Systematics also takes up
the challenge and gives
itself the means to respond
to this urgency.
Introduction
Post-molecular systematics and
the future of phylogenetics
Pyron 2015
How Many Kinds of Birds Are
There and Why Does It Matter?
Barrowclough et al. 2016
Assessing data quality in citizen
science
Kosmala et al 2016
Systematics produces data essential
to biodiversity sciences.
Ecology
Phylogenetics
Conservation
Taxonomy
Pest control
5. 5
Introduction
The angle of systematics,
a complementary perspective
Systematists are the largest
producers of biodiversity occurrences
with 1.2 to 2.1 billions of specimens in
museum collections (Ariño 2010).
Systematists have a unique point of view
on biodiversity, giving all taxa the same
significance and adding a historical
context to the study of biodiversity.
Ciccarelli et al. 2006
6. 6
Introduction
Data producers and
users
Global biodiversity
patterns at large
taxonomic scale: which
factors shape them?
Considering the largest taxonomic
scale possible to produce
generalizable outputs
How is the practice of
biodiversity data
gathering evolving?
Can biological diversity
be investigated in its
entirety?
As data producers, systematists have a special position to characterize biodiversity data
before using it.
The angle of systematics,
a complementary perspective
7. 7
Introduction
Data producers and
users
Global biodiversity
patterns at large
taxonomic scale: which
factors shape them?
Considering the largest taxonomic
scale possible to produce
generalizable outputs
How is the practice of
biodiversity data
gathering evolving?
Can biological diversity
be investigated in its
entirety?
The angle of systematics,
a complementary perspective
8. Plan
8
▪ Biodiversity occurrences
▪ Methods for Big-data
▪ Less specimens, more observations
▪ Taxonomic bias and societal preferences
▪ Productivity shapes the latitudinal
diversity gradient
▪ Conclusion
10. Occurrences= Primary Biodiversity Data
10
What ?
Where ?
When ?
Scientists, especially
systematists are the first
producers of biodiversity
data.
Citizen science projects
produce biodiversity data
for specific needs and uses
scientific supervision.
Networks of amateur
naturalists are important
producers of data
especially for birds and
other vertebrate taxa
11. Integration of occurrences in databases
11
Collection digitization
Data production
A non-exhaustive map
of the global and
European biodiversity
informatics landscape
(Bingham et al. 2017)
Primary biodiversity data are created by many
producers, however most of it is created by either
digitizing existing data or by producing new data
12. Integration of occurrences in databases
12
Collection digitization
Data production
A non-exhaustive map
of the global and
European biodiversity
informatics landscape
(Bingham et al. 2017)
Primary biodiversity data are created by many
producers, however most of it is created by either
digitizing existing data or by producing new data
13. Integration of occurrences in databases
13
Collection digitization
Data production
A non-exhaustive map
of the global and
European biodiversity
informatics landscape
(Bingham et al. 2017)
Primary biodiversity data are created by many
producers, however most of it is created by either
digitizing existing data or by producing new data
14. From occurrences to databases
What is GBIF?
GBIF—the Global Biodiversity
Information Facility—is an open-
data research infrastructure
funded by the world’s
governments and aimed at
providing anyone, anywhere
access to data about all types of
life on Earth.
■ 1,118 Publishers
■ 36,825 Datasets
■ 856,055,455 occurrences
The Global Biodiversity Information Facility (GBIF) connections.
(Bingham et al. 2017)14
15. From occurrences to databases
What is GBIF?
GBIF—the Global Biodiversity
Information Facility—is an open-
data research infrastructure
funded by the world’s
governments and aimed at
providing anyone, anywhere
access to data about all types of
life on Earth.
■ 1,118 Publishers
■ 36,825 Datasets
■ 856,055,455 occurrences
15
17. A dataset of 626 million occurrences
17
Occurrences accumulation in the GBIF
Exponential growth
The number of occurrences mediated by
the GBIF is growing exponentially.
57 million occurrences were recorded in
2014, which is more than 5 times the
amount of data recorded in 2004 (11
million).
The uncompressed volume of the GBIF data is approximately
500 GigaBytes. This volume of data is 400 times smaller
than the volume of data that the Gaia mission will produce
(200 TeraBytes).
Handling so many occurrences is a
methodological challenge.
18. DwCSP a fast biodiversity occurrence curator (in preparation for Bioinformatics)
A custom software
18
Darwin Core Spatial Processor
A tool to manipulate large amount of primary
biodiversity data
▪ Data enrichment using spatial files
(shapefiles and raster files)
▪ Spatial outliers detection
▪ Environmental outlier detection
Manipulating the GBIF data required to set up multiple systems, scripts and databases to, clean
and filter the data, compute statistics, visualize results on a map, etc.
Some tools used during the PhD: Java, R, PostgreSQL, QGIS
620 million occurrences in more than 60,000
species were processed.
19. DwCSP a fast biodiversity occurrence curator (in preparation for Bioinformatics)
A custom software
19
Darwin Core Spatial Processor
A tool to manipulate large amount of primary
biodiversity data
▪ Data enrichment using spatial files
(shapefiles and raster files)
▪ Spatial outliers detection
▪ Environmental outlier detection
Manipulating the GBIF data required to set up multiple systems, scripts and databases to, clean
and filter the data, compute statistics, visualize results on a map, etc.
Some tools used during the PhD: Java, R, PostgreSQL, QGIS
620 million occurrences in more than 60,000
species were processed.
20. Working with the GBIF data
20
For each species
1. Put the occurrences on a
grid
2. Keep species with 20 or
more occurrences
example:
Lacerta bilineata
21. Working with the GBIF data
21
For each species
1. Put the occurrences on a
grid
2. Keep species with 20 or
more occurrences
3. Detection of spatial outliers
example:
Lacerta bilineata
22. Working with the GBIF data
22
For each species
1. Put the occurrences on a
grid
2. Keep species with 20 or
more occurrences
3. Detection of spatial outliers
4. Detection of climatic
outliers
example:
Lacerta bilineata
23. Working with the GBIF data
23
For each species
1. Put the occurrences on a
grid
2. Keep species with 20 or
more occurrences
3. Detection of spatial outliers
4. Detection of climatic
outliers
5. Extrapolating species
distribution using niche
modelling
example:
Lacerta bilineata
25. A large and heterogeneous data set
25
The amount of GBIF-mediated data is
increasing exponentially.
GBIF-mediated data are very heterogenous
because of numerous data producers.
Occurrences accumulation in the GBIF
How is the practice of
biodiversity data
gathering evolving?
How do recent and old biodiversity
occurrences differ ?
Does the increase in data quantity
comes with an increase in data quality?
26. Two types of occurrences
26
Specimen-based and Observation-based occurrences are not identical.
The possible uses for an observational occurrence are limited by the ancillary data
collected during the observation, whereas a specimen can be analyzed in various ways
at a later stage.
27. The increasing disconnection of primary biodiversity data from specimens: How does it happen and how to handle it? (Systematic Biology, under review)
The number of observation-based
occurrences added to the GBIF is
growing at an exponential rate,
while the number of specimen-
based occurrences stay stable.
27
A general shift
Actinopterygii Aves Insecta Magnoliopsida Reptilia
Year
Proportion
Proportion
In proportion, a clear shift is visible
for the 24 taxonomic classes
studied.
28. The increasing disconnection of primary biodiversity data from specimens: How does it happen and how to handle it? (Systematic Biology, under review)28
Ancillary data as the solution?
Observation-based occurrences can
be complemented with ancillary data.
...and most of ancillary data are linked to specimen-
based occurrences.
Yet, ancillary data would be most useful to check or
update observation-based occurrences.
Ancillary data - such as DNA
sequences or multimedia files
(photo, video, recordings…) - are
more and more affordable to collect.
Still, very few GBIF-mediated
occurrences are linked to digital or
molecular data...
29. The increasing disconnection of primary biodiversity data from specimens: How does it happen and how to handle it? (Systematic Biology, under review)29
Recent data are of better quality
Spatial precision is improving (spatial issues decrease)
Overall, there is an improvement in the quality of biodiversity occurrences.
Taxonomic precision is improving
30. The increasing disconnection of primary biodiversity data from specimens: How does it happen and how to handle it? (Systematic Biology, under review)30
Conclusion & recommendations
We recommend to prioritise the
production of ancillary data in the
following order:
1. Specimens
2. Material samples (DNA)
3. Multimedia files
4. Detailed observation
Data producers have taken
the habit of providing precise
GPS coordinates and
taxonomic identification
In the age of smartphones
and global access to internet
it should be a priority to
encourage data producers to
link pictures and other
additional data to any
occurrence they create
More and more primary biodiversity data are not linked to voucher specimens (i.e. observations).
In addition, a very small proportion of these observations have auxiliary data. This situation
weakens the possibilities for future biodiversity studies relying on this data.
32. 32
Biodiversity occurrences: a biased dataset
Biodiversity occurrences are not collected evenly.
A well known bias is the spatial bias (Meyer et al.
2015). Some areas of the world are far most sampled
than others. Similarly, some taxa are more studied
than others.
This taxonomic bias has been studied at
small scales.
▪ for a single field such as
conservation (Di Marco et al. 2017)
▪ for specific taxa (Ford et al. 2017)
33. 33
Biodiversity occurrences: a biased dataset
First recommendation of Faith et al. (2013):
Biases must be recognised in biodiversity
sciences and efforts produced to bridge
them.
We used 24 classes (large taxonomic scale).
We analyzed 626 million occurrences.
We tested the ‘societal preferences’ and
‘taxonomic research’ hypotheses:
▪ The public preferences influence and bias
the choice of study organisms.
(Stahlschmidt 2011) (Number of web
pages)
▪ Scientific reasons and limitations lead
and orientate biodiversity data gathering.
(Number of scientific publications)Why only 17 % of bird species in the GBIF
have less than 20 occurrences while 79 %
of insect species are in the same case?
34. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)34
A bias affecting data quantity
In the GBIF, some groups have far more
occurrences than others even if those
groups are less speciose.
Millions of
occurrences
in the GBIF
Thousands of
species
in the GBIF
Median number
of occurrences
per species
Aves 345.11 12.82 371
Magnoliopsida 118.21 261.01 19
Insecta 46.78 352.78 3
Mammalia 10.78 11.53 15
Reptilia 4.98 11.30 24
Lecanoromycetes 4.97 17.79 8
Amphibia 3.94 5.89 54
Total in the GBIF 649.79 1200.38 6
35. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)35
A bias that increases over time
Looking at the quantity of occurrences produced
through time, the taxonomic bias is only worsening.
Millionsofoccurrences
36. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)36
A bias affecting data quality
The GBIF-mediated data also exhibit a
taxonomic bias in data quality.
The proportion of occurrences identified at the
species level varies across classes.
99 % 92 %
77 % 69 %
37. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)
Analysing more than 40,000 species, 39 of 47 generalized linear model showed a
positive correlation between the quantity of data per species and the number of web
pages for the species (societal influence).
37
Societal preferences influence
Vanessa atalanta
462,000 google results
528,227 occurrences
Misumena vatia
137,000 google results
5,556 occurrences
38. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)38
Conclusion
These results should encourage biodiversity researchers to communicate even more
with the public. If taxonomic bias is linked to societal preferences, the rise of citizen
science could further exacerbate this bias.
Hypotheses related to specific characteristics, such as species size or range, could
also explain this bias and should be explored.
▪ The public preferences influence the choice of
study organisms.
▪ Scientific reasons and limitations orientate biodiversity data gathering.
40. “
Thus, the nearer we approach
the tropics, the greater the
increase in the variety of
structure, grace of form, and
mixture of colours, as also in
perpetual youth and vigour of
organic life.”
40
Alexander von Humboldt, 1807
41. Latitudinal Diversity Gradient (LDG)
41
Organisms diversity tends to
be the highest near the
equator and diminishes as
we move towards the poles.
This pattern is called the
Latitudinal diversity gradient
Here biodiversity is quantified using
species richness.
The next results focus on terrestrial
taxa.
42. In situ hypotheses examples:
Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World
(in preparation)
42
A multitude of hypotheses
More than 30 hypotheses
have been formulated to
explain the formation of the
LDG (Willig et al. 2003).
Historical hypotheses
propose diversification
mechanisms that were not
tackled in this study.
In situ hypotheses propose
that environmental factors
shape the LDG.
Geometric hypotheses see
the LDG as a geometrical
artifact caused by random
species repartitions.
Productivity hypothesis
There is a positive correlation
between actual evapotranspiration
(productivity) and the species
richness of terrestrial birds (Hawkins
et al. 2003)
Ambient energy hypothesis
The species richness of terrestrial
birds in western palearctic areas is
related to the annual temperature
(Hawkins et al. 2003)
43. Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World
(in preparation)
43
The geometric hypotheses
The first geometric hypothesis: Colwell &
Hurt (1994), Colwell & Lees (2000). This
hypothesis is also called the mid-domain
effect.
An updated and untested geometric
hypothesis: Gross & Snyder-Beattie (2016)
Considering species have different
latitudinal range sizes, the random location
of those ranges on the globe would result
in more species near the equator.
44. Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World
(in preparation)
44
The dataset structure
Because of taxonomic and spatial biases (some areas are better sampled than others), we have
tested the LDG on 8 taxonomic classes on the New World.
▪ Amphibia
▪ Aves
▪ Liliopsida
▪ Magnoliopsida
▪ Mammalia
▪ Pinopsida
▪ Polypodiopsida
▪ Reptilia
208 millions occurrences
62,099 species
For each class, a list of geographic cells covering the New
World with species richness and explanatory variables values
is computed.
45. Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World
(in preparation)
45
Characterizing the LDG
A ‘classic’ LDG pattern was
found for 7 out of the 8
tested classes.
For these 7 classes, a
higher species richness
occurs between the -30°
and 30° lines of latitude
(dotted lines).
Pinopsida showed an
atypical pattern.
46. Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World
(in preparation)
46
Testing the hypotheses
The supported hypotheses were either both the productivity and ambient energy hypotheses
(Amphibia and Mammalia) or only the productivity hypothesis (Aves, Liliopsida, Magnoliopsida,
Polypodiopsida, Reptilia). The Gross & Snyder-Beattie hypothesis was rejected.
Non-spatial
stepwise
regression
Moran’s I test
for spatial
autocorrelation
Spatial
regression
(Spatial lag and
error model)
Best model
selection
Selection of
significant
variables
~ ~
Species
richness
Species richness
+
~ Evapotranspiration
+ Annual mean temperature
Productivity and Ambient energy
~ Evapotranspiration
Productivity
47. Latitudinal Diversity Gradient: Geometric hypotheses revisited using massive biodiversity occurrences in plants and animals of the New World
(in preparation)
▪ The null hypothesis proposed by
Gross & Snyder-Beattie doesn’t fit in
our model once the spatial
autocorrelation is taken into account.
47
Conclusion on the LDG
▪ The relative contribution of the
productivity hypothesis seems to be
the highest.
▪ There is a clear LDG for 7 out of the 8
class studied.
▪ More factors should be tested, especially
those included in historical hypotheses.
49. 49
How is the practice of biodiversity data
gathering evolving?
We recommend to encourage the production
of ancillary data (samples, photos…) along
with biodiversity occurrences.
The role of citizen sciences in the evolution of
this new paradigm should also be investigated.
The proportion of specimen-based occurrences
is rapidly falling behind the proportion of
observation-based occurrences In the GBIF
dataset.
This situation is worrying
because it jeopardizes the
feasibility and reliability of
some studies based on
biodiversity occurrences.
Year
Proportion
50. Some taxa are less known and
studied than others and this
situation worsens with time.
This taxonomic bias seems to be
linked to societal preferences.
50
Can biological diversity be investigated in
its entirety?
What can be said when looking at the species level? Do the most sampled species in the
GBIF also suggest a link with societal preferences? Changing the study scale might reveal
additional influences (species characteristics, data providers, etc.)
Class
Number
of species
Aves 611
Magnoliopsida 228
Liliopsida 70
Insecta 28
Actinopterygii 18
Mammalia 12
Polypodiopsida 7
14 next classes 26
1000 best sampled species
in the GBIF
51. The geometric hypothesis proposed by Gross &
Snyder-Beattie has been rejected while the
productivity hypothesis seems to be confirmed.
The Latitudinal Diversity Gradient pattern is clear
for 7 classes.
51
Global biodiversity patterns at large
taxonomic scale: which factors shape them?
More studies should be
done, this time including
the historical
hypotheses.
Other geographic
regions and biodiversity
patterns could be
studied.
The spatial non-
stationarity of the
models should be
explored.
52. THANKS!
To the jury members: Alexandre ANTONELLI, Jean-Philippe LESSARD, Anne-Sophie
ARCHAMBEAU and Rod PAGE
To the members of my PhD committee: Roseli Pellens, Philippe Grandcolas, Wilfried
Thuiller, Samy Gaiji and Jérôme Sueur.
To my thesis supervisors: Régine VIGNES-LEBBE and Frédéric LEGENDRE
Thanks to the people at the GBIF France and the International GBIF Node in Copenhagen
for their help and advices.
Many thanks to all the amazing people I get to interact with during this PhD in particular
all the people from the Institut de Systématique, Évolution, Biodiversité who welcomed
me during those three years.
Many thanks again to all the colleagues, friends and family who supported me and helped
me ! Words can’t express all my gratitude !
52
53. References▪ Ariño, Arturo H. « Approaches to Estimating the Universe of Natural History Collections Data ». Biodiversity Informatics 7, nᵒ 2 (2010). bi.v7i2.3991.
▪ Barrowclough, George F., Joel Cracraft, John Klicka, et Robert M. Zink. « How Many Kinds of Birds Are There and Why Does It Matter? » PLOS ONE 11, nᵒ 11 (2016):
e0166307.
▪ Bingham, Heather, Lauren Weatherdon, Katherine Despot-Belmonte, Florian Wetzel, et Corinne Martin. « The Biodiversity Informatics Landscape: Elements, Connections
and Opportunities ». Research Ideas and Outcomes 3 (2017): e14059.
▪ Cardinale, Bradley J., J. Emmett Duffy, Andrew Gonzalez, David U. Hooper, Charles Perrings, Patrick Venail, Anita Narwani, et al. « Biodiversity Loss and Its Impact on
Humanity ». Nature 486, nᵒ 7401 (2012): 59|67
▪ Ciccarelli, Francesca D., Tobias Doerks, Christian von Mering, Christopher J. Creevey, Berend Snel, et Peer Bork. « Toward Automatic Reconstruction of a Highly Resolved Tree
of Life ». Science (New York, N.Y.) 311, nᵒ 5765 (2006): 1283|87.
▪ Colwell, Robert K., et George C. Hurtt. « Nonbiological Gradients in Species Richness and a Spurious Rapoport Effect ». The American Naturalist 144, nᵒ 4 (1994): 570|95.
▪ Colwell, Robert. K., et David C. Lees. « The mid-domain effect: geometric constraints on the geography of species richness ». Trends in Ecology & Evolution 15, nᵒ 2 (2000):
70|76.
▪ Di Marco, Moreno, Sarah Chapman, Glenn Althor, Stephen Kearney, Charles Besancon, Nathalie Butt, Joseph M. Maina, et al. « Changing trends and persisting biases in three
decades of conservation science ». Global Ecology and Conservation 10 (avril 2017): 32|42.
▪ Faith, Dan, Ben Collen, Arturo Ariño, Patricia Koleff Patricia Koleff, John Guinotte, Jeremy Kerr, et Vishwas Chavan. « Bridging the Biodiversity Data Gaps: Recommendations
to Meet Users’ Data Needs ». Biodiversity Informatics 8, nᵒ 2 (2013).
▪ Ford, Adam T., Steven J. Cooke, Jacob R. Goheen, et Truman P. Young. « Conserving Megafauna or Sacrificing Biodiversity? » BioScience 67, nᵒ 3 (2017): 193|96.
▪ Gross, Kevin, et Andrew Snyder-Beattie. « A General, Synthetic Model for Predicting Biodiversity Gradients from Environmental Geometry ». The American Naturalist 188, nᵒ
4 (2016): E85|97.
▪ Hawkins, Bradford A., Eric E. Porter, et José Alexandre Felizola Diniz-Filho. « Productivity and History as Predictors of the Latitudinal Diversity Gradient of Terrestrial Birds ».
Ecology 84, nᵒ 6 (2003): 1608|23.
▪ Humboldt, Alexander. « Views of nature, or, Contemplations on the sublime phenomena of creation ». (1807)
▪ Kosmala, Margaret, Andrea Wiggins, Alexandra Swanson, et Brooke Simmons. « Assessing Data Quality in Citizen Science ». Frontiers in Ecology and the Environment 14, nᵒ
10 (2016): 551-60.
▪ Meyer, Carsten, Holger Kreft, Robert Guralnick, et Walter Jetz. « Global Priorities for an Effective Information Basis of Biodiversity Distributions ». Nature Communications 6
(2015): 8221.
▪ Myers, Norman, Russell A. Mittermeier, Cristina G. Mittermeier, Gustavo A. B. da Fonseca, et Jennifer Kent. « Biodiversity Hotspots for Conservation Priorities ». Nature 403,
nᵒ 6772 (2000): 853|58.
▪ Newbold, Tim, Lawrence N. Hudson, Andrew P. Arnell, Sara Contu, Adriana De Palma, Simon Ferrier, Samantha L. L. Hill, et al. « Has Land Use Pushed Terrestrial Biodiversity
beyond the Planetary Boundary? A Global Assessment ». Science 353, nᵒ 6296 (2016): 288.91.
▪ Pyron, R. Alexander. « Post-Molecular Systematics and the Future of Phylogenetics ». Trends in Ecology & Evolution 30, nᵒ 7 (2015): 384|89.
▪ Ripple, William J., Christopher Wolf, Thomas M. Newsome, Mauro Galetti, Mohammed Alamgir, Eileen Crist, Mahmoud I. Mahmoud, et William F. Laurance. « World Scientists’
Warning to Humanity: A Second Notice ». BioScience, (2017).
▪ Stahlschmidt, Zachary R. « Taxonomic Chauvinism Revisited: Insight from Parental Care Research ». PLOS ONE 6, nᵒ 8 (2011): e24192. journal.pone.0024192.
▪ Willig, M.R., D.M. Kaufman, et R.D. Stevens. « LATITUDINAL GRADIENTS OF BIODIVERSITY: Pattern, Process, Scale, and Synthesis ». Annual Review of Ecology, Evolution, and
Systematics 34, nᵒ 1 (2003): 273|309.
▪ World Scientists' Warning to Humanity (1992)53
55. Jump in the number of occurrences
55
Magnoliopsida Reptilia
1945: 890,000 occurrences
Occurrence Data of Vascular Plants
collected or compiled for the Flora of
Bavaria
2012: 470,000 occurrences
Geographically tagged INSDC sequences
Published by European Molecular Biology
Laboratory (EMBL)
56. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)56
A bias affecting data quality
A Multiple Correspondence Analysis shows
that old occurrences have a tendency to have
more issues.
The proportions of spatial and temporal
issues are different between classes,
showing once again a bias in quality.
57. Taxonomic bias in biodiversity data and societal preferences (published in Scientific Reports, 2017)57
A bias affecting data quality
Some classes have far more specimen-based
occurrences than other.
58. 58
The geometric hypotheses
The first geometric hypothesis: Colwell & Hurt
(1994), Colwell & Lees (2000). This hypothesis is
also called the mid-domain effect.
Rejected by Currie & Kerr 2007, 2008
Zappata et al. 2003
Thirty species with a uniform frequency
distribution of range sizes are placed
randomly within a finite domain (bottom
panels). Horizontal lines indicate range sizes
and midpoints indicate the position of ranges
along the 1-D domain. The overlap among
ranges on the bottom panels produces the
pattern of species richness observed on the
upper panels (dotted line). The mean value of
species richness that results from
repositioning the ranges randomly within the
domain 50 times is indicated by a solid line
60. 60
The complex relationship between the
public and the scientists
Martín-López et al. 2009
Conceptual model representing the
primary connections among
scientific information, scientific
activity and policy, nongovernmental
organizations (NGOs), society
and environmental administrations
(or departments) at different
governmental scales that establish
and allocate funds for species
conservation
61. 61
Other hypotheses behind taxonomic bias
Detectability Ease of
Identification
Encounter
probability
Ease of
Observation
Body size increase - + +
Reachable habitat + NA NA
High abundance + NA NA
Discrete
behaviour
NA - NA
Large range + NA NA
Diurnal taxon + + NA
Species similarity NA - -
High Speciosity NA NA -
62. 62
Other hypotheses behind taxonomic bias
Detectability Ease of
Identification
Encounter
probability
Ease of
Observation
Body size increase - + +
Reachable habitat + NA NA
High abundance + NA NA
Discrete behaviour NA - NA
Large range + NA NA
Diurnal taxon + + NA
Species similarity NA - -
High Speciosity NA NA -
Martín-López et al. 2007
Quadratic relation between the attitudes
toward species (preference score) and the
WTP for conservation of the species
63. The GWR model shows that there is no spatial stationarity in the model we kept.
Meaning the explanatory power as well at the covariate influence varies across space.
63
Testing the model spatial stationarity
The Geographically Weighted Regression (GWR) allows us to test the models we kept
across space. It serves as a test of the model spatial stationarity (i.e. the assumption
that the relationship between variables doesn’t varies across space).
Mammalia
GWR results
65. 65
Detecting spatial outliers
Global repartition of
the common black ant
(Lasius niger)
To detect spatial outliers:
• for each point
• find the 5 nearest points (neighbors)
• compute the orthodromic distance to each neighbors
• sum those distances
• flag the 1% points with the highest sum as outliers
66. 66
Detecting environmental outliers
class order species
annual mean
temperature
min temperature
of coldest month
annual
precipitation
Insecta Blattodea Tryonicus parvus 15.49 7.34 1370
Insecta Blattodea Arenivaga floridensis 22.35 9.91 1225
Insecta Blattodea Ectobius lapponicus 6.89 -6.94 772
Insecta Blattodea Phyllodromica subaptera 13.83 -0.74 425
Insecta Blattodea Periplaneta americana 17.63 6.61 440
Insecta Blattodea Tryonicus parvus 15.49 7.34 1370
To detect environmental outliers:
• for each point
• compute the Mahalanobis distance of each point using the
environmental variables
• flag the 1% points with the highest mahalanobis distance as
outliers
The Mahalanobis
distance is a measure
of the distance
between a point P and
a distribution D.