The variety, distinctiveness and complexity of life – biodiversity in other words and by implication the ecosystems in which it is situated – is our life support system. It is absolutely essential and more important than almost everything else but it is typically taken for granted. Today’s big societal challenges – food and water security, coping with environmental change and aspects of human health – are beyond the abilities of any one individual or research group to solve. Solving them depends not only on collaboration to deliver the appropriate scientific evidence but increasingly on vast amounts of data from multiple sources (environmental, taxonomic, genomic and ecological) gathered by manual observation and automated sensors, digitisation, remote sensing, and genetic sequencing. In April 2012 we called the biodiversity and ecosystems research communities to arms to formulate a consensus view on establishing an infrastructure to improve the accessibility of the ever-increasing volumes of biological data. We published the whitepaper: “A decadal view of biodiversity informatics: challenges and priorities” that has since been viewed more than 24,000 times. We envisage a shared and maintained multi-purpose network of computationally-based processing services sitting on top of an open data domain. By open data domain we mean data that is accessible i.e., published, registered and linked. BioVeL, pro-iBiosphere, ViBRANT and other FP7 funded projects have all explored aspects of this vision.
Data accessibility and the role of informatics in predicting the biosphere
1. Data accessibility and the
role of informatics in
predicting the biosphere
Alex Hardisty
Director of Informatics Projects,
School of Computer Science & Informatics
Coordinator, FP7 BioVeL project www.biovel.eu
email: hardistyar@cardiff.ac.uk
/alexhardisty (occasionally!)
1
2. Structuring the biodiversity informatics community at the European level and beyond
Biodiversity Informatics Horizons 2013
180 experts conclude that there is
“a growing need for predictive biosphere modelling”
• Integration: Make better use of what we have
• Cooperation: Data from the whole world is needed
• Promotion: Europe is well placed to offer leadership
2
3. What if …?
Imagine if we could …
… Predict community level dynamics of
ecosystems (i.e., behaviours) at scales
from local to global, based on the
ecology and biology of all individual
organisms …
e.g., Ecosystems: Time to model all life on Earth. Purves et al.,
Nature 493 (2013)
Image: StuartMiles / FreeDigitalPh3otos.net
4. Imagine if we could …
… Measure and calculate “Essential Biodiversity Variables” …
… for any geographic area (continental, regional, local), by any person
anywhere, using data for that area that may be held by any (research)
infrastructure. Not only that, but also learn how to forecast EBVs 4
6. From informatics perspective, how close are we to that?
Topical coverage
100%
Data sharing and QC
100%
0%
Data types
Data source tracking
Data citation tracking
Data integration
User applications &
interfaces
Funding
Access policy
Technology
GIS
Standards
Data
9 research infrastructures from
around the world exhibit “a
satisfactory level of potential
interoperability”
Software architecture
100%
0%
Programming
languages
Authentication
Authorization
Middleware
Computing
infrastructure
Standards
Technology
Service logic
0%
Geographical
coverage
Infrastructure
topology
Native
interoperability and
enablers
Merging of science &
policy needs
Merging of science &
industry needs
Engagement of
citizens
Licensing and
business model
General
6
7. A computational challenge: Greater than that of weather
forecasting; greater than that of climate prediction?
Image from climateprediction.net
HarfootMBJ, Newbold T, Tittensor DP, Emmott S, et al. (2014) Emergent Global
Patterns of Ecosystem Structure and Function from a Mechanistic General
Ecosystem Model. PLoS Biol 12(4): e1001841. doi:10.1371/journal.pbio.1001841
For 1km resolution, “… 3
to 6 orders of magnitude
larger, … an exascale
problem”
Jack K. Horner
Independent consultant &
7
Adviser to KU Biodiversity Institute
8. The situation today can be
likened to meteorology in
1950’s, 60’s and 70’s (and
later in climatology) when
the emergence of numerical
weather prediction drove
demand for:
• New observations
• The emergence of a global
infrastructure for acquiring,
mobilising and normalising
data, and
• Better models of global
atmospheric behaviour
8
9. Accessible data is useful data, not just for research
Global policies/reports
Regional
policies/reports
National
policies/reports
Data and information
Direct provision of data/information
Indirect provision through reports
Assessment processes
Green accounting etc
9
Diagram courtesy of EC FP7 EU BON project
10. To be able to predict the biosphere we need to
mobilise data and make it accessible
10
11. It’s a journey towards
• Global data, covering the whole planet. There are
significant gaps everywhere today
• Making all our small-scale, local data – which often
characterises the current day practice of field
ecology – global
That is to say, we have to mobilise, clean, normalise
and quality assure many small sets of data that
together can give us the global data we need to
calibrate models
We are achieving that for certain classes of data but
it is not without its difficulties
11
12. Issues arise in each of the 4 stages
of mobilising data for synthesis
• Data acquisition
– Standardised measurement protocols
• Data curation
– Assigning right metadata and persistent identifiers
– Finding a home for the data – and putting it there
• Data discovery and access
– Finding relevant data
– Machine readable access to data i.e., WS front-end
• Data processing / analysis, including re-use
– Owners want attribution
– Tracking provenance and follow licensing conditions
– Problems at every step, on every workflow run
http://envri.eu/rm 12
13. See also:
“Showing you this
map of aggregated
bullfrog occurrences
would be illegal”
http://peterdesmet.com
/posts/illegal-bullfrogs.
html
“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a
bleak picture. Very few GBIF registered datasets can be easily and legally
used, let alone without restrictions. This is mainly due to data being
published with no or a non-standard license.”
13
Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com
14. See also:
“Showing you this
map of aggregated
bullfrog occurrences
would be illegal”
http://peterdesmet.com
/posts/illegal-bullfrogs.
html
“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a
bleak picture. Very few GBIF registered datasets can be easily and legally
used, let alone without restrictions. This is mainly due to data being
published with no or a non-standard license.”
14
Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com
15. Data re-use: Owners want attribution
Example 1) Taxonomic data refinement Workflow
BioSTIF
CoL 3 levels of attribution
• complete work
• contributing database of the record
• expert who provides taxonomic
scrutiny of the individual record.
Tool
license (s)
GBIF data use agreement
• Respect restrictions of access to sensitive data.
• Identifier of ownership of data must be retained with every data record (through the workflow)
• Publicly acknowledge the Data Publishers whose biodiversity data they have used.
15
• Any additional terms and conditions of use set by the Data Publisher.
16. More problems at every step, on every run
Example 2) Niche Modelling Workflow
Create model
Model test
Model projection
High quality occurrence data
set
Select algorithm
Select parameter values for
the chosen algorithm
Assemble the model on
openModeller service
Test the performance of the
parameter in the model
Test performance of the
distribution prediction on the
model
Project Model with prediction
layers
Changing algorithm, parameter
values, and set of layers
Project Model with original
layers
Visualize and publish results
Select layers with environmental
factors that are likely to influence the
distribution of the species
Select prediction layers
• License on algorithm
• License on software
Licenses on
environmental data layers
• Permissions to use
• AuthN/AuthZ
Moving data from one
service to another
• 3rd party software
• All issues associated
with publication
16
17. In a recent EU BON study
Only 35% of surveyed datasets
(wider scope than just GBIF) are
accessible under an open license or
waiver, without restriction on use
For 29 scientific questions relating to
needs of European environmental
policy, the availability of datasets to
answer the questions is in the range
‘satisfactory’ (3) to ‘poor’ (2)
17
18. Multiple initiatives to make data more accessible;
some are general purpose
https://rd-alliance.org/
… builds the social and technical bridges that enable open sharing of data …
researchers and innovators openly sharing data across technologies, disciplines,
and countries to address the grand challenges of society.
http://www.datafairport.org/
… successful community supported conventions, policies and practices for data
identifiers, formats, checklists and vocabularies that enable data interoperability,
citation and stewardship.
ORCID and DataCite initiatives to uniquely identify (respectively) scientists and data sets 18
19. Some are more domain specific
Promoting free and open access
to biodiversity information
A framework to focus
effort and investment
to deliver biodiversity
knowledge more
effectively
www.biodiversityinformatics.org/
www.bouchout-declaration.org 19
20. A shared and maintained multi-purpose network of
computationally-based processing services in an open
data domain
Image: CoolDesign / FreeDigitalPh2o0tos.net
With 78 contributors, we
published the whitepaper,
April 2013 - since viewed
more than 34,000 times.
21. Building a heterogeneous Service Network
21
Users’ workflows and
applications
Sustained Service and
Data Providers
GBIF, CoL, OBIS, WoRMS,
EMBL-EBI, BGBM, CRIA, EoL,
BHL, ALA, LTER, etc. & more.
www.biodiversitycatalogue.org
Recognised and stable
Infrastructure Providers
National, EGI.eu, PRACE,
commercial, EUDAT, etc.
22. Preparing the next, coordinated steps
22
Diagram from LinkD Concept Note, September 2014
23. LinkD
Develop the highly responsive digital framework required to enable high
throughput research and support science of scale towards the long term vision of
modelling Life on Earth
LinkD
Science of Scale
for
L i fe on Ear th
What we want to do in LinkD?
ELODINS ENVRI+
From slides by Vince Smith, LinkD proposal coordinator, Natural History Musuem, London
Inspired by roadmap publications such as GBIO and the White paper. Mandated by European and global societal challenges. Supported by the maturity of the available foundational e-Infrastructures.
Science of Scale: To maximize the efficiency of the available data, services and tools. This is what the commission calls science 2.0. In short is using economies of scale in data collection and associated infrastructure to do big things.