Data accessibility and the role of informatics in predicting the biosphere

Data accessibility and the
role of informatics in
predicting the biosphere
Alex Hardisty
Director of Informatics Projects,
School of Computer Science & Informatics
Coordinator, FP7 BioVeL project www.biovel.eu
email: hardistyar@cardiff.ac.uk
/alexhardisty (occasionally!)
1

Structuring the biodiversity informatics community at the European level and beyond
Biodiversity Informatics Horizons 2013
180 experts conclude that there is
“a growing need for predictive biosphere modelling”
• Integration: Make better use of what we have
• Cooperation: Data from the whole world is needed
• Promotion: Europe is well placed to offer leadership
2

What if …?
Imagine if we could …
… Predict community level dynamics of
ecosystems (i.e., behaviours) at scales
from local to global, based on the
ecology and biology of all individual
organisms …
e.g., Ecosystems: Time to model all life on Earth. Purves et al.,
Nature 493 (2013)
Image: StuartMiles / FreeDigitalPh3otos.net

Imagine if we could …
… Measure and calculate “Essential Biodiversity Variables” …
… for any geographic area (continental, regional, local), by any person
anywhere, using data for that area that may be held by any (research)
infrastructure. Not only that, but also learn how to forecast EBVs 4

Depend on collaboration to deliver the evidence, i.e., based
on synthesis and modelling of
• Increasingly large amounts of data from multiple sources
(environmental, taxonomic, genomic and ecological)
• Gathered by manual observation and automated sensors,
digitisation, nextgen sequencing and remote sensing
Beyond the abilities of any one individual or any single
research community to collect, observe or generate.
Variety, Velocity and Volume of “Big Data”
5
Photo: Smokestacks against skyline and sunset, Estonia. © Curt Carnemark / World Bank Photo Collection

From informatics perspective, how close are we to that?
Topical coverage
100%
Data sharing and QC
100%
0%
Data types
Data source tracking
Data citation tracking
Data integration
User applications &
interfaces
Funding
Access policy
Technology
GIS
Standards
Data
9 research infrastructures from
around the world exhibit “a
satisfactory level of potential
interoperability”
Software architecture
100%
0%
Programming
languages
Authentication
Authorization
Middleware
Computing
infrastructure
Standards
Technology
Service logic
0%
Geographical
coverage
Infrastructure
topology
Native
interoperability and
enablers
Merging of science &
policy needs
Merging of science &
industry needs
Engagement of
citizens
Licensing and
business model
General
6

A computational challenge: Greater than that of weather
forecasting; greater than that of climate prediction?
Image from climateprediction.net
HarfootMBJ, Newbold T, Tittensor DP, Emmott S, et al. (2014) Emergent Global
Patterns of Ecosystem Structure and Function from a Mechanistic General
Ecosystem Model. PLoS Biol 12(4): e1001841. doi:10.1371/journal.pbio.1001841
For 1km resolution, “… 3
to 6 orders of magnitude
larger, … an exascale
problem”
Jack K. Horner
Independent consultant &
7
Adviser to KU Biodiversity Institute

The situation today can be
likened to meteorology in
1950’s, 60’s and 70’s (and
later in climatology) when
the emergence of numerical
weather prediction drove
demand for:
• New observations
• The emergence of a global
infrastructure for acquiring,
mobilising and normalising
data, and
• Better models of global
atmospheric behaviour
8

Accessible data is useful data, not just for research
Global policies/reports
Regional
policies/reports
National
policies/reports
Data and information
Direct provision of data/information
Indirect provision through reports
Assessment processes
Green accounting etc
9
Diagram courtesy of EC FP7 EU BON project

To be able to predict the biosphere we need to
mobilise data and make it accessible
10

It’s a journey towards
• Global data, covering the whole planet. There are
significant gaps everywhere today
• Making all our small-scale, local data – which often
characterises the current day practice of field
ecology – global
That is to say, we have to mobilise, clean, normalise
and quality assure many small sets of data that
together can give us the global data we need to
calibrate models
We are achieving that for certain classes of data but
it is not without its difficulties
11

Issues arise in each of the 4 stages
of mobilising data for synthesis
• Data acquisition
– Standardised measurement protocols
• Data curation
– Assigning right metadata and persistent identifiers
– Finding a home for the data – and putting it there
• Data discovery and access
– Finding relevant data
– Machine readable access to data i.e., WS front-end
• Data processing / analysis, including re-use
– Owners want attribution
– Tracking provenance and follow licensing conditions
– Problems at every step, on every workflow run
http://envri.eu/rm 12

See also:
“Showing you this
map of aggregated
bullfrog occurrences
would be illegal”
http://peterdesmet.com
/posts/illegal-bullfrogs.
html
“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a
bleak picture. Very few GBIF registered datasets can be easily and legally
used, let alone without restrictions. This is mainly due to data being
published with no or a non-standard license.”
13
Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com

See also:
“Showing you this
map of aggregated
bullfrog occurrences
would be illegal”
http://peterdesmet.com
/posts/illegal-bullfrogs.
html
“Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a
bleak picture. Very few GBIF registered datasets can be easily and legally
used, let alone without restrictions. This is mainly due to data being
published with no or a non-standard license.”
14
Peter Desmet and Bart Aelterman, 22nd Nov 2013, peterdesmet.com

Data re-use: Owners want attribution
Example 1) Taxonomic data refinement Workflow
BioSTIF
CoL 3 levels of attribution
• complete work
• contributing database of the record
• expert who provides taxonomic
scrutiny of the individual record.
Tool
license (s)
GBIF data use agreement
• Respect restrictions of access to sensitive data.
• Identifier of ownership of data must be retained with every data record (through the workflow)
• Publicly acknowledge the Data Publishers whose biodiversity data they have used.
15
• Any additional terms and conditions of use set by the Data Publisher.

More problems at every step, on every run
Example 2) Niche Modelling Workflow
Create model
Model test
Model projection
High quality occurrence data
set
Select algorithm
Select parameter values for
the chosen algorithm
Assemble the model on
openModeller service
Test the performance of the
parameter in the model
Test performance of the
distribution prediction on the
model
Project Model with prediction
layers
Changing algorithm, parameter
values, and set of layers
Project Model with original
layers
Visualize and publish results
Select layers with environmental
factors that are likely to influence the
distribution of the species
Select prediction layers
• License on algorithm
• License on software
Licenses on
environmental data layers
• Permissions to use
• AuthN/AuthZ
Moving data from one
service to another
• 3rd party software
• All issues associated
with publication
16

In a recent EU BON study
Only 35% of surveyed datasets
(wider scope than just GBIF) are
accessible under an open license or
waiver, without restriction on use
For 29 scientific questions relating to
needs of European environmental
policy, the availability of datasets to
answer the questions is in the range
‘satisfactory’ (3) to ‘poor’ (2)
17

Multiple initiatives to make data more accessible;
some are general purpose
https://rd-alliance.org/
… builds the social and technical bridges that enable open sharing of data …
researchers and innovators openly sharing data across technologies, disciplines,
and countries to address the grand challenges of society.
http://www.datafairport.org/
… successful community supported conventions, policies and practices for data
identifiers, formats, checklists and vocabularies that enable data interoperability,
citation and stewardship.
ORCID and DataCite initiatives to uniquely identify (respectively) scientists and data sets 18

Some are more domain specific
Promoting free and open access
to biodiversity information
A framework to focus
effort and investment
to deliver biodiversity
knowledge more
effectively
www.biodiversityinformatics.org/
www.bouchout-declaration.org 19

A shared and maintained multi-purpose network of
computationally-based processing services in an open
data domain
Image: CoolDesign / FreeDigitalPh2o0tos.net
With 78 contributors, we
published the whitepaper,
April 2013 - since viewed
more than 34,000 times.

Building a heterogeneous Service Network
21
Users’ workflows and
applications
Sustained Service and
Data Providers
GBIF, CoL, OBIS, WoRMS,
EMBL-EBI, BGBM, CRIA, EoL,
BHL, ALA, LTER, etc. & more.
www.biodiversitycatalogue.org
Recognised and stable
Infrastructure Providers
National, EGI.eu, PRACE,
commercial, EUDAT, etc.

Preparing the next, coordinated steps
22
Diagram from LinkD Concept Note, September 2014

LinkD
Develop the highly responsive digital framework required to enable high
throughput research and support science of scale towards the long term vision of
modelling Life on Earth
LinkD
Science of Scale
for
L i fe on Ear th
What we want to do in LinkD?
ELODINS ENVRI+
From slides by Vince Smith, LinkD proposal coordinator, Natural History Musuem, London

Take home message: “It’s a journey”
• Accessible data is the enabler of “in-silico” science
that leads towards predicting the biosphere
• A shared multi-purpose network of processing
services, sitting on top of open data is the route to
interoperability
•Working together as a community is essential
24
Photo: A lone farmer walks among rice paddies. © DFATD-MAECD/Tick Collins

Data accessibility and the role of informatics in predicting the biosphere

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data accessibility and the role of informatics in predicting the biosphere

Ähnlich wie Data accessibility and the role of informatics in predicting the biosphere (20)

Mehr von Alex Hardisty

Mehr von Alex Hardisty (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data accessibility and the role of informatics in predicting the biosphere

Hinweis der Redaktion