Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Introduction to Open Science and EOSC
1. www.geant.org
www.geant.org
1 |
Click to edit Master title style
• Click to edit Master text styles
• Second level
• Third level
• Fourth level
• Fifth level
01/04/2022 1
Introduction to Open Science and EOSC
www.geant.org
Sarah Jones
EOSC Engagement Manager
sarah.jones@geant.org
Twitter: @sarahroams
Predictive Epigenetics PEP-NET training network
1st April 2020
4. www.geant.org
www.geant.org
“science carried out and communicated in a manner which
allows others to contribute, collaborate and add to the research
effort, with all kinds of data, results and protocols made freely
available at different stages of the research process.”
Research Information Network, Open Science case studies
www.rin.ac.uk/our-work/data-management-and-curation/
open-science-case-studies
Defining Open Science
4 |
7. www.geant.org
www.geant.org
• Free, immediate, online access to the results of research
• Two routes to make sure anyone can access your papers
– Gold route: paying APCs to ensure publishers makes copy open
– Green route: self-archiving Open Access copy in repository
• Find out what your publisher allows on SHERPA RoMEO
– www.sherpa.ac.uk/romeo
Open access to publications
8. www.geant.org
www.geant.org
Open data
make your stuff available on the Web (whatever format) under an open licence
make it available as structured data (e.g. Excel instead of a scan of a table)
use non-proprietary formats (e.g. CSV instead of Excel)
use URIs to denote things, so that people can point at your stuff
link your data to other data to provide context
Tim Berners-Lee’s proposal for five star open data - http://5stardata.info
“Open data and content can be freely used, modified
and shared by anyone for any purpose”
http://opendefinition.org
9. www.geant.org
www.geant.org
• Documenting and sharing workflows and methods
• Sharing code and tools to allow others to reproduce work
• Using web based tools to facilitate collaboration and interaction from the
outside world in your research
• Using tools like MyExperiment and Taverna
Open methods
10. www.geant.org
www.geant.org
Reliance on specialist research software
Slide from Neil Chue-Hong, Software Sustainability Institute
56%
71%
Do you use research
software?
What would happen to your
research without software
Survey of researchers from 15 UK Russell Group universities conducted
by SSI between August - October 2014. DOI: 10.5281/zenodo.14809
Develop their
own software
Have no formal
software training
12. www.geant.org
www.geant.org
Degrees of openness
Open Restricted Closed
Content that can be
freely used, modified and
shared by anyone
for any purpose
Limits on who can use the data,
how or for what purpose
- Charges for use
- Data sharing agreements
- Restrictive licences
- Peer-to-peer exchange
- …
Five star open data
Unable to share
Under embargo
13. www.geant.org
www.geant.org
• FAIR ≠ Open
• FAIR ensures data can be found, understood and reused
• Data can be shared under restrictions & still be FAIR
"As open as possible, as closed as necessary"
And what is FAIR?
13 |
Image CC-BY-SA by SangyaPundir Image CC-BY by European Commission FAIR data expert group
14. www.geant.org
www.geant.org
What FAIR means: 15 principles
Findable
F1. (meta)data are assigned a globally unique and eternally
persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a searchable resource.
F4. metadata specify the data identifier.
Interoperable
I1. (meta)data use a formal, accessible, shared, and broadly
applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.
Accessible
A1 (meta)data are retrievable by their identifier using a
standardized communications protocol.
A1.1 the protocol is open, free, and universally implementable.
A1.2 the protocol allows for an authentication and authorization
procedure, where necessary.
A2 metadata are accessible, even when the data are no longer
available.
Reusable
R1. meta(data) have a plurality of accurate and relevant attributes.
R1.1. (meta)data are released with a clear and accessible data
usage license.
R1.2. (meta)data are associated with their provenance.
R1.3. (meta)data meet domain-relevant community standards.
Slide CC-BY by Erik Schultes, Leiden UMC
doi: 10.1038/sdata.2016.18
15. www.geant.org
www.geant.org
The FAIR data principles explained
• Clarifications from GO FAIR
• Each principle is a link to
further clarification,
examples and context
https://www.go-fair.org/fair-
principles
R1. Meta(data) are richly described with a plurality of accurate and relevant
attributes
• By giving data many ‘labels’, it will be much easier to find and reuse the data.
• Provide not just metadata that allows discovery, but also metadata that richly
describes the context under which that data was generated
• “plurality” indicates that metadata should be as generous as possible, even to the
point of providing information that may seem irrelevant.
16. www.geant.org
www.geant.org
• Findable
- Persistent Identifier
- Metadata online
• Accessible
- Data online
- Restrictions where needed
• Interoperable
- Use standards, controlled vocabs
- Common (open) formats
• Reusable
- Rich documentation
- Clear usage licence
FAIR data checklist
https://doi.org/10.5281/zenodo.5111307
17. www.geant.org
www.geant.org
• Various research communities have been sharing their
data in a ‘FAIR’ way long before the term emerged
• Meaningful and memorable articulation of concepts
• Natural desire to want to be ‘fair’
• FAIR is gaining significant international traction
FAIR is nothing new
20. www.geant.org
www.geant.org
A study that analysed the citation counts of 10,555 papers on gene expression
studies that created microarray data, showed:
“studies that made data available in a public repository
received 9% more citations than similar studies for
which the data was not made available”
Data reuse and the open data citation advantage,
Piwowar, H. & Vision, T. https://peerj.com/articles/175
Get a citation advantage
21. www.geant.org
www.geant.org
Increased use and economic benefit
Up to 2008 Since 2009
• Freely available over the internet
• Google Earth now uses the images
• Transmission of 2,100,000
scenes per year.
• Estimated to have created value for the
environmental management industry of
$935 million, with direct benefit of more
than $100 million per year to the US
economy
• Has stimulated the development of
applications from a large number of
companies worldwide
The case of NASA Landsat satellite imagery of the Earth’s surface:
http://earthobservatory.nasa.gov/IOTD/view.php?id=83394&src=ve
• Sold through the US Geological
Survey for US$600 per scene
• Sales of 19,000 scenes per year
• Annual revenue of $11.4 million
22. www.geant.org
www.geant.org
“Open Research Europe requires open
access to research data supporting
articles under the principle ‘as open
as possible, as closed as necessary’,
according to the policy of Horizon
Europe. Data should be deposited in
trusted data repositories.”
Funder imperatives...
https://open-research-europe.ec.europa.eu/for-
authors/data-guidelines#opendata
23. www.geant.org
www.geant.org
But there are also opportunity costs
By Emilio Bruna
http://brunalab.org/blog/2014/09/04/the-opportunity-
cost-of-my-openscience-was-35-hours-690
For his paper he calculated the following:
1. Double checking the main dataset and
reformatting to submit to Dryad: 5 hours
2. Creating complementary file and preparing
metadata: 3 hours
3. Submission of these two files and the
metadata to Dryad: 45 minutes
4. Preparing a map of the locations: 1 hour
5. Submission of map to Figshare: 15 minutes
6. Cleaning up and documenting the code,
uploading it to GitHub: 25 hours
7. Cost of archiving in Dryad: US$90
8. Page Charges: $600
24. www.geant.org
www.geant.org
• EC and Member States committed to FAIR and Open
• Pursue this in research policy and grant conditions
• Lots of investment in infrastructure to support data sharing
• Ultimately supports the science ecosystem and ensures
greater return on investment
FAIR and Open both central to EOSC
24 |
27. www.geant.org
www.geant.org
• Collaboration between European
Commission and Member States to
“make Open Science the new normal”
• Established EOSC Association as legal
entity to govern and oversee the
implementation
• Huge investment in infrastructure –
€350 million in initial development
phase and at least €1 billion co-
investment foreseen for next 7 years
Large EC initiative
27 |
EOSC
Association
Steering
Board
European
Commission
28. Long history of political agreements and activity
Lots of groundwork since 2015
• Council Conclusions
• Expert Group reports
• EC documents
• Major investment in EOSC
related projects to develop the
infrastructure and platform
30. www.geant.org
www.geant.org
• A web of FAIR data and services
• Federation of eInfra and Research
Infrastructures (RIs)
• Environment in which data can be
brought together with services to
perform analyses and address
societal challenges
The EOSC platform
32. www.geant.org
www.geant.org
FAIR is central to principles in EOSC
• Is the glue that connects data & services
• Requirement for FAIR to support reuse
• Use community standards
• Share all types of output (openly)
35. www.geant.org
www.geant.org
• Currently the primary resource for
navigating EOSC
• https://eosc-portal.eu
• Includes a virtual tour for new users
• Catalogue and marketplace is how
you discover, access and compose
resources
EOSC Portal
37. Access to free storage, compute and support services
C-SCALE will federate compute
and data resources from the
Copernicus DIAS, the national
Collaborative Ground
Segments and the European
Open Science Cloud (EOSC)
towards a European open
source Big (Copernicus) Data
Analytics platform:
- Storage services: up to 12 PB
- Cloud services: up to
17,728,500 CPU hours
- HPC/HTC services: up to
3,100,000 CPU hours
- GPU services: up to 6,000
GPU hours
DICE makes available a set of
data management services (and
associated resources) for
researchers and research
communities from any scientific
domain including:
- Data archives (up to 25 PB)
- Policies based data archives (up
to 17 PB)
- Personal and project
workspaces (up to 5 PB)
- Data repository services for
data sharing (up to 8 PB)
- Data discovery services (with
PID and DOI services and
metadata harvesting)
EGI-ACE will deliver the EOSC
Compute Platform and will
contribute to the EOSC Data
Commons. Services offered
include: compute and storage
resources, compute platform
services, data management
services and related user support
and training.
The total capacity that EGI-ACE
makes available through the call
between 2021-2023 is:
- 80,000,000 CPU hours
- 250,000 GPU hours
- 20 PB storage
support to Argos DMP service by
drafting discipline specific DMPs,
Horizon Europe DMP support
set your own community
research gateway
(connect.openaire.eu) and
Zenodo communities
access open science metrics for
your projects, institution,
community
service to anonymise your data
and comply with GDPR
support and mentoring on
Horizon Europe open access
mandates
Provides three core services for
Research Lifecycle Management:
- ROHub: tool to facilitate the
exchange of information across the
scientific community.
- Text Enrichment and Mining:
service which automatically extracts
valuable information and metadata
from bibliographic sources and
other text documents
- Datacube technology for Earth
Observation (EO) data
management: efficient access to
extensive collections of multi-
temporal and multi-dimensional EO
imagery, also allowing
interoperability among the different
information layers.
https://marketplace.eosc-portal.eu
39. www.geant.org
www.geant.org
EOSC Future is using AI techniques to make recommendations to users:
• relevant projects, data, publications, training materials
• potential collaborators (people, task forces, communities)
Recommendations based on
• viewing history
• order history
• general popularity
• popularity among users with
a similar background/interests
Recommendations for users
40. www.geant.org
www.geant.org
• Federated identity management – ease of single sign on
• Access to a greater number of services
• Funding provided to pay for compute e.g. EGI-ACE, DICE
• Discovery of related data from other disciplines / sectors
• Greater ability to collaborate and address key research
questions
Benefits of EOSC for researchers
40
43. www.geant.org
www.geant.org
1. Choose your dataset(s)
– What can you may open? You may need to revisit this step if you
encounter problems later.
2. Apply an open license
– Determine what IP exists. Apply a suitable licence e.g. CC-BY
3. Make the data available
– Provide the data in a suitable format. Use repositories.
•
4.Make it discoverable
– Post on the web, register in catalogues…
How to make data open?
https://okfn.org
47. www.geant.org
www.geant.org
• Look for provision from your community, university, publisher, funder etc
• Check they match your particular data needs: e.g. formats accepted;
mixture of Open and Restricted Access.
• See if they provide guidance on how to cite the deposited data.
• Do they assign a persistent & globally unique identifier for sustainable
citations and to links back to particular researchers and grants?
• Look for certification as a ‘Trustworthy Digital Repository’ with an explicit
ambition to keep the data available in long term.
How to select a repository?
48. www.geant.org
www.geant.org
Metadata Standards Directory
Broad, disciplinary listing of standards
and tools. Maintained by RDA group
http://rd-alliance.github.io/metadata-directory
Use metadata standards
FAIRsharing
• A portal of data standards,
databases, and policies
• Focused on life, environmental
and biomedical sciences
https://fairsharing.org
49. www.geant.org
www.geant.org
If you want your data to be re-used and sustainable in the long-
term, you typically want to opt for open, non-proprietary formats.
Choose appropriate file formats
Type Recommended Avoid for data sharing
Tabular data CSV, TSV, SPSS portable Excel
Text Plain text, HTML, RTF
PDF/A only if layout matters
Word
Media Container: MP4, Ogg
Codec: Theora, Dirac, FLAC
Quicktime
H264
Images TIFF, JPEG2000, PNG GIF, JPG
Structured data XML, RDF RDBMS
Further examples:
https://ukdataservice.ac.uk/learning-hub/research-data-management/format-
your-data/recommended-formats
51. www.geant.org
www.geant.org
More on life science tools
and infrastructure coming
up in Susanna’s talk
51 |
Image: Sangharsh Lohakare https://unsplash.com/photos/Iy7QyzOs1bo
Journal prices have outpaced inflation by more than 250% over the past 30 years
15 entire disciplines where the average price for one journal for one year is over £1000 (chemistry £4227, physics £3229). Journal called tetrahedron that’s over £40,000
Irrational to think that scientists are paid by government to do research and then the papers are locked away behind paywalls. Journals don’t do the research, employ the people or pay the reviewers.
In the last four years, we have investigated and understood the challenges of the UK research community.
Anecdotally, we had a lot of evidence for people working in this area that researchers relied on software, but there had been no studies conducted. So we did this ourselves.
Two areas of interest, do you use software and possibly more important, what would happen to your research without software – this is 170,000 researchers in the UK who could not conduct their software without software.
This is more than just a reliance on Word or web browsers – specialist software is written into the research workflows of people from psychology to physics, from the life sciences to literature. The reliance isn’t confined to the “traditionally” computationally intensive subjects, it’s a feature of all disciplines.
This means that 140,000 researchers are relying on their own coding skills.
Certain research communities have also seen the benefit of sharing data as it speeds up the process of discovery. This article shows how researchers in the field of Alzheimer’s research have agreed as a community to share data immediately to make scientific breakthroughs.
There’s also a citation advantage for individual researchers. This study by Heather Piwowar and Todd Vision looked at 10,555 paper of gene expression studies that had shared the associated microarray data. Those studies that shared data received 9% more citations.
There’s also an economic benefit, as seen by the case of the NASA landsat satellite images. These were sold until 2008 for $600 a scene. Now they’re freely available and used by Google Earth. Previously they sold 19,000 images a year, whereas now they transmit 2.1 million. The revenue has gone up incredibly too from $11.4 million to an estimated value of $935 million with direct benefit of more than $100 million. The release has also stimulated the development of applications from companies worldwide.
This case study comes from the Royal Society Report on Science as an Open Enterprise.
The background to this is about making the most of the data that has been created through publicly funded research. The guidelines speak of:
Improved quality of results
Greater efficiency
Faster to market = faster growth
Improved transparency of the scientific process
It’s not all positive though – otherwise why isn’t everyone already doing this. There is a certain amount of effort and cost to open science, which this blog post by Emilio Bruna highlights. He calculated the cost of sharing his data for one paper and came to a total of 35 hours and $690. He breaks this down into the cost of preparing the dataset, creating complementary metadata and associated files, cleaning up and documenting the code (which involves a big mental leap), and the charges applied.
Still a question we are asking ourselves but some commonality of vision is sticking.
I like this picture as it represents some of that for me:
Federation of services
Interconnecting / interoperable
User in the centre
Greenfield site? Open to ideas / creativity?
Guidance from the DCC can also help researchers to understand data licensing. This guide outlines the pros and cons of each approach e.g. the limitations of some CC options
The OA guidelines under Horizon 2020 point to CC-0 or CC-BY as a straightforward and effective way to make it possible for others to mine, exploit and reproduce the data. See p11 at: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf