2. The State of Open Research Data
@AmyeKenall
Work at the Open Access publisher BioMed Central…
Spearheading Open Data initiatives and policy
As well as new forms of credit
3. The State of Open Research Data
Open data is data that can be freely
used, shared and built-on by anyone,
anywhere, for any purpose.
4. The State of Open Research Data
Open
Data
Community
Funding
People
Tools
• Training and
mentorship
• Dialogue
• Professional Dev
• Incentives
• Best practice
• Behavior
• Open tools and
standards
• Repositories
• Policies
• Recognition
6. The State of Open Research Data
Open
Data
Community
Funding
People
Tools Funding
NIH expects all funded investigators to adhere to the GDS
Policy... Failure to comply with the terms and conditions of
the funding agreement could lead to enforcement actions,
including the withholding of funding, consistent with 45
CFR 74.6235 and/or other authorities, as appropriate.
7. The State of Open Research Data
Table. Funder policy and support for published research outputs and
data
8. The State of Open Research Data
Limited funding
Incentives
Sustainability of infrastructure
What’s still needed?
9. The State of Open Research Data
Open
Data
Community
Funding
People
Tools
Community
10. The State of Open Research Data
Helping researchers leverage the power of the Open Web;
Fostering a sustainable community of practitioners through
learning around Open Source, Open Data; Empowering
others to lead in their communities
2013
13. The State of Open Research Data
More resource for training
Integration of data/software training into institutes
Participation
What’s still needed?
14. The State of Open Research Data
Open
Data
Community
Funding
People
ToolsTools
15. The State of Open Research Data
Infrastructure
2009
16. The State of Open Research Data
The bioCADDIE team will develop a data discovery index (DDI)
prototype which will index data that are stored elsewhere. The
DDI will play an important role in promoting data integration
through the adoption of content standards and alignment to
common data elements and high-level schema.
Better discovery
17. The State of Open Research Data
Data Journals
Getting credit for your data
18. The State of Open Research Data
Searchable index of supplementary data
Tagged data citations
Better use of metadata for discoverability by tools
Linked infrastructure through unique identifiers (eg ORCiDs, DOIs)
Convenience
Linked Open Data as standard
What’s still needed?
19. The State of Open Research Data
Open
Data
Community
Funding
People
Tools
People
20. The State of Open Research Data
• 3,000 Rice Genomes Project
• Neurodata without Borders
• Alzheimer’s Disease Neuroimaging Initiative
Increase in Consortia
More community buy-in – Open Con, Open
Science Sessions
Institutional involvement
21. The State of Open Research Data
“Commons” Buy-in
Trust
Reuse
Data citation
Use of metadata standards
What’s still needed?
22. The State of Open Research Data
• Need trust, not as easy as creating new
incentives or mandating policy
• Needs to be community driven,
• Tools can change behavior
• If funders want open data, they need to
fund behavior change – training, etc Open
Data
Community
Funding
People
Tools
What we’ve learned…
23. The State of Open Research Data
@AmyeKenall
amye.kenall@biomedcentral.com
Questions?What drives open data in your work
and how can this be improved?
Hinweis der Redaktion
I‘m going to talk to you about the State of Open Data. To do that, I‘m going to touch upon the infrastructure that drives open data—asking where we‘re at and what‘s still needed in the key areas of that infrastructure. To finish, I‘d like to turn the question to you: What drives open data in your own work? How do you feel you could be better supported and better support open data?
Before we get started, I wanted to tell you a bit about myself. I work at the open access publisher BioMed Central. We started publishing open access 15 years ago and were the first open access publisher around at the time. We publish journals across the life sciences and medicine. At BioMed Central I work across our journals to spearhead initiatives around open data and open science generally. This includes partnering with repositories to make open data easy for our authors as well as exploring new avenues for credit for those who do more of the technical work on data and might never make it to first or last author.
One more caveat: When I say “open data” in this presentation, I’m using the Open Knowledge Foundation definition.
“Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.”
Can span all the sciences, though when includes medical or personal data, might mean restricted access or anonymised data.
For more of an introduction around open data, I would suggest you explore: Open Knowledge Foundation, Ross Mounce’s Intro to Open Data Webcast
What do we mean when we talk about the state of open data? What we’re really asking about is the infrastructure that drives the open data commons. By commons I don’t just mean individual data sets and data repositories, but also metadata standards. I mean understanding data as infrastructure to scientific research.
In this slide, I’ve identified four modules to that infrastructure. Tools, Community, Funding and People.
Although I’ll get to people last, it’s important to note people, ie researchers, have the most power to effect change in the state of open data. But that change has to come from a change to a “commons” mentality– a mentality that we can see happened in the open source movement: ie, I’ll put what I make in so I can also get stuff out. It’s a perspective that values the commons as a structure for research objects. Up until now, that’s not the perspective driving individual researchers.
This infrastructure is not set in stone but evolving and malleable.
There are some important dates and events that have to be considered as precursors to the current state of open data.
The Internet and Open Source movement is really the foundation. These provide not just the digital resource to make open data possible but also provide an example of a successful “commons” at work.
I’ve also put on her the establishment of the Bermuda Principles in 1996 and the move to Open Access for research articles. The Bermuda Accord was the first and largest community within the life sciences to come together to establish prepublication, rapid release for sharing of DNA sequence data. Meanwhile, Open access out of all the open science movements has really fuelled a new model for the communication of science and acts as an example of what a commons resource can be.
For this talk, I’m looking back from 2009 – I’ve chosen this data because 2009 is the date at which Obama issued a statement making all information produced from a federal entity open and machine readable. This is when governments, not just the USA but also Europe and others, starting looking back at the open source movement and what it had done and had a eureka moment, seeing the transformative power of the “commons” in the world of the Internet. It’s this idea of a “commons” that begins to transform research into open research. This is important because it says to me that the decision has been made. Government has chosen an open future for research. It iis now waiting on researchers, publishers, and federal agencies to catch up.
[2004—OECD Declaration on Access to Research Data from Public Funding]
Starting with those agencies—the Funders that first began implementing data policies were those that also hosted data centres, such as the Natural Environment Research Council (aKA NERC), and they began to do so in the late 1990s.
May 2013 Obama issued an executive order saying all data/information coming from government needs to be machine readable and open for reuse. The statement from 2009 would finally be put into action. All federal agencies had to report on exactly how they would do this. As a result the NIH has now brought on “data tzsar” Phil Bourne and the White House has brought in their first Chief Data Scientist as Deputy Chief Technology Officer for Data Policy.
Several quite strong funder data policies have followed: the NIH Genomic Data sharing Policy (Aug 2014). The NIH data policy explicitly says failure to comply with the data deposition policy can lead to enforcement actions, such as funds being withheld. This strong stance is not surprising given the $965 billion in economic impact the investment in the human genome project has fuelled for the economy. More recently, the Gates Foundation has also delivered a strong data deposition policy.
With data policy being nearly ubiquitous across funders, it’s worth looking at what is missing here—as we’re clearly far from living in an open data commons.
This tables shows funder policy and support for data outputs. Looking at a few UK funders, though the picture is quite similar globally, we can see that while policy coverage is well established as is monitoring, support and enforcement are lacking.
This points to the issues that aren’t addressed in funder policy alone.
Funding has a limit (training versus cancer, cleaning data is costly and including all costs would not get funded), incentives still aren’t there (whether you made your data open is never a deciding factor on whether you get a grant), and lack of ownership/sustainability.
Some of these have been addressed by a few key community projects that have come about in the past three years.
Mozilla Science lab was launched in 2013. It aims to make the infrastructure of the Web work for science. That means helping researchers leverage the power of the Web to do research. Importantly, it also means training and fostering a sustainable community around open science to let others lead.
In practical terms, this has led to the launch of a few key projects and outreach activities. Sometimes this means building tools, but important is that these tools are built by the community—enabling community ownership.
Problem with author contributorship and data work.
You click on an article and can see the open badge flag. This takes you to a list of all the badges claimed by the authors. Clicking on an author will also take you to his/her ORCiD, where the badge data also live.
These badges can not only spur collaboration and smarter hiring, using this data to make it easier to find people with the skills you need, but take back credit and recognition from the broken mechanisms like the Impact Factor and put it back in the hands of researchers.
We can use this same infrastructure to create badges around Open Data and Open Code and feed this data to funders to help incentivise researchers.
Two other community organisations with wide impact have been Software Carpentry and its sister organisation Data Carpentry. These are teaching researchers the computation and data analysis skills they need in data-driven research. How it works is an instructor comes to your university and hosts usually a two-day session around these skills. The organisation uses the train the trainer mentality, ensuring it’s really the community that owns this initiative. Like with Mozilla Science Lab the goal is to foster a sustainable open science community.
Another organisation to highlight is Force11, which has been truly community driven and spearheaded data citation over the past two years.
Training is not necessarily the top of funder priority lists, but it needs to be.
Likewise, institutions must take more ownership of this, not just through their core curriculum which might be slow to develop but working with organisations like Data Carpentry to ensure learning around these skills is agile.
And finally participation, especially from more senior reserachers.
Tools.
When we look at tools for open data, in 2009 the landscape was full of subject specific repositories, and not all built sustainably. When “open data” began to be discussed more, many researchers lamented the lack of data infrastructure. Now, however, we have several well established sustainable general data repositories –like Dryad and Figshare– and new players are coming to the market with Mendeley recently releasing their Beta version of a data repository.
In my opinion the biggest game changer in terms of tools in the open data space would be the NIH Biocaddie project to make data discoverable. A PubMed for data.
The bioCADDIE team will develop a data discovery index (DDI) prototype which will index data that are stored elsewhere. The DDI will play an important role in promoting data integration through the adoption of content standards and alignment to common data elements and high-level schema.
At present, the two biggest problems for open data have been discoverability and reuse (clearly quite linked) and biocaddie will be able to show us what research looks like in an open data commons.
Finally, we’ve seen a response from publishers to funder mandates to deposit data in the form of data journals. These journals publish “Data notes” or “Data descriptors”—usually 3-6 pages long, and focussing just on the validation and potential use of the data, not the analysis and conclusions of a study. This enables authors to both ensure the rapid release of their data and to get credit for their data, even if it is before the publication of their full findings.
But several things are still needed in this space:
Searchable index of supplementary data
Tagged data citations
Better use of metadata for discoverability by tools
Linked infrastructure through unique identifiers (eg ORCiDs, DOIs)
Convenience
Linked Open Data as standard
What do I mean by people? I mean researchers. I also mean institutions and libraries. The people working in research. The people are key to embracing an open data commons.
If we look at People – from the open data perspective, they’ve changed a lot since 2009.
More consortia projects.
More people leading open science meetings and sessions at conferences.
More libraries involved in data management for researchers.
But unlike these other spaces, a lot more is still needed from people.
But how do you effect change in individuals?
The three key things are: a commons mentality; Trust; and Reuse. On practical level, we need more researchers to get into the practice of citing data and producing “cleaner” data.
In conclusion, what have we learned since 2009 when Obama first said all government outputs, including the NIH, need to be open and machine readable, including data?
Need trust, not as easy as creating new incentives or mandating policy (the problem is in implementation)
Needs to be community driven,
Tools can change behavior
If funders want open data, they need to fund behavior change – training, etc