24. Piwowar, et. al., “Sharing Detailed Research Data Is
Associated with Increased Citation Rate”
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0000308
25. Climate Archæology
de la Mare, William K., 1997, "Abrupt mid-twentieth-century decline in Antarctic sea-
ice extent from whaling records", Nature, vol.389, pp 87-90, 4 Sept 97
28. Australian National Data Service (ANDS)
An initiative of the Australian Government being
conducted as part of the National Collaborative
Research Infrastructure Strategy ($A24M) and the
Super Science Initiative ($A48M)
A collaboration between Monash University, the
Australian National University and CSIRO
Nearly 50 staff, funded to mid 2013
More researchers re-using more data more often
Data as a first-class object
ands.org.au 28
46. Identify: Journal Demo
• http://dx.doi.org/10.1016/j.yqres.2010.04.004
• “Elsevier and PANGAEA (Publishing Network for
Geoscientific & Environmental Data) announced their
next step in interconnecting the diverse elements of
scientific research. Elsevier articles at ScienceDirect are
now enriched with graphical information linking to
associated research data sets that are deposited at
PANGAEA. This enrichment functionality offers a
blueprint of how Elsevier would like to work with data
set repositories all over the world [emphasis added].”
http://newsbreaks.infotoday.com/Digest/Elsevier-Enriches-
Articles-With-Research-Data-Sets-69148.asp
Start with a question.
What is the different between these?
And these?
Thanks to machines like these, we now know that at genetic level
It’s only 1% of this.
But that’s just genetics. What about culture?
We now know that a range of species (including crows!) are tool users, and they pass on particular techniques
Think of this as a chimpanzee tutorial…
But this sort of transmission of culture doesn’t transcend either time or space. You need to be in the same time and place to learn.
For our species one of the big breakthroughs was the development of language. This now allowed for easier transmission than show and tell, but still didn’t address the time and space problem.
So, where am I going with this? To data of course…
These are data from 7,000 BCE
Each token is a particular value
Initially they were used on their own (a bit like coins today)
Then around 4,000 BCE we see the emergence of these: bullae
Explain: Seal (identify), signs for what is traded, contents as tokens.
Essentially the first written contracts
To avoid having to literally break the contract to see what numbers it contained, the next step was to provide a representation on the outside.
Then in 2900 BCE some genius made the crucial conceptual leap: if we have the numbers in symbolic form on the outside, do we need them in physical form on the inside? Answer: No, and so we get clay tablets. And those strange marks next to the numbers? The very beginnings of pictographic writing…
And then very quickly, the first libraries.
I don’t have time to cover the entire history of writing, but just want to make the point that writing came from the need to capture and manage data. Or to put it another way, much of what we regard as civilisation started with accounting. Any accounting graduates in the audience?
So, let’s fast forward about 45 centuries to the present and look at the state of data in scholarly communication. Unfortunately, it’s inconvenient, imprisoned, invisible, inaccessible, and ignored
Need to retype
Near impossible to liberate. Talk about ChemXSeer example and DataThief Java application
Too transformed
Discipline scientist may know how to get these data but I don’t
Only journal like this I know. Anecdotal evidence that it is hard to get negative papers published
All of the above problems are really about difficulties in getting to the data so it can be re-used.
By why would you want to re-use data?
NOTE: Some of these arguments are at individual, national, global level
Efficiency – don’t reinvent wheel
Validation – repeatability of research
Integrity – of scholarly record
Value for Money – public money funded it, it should be available to public (ClimateGate!)
Self-interest – sharing with a future self, greater visibility
So, what are some good stories around data sharing?
Hubble Space Telescope (HST) operating since 1990
Observations are proposed, and if accepted, data is collected and made available to the proposers – who then write a research paper
Each year around 1,000 proposals are reviewed and approximately 200 are selected, for a total of 20,000 individual observations
Data is stored at the Space Telescope Science Institute and made available after embargo period
GO = General Observation program
AR = Archival Reuse
From Wikipedia: “A DNA microarray is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10−12 moles) of a specific DNA sequence, known as probes (or reporters). These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation.”
Heather Piwowar looked at the citation history of cancer microarray clinical trial publications
Found that publicly available data was associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin
Climate researchers need to be able to run their models foreward (forecasting) and backwards (backcasting) to check they are correct.
The southern limit of whaling is constrained by sea ice, and since 1931 whaling records have been collected for every whale caught. This paper took these records and used this.
His analysis indicates that the Antarctic summer sea-ice edge has moved southwards by 2.8° of latitude between the mid 1950s and early 1970s
This suggests a decline in the area covered by sea ice of some 25%
Number of initiatives around the world working to do a better job on data: NSF DataNet (Bill at end of conference), JISC Managing Research Data, NL SURF/DANS
I want to talk about one from New Zealand’s West Island…
28
So, how are we doing this? We’ve got a whole series of programs of activity, but one way to visualise the infrastructure that is needed is to distinguish…
The current picture for Australian (and other) research data
From…
The components that ANDS is adding to produce the ARDC
So, if that is a partial view of the present (Bill will tell you more tomorrow, I’m sure), what about the future?
Talk about ANDS was a founding member of DataCite. TIB in Germany was another and is providing the data DOIs for this example
So, to conclude:
The need to manage data is not just a modern problem – it drove crucial developments in Western civilisation nearly 9,000 years ago
For most of the last two hundred years, data has largely been the neglected stepdaughter in scholarly communication, eclipsed by its more glamorous sister the journal article. And I’ve reviewed some of the attendant problems arising from this
Two things are driving a change in this approach: the shift to more data-intensive research and growth in information systems that can better manage and make available the underlying data
I showed you some of the bits of the future that are starting to appear – forerunners of the way the research world might look for many disciplines in the next 10-20 years
Or to put it another way, data is what helped to make it possible to go from this <click.
Thanks to all those who made their images available under CC licensing for re-use
[click]
And thank you for the opportunity to speak to you this morning.