Presentation given at UQ Winterschool 2014. The advent of the Internet is bringing about fundamental changes in the ways that research is performed and communicated. These have been particularly driven by the growing importance of data, as well as the tools available to work with this data. This presentation will examine this shift, drawing on examples from the lifeโsciences, and try to make some predictions about the next five years.
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
ย
The life-sciences as a pathfinder in data-intensive research practice
1. The life-sciences as a
pathfinder in data-
intensive research
practice
Dr Andrew Treloar, Director of
Technology
11 July 2014 CC-BY-SA, @atreloar 1
2. Structure presentation
๏ง Research Lifecycles
๏ง Functions of Scholarly Communication
๏ง Pointers to the future
๏ง Characterising the future
๏ง Pathfinder problems
๏ง Conclusions
11 July 2014 CC-BY-SA, @atreloar 2
5. Sharing: Scholarly Communication
System and its Functions
๏ง Registration
๏ง Certification
๏ง Awareness
๏ง Archiving
(Rosendaal and Geurts, 1997)
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 5
6. System of Journals
๏ง Registration
๏ง submission of manuscript
๏ง Certification
๏ง peer-review (pre-publication)
๏ง commentary (post-publication)
๏ง Awareness
๏ง discovery services
๏ง Archiving
๏ง libraries (print)
๏ง publishers (electronic)
๏ง special purpose organisations (e.g. Portico)
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 6
7. Pointers to the future
โthe future is already here โ itโs
just not very evenly distributedโ
William Gibson, NPR interview
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 7
13. Registration: some observations
๏ง Decoupling registration from certification
๏ง Timestamping, versioning
๏ง Registration of various types of objects
๏ง Machines as creators and contributors
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 13
17. Certification: some observations
๏ง Peer-review decoupled from publication process
๏ง Certification of various types of objects
๏ง Machines validating form
๏ง Social endorsement
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 17
21. Awareness: some observations
๏ง Awareness for various types of objects
๏ง Real time awareness
๏ง Awareness support targeted at machines
๏ง Awareness through social media
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 21
24. Characterising the future
Fixed Varying
Discrete Continuous
Hidden VisibleResearch Process
Nature of object
Process of making public
Speed of communicationDelayed Instant
Atomic CompoundAtomicity of object
Communicated object
Publication
+data proxies
Publication +
linked data +
linked models
Formal InformalNature of process11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 24
25. Fundamental changes
๏ง The research process (objects, social
dimension) is becoming more exposed
๏ง Articles, books are no longer the only
relevant objects for research
communication
๏ง Objects are no longer static
๏ง Machines are joining humans as (co-
)creators and consumers of research
objects
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 25
26. Pathfinder problems
๏ง Integrity of the scholarly record
๏ง The three obsolescences
๏ง hardware
๏ง file format
๏ง software
11 July 2014 CC-BY-SA, @atreloar 26
27. System of Journals: Archiving
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 27
28. Web of Objects: Archiving?
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 28
29. Not just citation relationships
11 July 2014 CC-BY-SA, @hvdsomp and @atreloar 29
30. The problem of obsolescence
๏ง Lifescience research environment can be viewed
as undergoing a process of accelerated evolution
๏ง Other disciplines will hit these problems in time
11 July 2014 CC-BY-SA, @atreloar 30
34. Abandonware
๏ง โLast summer, a member of the biology department of the
University of Udine in Italy approached Nicola Vitacolonna
with an intriguing project. The ANREP program, which
annotates structural motifs in gene or protein sequences,
was out of date having been written more than a decade
ago. Although still used by molecular biologists, its slow
computing ability meant a straightforward multiple search
could take all night on a desktop PC. The Udine biologist
wanted Vitacolonna, a postdoctoral fellow in
computational biology, to write a program that could do
the job more quickly.โ
๏ง Sam Jaffe, Scientists Abandon their Software, The Scientist, Feb 16, 2004
11 July 2014 CC-BY-SA, @atreloar 34
35. File format obsolescence: Illumina
๏ง Probability of error in basecalling encoded using ascii
code to reduce file size
๏ง Meaning of the ascii code changed along the life cycle
and for data generated at different time points the
quality might be encoded differently
๏ง โIf you get an error like "Invalid quality score value",
your fastq file probably has Sanger (offset 33) instead
of Illumina (ASCII offset 64) quality scores. You'll need
to add the option "-Q33" to your FASTX Toolkit
argumentsโ. Obviouslyโฆ
11 July 2014 CC-BY-SA, @atreloar 35
37. Conclusions
๏ง Need to move to a smaller number of standard file
formats
๏ง Need to move to a more sustainable model of
software development and maintenance
๏ง Need to encourage platform manufacturers to
innovate around the hardware, not the software
๏ง NOTE: other disciplines are looking to lifesciences
to work out how to solve some of these problems
11 July 2014 CC-BY-SA, @atreloar 37
38. On best practices in the development of
bioinformatics software, Front. Genet., 02 Jul 14
๏ง Source code available to reviewers
๏ง Software indexed, citable, available
๏ง Source code documented
๏ง Source code managed
๏ง Test libraries, sample data and dataset repositories
available
11 July 2014 CC-BY-SA, @atreloar 38
Story that is being told here โ might seem initially in pieces, but there is a common thread.
Point of first section is broad context for two case studies
Increasingly, Share is bleeding into Do, so letโs zoom in on this
Want to provide a series of snapshots of the future drawn from lifesciences
Sourceforge is another example
DNA variant of NG_000007.3 (hemoglobin)
Sardinian population
Provenance: authors of the article from which the nanopub was mined
Content: Post-publication peer review of pubs
Content: Post-publication peer review of pubs
Publons aims to change all that. Members of the site can import papers, rate them, and discuss them. In ongoing discussions, members can endorse reviews. When the endorsements reach a certain threshold, the review gains a digital object identifier (DOI), turning it into an object that can be cited in more traditional academic literature.
Content: Multiple sources checking the validity/classification of data
Content: Multiple sources checking the validity/classification of data
Content: Multiple sources checking the validity/classification of data
Could also have had this for Registration, of course
Content: Multiple sources checking the validity/classification of data
Problem of reproducibility is just part of the problem
Integrity used to be based on reliable archives
Accelerated evolution (again, like Cambrian explosion)
Not supported after 2016
Omictools, Seqanswers
I am reminded a bit of the early days of computing and the proliferation of word processors
One way to think about this problem is in terms of diffusion of innovation