An internal presentation to the SRI AI Center, to get people up to speed on current goings-on in open science. Tries to cover far too many things, and slides probably aren't very comprehensible by themselves.
1. Directions in Open Science
Mike Travers
SRI Bioinformatics Research
Group
For AIC Lunch and Learn, 30 Jan 2012
2. About this talk
• Partly a trip report from Open Science
Summit 2011
• Partly an attempt to define open science
and explore its impact
• Partly an excuse to talk about some of my
own vaguely related work
• And partly some semi-crazy speculation
about future projects in this space
3. The Open Science Summit unites researchers, life science
industry professionals, students, patients and other
stakeholders to discuss the future of collaborative science
and innovation.
…in-depth sessions on new models for drug discovery and
clinical trials, personal genomics, the patent system, the
future of scientific publications, and more.
4. What is Open Science?
• Many different things, but boils down to:
• Removing barriers to scientific
communication and collaboration:
– Social
– Technical
– Legal
– Economic
– Bureaucratic
• To accelerate scientific progress
• Utilizing modern technology
5. Driven by technological change
• The Internet has radically reduced
communication costs
• So old institutions of scientific
communication are now obstacles
– Closed academic publishers, notably:
• Internet will transform scientific media just
like it has newspapers, TV, social life….
• The difference is: science is more
important than sharing cat pictures
7. Open
• Most visible and successful branch
of open science
• Articles are free to read, pay
to publish
• Funders are starting to require
some form of public access
8. Gold: OA journal, Green: OA self-archiving
Open Access to the Scientific Journal Literature: Situation 2009, PLoS ONE, Bo-Christer Björk et al
9. Research Works Act
• H.R.3699 – “A bill to ensure the continued
publication and integrity of the peer-reviewed
research works by the private sector.”
No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any
policy, program, or other activity that--
(1) causes, permits, or authorizes network dissemination of any private-sector research
work without the prior consent of the publisher of such work; or
(2) requires that any actual or prospective author, or the employer of such an actual or
prospective author, assent to network dissemination of a private-sector research work.
10. Myth 1: American consumers have a right to free access to articles their tax
dollars fund.
Fact
American taxpayers do not fund peer reviewed research articles; they fund some
of the research that is used in those articles…
11. Beyond Open Access
• Not going to say a whole lot about OA, because:
• It’s easy to understand
• It’s pretty clearly going to win in the long term
• By itself, not a very radical change to how science is
done:
– Knowledge is still in paper-sized chunks
– Papers are peer-reviewed prior to publication;
– Once something is published, it’s static
• All these parameters are being challenged in some
way by other efforts
• George Whitesides (Harvard chemist): “The concept
of the scientific paper is eroding before our very
eyes”
12. Variations on publishing
• “Peer review is broken”
– Too slow
– Too biased
– Too rigid
– May be “the worst system except for all the others”
• Pre-peer-review publication
– Eg arXiv.org
• Micropublication
– Crowdsourcing, blogs, wikis….
• Open-notebook science
– No gap at all between bench and publication
• Database-linked publications
• Dynamic Review Papers
13. Biggest sequencing operation
in the world
Generating 6 terabytes/day of
genomic data
Open-Source Genomic Analysis of Shiga-Toxin–Producing E. coli
O104:H4 Rohde et al 2011 (NEJM)
Toxic E. coli outbreak in Germany May 2011:
We released these data into the public domain… which elicited a burst of crowd-
sourced, curiosity-driven analyses carried out by bioinformaticians on four
continents. Twenty-four hours after the release of the genome, it had been
assembled; … Five days after the release of the sequence data, we had designed and
released strain-specific diagnostic primer sequences, and within a week, two dozen
reports had been filed on an open-source wiki …dedicated to analysis of the strain
https://github.com/ehec-outbreak-crowdsourced
14. GigaScience is a new integrated database and journal
co-published in collaboration between BGI Shenzhen
and BioMed Central, to meet the needs of a new
generation of biological and biomedical research as it
enters the era of "big-data."
23. Somewhat less garage-
• Independent research institute, started
from data released by Merck
• Repository of experimental data (Sage
Commons)
• Network of cooperating institutions
• Starting to build a computational platform
(Synapse)
25. And some individual
researchers
• Peter Murray-Rust
Chemist, Cambridge,
promoter of Chemical Markup Language and
semantic web
“Closed science makes people die!”
• Victoria Stodden
Statistician, Columbia,
reproducibility of computational science
(cf ClimateGate)
26. Some open science success
stories
• Galaxy Zoo
• FoldIt
• Nutrient Network (NutNet)
• Prazinquantel synthesis
27.
28. Galaxy Zoo
• Citizen science (loosely)
• Image classification task
• Mechanical Turk-like approach (but
unpaid)
• About 200K participants
• Discovered a whole new class of galaxies
(“green pea”) and a quasar mirror
• 22 published papers in 3 years
33. Matthew Todd, chemist at U
of Syndney
Schistosmiasis
Looking for synthesis for
known drug Prazinquantel
(PZQ) in enantiopure form
Open-notebook science
(LabTrove)
37. What paper has the most
authors?
• NutNet paper:
40 authors, 41 institutions
• This one from SLAC and elsewhere:
407 authors, but only 35 institutions
38. Three variations on the scientific
process
• Automated Science
• Distributed Science
• Web-scale Intelligent Science
• Open Science as the lubrication /
accelerant that makes these feasible
39. Afferent: Automation for Drug
Discovery
• Combinatorial Chemistry
• Planning software to drive lab robots
40.
41. Distributed Science
• Some science (eg evaluation of drug
candidates) is highly parallelizable,
• Hence distributable
• CollabRx was initially supposed to support
“virtual pharma companies” that would tie
disparate academic research efforts into
focused teams
42.
43. Web-scale Intelligent Science
• Imagine all of science as a giant distributed
computational process
• Individual scientists are agents
– working on a small part of the problem
– Sharing their results
– Getting feedback and funding dependent on
success
• Centralized data integration and decision
tools used to help determine next useful
experiment
44. Steps towards distributed
intelligence
• Adaptive clinical trials
– Rather than a classical trial with two arms run to
completion
– Change the distribution of test cases based on
ongoing results
• Now imagine this strategy applied more globally
across all treatments for a disease
• Credit for this slightly mad vision goes mainly to
Marty Tenenbaum:
– AI Meets Web 2.0 (2006)
– Shrager, Tenenbaum, Travers, Cancer Commons:
Biomedicine in the Internet Age (2011)
45. What does all that have to do with
Open Science?
• Open Science is lowering barriers to
collaboration
• So it’s a necessary but not sufficient step
towards this new kind of science
• CollabRx may just have been too early:
– the groundwork hasn’t been laid yet,
– we are still working on basics
– (eg standards for representation)
• Reducing friction (or transaction costs) can
be incredibly important
46. “Changing the cost of innovation
fundamentally changes the nature of
innovation”
– Joichi Ito
TCP, HTTP etc are the
containerization of
data.
So what’s the analog
for scientific
knowledge?
48. A mix of technical,
institutional, and legal
standardization:
-Standard licenses
(parameterizable)
-RDF representation for
licenses.
-Web Tools to generate
these
-Sites that collect and
“market” available
materials.
49. BioBike, a platform for open
science
• Conceived of as a vehicle for getting
biologists to do their own knowledge-
based biocomputing.
• Lisp + Frame system + Bioinformatics
Tools
– Through-the-web programmability
– Community sharing of code and data
– Visual Programming Language
• Open Source
•
Jeff Elhai, Arnaud Taton, J. P. Massar, John K. Myers, Michael Travers, Johnny Casey, Mark
Slupesky, Jeff Shrager. BioBIKE: A Web-based, programmable, integrated biological knowledge
base. Nucleic Acids Research, 2009
50.
51.
52. BioBike and Open Science
• BioBike wasn’t for Open Science per se
• But it did explore some ideas in web-
based biocomputation
• The next-generation BioBike platform:
– Data: Big data, Open data, semantic web
integrated
– Programming: Able to deal with large scale
and distributed workflows with human
elements
– Collaboration: Integrating different
communities in a “trading zone”
KnowOS: The (Re)Birth of the Knowledge Operating
System. Mike Travers, JP Massar, and Jeff Shrager,
International Lisp Conference 2005
53. What is a platform?
• The economic meaning of “platform” is interesting
• Something that:
– Supports two-sided network effects
– Stands in the middle and extracts a toll
• Examples:
– Credit cards
(merchants ↔ consumers)
– Operating systems
(application developers ↔ users)
• Science has more complicated networks and relations
– Data providers
– Data consumers
– Service providers
– Analysts (statisticians, eg)
– Patients
• A science platform is not going to make anyone rich like Facebook,
but it would be nice to have a powerful and standard way for all
these groups to collaborate.
54. Open Data is outstripping analysis
capacity
• Or in other words:
– data is cheap,
– attention, knowledge, & expertise are
expensive
• A platform for collaborative computational
interpretation of biological data
• To better leverage the expensive
resources
55. identifies advancing new computational infrastructure as a
priority for driving innovation in science and engineering.
Scientific discovery and innovation are advancing
along fundamentally new pathways opened by the
development of increasingly sophisticated software.
the overarching goal of transforming
innovations in research and education into
sustained software resources that are an
integral part of the cyberinfrastructure
56. Anti-open arguments
• Peer-review is an essential filter; without it
too much nonsense gets out
• Electronic availability of articles actually leads
to narrowing of science (Evans, 2008)
• Privacy, HIPAA, etc.
• Need to retain IP for economic motivation
• The problem isn’t availability of data; it’s
making sense of what we do have
• See PRISM for more
57. Opener Science
• Science is already
pretty open!
• institutions of openness
played a role in the
foundation of science,
including the first
scientific journals
58. Historical Origins of Open
Science
• Before the invention of science,
knowledge of the natural world was closely
guarded, passed down from master to
apprentice.
• The development of institutions of
openness was a key factor in the scientific
revolution (Paul David, Stanford
economist)
• …and the printing press was a key factor
in that.
59. So…
• The printing press is almost 600 years old
• The scientific journal is almost 350 years old
• There’s been some advancement in
communication technology since then…
• Science will eventually change:
– Either a modest acceleration of the scientific
process,
– Or as significant and discontinuous as the first
scientific revolution
• Which one? An open question.
Datacite: working with data centres to assign persistent identifiers to datasets, we are developing an infrastructure that supports simple and effective methods of data citation, discovery, and access.