“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
Open Data - Where Do We Stand from a Researcher's Perspective?
1. Open Data – Where Do We
Stand from A Researcher's
Perspective?
Philip E. Bourne
University of California San Diego
pbourne@ucsd.edu
2. My Perspective …
• Mine is a biomedical sciences perspective
• My lab. distributes for free data equivalent to ¼ the
Library of Congress every month
• I am a supporter of open access (provided there is a
business/sustainability model) and founding editor in
chief of PLOS Computational Biology
• I am Co-founder of SciVee Inc. and believe
innovation comes from open access to knowledge
• Recently became UCSD’s AVC of Innovation which is
giving me a more institutional perspective
I Readily Acknowledge Each Discipline is Different
3. My General Opinion:
Where Does the Open Access Debate
Stand Today?
• Its not a question of “if” but a question of
“when” and “how” for most disciplines
• We are at the tip of the iceberg in our
ability to use OA content
• OA will gain momentum in an increasingly
knowledge-based economy
4. The State of Play:
UC Open Access Policy Debate:
Opt Out vs Opt in
• For • Against
– Publically funded – Cost to some
research should be disciplines
public – Impact on societies
– Institutional – Journal quality re
Perspective: The open promotion
provision of data and – Extra work
knowledge derived
– Administration
from these data
appears to be an – UC as “Big Brother”
unidentified asset at
this time
5. We will come back to this, but
first let us explore why open
knowledge is so important (to
me at least)
6. Open Data May *
Save Lives?
Structure Summary page activity for
H1N1 Influenza related structures
Jan. 2008 Jul. 2008 Jan. 2009 Jul. 2009 Jan. 2010 Jul. 2010
3B7E: Neuraminidase of A/Brevig Mission/1/1918
H1N1 strain in complex with zanamivir
1RUZ: 1918 H1 Hemagglutinin
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
7. Open Science Can Accelerate
the Scientific Process…
For some people the change may
be too slow to save their life
8. Josh Sommer – A Remarkable Young Man
Co-founder & Executive Director the Chordoma Foundation
http://sagecongress.org/Presentations/Sommer.pdf
9. Chordoma
• A rare form of brain
cancer
• No known drugs
• Treatment – surgical
resection followed by
intense radiation
therapy
http://upload.wikimedia.org/wikipedia/commons/2/2b/Chordoma.JPG
13. If I have seen further it is only by
standing on the shoulders of giants
Isaac
Isaac Newton
From Josh’s point of view the climb
up just takes too long
> 15 years and > $850M to be
more precise
Adapted: http://sagecongress.org/Presentations/Sommer.pdf
18. What Does Meredith Tell Us?
• The Wikipedia / Kahn Academy /YouTube
generation knows no bounds
• Bounds are too often imposed by tradition
rather than what makes the most sense
• Another example of an underexploited
asset at this time?
19. Another Way of Thinking About
the Implications of What Josh
and Meredith Represent Is the
Need for New Forms of
Knowledge Management and
Access
Lets Explore this Notion with
An Emphasis on Data
20. The Silos of Data & Knowledge Are
Starting to Coalesce
Is a Biological Database Really Different than a Biological Journal?
PLoS Comp. Biol. 2005 1(3) e34
21. The Silos of Data & Knowledge Are
Starting to Coalesce
• Supplemental information • Databases are now
has exploded knowledgebases
• Data journals are • Science can be done on
emerging the fly
• The use of rich media is • Biocuration is a respectful
increasing career
• Software and other
processes are becoming
available PLoS Comp. Biol. 2008. 4(7): e1000136
22. Where Does That Take Us?
• A paper is an artifact of a previous era
• It is not the logical end product of eScience,
hence:
– Work is omitted
– Article vs supplement is a mess
– Visualization may be limited
– Interaction and enquiry are non-existent
– Rich media can help, but barriers remain
23. Where Does That Take Us?
Data Sharing Policies
• From the NSF:
• Investigators are expected to share with other researchers, at
no more than incremental cost and within a reasonable time,
the primary data, samples, physical collections and other
supporting materials created or gathered in the course of
work under NSF grants. Grantees are expected to encourage
and facilitate such sharing. See Award & Administration Guide
(AAG) Chapter VI.D.4.
24. Big Data is Off…
• March 2012 OSTP
commits $200M to
Big Data
• NSF, DOD, NIH all
announce programs
• GBMF think tank
leads to soon-to-be-
announced
institutional awards
25. Where Does That Take Us?
Add into the Mix:
• Reproducibility • It really is a myth!
• Maintainability • DNA doubles in 5 months
• Usability • Go ahead and try!
• Reward • Tenure for data – no way
Notwithstanding dreams do emerge …
Here is mine
26. Here is What
The Knowledge and Data Cycle
0. Full text of PLoS papers stored 4. The composite view has
I Want
in a database links to pertinent blocks
of literature text and back to the PDB
1. User clicks on thumbnail
4. 2. Metadata and a
webservices call provide
a renderable image that
1. can be annotated
3. A composite view of
1. A link brings up figures
from the paper journal and database 3. Selecting a features
content results
3. provides a
database/literature
mashup
4. That leads to new
papers
2.
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed PLoS Comp. Biol. 2005 1(3) e34
28. Simultaneously Discovery
Informatics Emerges
• Google with not
suffice as a scientific
knowledge discovery
tool
• Google is broad but
shallow
• Science is cross-
disciplinary narrower
and deeper
29. NSF Discovery Informatics
Workshop
• Discoveries surpass
an individuals ability -
need intelligent tools
• Need to increase
connections between
knowledge and data
• Need to combine
diverse human
abilities
Discovery informatics - computer scientists, domain scientists,
social scientists -
http://www.isi.edu/~gil/diw2012/NSFDiscoveryInformatics2012-FinalReport.pdf
30. This is Just the Beginning of
Discovery Informatics
• Each evening the labs “Evernote”
notebooks are scanned for commonalities
from the days activities. These are seeds
in a deep search of the web for knowledge
and data that has become available since
last searched. Results are ranked and
presented for consideration over coffee
the next morning
http://www.discoveryinformaticsinitiative.org/diw2012
31. Unimaginable Connections Made Automatically
Through RDF Descriptions
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22_colored.html
32. Before We Get Too Heady Lets
Look at the Realities of the
Situation from My Perspective
• Data repositories are broken
• There is a “high noon” effect
• NCBI has been a wonderful model to
date…
33. Data/Institutional Repositories
• Build it and they will come fails most of the
time
• Institutional repository is an oxymoron
• NCBI works because:
– It is an act of the US congress
– It has strong leadership
– It has a monopoly on the literature
– It has IT thought out over many years
Innkeeper at the Roach Motel D. Salo 2008
http://muse.jhu.edu/journals/library_trends/v057/57.2.salo.html
34. Data/Institutional Repositories
• “High Noon” Effect
– Publishers make knowledge in very difficult,
but at least knowledge out, albeit limited is
consistent, intuitive and easy to use
– Data repositories make data in and data out
very difficult – they strive to be different when
in fact users want them to be the same
35. Data and Journals
• That journals are thinking about data is
good
• Dryad etc. are welcome but a stop gap
measure
• Fully functional data journals will not occur
without a change to the reward system
• Data papers can help shift the reward
system
• Are PLoS Topic Pages a sign?
36. Interim Solution:
Use the Traditional Reward System
The Wikipedia Experiment – Topic Pages
Identify areas of Wikipedia that
relate to the journal that are
missing of stubs
Develop a Wikipedia page in the
sandbox
Have a Topic Page Editor Review
the page
Publish the copy of record with
associated rewards
Release the living version into
Wikipedia
37. Think Globally Act Locally:
What Can Our Institutions Do
Now To Move Us in The Right
Direction?
38. Institutional Response
• Have repositories that are useful
– Use common standards
– Are vetted by the community
– Are fully open and searchable
• Reward all forms of scholarship
• Leverage the asset …
39. Most Laboratories
• We are the long tail
• Goodbye to the
student is goodbye to
the data
• Very few of us have
complied (or will
comply with the data
management plans
we write into grants)
40. UCSD Dropbox
• Simple!!!!
• Can drop large files easily
• Asks for limited metadata and permissions to
“discover”
• Has guaranteed quality of service and
security not available in the cloud
• Is the data management plan and charged
against grants
• Is a rich campus corpus open to discovery
informatics
41. The UCSD Dropbox
Discovery Environment
• Scenarios:
– Fosters known collaborations through
simplified data exchange
– Discovers new collaborators through the
same or related data elements
– A corpus whose intrinsic value is as yet
unknown
42. What Do I Want by 2020 or
Earlier as a Researcher?
• Answer biological questions not just
retrieve data
• Understand all there is to know about the
availability and quality of a unit of
biological data
• Operate on data in a way that is simpler,
more productive, and reproducible
43. What Do We Need to Do to Get
There? A Data Registry?
• Individual repositories register their
metadata which includes access
statistics, commentary etc. – DataCite
is a beginning
• Identify identical data objects and their
respective metadata for comparative
analysis
• Funders support registration
• Publishers support registration
44. What Do We Need to Do to
Get There? An App+ Store?
• The App model
– Think of it operating on a content base
rather than a mobile device
– Simple and consistent user interface
– Needs to pass some quality control
– Has a reward
• The App+ Model
– Apps interoperate through a generic
workflow interface
45. In Summary
• We have at hand the means to accelerate
the rate of discovery
• To do so we need to place more value on
the data, the individuals that produce it
and the institutions that maintain it
• We are all stakeholders in this endeavor
• Here is one way to get involved….
46. Get Involved: FORCE11
• Tools and Resource
catalog
• Article database in
Mendeley
• Discussion Forum via
Google
• Blogs courtesy of blog
sites and RSS feeds
• Web site via Drupal
• Announcements via
Twitter
http://force11.org
47. General References
• Force11 Manifesto
• Fourth Paradigm: Data Intensive Scientific
Discovery
http://research.microsoft.com/enus/collabora
tion/fourthparadigm/