Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering open science and data publishing
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Standing on Giants: Open Science and Data Sharing
1. Class 10…open science
'if I have seen further it is by
standing on the shoulders of
giants'.
Scott Edmunds, HKU Data Curation MLIM7350
2. Communicating in-class
• Chat channel:
• http://backchannelchat.com/chat/dw131
• Feel free to ask questions, requests to speed
up/slow down
Also feel free to email: scott@gigasciencejournal.com
3. Reflection: how fair is FAIR?
Read the FAIR principles paper.
Do you think they are applicable and
feasible for HK? If it is feasible, what is
needed to implement them?
http://www.nature.com/articles/sdata201618
4. Reflection: how fair is FAIR?
• Lots of summarizing the principles, with stress on
importance of “machine-actionability” (important)
• Not everybody mentioned HK context, but a few said
more on this, including:
– Meifeng Chen looked at how FAIR data.gov.hk was (not very)
– Tak Hei Lam thought scholarhub could be a good testbed (more
accessible formats & experiment with RDF)
– Jiafeng Zhou thinks government (?) & ODHK (yes) can help advocate
for implementation
http://www.nature.com/articles/sdata201618
5. HKU Repeatability in HK
Research Experiment
What have we found?
Emailed all controlled access examples for access.
Only 2 responses so far. One bounced. One postal address only.
Data available (by standards of the field) 20 (39.2%)
Data sort of available (raw would have been better) 12 (23.5%)
Need to request access (ethics) 17 (33.3%)
Data not accessible 2 (4%)
Googledoc:https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nyeYmB3Uh4U
23HX-o/edit?usp=sharing
CSV Archive: http://publicdatahk.com/dataset/hku-repeatability-in-hk-research-experiment
Figshare: https://figshare.com/s/0e0906894fc934f8c7b2
6. Final Project
• Presentation Time – 5 mins inc. feedback
• Who has slides to give me?
Ernest Tak-Hei
Lamhttps://docs.google.com/presentation/d/1lYDxCCtumuUBbMwcd5XfgqER
2z9yfU_uTo6DGD3sdpw/edit?usp=sharing
Mika Qiao Zihttps://drive.google.com/file/d/0B-
If8LKUjDK6eVdOaWNVeHdEMTg/view?usp=sharing
11. Biggest Challenge: Closed Access
Handful of closed access STM publishers control market
Force libraries to buy “bundles”
Revenue >$9B
Average cost /article >$5000 USD
Publishers retain copyright
Prevent data mining of content
Withold information from 99.9% who need it!
12. Publishing: better than a gold mine
See: http://alexholcombe.wordpress.com/2013/01/09/scholarly-publishers-and-their-high-profits/
13. Increasing strain on library budgets
-50%
0%
50%
100%
150%
200%
250%
300%
350%
400%
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
PercentageChange
Year
MIT library purchases v inflation 1986-2006
Consumer Price Index % + Serial Expenditures % + # Serials Purchased % +
# Books Purchased % + Book Expenditures % +
Journal expenditure
Inflation
15. The good news: the fightback has started…
http://thecostofknowledge.com/
16. The Solution: Open Access
“By “open access” to [peer-reviewed research literature], we mean its
free availability on the public internet, permitting any users to read,
download, copy, distribute, print, search, or link to the full texts of
these articles, crawl them for indexing, pass them as data to
software, or use them for any other lawful purpose, without financial,
legal, or technical barriers other than those inseparable from gaining
access to the internet itself. The only constraint on reproduction and
distribution, and the only role for copyright in this domain, should be
to give authors control over the integrity of their work and the right to
be properly acknowledged and cited.”
Budapest Open Access Initiative:
• Maximizes reuse and access
• Gives authors control over the integrity of their work and the right
to be properly acknowledged and cited.
• “Real” OA asks for no restrictions/limitations = CC-BY
18. The Solution: Pre-prints
Finally taking off across the globe
http://ec.europa.eu/research/openscience/index.cfm?section=monitor&pg=access#3
>50K pre-prints in China
Now has ChinaXiv:
http://chinaxiv.org/
19. The Solution: OER
Open Education Resources: democratising education
https://www.oercommons.org/
20. The Solution: OER
Open Education Resources: democratising education
https://www.oercommons.org/
(Hint, can be useful for one of the project options…)
22. Data platforms: easy to build
https://ckan.org/
Open source, from OKI. Used by Governments (inc. HK),
Universities (Bristol), even hospital registries.
23. Pragmatic/Infrastructure:
Wiki science:
• 10,000 distinct gene pages.
• 2.07 million words and 82MB data.
• 50 million views & 15,000 edits per year.
Crowdsourcing, wisdom of the masses
GeneWiki
GitHub science:
A hypothetical Git workflow for a scientific collaboration involving 3 authors.
Karthik Ram: http://www.scfbm.org/content/8/1/7
http://en.wikipedia.org/wiki/Portal:Gene_Wiki
52. • Review
• Data
• Software
• Models
• Pipelines
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researchers who release data to public repositories with
a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data set
would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
New incentives/credit
53. Step 1: Archived data
Step 2: Landing Page
Step 3: Persistent Identifier
Step 4: Clear metadata
Step 5: Advertisement (cite our data like this…)
Data Citation Principles
https://www.force11.org/group/joint-declaration-data-citation-principles-final
56. Not just carrots…
“The data discovery index (DDI) enabled through
bioCADDIE is to do for data what PubMed (and
PubMed Central) did for the literature.”
https://datamed.org/
57. How do we find datasets?
Commercial products: Data Citation Index
http://wokinfo.com/products_tools/multidisciplinary/dci/
58. How do we publish datasets?
http://dashboard101innovations.silk.co/page/Archive-share-data-%26-code
59. How do we publish datasets?
http://dashboard101innovations.silk.co/page/Archive-share-data-%26-code
6156 respondents to survey answered (preset answers or “other”):
60. How do we publish datasets?
http://dashboard101innovations.silk.co/page/Archive-share-data-%26-code
6156 respondents to survey answered (preset answers or “other”):
62. Inc Hong Kong: DataSpace@HKUST
https://dataspace.ust.hk/
63. How do we find datasets?
https://search.datacite.org/
https://datamed.org/
64. How do we find dataset citations?
Workarounds in Europe PMC & GoogleScholar & Europe (DOI string)
65. Exercise: find the data citations
How many citations does the Darwin (ground) finch genome data
in this paper have?
Can find it by searching through the comparative genomic datasets.
Should be in https://search.datacite.org/ or https://datamed.org/
http://science.sciencemag.org/content/346/6215/1311
66. Data Publishing: nothing new…
Data & Metadata Collection/Experiments
Analysis/Hypothesis/Analysis
Conclusions
+ Area of Interest/Question
1839
1859
20 Yrs.
75. Submitting to CKAN: basic version
Need to install
Very Basic Metadata:
Title,
Description,
Source,
Organisation,
Maintainer & Author
(names and emails),
Source (URL),
License
& custom tags.
76. Submitting to OSF
Very Very Basic Metadata (all optional):
Title, Authors, Keywords, Description, License (CC0,
CCBY). Free.
https://osf.io/8zfty/
77. Submitting to Mendeley data
Basic Metadata:
Contributors,
Institutions (optional),
Categories,
Description,
Steps to reproduce (optional),
Related Links (optional)
Lots of license options (inc NC).
Free.
https://data.mendeley.com/
78. Submitting to figshare
Very Basic Metadata:
Title,
Authors,
Catagories/File Type,
Keywords,
Description.
License (CC0, CCBY, OSI).
Free if <5GB.
https://figshare.com/s/0e0906894fc934f8c7b2
79. Looking ahead…
• Final project due 15th May
• Any interest in summer project – writing up
Reproducibility Exercise into a short paper let
me know…
• Questionnaire to fill out for Dr Chu (we’ll hand
this out)…