4. Let's Move Beyond Open Data
to Open Development?
This year July The Sunday Business section of the New
York featured a story about the Bank’s Open Data
initiative and claimed that datasets and information will
ultimately become more valuable than Bank lending.
This is not about the World Bank as the central repository
of knowledge sharing its knowledge and wisdom with
clients from the South.
It is about “democratizing development economics” in
that it levels the playing field on knowledge creation and
dissemination and opens the development paradigm to
participation from researchers and practitioners, software
developers and students, from north and south.
5. SRF
The CGIAR is unique in having the capacity to collect
experimental, monitoring, and survey data on
agricultural systems throughout the developing world.
Most data collected by CRPs, whether broad-scale data
used to describe and monitor farming system changes, or
focused data collected to examine specific processes and
hypotheses, should be of such potential value that the
cost of archiving and sharing is justified by the value
added in terms of expanded research results from the
use of that data by a wider research community.
6. “This is one clear and consistent message
from the last CGIAR science forum: data
archiving of the CG Centers is overall
abysmally poor”
Robert Nasi Director
Forests, Trees and Agroforestry (CRP6)
8. Research Data Repository
What should be deposited
1. All research data belonging to publications
2. High value data sets of interest to ICRAF, other CG
centers & Partners
Research Data Management Policy
9. The Policy
all of the Centre’s data needs to be:
a) derived from research relevant to our agenda, to the
development challenges in our strategy, to the
Strategy and Results Framework (SRF) of the CGIAR
and to the CGIAR Research Programs (CRPs)
b) of high quality (well designed, well collected, well
verified, well documented);
c) protected and archived;
d) is made available (know that it exists) and easily
accessible (can gain access to the data) to all;
e) is adaptable so that it can be well utilized and
transformed where possible into actionable
knowledge;
10. Who’s responsibility?
The Centre
• setting up clear protocols, conducting peer reviews, using robust
and well-documented methods and appropriate statistical
analyses, and producing meta-analysis and syntheses of results
• providing a stable, reliable data repository system that can handle
both document-centric and data-centric objects.
• ensuring that all necessary raw data will be made public to
reproduce or replicate every scientific publication that is based on
research data
11. Who’s responsibility?
The project/scientist
• compliance with explicit quality standards
• submit necessary raw, verified data for every scientific publication
in standard file formats.
• ensure that research data produced for the Centre is described by
appropriated Metadata throughout their lifecycle
12. How do we achieve highest
scientific standards?
RMG: quality control throughout the data
lifecycle (collection, verifying,
managing, analyzing, storing)
Beyond RMG: to ensure that all staff follow the
institutional standards and
guidelines.
The ultimate benchmark for all scientists however, is the
consensus of peers
13. Research Data Repository
Challenge:
Move data from scientist laptops to institutional server
and
Have the data described by sufficient metadata
without
Increasing transaction costs or
Creating an auditing issue
15. Dataverse Network
• The Dataverse Network is an application to
publish, share, reference, extract and analyse
research data.
• It facilitates making data available to
others, and enables replication of work.
• Researchers and data authors get
credit, publishers and distributors get
credit, affiliated institutions get credit.
16. Dataverse Network
• A Dataverse Network hosts multiple
Dataverses.
• Each Dataverse contains studies or collections
of studies.
• Each study contains cataloguing information
that describes the data plus the actual data
files and complementary files.
17. Data Backup & Preservation
• The IQSS Dataverse Network maintains a full backup of all data and
directories on the Network for 6 months, in the Harvard Depository.
This means that there always is a full, offsite copy of the Network
that is less than 7 months old.
• IQSS will maintain on-line storage, backup, and media migration
sufficient for all studies it accepts (in addition to storage provided
for the IQSS DVN).
• The Henry A. Murray Archive, through its endowment, supports
permanent bit-level preservation of all social science research
studies directly deposited in the IQSS Dataverse Network.
http://thedata.org/book/data-backup-terms
18. Hosting
• There are two approaches:
1. You can download and install the Dataverse Network
Application and effectively become a host; or
2. You can create a Dataverse on *IQSS Dataverse Network at
Harvard University. This Network is open to all
researchers, publishers and data distributors.
• Option 1 gives you more control but includes added
responsibility & cost
*Institute for Quantitative Social Science
19. Hosting – IQSS Option
• Advantages
– Dataverse software is installed, hosted and managed for you by IQSS
– Dataverse is hosted in Harvard’s infrastructure which is very good
– IQSS offer great support in assisting you set up your dataverse and
provide great help if you run into any problems
• Disadvantages
– Network level administrative tasks cannot be done, these include:
• Creating user groups based on IP address or IQSS network user names
• Creating harvesting dataverses which allow you to share meta data with other
systems e.g. Dspace. Sharing includes exporting and importing meta data.
• Complete deletion of studies not just deaccession
• Accessing web statistics
– Cannot use alias URLs to point to your dataverse e.g. we cannot have
the url http://data.worldagroforestry.org pointing to the ICRAF IQSS
dataverse http://dvn.iq.harvard.edu/dvn/dv/icraf
*Institute for Quantitative Social Science
20. Hosting – Self Hosting
• Advantages
– Full access to Network level Administrative tasks including:
• Ability to import and export studies to and from other systems
• Ability to create user groups based on IP address and your dataverse users
• Ability to use software supplied utilities e.g. complete deletion of studies and
locking of studies
• Greater flexibility in user management and “Terms of use” management
• Greater flexibility in dataverse branding
– Ability to use organization URLs to point to the dataverse e.g.
http://data.worldagroforestry.org
• Disadvantages
– Need an IT expert to install and manage the dataverse
software, including things like upgrading, applying security
patches, backups etc.
– Need good server infrastructure for hosting the application especially
server space.
23. Data Citation for each study
• Dataverse allows to cite research digital data
from published printed work
• Citation automatically generated when study
is created.
• Data Citation format:
Author, Date, “Title”, Persistent Identifier
Universal Numerical Fingerprint (UNF)
Distributor or other optional fields [ …]
24. Unique Citation Components
1. Persistent Identifiers – Offer permanent and
reliable links to digital objects. Uses the
handle system. e.g. hdl:1902.1/15673
2. Universal Numerical Fingerprint –
– Applied on quantitative data
– Used to uniquely identify and verify data
e.g. 5:G22I+TtPQPAyFcRT6SrUfA==
25. Example of Citation
Frank Place; Patti Kristjanson; Steve Staal; Russ
Kruska; Tineke deWolff; Robert Zomer; E C
Njuguna, 2005, "Replication data for:
Development pathways in medium-high
potential Kenya: a meso-level analysis of
agricultural patterns and
determinants.", http://hdl.handle.net/1902.1
/15673 UNF:5:G22I+TtPQPAyFcRT6SrUfA==
World Agroforestry Centre [Distributor] V1
[Version]
26. Designed for Research Data
Data-format aware Research data workflows
• Input formats: CSV • Researcher can enter deposit
, TAB, SPSS, STATA, GraphML directly
• Export: reformat, subset, analyze • Multiple workflows:
• Preservation-reformatting closed, review-and-release, wiki
• Semantic fingerprints • Versioned
Find distributed resources Flexible licensing
• Can provide a portal to distributed • Access control for research
resources (OAI-PMH harvesting groups
client) • Layered usage terms
• Data can also include meta data for • Data request workflow
harvesting
Robust
Supports Any file type, only restriction 1 file size = 2GB
29. Publication and Data Submission
Proposed Workflow
Start
Request Data GRP/Region submit
from scientists publication
Data submitted Data Yes No
Publication Publication Publication
to RMG Data or data has data? submitted into
manager
Dspace
No
Request
Publication Changes
Data published in
Received? Publication
dspace
Yes Dspace Editors
Dspace Editors Approval
receive data link
Upload data to
dataverse
Publication
Update Dspace Yes Publication No
published to the
(unreleased) Approval
web
publication with
data link