Looking at the past of infrastructure development for research data in the context of infrastructure development patterns and experiences from the evolution of the IEDA data facility to inform future pathways and developments. A major focus of the lecture is on the FAIR principles and the issues surrounding reusability of data.
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
EGU 2018 Ian McHarg Lecture
1. Data Infrastructure for the
Earth & Space Science
How Far Have We Come,
Where Are We Heading?
Kerstin Lehnert
Lamont-Doherty Earth Observatory, Columbia University
April 10, 2018
Ian McHarg Lecture 2018
1
2. Before I start, a short detour ...
April 10, 2018
Ian McHarg Lecture 2018
2
The Kaiserstuhl, Germany
4. My goal
April 10, 2018
Ian McHarg Lecture 2018
4
study the past
if you would
define the future
Confucius
5. Learning from the past:
(1) The Big Picture
April 10, 2018
Ian McHarg Lecture 2018
5
2007
2018
https://www.rd-alliance.org/sites/default/files/Common_Patterns_in_Revolutionising_Infrastructures-final.pdf
6. Learning from the past:
(2) The Real World
The story of IEDA
(Interdisciplinary Earth Data Alliance)
www.iedadata.org
... there was a database named PetDB
April 10, 2018
Ian McHarg Lecture 2018
6
7. A biased perspective
I am a geoscientist who
directs a US data facility for
primarily investigator-based
data (“long tail”) funded by
the National Science
Foundation.
April 10, 2018
Ian McHarg Lecture 2018
7
www.iedadata.org
8. Defining the Topic
Data infrastructure is a digital
infrastructure promoting data sharing and
consumption.
Its goal is to enable researchers to make the best use of the
world’s growing wealth of data for the advancement of
science and the benefit of society.
April 10, 2018
Ian McHarg Lecture 2018
8
9. Data drive Earth science:
A new way of understanding the world
April 10, 2018
Ian McHarg Lecture 2018
9
Data:
The 4th Paradigm
The 5th Dimension
10. We have been talking about it for a
while ...
April 10, 2018
Ian McHarg Lecture 2018
10
2006
12. Growth of Earth & Space Science Informatics
63 ESSI session proposals – an increase of 40%
729 ESSI abstracts – an increase of ~18.7 %
35 ESSI oral sessions - an increase of ~40%
4 Data Fair Town Halls
Machine Learning/Deep Learning: biggest increase in any theme
big increases also in FAIR, Repositories & Data Storage, and Adoption & Adaption
Carnegie Institution: Unleash the Power of Data 12
Credit: Lesley Wyborn
AGU FM Program Committee Member
AGU Fall Meeting 2017:
14. Learning from the past: The Big Picture
Insights into the development of infrastructures
April 10, 2018
Ian McHarg Lecture 2018
14
15. Revolutionary!
April 10, 2018
Ian McHarg Lecture 2018
15
Roman water supply system
Railroad systems
Global electrification
Internet
16. Patterns of Infrastructure Development
Edwards et al. 2007
1. Deliberate and successful design of
‘local’ systems.
2. Technology transfer across domains
and locations
3. Infrastructure form via gateways
that allow dissimilar systems to be
linked into networks
Wittenburg & Strawn 2018
1. Inventions and development of
start-up systems
2. Technology transfer between
regions and also society
(creolization)
3. Planning for system growth where
"reverse salients" need to be
tackled
4. Substantial momentum (mass,
velocity, direction)
April 10, 2018
Ian McHarg Lecture 2018
16
System Building
Growth
Consolidation
17. Patterns of Infrastructure Development
Edwards et al. 2007
1. Deliberate and successful design of
‘local’ systems.
2. Technology transfer across domains
and locations
3. Infrastructure form via gateways
that allow dissimilar systems to be
linked into networks
Wittenburg & Strawn 2018
1. Inventions and development of
start-up systems
2. Technology transfer between
regions and also society
(creolization)
3. Planning for system growth where
"reverse salients" need to be
tackled
4. Substantial momentum (mass,
velocity, direction)
April 10, 2018
Ian McHarg Lecture 2018
17
System Building
Growth
Consolidation
18. Creolization
New components are continuously introduced
trying to solve specific challenges
Capabilities grow unevenly (e.g. big vs small data)
Fragmentation
Leads to
Inefficiencies in use and costs
Winners & loosers: some solutions are more
promising and get more attraction
Better understanding the underlying rules,
principles and limitations.
April 10, 2018
Ian McHarg Lecture 2018
18After Wittenburg & Strawn, 2018)
19. Attraction via “Universals”
“Simple” principles, broadly supported
Only influence directly a specific part of the
overall infrastructure, enable efficiency at the top
layers
Form stable basis for new developments
April 10, 2018
Ian McHarg Lecture 2018
19After Wittenburg & Strawn, 2018)
“Universals are ... essential to create a
momentum by overcoming fragmentation and
achieving economies of scale.
20. Attraction is happening!
Relevance of community organizations that
define principles, procedures, and component
specifications
RDA: global & cross-disciplinary
ESIP: Earth Science & US (others coming?)
New: RDA Interest Group “ESIP/RDA Earth,
Space, and Environmental Sciences”
April 10, 2018
Ian McHarg Lecture 2018
20
21. Universal: FAIR principles
April 10, 2018
Ian McHarg Lecture 2018
21
Represent a guideline for data providers to
enhance the reusability of their data holdings:
Data can be found on the Internet.
Data are accessible in a usable format with clear rights
and licenses.
Data access is reliable & persistent.
Data are identified in a unique and persistent way so
that they can be referred to and cited.
Data are documented with rich metadata.
22. Universal:
Standards for data repositories
Cooperative effort between Data Seal of Approval (DSA) and the World Data
System (WDS) under the umbrella of the Research Data Alliance (RDA)
Harmonized requirements & procedures for certification of repositories
Confidence for publishers and funders which repositories to trust
Basis for development of new repositories
April 10, 2018
Ian McHarg Lecture 2018
22
23. “Enabling FAIR Data” project @ AGU
Develop & implement standards that will connect researchers, publishers, and
data repositories in the Earth and space sciences to enable FAIR data
Grant from the Laura and John Arnold Foundation (LJAF) to the AGU
FAIR-compliant data repositories (CoreTrustSeal certified, preferred domain
specific)
FAIR-compliant Earth and space science publishers
Align their policies for data to be deposited in certified repositories
Gives similar experience for researchers.
Carnegie Institution: Unleash the Power of Data 23
Slide after S. Stall et al., presentation at RDA P11
Berlin, March 2018
24. All publishers who are part of the
Coalition on Publishing Data in the Earth
and Space Sciences (COPDESS) support
the efforts of trusted repositories that
aggregate research data, software, and
physical samples for the use of the
scientific community.
Carnegie Institution: Unleash the Power of Data 24
“These Data Guidelines align the
Author’s instructions for the submission
of data sets in the Earth and Space
Sciences, for all affiliated publishers.”
25. Universal:
Persistent Identifiers
April 10, 2018
Ian McHarg Lecture 2018
25
Founded 2009
Founded 2011
Founded 2012
“The intention of this cross-
disciplinary report is to overcome still
existing confusions about PIDs and the
lack of detail knowledge in many
disciplines. ...to identify agreements
across documents that have been
suggested to be included by experts.”From: “Common Patterns in Revolutionary
Infrastructures and Data”
P. Wittenburg & G. Strawn, February 2018,
26. Learning from the past:
(2) The Real World
The story of IEDA
(Interdisciplinary Earth Data Alliance)
...there was a database named PetDB
April 10, 2018
Ian McHarg Lecture 2018
26
27. Once upon a time ...
April 10, 2018
Ian McHarg Lecture 2018
27
PetDB web site in 1999
28. April 10, 2018
Ian McHarg Lecture 2018
28
Note:
PetDB is a database that allows to access
data at the level of individual data
points, not files!
29. Success: New data-driven science
in geochemistry
April 10, 2018
Ian McHarg Lecture 2018
29
Meyzen et al. (2007): „Isotopic portrayal
of the Earth's upper mantle flow field.“
Putirka et al. (2007)
Stracke & Hofmann (2005)
Class & Goldstein (2007)
2018: 740 citations
30. An analysis in 2007
April 10, 2018
Ian McHarg Lecture 2018
30
T. Plank, 1999: “Within about 5 minutes of logging on for the first
time, I was staring at an EXCEL file that had all the REE on
basalt glasses from the EPR from 10°N to 20°S. And the answer
to my La/Sm question. I am very impressed, we are looking at
the future of geochemistry.”
GSA 2007 talk: “My Data, Your Data, Our Data!”
32. Another failed network attempt
PaleoStrat not funded
Development of interoperability
with CoreWall not funded
Too many political obstacles
April 10, 2018
Ian McHarg Lecture 2018
32
“Promises, Achievements, and Challenges of
Networking Global Geoinformatics Resources”
EGU General Assembly 2008
33. Growth of data systems at Lamont
April 10, 2018
Ian McHarg Lecture 2018
33
34. Consolidation
“This Cooperative Agreement converts a series of proposal/award-driven
activities into a community-based facility that serves to support, sustain,
and advance the geosciences by providing a centralized location for the
registry of and access to data essential for research in the solid-earth and
polar sciences.”
- Continue operating & maintaining existing systems
- Develop tools for investigators to comply with NSF data policies (IEDA Data
Management Plan Tool & Data Compliance Reporting Tool)
- Develop tools and modify architecture to provide integrated access to holdings
April 10, 2018
Ian McHarg Lecture 2018
34
36. IEDA Today: Data Holdings & Growth
> 70 TeraBytes of marine geophysical sensor data in the MGDS
> 20 million analytical measurements for >1 million samples in
EarthChem
> 4.2 million samples registered and searchable in SESAR (System
for Sample Registration)
11/15/17Presentation at NSF-EAR 36
37. IEDA Today
Thousands of download requests per
month
>2,000 citations in the literature
~ 10,000 start-ups of GeoMapApp per
month
>2,700 GeoPass users*
Demonstrated impact on science
11/15/17Presentation at NSF-EAR 37
*GeoPass accounts are required to submit data to EarthChem/
Geochron, SESAR, & USAP-DC, and to use the DMP Tool
0
50
100
150
200
250
NumberofCitationsPerYear
EarthChem/ PetDB / SedDB
MGDS/ GMRT/ GMA
Citations of IEDA Systems in the
Scientific Literature
38. IEDA is “attracting”
👍
Certification: Member of World Data System since 2011 (CoreTrustSeal
certification underway)
Use of Persistent Identifiers
Publication agent of DataCite since 2011
DOI registration of datasets since 2009 via TIB Hannover
The International Geo Sample Number: A PID for physical sampleas
FAIR data
Finable/accessible: DOIs, landing pages, GUIs, APIs
Interoperable: CSW, DataONE member node, schema.org (EarthCube project P418)
Reusable: disciplinary expertise for data curation, rich provenance metadata
April 10, 2018
Ian McHarg Lecture 2018
38
40. Merger of EarthChem & MGDS created
tensions
Partner system needs versus overarching IEDA level needs
Budget
Staff expertise
Staff allocations
Distribution among different funding sources (3 different NSF programs)
Scientific utility versus trustworthiness of operations
Operation & maintenance versus innovation
April 10, 2018
Ian McHarg Lecture 2018
40
41. Merger did not lead to the expected
‘economies of scale’
Disciplinary data curation continues as the most relevant component.
Additional resources/effort needed for coordination and alignment of
activities and practices across partners.
More project management required due to budget level and status as facility.
Building useful data search and discovery across multi-disciplinary systems is a
challenging problem.
April 10, 2018
Ian McHarg Lecture 2018
41
Costpersystem
43. Access to all IEDA repositories in one place
Free text, map, and facet-based search
options
ISO metadata available for other catalogs to
harvest
Major work to align concepts and
vocabularies in the different repositories
Challenge to agree on facets
Relevance to different data types
Availability of metadata
Granularity of datasets
April 10, 2018
Ian McHarg Lecture 2018
43
Achievements:
IEDA Integrated Catalog
44. A changing ecosystem
“IEDA’s cross-disciplinary services for data discovery (IEDA Data Browser)
and data access (IEDA Integrated Catalog) across all IEDA systems are
increasingly superseded by tools developed with substantially larger
resources as part of EarthCube, Google (Google’s new Research Data
Search based on schema.org), or perhaps DataONE. These recent
developments aim to provide researchers with the tools to find and use
data in a highly distributed and fragmented data infrastructure based on
new approaches for interoperability, metadata registries, and hubs such
as SCHOLIX to link data and literature.”
IEDA: Future Scope and Structure
(IEDA internal report, K. Lehnert & S. Carbotte, January 2018)
April 10, 2018
Ian McHarg Lecture 2018
44
45. We need to adapt
� Reduce complexity of operations
� Adjust to and better leverage external CI developments (e.g. EarthCube)
� Enhance opportunities to grow partnerships relevant to the disciplinary
systems to target needs of the disciplinary communities
Systems and/or services that serve broader audiences should be funded
independently (SESAR, GeoMapApp, GMRT)
Create a new management/governance structure
more independence for IEDA partners and funders to allow growth
rely on external developments for cross-disciplinary services
Ian McHarg Lecture 2018
45
46. Where are we heading from here?
April 10, 2018
Ian McHarg Lecture 2018
46
47. Oh no, that diagram again ...
A Digital Object has a structured bit sequence
stored in a trustworthy repository.
A Digital Object has a PID and metadata.
The PID is associated with all relevant kernel
information that allows humans and machines
to enable FAIR.
Kernel information and Digit Object have types
allowing humans and machines to associate
operations with them.
April 10, 2018
Ian McHarg Lecture 2018
47
According to Wittenburg & Strawn (2018), the
implementation of data infrastructure can be
guided by 4 statements:
48. Re-
usability
Impact
on
Science
Sustaina-
bility
My take on priorities
April 10, 2018
Ian McHarg Lecture 2018
48
Data type specific best practices
Metadata quality
Granularity of access, data fusion
Metrics
Data Science Education
Business models
Consolidation
The impact of data
infrastructure on science
& society depends on the
reusability of data and
will ultimately justify its
continued funding.
49. Reusability problem: Metadata quality
Discipline-specific and data type
specific metadata not well defined
and enforced
Lack of consistent vocabularies
Automated metadata enrichment
(e.g. CINERGI) has not yet had
convincing results
Manual data curation still best,
but too costly
April 10, 2018
Ian McHarg Lecture 2018
49
“The Geochemical Data(base) Factory: From Heterogeneous Input to
Homogeneous Output. AGU FM 2009
50. Reusability problem: data wrangling
Surveys in recent years show that data scientists still spend 75-80% of their time
‘data wrangling’.
RDA EU survey 2013 (75%)
Brodie 2015 (80%)
CrowdFlower 2017 (80%)
April 10, 2018
Ian McHarg Lecture 2018
50
Source:
Crowdflower
51. Reusability solution: Data Fusion
Harmonize & integrate data so that
disparate pieces of information form a
picture that can be explored to reveal
patterns in space, time, and properties.
April 10, 2018
Ian McHarg Lecture 2018
51
52. Structure data so they can be accessed and
understood at a more granular level
Approaches are available and improving
ISO/OGC Observations & Measurements
Observation Data Model ODM2 (Horsburgh et al. 2017)
Schema.org
Open Core Data
Reusability solution:
Data Fusion
April 10, 2018
Ian McHarg Lecture 2018
52
S. Cox et al. “Mainstream web standards now
support science data too”; AGU FM 2017
53. Reusability problem: The Long Tail
Small data volumes, but big potential
Culture is not open to sharing
Data fragmented and highly heterogeneous
Lots of .xls files
Many data never see the light of day
April 10, 2018
Ian McHarg Lecture 2018
53
ESIP Winter Meeting, January 2016
54. Reusability hope: Generation change
“A new scientific truth does not triumph by
convincing its opponents and making them see
the light, but rather because its opponents
eventually die, and a new generation grows up
that is familiar with it.”
Max Planck
April 10, 2018
Ian McHarg Lecture 2018
54
55. April 10, 2018
Ian McHarg Lecture 2018
55
Credit: Jon Stelling, LeHigh University
56. steps in the data life cycle are siloed in many
communities and disciplines
Recommendation: focus on the full data life
cycle
April 10, 2018
Ian McHarg Lecture 2018
56
Final Report from the NSF Computer and Information Science and
Engineering Advisory Committee, Data Science Working Group
Communications of the ACM, Vol. 61 No. 4,
Pages 67-72, April 2018
57. A trend toward large facilities
April 10, 2018
Ian McHarg Lecture 2018
57
58. Education in Data Science or
Data Science in Education
Data Science as a new field in academia
Different organizational models emerging at academic
institutions to integrate with domain sciences
April 10, 2018
Ian McHarg Lecture 2018
58
59. I’ll leave the funding question to the
experts.
April 10, 2018
Ian McHarg Lecture 2018
59
Trust of the science community
60. Funding
April 10, 2018
Ian McHarg Lecture 2018
60
“Funding research data management and related infrastructures”, May 2016
Authors: Knowledge Exchange Research Data Expert Group and Science Europe Working Group
on Research Data.
61. Did we move at all?
April 10, 2018
Ian McHarg Lecture 2018
61
Did we move at all?
2007
62. Success!
The International Geo Sample Number
Grew from a local, centralized system started in 2004 to
an international organization founded in 2011
Now has 24 members in 5 continents
currently 5 active Allocating Agents
Adoption by researchers, collection curators, publishers,
and funding agencies growing
Adoption spreading to other disciplines
Biology, archeology, material sciences
2/15/2018 62
4,261,436
2,100,273
100,342 30,925 4,809
IEDA Geoscience
Australia
MARUM CSIRO GFZ
# of IGSNs issued by active IGSN Allocating
Agents
Organic Biomarker Data Workshop
Newest members since 2017:
USGS (USA)
BGS (UK)
CNRS (France)
IFREMER (France)
ANDS (Australia)
63. The final message: Let’s work together!
It is relevant that we leverage existing
capabilities and expertise.
We do not have the luxury of duplicating
effort.
We need to break down barriers between
communities and stakeholders that compete
for their piece of the pie.
April 10, 2018
Ian McHarg Lecture 2018
63
NSF Workshop Cyberinfrastructure for Large Facilities, Nov 2015
64. Back to the beginning:
April 10, 2018
Ian McHarg Lecture 2018
64
“Do what excites you. Follow your passion.
Don't necessarily worry about what obstacles
might be there, because there are always ways
to overcome them. But the most exciting thing
is to be able to do what you love, and just don't
let anything stand in the way of that.”
Carol Greider 2009 Nobel Prize winner
I am incredibly honored and humbled by this medal, and I really would like you to know how much this means to me. So before I start getting into the topic of RDI, I would like to take a brief detour and talk a little bit about how I got here and what the significance of this honor is in my life.
In 1982 I was about ready to finish my dissertation in petrology when I got pregnant, married, and became a housewife. The scientific work that I was doing came to an end and my career seemed to be over before it had even started. Two years after my son was born, I took a half-time position as lab technician at the Max-Planck-Institute for Chemistry in town, and even though it did not pay any real money, it brought me back into the research environment. I had amazing colleagues, who encouraged me to finish my PhD, and supported me through a rough couple of years, when I tried to be a mom during the day and catch up with science at night. But it was the best thing I have done, and I am so grateful to all those colleagues. Without that PhD, I would not have been able to get the position as Staff Associate at the Lamont-Doherty Earth Observatory, when I moved to the US in 1996. In that position I had two main duties: to run a geochemistry lab and to build a database for volcanic rock geochemistry. And that was the beginning
A lecture like this is a great opportunity to reflect on the past, where we started off and where we got to, and use the experiences that we collected ourselves in our work and the insights gained through broader developments – be they good or bad – to inform decisions regarding the future.
I will take two different looks at the past:
one is using the work of historians, economists, social scientists, and information scientists to understand the development of infrastructures and how insights can inform the development of data and cyberinfrastructure. In 2007 while preparing a presentation for a NSF workshop that was convened to envision the future of Geoinformatics in the US and globally, I found a report written by Paul Edwards and colleagues that was a real eye-opener and helped me and I think many others to put ongoing activities aimed at building cyberinfrastructure into a context. Just last month, while preparing for this lecture, I ran into a paper by Peter Wittenburg and George Strawn that builds on the same classic book by Thomas Hughes to define the path of data infrastructure for the future.
The other one is based on my own experiences along the path of building data infrastructure for the solid earth sciences, especially the experiences gained in the creation and operation of the Interdisciplinary Earth data Alliance that I am directing.
I word of caution first: The data universe is highly complex and diverse. I cannot possibly aspire to cover all topics and address every aspect. I am a geoscientist ...
Vision:
Enable an open, extensible, and evolvable digital science ecosystem.
Facilitate research data, information, knowledge, and data tools discovery.
Enhance problem-solving processes.
Move and connect scientific data across scientific disciplines
Manage scientific workflows
Interoperation between scientific data and literature
Integrated science policy framework
Networked digital data systems & libraries that interoperate
There are a number of drivers behind building data infrastructure:
There is an ever growing, and maybe exponentially growing volume of data acquired in the sciences in general, and specifically in the Earth sciences where new data acquisition technologies and computing capabilities are used to gather observations from space, in the oceans, and on land, to simulate earth processes and to generate models that predict future paths.
And there are data and the technologies to mine, analyze and visualize data are giving us new insights into the way the earth works and
Lots of reports have come out.
There is no doubt that infrastructures have a profound effect on nature of modern human societies
Roman water supply system
Opened the way to building the largest capital in ancient times,
Railroad systems
Allowed to exchange people & goods at unknown speeds and facilitated the first industrial revolution,
Global electrification
Changed the availability of power and facilitated the second industrial revolution.
The Internet with its web applications
Changed the availability of information and facilitated new kinds of businesses.
Start with test installations, followed up by small size installations, then being extended stepwise to interconnected systems
“Attraction and convergence are driven mainly by efficiency and economic concerns.
The benefit of convergence is the belief of stakeholders that a stable fundament has been built, on top of which new investments and developments can be made to fully exploit the new technologies and infrastructures.”
FAIR principles are a major milestone that represents an ‘attractor’ in the solution space. But FAIR principles express policy goals. They need to be translated into actions
When businesses merge, it is often to achieve economies of scale. Larger organizations are typically able to produce goods and services more efficiently and at a lower per-unit cost than smaller businesses because fixed costs are spread out over a larger number of units. This is not always the case, however. Sometimes when two firms merge, being larger will actually create dis-economies of scale, where per unit production costs increase because of increased coordination costs.
Re-usabiDomain standards
Business models
Workforce
Quality
Communities need to define disciplinary and data type specific best practices (documentation of provenance, uncertainties, etc.)
Readiness for data mining & analysis
Improve granularity of access
Data fusion (the ‘data lake’)
There are more lessons to be learned from the IGSN development, but that is for another talk.