The document discusses the evolution of science paradigms and scholarly communication towards more data-driven approaches. It describes how digital libraries and repositories are becoming central to the new model of open access scholarly communication. Infrastructure like DRIVER and OpenAIRE are working to integrate existing repositories and enable deposition, discovery and access of publications and data. Significant computational and data challenges remain in supporting data-intensive science at large scales.
3. Outline
Science Paradigms
New Scholarly Communication & Open Access
Digital Libraries & Repositories
– DRIVER → OpenAIRE
Computational & Data Challenges
eInfrastructures
– D4Science (I & II)
– GRDI2020
Conclusions
4. Science Paradigms
1st - Thousand years ago:
science was empirical
describing natural phenomena
w/ some models, generalizations
2nd - Last few hundred years: 2
.
a 4G c2
theoretical branch a 3 2
a
using models, generalizations
4
5. Really Early Times
One scientist
One location
One discipline
One phenomenon
One pencil (… carver …)
One paper (… stone …)
Street announcements, e.g., Εύρηκα!
6. Science Paradigms
3rd - Last few decades:
a computational branch
simulating complex phenomena
6
7. Recent Times
One small group of scientists
One location
One discipline
One phenomenon
One file system
One local disk with custom files
Publications at refereed forums
8. Science Paradigms
4th - Today:
data exploration (eScience)
unify theory, experiment, and simulation
8
9. Current Times
Many/large teams of scientists
Many locations
Many disciplines
Many phenomena
Many data management systems
Many data forms
Web uploads for publications, data, processes, …
10. Current Times
Web uploads for publications, data, processes, …
New order in scholarly communication
Open access
Creator, author, publisher, curator, preserver
roles mixed up
Digital libraries & repositories at centre stage
11. eInfrastructure Layers
}
Communities
Users
Functionality
Data / Info / Pubs
Processing
Network
12. Scholarly Communication
Imperatives
1. Comprehensive, global access to any type of
scientific information
2. Minimum time and resources effort to access
and use this information
3. Easy search/navigation, handling, manipulation,
and re-dissemination of information
4. Maximum visibility to and communication with
the research community, research impact
5. Long-term access and preservation of research
results
13. Open Access
“Our mission of disseminating knowledge is only half
complete if the information is not made widely and
readily available to society. New possibilities of
knowledge dissemination not only through the classical
form but also and increasingly through the open access
paradigm via the Internet have to be supported.
Berlin Declaration on Open Access to Knowledge
in the Sciences and Humanities, 2003
16. DRIVER High-Level Objectives
Develop an environment for integrating existing
national, regional, or thematic repositories
Create a production-quality European DR
infrastructure
Prepare the future expansion and upgrade of the DR
infrastructure across Europe
Identify and promote the use of a relevant set of
standards
Raise awareness among user communities
17. D-NET eInfrastructure Software
Service-Oriented Architecture
Web Services, dynamic service registration, ...
Distributed environment
Services executed on a network of machines
D-NET components (Lego approach)
Enabling services: infrastructure middleware
Data Management services: aggregation systems
End-user Functionality services: search,
community support, portals
18. DRIVER production infrastructure
D-Net’s release v1.1
Light User Interfaces
Advanced User Interfaces
?
End users
PO Functionality Layer
EU Open Access
Repositories
Administrators
PO Data Layer
RO
Enabling Layer
21. Repository Landscape
DRIVER activity
254 repositories – 31 countries
220+ harvested
1,2M documents
European repositories +/- 500
World repositories +/- 1100
22. Story – Tales from
Repository managers
Initially I just used the Validation tool to see if
our repository is more or less on track and was
reassured when the results looked good,
which gave me confidence to register.
- Louw Venter,
Boloka Research Repository of the North-West University
South Africa
23. COAR
Confederation of Open Access Repositories
Permanent organisational backbone for
European (and world) repository infrastructure
– Geographic and thematic extension
– Diffusion of DRIVER technology
– Connect established communities of practice
– Promote Open Access
– Fill repositories with Open Access
publications
26. D-Net’s current uptake
DRIVER European Information Space
– www.driver-community.eu
OpenAIRE EC pilot
– www.openaire.eu
European Film Gateway and other EC projects
– www.europeanfilmgateway.eu
Experimentation of deployment of new infrastructure
instances
– China, India, Portugal, Belgium, Spain, Slovenia
27.
28. OpenAIRE High-Level Objectives
Implement European policy on Open Access
“Every publication resulting from European funding
under FP7 or from the ERC should be stored in a
repository and be openly available”
Promote above policy to researchers
Pilot project for full-scale implementation in the
future
29. OpenAIRE - factsheet
Open Access Infrastructure for Research in Europe
Programme: FP7 – Research Infrastructures
Starting date: December 1, 2009
Duration: 36 months
Budget: 4.1 Million
38 partners covering all European member-
states
To be reached at www.openaire.eu
30. Partners
University of Athens (coordinator)
Scientific Communities
University of Goettingen Library (scientific
coordinator) Health (Life Sciences)
CNR-ISTI (technical coordinator) – EMBL-EBI
University of Bielefeld Environment
– World Data Center for Climate
Spanish National Research Council (CSIC)
– Consultative Group on International
CERN Agricultural Research (CGIAR)
SURF Information & Communication Science
ICM – University of Warsaw – Cognitive Interaction Technology
(CITEC)
University of Minho
Socio-economic Sciences and Humanities
University of Gent Library – Data Archiving and Networked
eIFL Services (DANS)
Technical University Denmark Liaison Offices
31. Liaison Offices
Region 1 North Region 2 South Region 3 East Region 4 West
(DTU) (UMINHO) (eIFL) (UGENT)
Austria
Denmark Czech Republic Bulgaria (University of Wien)
Cyprus
(Danish Technical (Technical University of (Bulgarian Academy of
(Universtity of Cyprus)
University) Ostrava) Sciences)
Belgium
Greece (Universtiy of Gent)
Finland (National
Estonia
(University of Helsinki) Documentation Center) Hungary (HUNOR)
(University of Tartu)
France
(Couperin)
Sweden Italy
(National Library of (CASPAR) Lithuania
Latvia
Sweden) (Kaunas Technical
(University of Latvia) Germany
University)
(University of Kostanz)
Malta
(Malta Council for
Science & Technology)
Poland
Romania Ireland
(ICM – University of
(Kosson) (Trinity College)
Warsaw)
Portugal
(University of Minho)
Slovakia Netherlands
Slovenia (Utrecht University)
Spain (university Library of
(University of Ljubljana)
(Spanish Foundation for Bratislava)
Science & Technology)
UK
(SHERPA)
32. European Helpdesk
National Open Access Liaison Offices (27 countries)
Provide OA “toolkits” for
– Researchers
– Institutions
Setup 24/7 portal for deposit, search of OA publications
Liaison with
– Other European OA initiatives
– Publishers
– CRIS systems
33. Supporting repository
eInfrastructure
OpenAIRE portal built on D-NET
Access to scientific publications
– Search, browse
– Visualization tools
Deposition of articles
– Setup repository for “orphan” (better, “homeless”)
researchers (CERN’s INVENIO)
– Harvest OA publications from existing repositories
Provide monitoring tools for
– Document/depositing statistics
– Usage statistics from repository infrastructure
Interoperation with other infrastructures
35. DRIVER-2-OpenAIRE Take Away
Changing the culture in research publications
Open accessibility to research results
Metrics of research output vs. funding
Technology + info + people infrastructures
36. Current Times
Many/large teams of scientists
Many locations
Many disciplines
Many phenomena
Many data management systems
Many data forms
Web uploads for publications, data, processes, …
37. Data in 4 th Science Paradigm
Captured by instruments or generated by simulators
Processed by software
Stored in computer as Information/Knowledge
Analyzed while in scientist’s database / files
using data management and statistics
37
38. Overall Data Flow
Data acquisition, reduction,
analysis, visualization, storage
Data
Acquisition Remote users w/
System local computing
and storage
raw High Speed Network
Remote
data
users
Metadata
Local
Remote storage
Supercomputers users
39. PAN-STARRS
PS1
– detect ‘killer asteroids’,
starting in November 2008
– Hawaii + JHU + Harvard +
Edinburgh + Max Planck Society
Data Volume
– >1 Petabytes/year raw data
– Over 5B celestial objects
plus 250B detections in DB
– 100TB database
– PS4: 4 identical telescopes in 2012, generating 4PB/yr
40. Cosmological Simulations
Cosmological simulations have 109 particles and
produce over 30TB of data (Millennium)
Build up dark matter halos
Track merging history of halos
Use it to assign star formation history
Combination with spectral synthesis
Realistic distribution of galaxy types
Hard to analyze the data afterwards need DB
Optimize comparison to real data
41. Immersive Turbulence
Unique turbulence database
– Consecutive snapshots of a
1,0243 simulation of turbulence:
now 30 Terabytes
– Soon 6K3 and 300 Terabytes
– Hilbert-curve spatial index
and massive mining
– Treat it as an experiment, observe
the database!
– Throw test particles in from your laptop,
immerse yourself into the simulation,
like in the movie Twister
New paradigm for analyzing
HPC simulations!
42. Balloon
(30 km)
LHC and other HEP data CD stack with
1 year LHC data!
(~ 20 km)
Very complex data model
Will generate 1GB/s, 10 PB/y
Data: raw calibrated skimmed high-
level objects physics analyses results Concorde
(15 km)
Duplicated for in-silico experiments to
interpret data
Dependence on grey literature: calibration
constants, algorithms ... oral tradition! Mt. Blanc
(4.8 km)
43. Other Reference Applications
SDSS: 10TB total, 3TB in DB, soon 10TB, 6 years old
SkyQuery: fast spatial joins on largest astronomy catalogs / replicate
multi-TB datasets 20x for performance (1Bx1B in 3 mins)
OncoSpace: 350TB of radiation oncology images today, 1PB in two years,
to be analyzed on the fly
BaBar: Grows 1TB/day
2/3 simulation Information
1/3 observational Information
VLBA (NRAO): generates 1GB/s today
NCBI: “only ½ TB” but 2X each year
very rich dataset
Pixar: 100 TB/Movie
44.
45. D4Science:
Environmental Monitoring
European Space Agency
Global environmental issues: marine environment,
forest ecosystem, air quality
Sensor data analysis, integration and correlation of
data sources; reasoning, information/knowledge mgnt
Large amount of information ( 1TB), added-value
applications and services
Seamless workflow definition &
on-demand data processing
46. D4Science: Fishery Resources Mgmt
Fishery@FAO and WorldFish Center
Worldwide spread researchers from many
disciplines (biologists, climatologists, GIS
experts, socio-economists, fishery managers,
etc.)
Continuous assessment for sustainable
development & use of the ecosystem of world’s
fisheries and aquaculture, e.g., species, aquatic
resources, hydrological changes
Extreme data diversity
47. Conclusions
Digital Libraries & Repositories: The new way
for scholarly communication (final product)
Data Infrastructures: The new libraries for all
scientific documentation (intermediate and
final products)
Huge technological and organizational
challenges
LONG way to go
FUN way to go