Dr. Ross King, AIT Austrian Institute of Technology GmbH, gave an invited talk about the FP7 project SCAPE at the eSciDoc Days in Berlin, October 27, 2011, https://www.escidoc.org/JSPWiki/en/ESciDocDays.
2. SCAPE
Digital Preservation
• For the first time, the rate of
increase of information creation is
beginning to exceed the rate of
increase in storage capacity.
• This massive volume of digital
material raises a number of issues:
• What is worth preserving?
• How to preserve so much?
• How to access preserved data?
• How to create incentives to
preserve?
http://arstechnica.com/business/consumerization-of-it/2011/09/information-explosion-how-rapidly-expanding-storage-spurs-innovation.ars
07.11.2011
2
3. SCAPE
Digital Preservation
• Standards, best-practices, and technologies utilized in order to
ensure access to digital information over time
• How long?
“Digital documents last forever – or five years,
whichever comes first.”
http://www.clir.org/pubs/reports/rothenberg/introduction.html
• Generally we mean decades or centuries
07.11.2011
3
4. SCAPE
SCAPE – what is it about?
• Planning and managing computing-intensive (digital)
preservation processes such as the large-scale
ingestion or migration of large (multi-Terabyte)
data sets
SCAPE is a follow-up to the highly successful FP6 IP Planets.
5. SCAPE
SCAPE Project Data
• Project instrument: FP7 Integrated Project
• 6. Call
• Objective ICT-2009.4.1:
Digital Libraries and Digital Preservation
• Target outcome (a) Scalable systems and services for
preserving digital content
• Duration: 42 months
• February 2011 – July 2014
• Budget: 11.3 Million Euro
• Funded: 8.6 Million Euro
6. SCAPE
SCAPE Consortium
Number Partner name Partner short name Country
1 (coordinator) AIT Austrian Institute of Technology GmbH AIT AT
2 British Library BL UK
3 Internet Memory Foundation IMF NL
4 Ex Libris Ltd EXL IL
5 Fachinformationszentrum Karlsruhe FIZ DE
6 Koninklijke Bibliotheek KB NL
7 KEEP Solutions KEEPS PT
8 Microsoft Research MSR UK
9 Österreichische Nationalbibliothek ONB AT
10 Open Planets Foundation OPF UK
11 Statsbiblioteket Aarhus SB DK
12 Science and Technology Facilities Council STFC UK
13 Technische Universität Berlin TUB DE
14 Technische Universität Wien TUW AT
15 University of Manchester UNIMAN UK
16 Pierre & Marie Curie Université Paris 6 UPMC FR
7. SCAPE
SCAPE Project Overview
SCAPE will enhance the state of the art in digital preservation in three ways:
• Infrastructure and tools for scalable preservation actions
• A framework for automated, quality-assured preservation workflows
• Integration of these components with policy-based automated
preservation planning and watch Takeup
Stakeholders
Communities
Dissemination
Training Activities
Sustainability
SCAPE results will be validated in three large-scale testbeds:
• Digital Repositories Testbeds
• Web Content Corpora
Integration
• Research Data Sets Benchmarking
Validation
The SCAPE Consortium brings together Cross-project Activities
Project Management
a broad spectrum of expertise from Platform
Technical Coordination
Research Roadmap
• Memory institutions Automation
Workflows
• Data centres Planning and Watch Parallelization Preservation
Components
Virtualization
• Research labs Quality Assurance
Institutional Policies Scalable Components
• Universities Technical Watch
Automated Planning
Automation-ready
Tools
• Industrial firms
7
8. SCAPE
Selected SCAPE Testbed Scenarios
• Characterise large video files
• The master MPEG2 files are so large that it is difficult to apply JHOVE and
insufficient detail is provided. A detailed characterisation of the MPEG2 streams
is needed in order to identify technical dependencies for extracting from or
rendering the MPEG2 stream. This would enable preservation risks related to
current access services to be monitored and action taken as necessary to ensure
continued access and preservation.
• Carry out large scale migrations
• Migrating from one format to another introduces the possibility of damaging the
content or failing to capture significant properties of the original in the resulting
destination format.
• Specific requirements include:
• Solution tools that operate reliably at scale (80TB, 2 million pages)
• Automated QA, ideally with no manual intervention on a file by file basis
• QA performed by independent process from the migration process from digitalbevaring.dk
• QA demonstrates strong evidence of significant properties being captured
in the destination format
• Quality assurance in web harvesting
• For large scale crawls, automation of the quality control processes is a necessary
requirement. Currently, this process relies on random sampling and very basic
quantitative checks. 8
9. SCAPE
Selected SCAPE Challenges
• Bridging the gap between test workflows and
scalable workflows
• Applying Map/Reduce to binary data
• Locality of data
• Bring the data to the computation, or
bring the computation to the data?
• Repository Integration
• Repository Consistency
• Scalable Ingest
• Preservation Planning
• How to scale?
• How to automate?
• Research data sets from digitalbevaring.dk
• How to preserve contextual information?
9
10. SCAPE
SCAPE Solutions
• SCAPE Platform
• HADOOP, Stratosphere
• Virtualized cluster
• Repository integration
• HBASE, HDFS - Fedora
• Three levels of parallelization from digitalbevaring.dk
• Distribution of files
• Splitting binary files
• Parallelisation of algorithms
• Mapping Taverna to HADOOP
10
11. SCAPE
SCAPE Solutions
• Automated Planning and Watch
• Building on the Planets PLATO tool
• Automated watch based on
• Results Evaluation Framework (REF) database
• Monitoring trends in web harvests
• Automated planning based on semantically
formalized policies
• Automated Quality Assurance
• QA in web harvesting through automated comparison of
rendered pages – combined structural and image analysis
11
12. SCAPE
SCAPE Achievements
• Public Website
• http://www.scape-project.eu/
• Development Infrastructure
• Hosted by the Open Planets Foundation and GitHub
• Development Wiki
• http://wiki.opf-labs.org/display/SP/Home
• Deliverables
• First Deliverables available for download
• Publications
• 13 in the first nine months, including 6 at iPres next week
• Report: comparative analysis of identification tools
• Platform
• 10-node, 20 TB experimental cluster hosted by AIT
12
13. SCAPE
SCAPE Contact Information
• http://www.scape-project.eu/
• office@list.scape-project.eu
• Dr. Ross King
AIT Austrian Institute of Technology GmbH
Donau-City-Strasse 1
A-1220 Wien
13