Scalable Services For Digital Preservation Ross King

The Planets Interoperability Framework

Scalable Services for Digital Preservation

DPE, Planets and CASPAR Third Annual Conference:
Costs, Benefits and Motivations for Digital Preservation
30. October 2008
Ross King, Christian Sadilek, Rainer Schmidt
Austrian Research Centers GmbH – ARC

Outline
 Planets Interoperability Framework
 Grids and Clouds
 Initial Experimental Results

Motivation
 There are a number of functions that all (or nearly all) software
applications commonly need. These include functions such as
• Data persistence
• User management
• Authentication and Authorization
• Monitoring, Logging, and Notification
 The Interoperability Framework (IF) software components
provide these commonly required functions.

 Defines an Service-Oriented Architecture for Digital Preservation
 Set of Services, Interfaces, a common Data Model

 Implements Common Services
 Authentication and Authorization, Monitoring, Logging, Notification, …
 Service Registration and Lookup

 Provides APIs for Applications that use Planets
 Testbed Experiments, Executing Preservation Plans

 Provides Workflow Enactment Service and Engine
 Components-based, XML serialization

The Problem of Scalability
 Planets is a preservation architecture based on Web Services
 Supports interoperability and a distributed environment
 Sufficient for a controlled experiments (Testbed)
 Not sufficient for handling a production environment
 Massively, uncontrolled user requests
 Mass migration of hundreds of TBytes of data
 Content Holders are faced with loosing vast amounts of data
 Sufficient computational resources in-house?
 There is a clear demand for incorporating Grid or Cloud Technology

Integrating Virtual Clusters and Clouds
 Basic Idea: Extending Planets SOA with Grid Services
 The Planets IF Job Submission Services
 Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)
 Grid approach/standards (SOAP, HPC-BP, JSDL)
 Cluster nodes are instantiated from specific system images
 Most Preservation Tools are 3rd party applications
 Software need to be preinstalled on cluster nodes
 Cluster and JSS be instantiated in-house (e.g. a PC lab) or
on top of (leased) cloud resources (AWS EC2).
 Computation be moved to data or vice-versa

Integration – Planets Tiered Architecture
Preservation Planning reference
Tools + Planets
and Service Format
Workflow Generation Registry Services
Registry
Experimental
lookup Environment
(qualitative)
3rd Party
Tool
Workbench for Execution Services
Testbed Experiments Engine

Workflow Def.
maintain Grid/Cloud
Execution
Services
Web Portal Metadata Data Model
Browser Store
Production
Environment
Web/Grid
Clients Planets Service Context Resources
Services

Experimental Setup
 Amazon Elastic Compute Cloud (EC2)
 1 – 150 cluster nodes
 Custom image based on RedHat Fedora 8 i386

 Amazon Simple Storage Service (S3)
 max. 1TB I/O,
 ~32,5MBit/s download / ~13,8MBit/s upload (cloud internally)

 Apache Hadoop (v.0.18)
 MapReduce Implementation

 Pre-installed command line tools (e.g, ps2pdf )

Preservation Planning reference
Tools + Planets
and Service Format
Workflow Generation Registry Services
Registry
lookup

3rd Party
Tool
Workbench for Execution Services
Testbed Experiments Engine

Workflow Def.
maintain Grid/Cloud
Execution
Services
Web Portal Metadata Data Model
Browser Store

Web/Grid
Clients Planets Service Context Resources
Services

Experimental Setup
Virtual Cluster (Apache Hadoop) Cloud Infrastructure (EC2)

Job
JSDL

JSS Virtual Node
Job Description File (Xen)

Storage
Infrastructure
(S3)

Raw Data
Data Transfer Service

Experimental Results 1 – Scaling Job Size
x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6
10,00 number of nodes = 5
9,00
8,00 EC2 0,07 MB
7,00 EC2 7,5 MB
time [min]

6,00
EC2 250 MB
5,00
SLE 0,07 MB
4,00
3,00 SLE 7,5 MB
2,00 SLE 250 MB
1,00
0,00
1 10 100 1000
tasks x(1k) = t_seq / t_parallel
and tasks = 1000

Experimental Results 2 – Scaling #Nodes
40,00
n=1, t=36, s = 0.72, e=72%
X

35,00

30,00
n=1 (local), s1=1, t=26
X

25,00
time [min]

EC2 1000 x 0,07 MB
20,00
n=5, t=8, s=3.25, e=65% SLE 1000 x 0,07 MB
15,00
n=10, t=4.5, s=5.8, e=58%
10,00
X

n=50, t=1.68, s=15.5, e=31%
5,00 X n=100, t=1.03, s=25.2, e=25%
0,00
X X
0 50 100 150
nodes

Conclusions
 Preservation systems will need to employ Grid/Cloud resources
 Therefore there is a need to bridge communities in the areas of digital libraries
and e-science.
 Cloud and virtual infrastructures provide a powerful solution for obtaining
on-demand access to computational resources.
 Planets IF Job Submission Service provides a first step
 Submission to virtual cluster of preservation nodes using Grid protocols.
 Performance scales roughly with the number of nodes, accounting for expected
overheads
 Many open issues remain! Security, reliability, standardization, legal aspects...

Thank you for your attention!
 Planets Project
http://www.planets-project.eu

 Contacts
Ross King ross.king@arcs.ac.at
Rainer Schmidt rainer.schmidt@arcs.ac.at

Sample JSDL code
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<jsdl:JobDefinition xmlns=quot;http://www.example.org/quot;
xmlns:jsdl=quot;http://schemas.ggf.org/jsdl/2005/11/jsdlquot;
xmlns:jsdl-posix=quot;http://schemas.ggf.org/jsdl/2005/11/jsdl-posixquot;
xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot;
xsi:schemaLocation=quot;http://schemas.ggf.org/jsdl/2005/11/jsdl jsdl.xsd quot;>
<jsdl:JobDescription>
<jsdl:JobIdentification>
<jsdl:JobName>start vi</jsdl:JobName>
</jsdl:JobIdentification>
<jsdl:Application>
<jsdl:ApplicationName>ls</jsdl:ApplicationName>
<jsdl-posix:POSIXApplication>
<jsdl-posix:Executable>/bin/ls</jsdl-posix:Executable>
<jsdl-posix:Argument>-la file.txt</jsdl-posix:Argument>
<jsdl-posix:Environment name=quot;LD_LIBRARY_PATHquot;>/usr/local/lib</jsdl-posix:Environment>
<jsdl-posix:Input>/dev/null</jsdl-posix:Input>
<jsdl-posix:Output>stdout.${JOB_ID}</jsdl-posix:Output>
<jsdl-posix:Error>stderr.${JOB_ID}</jsdl-posix:Error>
</jsdl-posix:POSIXApplication>
</jsdl:Application>
</jsdl:JobDescription>
</jsdl:JobDefinition>

Map-Reduce for Migrating Digital Objects
 Map-Reduce implements a framework and prog. model for processing
large documents (Sorting, Searching, Indexing) on multiple nodes.
 Automated decomposition (split)
 Mapping to intermediary pairs (map), optionally (combine)
 Merge output (reduce)
 Provides implementation for data parallel problems, i/o intensive,
 Example: Conversion digital object (e.g website, folder, archive)
 Decompose into atomic pieces (e.g. file, image, movie)
 On each node, convert piece to target format
 Merge pieces and create new data object

Scalable Services For Digital Preservation Ross King

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Scalable Services For Digital Preservation Ross King

Ähnlich wie Scalable Services For Digital Preservation Ross King (20)

Mehr von DigitalPreservationEurope

Mehr von DigitalPreservationEurope (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scalable Services For Digital Preservation Ross King