Grid Middleware – Principles, Practice and Potential

(Or: What do Wombats and Grid have
in common?)
UK OGSA Evaluation Project
(UCL, Imperial, Newcastle,
Edinburgh)
UCL Project Members: Paul Brebner,
Wolfgang Emmerich
University College London
P.Brebner@cs.ucl.ac.uk
Grid Middleware – Principles, Practice and Potential

What do Wombats and Grid have in common?
A They are secretive and misunderstood creatures?
B They live in complex underground burrows?
C You wouldn’t want to meet one in a confined
space in the dark?
D All of the above?
?

Grid – Abstract
• Principles
– What are the principles of Grid middleware?
• Practice (and pitfalls)
– How easy is it to use in practice? What are the pitfalls?
• Potential
– What potential does Grid middleware have to
• (1) provide insight into different ways of using Service
Oriented Architectures, and
• (2) support automatic deployment and debugging?

Grid – Principles
• Principles
• Potential

Grid Principles – cluster, enterprise, internet

Grid Principles – Grid vs Enterprise
• What’s the difference between Grid and
Enterprise? (Typical generalisations…)
• Grid
– Crosses firewalls and organisational boundaries
– Resource and code focussed
• scientist has some code, and wants to execute it on as many
resources as possible, to solve ever bigger problems
– Developer, deployer and user may be the same person

Grid Principles – Grid
Code
New
Data
Data
User wants: Infinite resources, scalability, monitoring
Code
Data
Organisations want:
Fair sharing,
ease of maintenance?

• Enterprise
– Code developed, deployed and maintained by
enterprises behind firewall
– Exposed as web services for intra and inter
organisational interoperability
– Users don’t develop or deploy code

Grid Principles – Enterprise
User wants:
Response time,
availability
Query or
Transaction
Response
Service developer
Enterprise wants:
Interoperability,
scalability,
security

• Grid (User view)
– I have some code, make it run fast for me.
– Concerns: Finding resources, platform portability,
deploying, running and monitoring “jobs”, security,
data management.
• Enterprise (Enterprise owner view)
– I have some business logic exposed as Web service –
ensure internal and external users get required QoS.
– Concerns: QoS, interoperability, transactional,
performance/scalability, security, multiple applications
sharing services.

Grid Principles – Just another component model?
• Inspight of these differences, they have something
in common
• OGSI has J2EE origins
– “What does it mean to ship a J2EE-based Grid environment,
something that can deliver OGSI-compliant services? It means that
you provide a server programming environment that makes it very
easy for service writers to implement services that conform to the
set of standards that are OGSI.”
– Containers, lifecycle management
– Goal: Easy to write services and interoperability at
interface level

Grid Principles – OGSA vs OGSI

Grid Principles – OGSA without OGSI

Grid Principles – OGSA and ?
?

Grid Principles - Architecture
J2EE – n-tiered architecture

OGSA – semi-layered, or “sum of services”

GT3 – (core) server side components

Grid Principles – OGSA Services
• Infrastructure services
• Execution Management services
• Data Services
• Resource Management Services
• Security Services
• Self-Management Services
• Information Services

Grid Principles – J2EE cf OGSI
Feature J2EE OGSI
Containers Multiple (4) One
Components Multiple One (+inheritance)
Roles Explicit Implicit
Implementations Many 1-2 (sort of)
Component
purpose
Presentation/Business
logic/persistence
High-level grid
services

Grid Principles - State
• Treatment of stateful instances?
– J2EE has stateful session and entity beans
• CMP Entity beans: lifecycle management
(passivation/activation/pooling), caching, and automatic
persistence support
• Typically accessed via Stateless Session Beans or MDBs
– GT3 has stateful instances (created by Factories)
• Accessed via SOAP and handles
• No automatic passivation/activation or persistence

Grid Principles - Roles
• J2EE
– Component developer
– Application assembler
– Deployer
– System Administrator
• Not to mention product and tool providers, system architect,
and database designer and administrator, etc
• Many products provide distributed/remote tool
support

Grid Principles - Roles
• Grid?
– Increasing number of roles in practice
– But, no explicit definition of Grid roles, and
– Poor tool support for cross-organisational
support of roles

Grid Principles - Deployment
• Treatment of deployment?
– J2EE has explicit deployment role, and
typically good tool support for remote
deployment
– Support for product independent deployment
(JSR-88 since J2EE 1.4)
– GT3 has built-in support for remote
“code/executable” deployment (staging), but
none for remote “service” deployment

Grid Principles – Confusion/alternatives
• How is Globus intended to be used?
– 1: Science as first-order services
• Middleware for building and hosting Grid
Applications, by exposing science code as Grid
services.
– 2: High-level grid services
• Middleware for building a set of high level Grid
services, composed to provide new Grid
functionality. Science isn’t first-order service, but
executed and managed by Grid services.

Grid Principles – Science services or Grid services
Client
E=mc2
1
Science services:
Directly callable, described

Client
E=mc2
1
D=A+2B+C2
Science services:

Client
2
D=A+2B+C2
E = mc2
E=mc2
1
D=A+2B+C2
Data
Execution
Science services:
discoverable
Science: Indirectly callable, not
directly described or discoverable

Grid – Practice
• Principles
• Potential

Grid Practice – What to evaluate?
• OGSA > OGSI > GT3.2 – Grid SOA exemplar
– Initially evaluate installation, configuration, and
security
– Then performance and scalability, deployment,
architectural choices, etc.
• What’s the point? What are we trying to learn?
– What are some of the s/w engineering and architectural
issues surrounding Grid infrastructure? Across
organisational boundaries?
– What improvements are required before it is suitable
for production environments?

Grid Practice –”Realistic” test-bed
• Heterogeneous platforms
– Linux, Solaris, Windows
• Cross-organisational
– Four nodes
– Independently administered
– Firewalls and access restrictions
• Security
– UK e-Science CA

Grid Practice – Incremental
• Start with Core Package (Just container and basic
services – e.g. container registry service)
• Add Security
• Then try “All Services”
• Simple enough – in theory
– Relationship between packages not well understood
– Java and non-Java components
– Poor integration between some parts

Grid Practice – single node
Install
OS/HW
GT3
Install

Install
Configure
OS/HW
GT3
Install

Install
Configure
Deploy
OS/HW
GT3
Install

Install
Configure
Deploy
Run
OS/HW
GT3
Install

Grid Practice – Multiple sites
GT3

GT3 GT3 GT3 GT3

GT3 GT3 GT3 GT3
Interoperate

GT3 GT3 GT3 GT3
Interoperate
GT3 GT3
Secure

GT3 GT3 GT3 GT3
Interoperate
GT3 GT3
Secure
Manage

Grid Practice – What we found
• Port number management (conflicts, discovery)
• Host access (requirements and site policies)
• Remote visibility of installation, container,
services (what, configuration, version)
• Installation by System Administrators (role
division, extra effort)
• Tomcat or Test container (different configuration)
• Linux is the only well supported platform
• Exponential increase in testing complexity as
number of nodes increases.

Grid Practice – Security
• Grid Security Infrastructure (GSI)
– X.509 certificates
– Mutual authentication (client/host)
– Proxy certificates (delegation and single sign-on)
• Authentication (Who are you?)
– Secure Message (Basic)
– Secure Conversation
• Signing or Encryption (prevent unauthorised altering/reading)
• Authorisation (Who is authorised to use container,
factory, service, method)
– Gridmap file (Access Control List – maps Grid to Local
identifies)

• In theory just have to
– obtain (and update) host, client, and CA certificates
– convert
– install
– configure (server, client side, container, services, etc)
– generate (and update) proxies.
• However, parts of “All Services” package also
needed.

• Interactions between security for multiple
installations
• Essential to test non-secure interoperability first
• Windows client-side security
• Testing and viewing security configuration
• Debugging secure calls
• Client side security is programmatic
• Security management scalability
– Construction and maintenance of user accounts and
grid-map file entries.

• Interactions between security for multiple
installations
– For testing may want
• multiple versions, or duplicates (with different
configurations) of same versions.
• One container with no security, and another
container with security
– May want test/production environments

• Essential to test non-secure interoperability
first
– Trying to test interoperability and security
simultaneously wasn’t fun

• Windows client-side security
– Not obvious exactly what parts of Globus are
needed for client side code with security (no
“client side + security” package).

• Testing and viewing security configuration
– View/edit and check security configuration for
containers and services
– Confusion about hierarchical security settings
• Virtual Organisations, clusters, servers, containers,
factories, services, methods, and instances.
– Remotely
– Validate security deployment before run-time

• Debugging secure calls (or any stateful service)
– Proxy interceptor approach (e.g. TCPMON) won’t
work with stateful services
• As grid handle returned to client contains the port number of
the instance, not the proxy
– But proxies are an important design pattern for SOAs…
– GT4/WS-RF may be different
• Handle resolvers, WS-Addressing and WS-
RenewableReferences

• Client side security is programmatic
– Client side code modifications required to call
services/methods with required protocols
– Should be declarative
– Sensitive to server side security credentials

• Security management scalability
– Construction and maintenance of user accounts and grid-map file
entries.
– For each server, each user needs an account, and an entry in the
container gridmap file (mapping client certificate to account)
– May also need service specific gridmap files
– Not scalable for large numbers of users, servers, services.
– Revocation of certificates, host certificate expiry problem
• Alternatives?
– Tool support
– Role based authentication
– Shared accounts or certificates (probably evil)

Grid Practice - Performance
• First approach (initial results)
– Scientific benchmark (SciMark2.0) modified to
measure throughput, and invoked as a Stateful Grid
Service
– Metric is Calls Per Minute (CPM) – one unit of work.
– No large-scale data movement, just SOAP parameters
and result, and computation/memory load.
• Good performance and scalability
– Minimal overhead cf standalone benchark
– Security has minimal overhead
– Sustained 4200 “jobs” an hour throughput
– Problem with client side timeouts as response times
increase

ART (s)
0
50
100
150
200
0 10 20 30 40 50 60 70
Threads
Time(s)
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu Intel)
Edinburgh (4 hyperthread cpu Intel)
All
Tomcat
Fastest: 3.6s (Edinburgh)
Slowest: 25s (UCL)

Throughput (CPM)
0
10
20
30
40
50
60
70
80
0 20 40 60 80
Threads
CPM
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu intel)
Edinburgh (4 hyperthread cpu Intel)
All (12 cpus)
Theoretical Maximum
95% of predicted maximum throughput

• Tomcat vs Test container
– No difference on 3 out of 4 nodes
– But 67% faster on one node (Newcastle, slowest Intel
box)
• Attachments will work with GT3 and Tomcat
– But not with security
– Limit of 1GB (DIME)
– Bug in Axis – doesn’t clean up temporary files.

• Stateful instances visible externally can be
problematic
– Intermittent unreliability
• On some runs, 1 exception in 300 calls (reliability of .9967)
– But non-repeatable, SOAP/network related?
• What is the safe response to exceptions? Can’t just retry.
– Possible to kill container (relies on clients being well
behaved):
• By invoking same instance/method more than once.
• By consuming container resources
– But instances can be passivated/activated in theory
– Could be used to enable fine-grain (per instance) control over
resource usage.

Grid Practice - Pitfalls
• Production quality Grid middleware needs
(“What this bike needs is …”)
• Support for
– Remote
– location independent
– cross-organisational
– multiple role scenarios
– Such as…

Grid Practice - Pitfalls (continued)
– Platform independent, automatic, installation.
– Tool support for configuration and deployment
creation, validation, viewing and editing.
– Management console for grid, nodes, globus packages,
containers and services.
– Remote deployment and management of services.
– Remote distributed debugging of grid installations,
services, and applications.
– Tool support, and more scalable processes for security.

Grid – Potential
• Principles
• Potential

Grid Potential – Architectural alternatives
• Evaluate the two approaches in more detail
– Science exposed as services, vs science code managed
by higher level grid services.
• Explore alternative mechanisms for:
– Executing science code
– Load balancing and scheduling/resource management
– Directory services (service and resource discovery)
– Data movement (e.g. SOAP Attachments vs GridFTP)

Grid Potential – Architectural evaluation
• Evaluation approach
– Loosely based on ATAM + mechanisms
– Clarify the role of different GT3 mechanisms,
and quantify pros/cons
– Two versions of application
– Evaluate with
• Architecture
• Roles
• Scenarios (to quantify quality attributes)

• Pick a number of roles of interest
– Define attributes of interest, and scenarios to exercise
and measure them
• Deployment
– Consistency of deployment, and time to deploy
• Debugging
– Ability to locate root cause of problem and rectify
• Security admin
– Cost/time to secure increasing number of clients/nodes
• Grid owner
– Scalability and ease of management

• Hypothesis
– Both approaches to using Grid are identical
– But won’t be surprised by some differences – e.g.
scalability, discovery, deployment
• Problems with
– MDS3 (Directory and resource discovery service)
working with aggregated service data across sites
– GridFTP
– Wrapping Science code with MMJFS

Grid Potential - Deployment
• How to install and configure Grid infrastructure
and services - scalably and securely?
• Install GT3 infrastructure and security manually
– MMJFS allows executable code to be staged
automatically (But not services - could provide a
deployment service).
• Install bootstrapping code, and then install and
deploy all other code and security automatically.
– Using SmartFrog (HP) in the lab, and then test-bed.
– Firewalls, platform specific configurations, user sand-
boxing, configuring GT3 security remotely, and “trust”
with System Administrators are open issues.

Grid Potential – Deployment Speculation
• Explicit deployment-flows?
– In Enterprise applications are increasingly represented
as work-flows.
• Good for distributed execution, and comprehensibility.
– What if deployment plans are also represented
explicitly as flows (deployment-flows)?
– Some work on work-flow aware resource management
(for Grid).
– Deployment-flows could even be auto-magically
generated from work-flows, and executed to ensure
resources are deployed correctly JIT for work-flow
execution.

• For example:
– Work-flow with two tasks
• 1st task requires 10 nodes, 2nd task 100 nodes.
– Produce deployment-flow which is interleaved
with work-flow to:
• Deploy 1st service for first task to 10, and start
execution
• Deploy 2nd service to 100 nodes concurrent with
execution of 1st task, and ready for execution of 2nd.

T1 x 10
T2 x 100
Execute T1 Execute T2
S1S1S1S1S1S1S1
Deploy S1 x 10
S1S1S1S1S1S1S2
Deploy S2 x 100
Could also include
un-deploymentS2S2S2S2S2

Grid Potential - Deployment + Debugging
• Debugging distributed systems is tricky
– Need better support for cross-cutting non-functional concerns such
as deployment and debugging.
– (One) problem with debugging services is not knowing the context
of errors (to aid diagnosis or cure) – a service is just a black box
with an interface.
• Deployment aware debugging:
– Starting from functional work-flows, generate deployment-flows,
which are executed prior to, or concurrent with, functional work-
flows.
• This ensures that deployment is done consistently and automatically
with respect to application execution.
– If failure in functional work-flow, then corresponding deployment-
flow is examined to determine likely causes, and parts are re-
executed.
– Failure in deployment-flow can also possibly be managed.

• Three phases of Debugging
• Debug deployment
– Relies on deployment infrastructure and deployment-flows
– What works locally or on one node may not work remotely, or identically
on all nodes without modification, and deployment framework itself may
be an extra cause of failure
• Debug/trace application + infrastructure to get working initially
– Relies on visibility/transparency of deployed and running infrastructure
and application
– Ideally want integrated (active), or at least proxy/sniffer (passive),
debugging (profiling, tracing, stepping) support.
• Debug working application upon failure
– But multiple failure modes
– Has application + infrastructure been analysed and/or tested for them all?
– Can diagnosis and rectification be done anyway?

• Backtrack through deployment steps (Like peeling an onion)
– Some steps will need to be reversed, and then redone correctly
– Manage dependent, redundant, and inconsistent operations
• This approach may fix an (interesting) sub-class of problems:
• Those which can be fixed by simply redoing (or replicating) (part of) the
installation, E.g.
– Intermittent failure of container or services
– Resource starvation or overload – deploy services to more resources
• Security problems that can be fixed with reconfiguration or refresh of
certificates/proxies.
– But not:
• network, or all configuration and security/access problems.
• Or “Enterprise Web services” (from a user perspective, as users can’t
deploy)

T1 x 10
Failure!
Execute T1 Execute T2
S1S1S1S1S1S1S1
Deploy S1 x 10
S1S1S1S1S1S1S2
Deploy S2 x 100
S2S2S2S2 S2
Redploy S2 on
failed node
?

• What’s still needed?
– Connection between executing client code and
deployment infrastructure
– Ability to reason about relationship between work-
flow/client failures, deployment-flows and grid
infrastructure, diagnose failure causes, and plan solutions
– Ideally want applications and deployment represented
explicitly as flows – work and deployment flows.
– Could possibly infer work-flow and therefore
deployment-flow from running system in the absence of
explicit information?
– Justification – is the problem significant, and how far does
this solution go?

• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au

• Thank you J
• Not

• Thank you J
• Not (quite)

• Thank you J
• Not (quite) the

• Thank you J
• Not (quite) the End

• Thank you J
• Not (quite) the End…

Postscript – The Secret Life of Grid?
Our experiences Evaluating Grid technology reminds me of an
Australian book (“The Secret Life of Wombats”) about a school boy
who used to sneak out of his dormitory after everyone was asleep to go
“wombatting”. He spent his nights secretly crawling down Wombat
burrows with a flashlight – a potentially lethal activity (not just from
cave-ins, as wombats are ferocious when cornered!) – and wrote
copious notes resulting in a substantial increase in knowledge of these
“mysterious and often misunderstood creatures”.

Our experiences Evaluating Grid technology reminds me of an
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc

Our experiences evaluating grid technology reminds me of an
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc

Grid Middleware – Principles, Practice and Potential

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Grid Middleware – Principles, Practice and Potential

Similar to Grid Middleware – Principles, Practice and Potential (20)

More from Paul Brebner

More from Paul Brebner (20)

Recently uploaded

Recently uploaded (20)

Grid Middleware – Principles, Practice and Potential