A presentation I gave at UCL, while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich.
Paul Brebner, University College London, Computer Science Department Seminar: "Grid Middleware - Principles, Practice, and Potential", 1 November 2004.
The project page was still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
DSPy a system for AI to Write Prompts and Do Fine Tuning
Grid Middleware – Principles, Practice and Potential
1. (Or: What do Wombats and Grid have
in common?)
UK OGSA Evaluation Project
(UCL, Imperial, Newcastle,
Edinburgh)
UCL Project Members: Paul Brebner,
Wolfgang Emmerich
University College London
P.Brebner@cs.ucl.ac.uk
Grid Middleware – Principles, Practice and Potential
2. What do Wombats and Grid have in common?
A They are secretive and misunderstood creatures?
B They live in complex underground burrows?
C You wouldn’t want to meet one in a confined
space in the dark?
D All of the above?
?
3. Grid – Abstract
• Principles
– What are the principles of Grid middleware?
• Practice (and pitfalls)
– How easy is it to use in practice? What are the pitfalls?
• Potential
– What potential does Grid middleware have to
• (1) provide insight into different ways of using Service
Oriented Architectures, and
• (2) support automatic deployment and debugging?
4. Grid – Principles
• Principles
– What are the principles of Grid middleware?
• Practice (and pitfalls)
– How easy is it to use in practice? What are the pitfalls?
• Potential
– What potential does Grid middleware have to
• (1) provide insight into different ways of using Service
Oriented Architectures, and
• (2) support automatic deployment and debugging?
6. Grid Principles – Grid vs Enterprise
• What’s the difference between Grid and
Enterprise? (Typical generalisations…)
• Grid
– Crosses firewalls and organisational boundaries
– Resource and code focussed
• scientist has some code, and wants to execute it on as many
resources as possible, to solve ever bigger problems
– Developer, deployer and user may be the same person
7. Grid Principles – Grid
Code
New
Data
Data
User wants: Infinite resources, scalability, monitoring
Code
Data
Organisations want:
Fair sharing,
ease of maintenance?
8. Grid Principles – Grid vs Enterprise
• Enterprise
– Code developed, deployed and maintained by
enterprises behind firewall
– Exposed as web services for intra and inter
organisational interoperability
– Users don’t develop or deploy code
9. Grid Principles – Enterprise
User wants:
Response time,
availability
Query or
Transaction
Response
Service developer
Enterprise wants:
Interoperability,
scalability,
security
10. Grid Principles – Grid vs Enterprise
• Grid (User view)
– I have some code, make it run fast for me.
– Concerns: Finding resources, platform portability,
deploying, running and monitoring “jobs”, security,
data management.
• Enterprise (Enterprise owner view)
– I have some business logic exposed as Web service –
ensure internal and external users get required QoS.
– Concerns: QoS, interoperability, transactional,
performance/scalability, security, multiple applications
sharing services.
11. Grid Principles – Just another component model?
• Inspight of these differences, they have something
in common
• OGSI has J2EE origins
– “What does it mean to ship a J2EE-based Grid environment,
something that can deliver OGSI-compliant services? It means that
you provide a server programming environment that makes it very
easy for service writers to implement services that conform to the
set of standards that are OGSI.”
– Containers, lifecycle management
– Goal: Easy to write services and interoperability at
interface level
20. Grid Principles - State
• Treatment of stateful instances?
– J2EE has stateful session and entity beans
• CMP Entity beans: lifecycle management
(passivation/activation/pooling), caching, and automatic
persistence support
• Typically accessed via Stateless Session Beans or MDBs
– GT3 has stateful instances (created by Factories)
• Accessed via SOAP and handles
• No automatic passivation/activation or persistence
21. Grid Principles - Roles
• J2EE
– Component developer
– Application assembler
– Deployer
– System Administrator
• Not to mention product and tool providers, system architect,
and database designer and administrator, etc
• Many products provide distributed/remote tool
support
22. Grid Principles - Roles
• Grid?
– Increasing number of roles in practice
– But, no explicit definition of Grid roles, and
– Poor tool support for cross-organisational
support of roles
23. Grid Principles - Deployment
• Treatment of deployment?
– J2EE has explicit deployment role, and
typically good tool support for remote
deployment
– Support for product independent deployment
(JSR-88 since J2EE 1.4)
– GT3 has built-in support for remote
“code/executable” deployment (staging), but
none for remote “service” deployment
24. Grid Principles – Confusion/alternatives
• How is Globus intended to be used?
– 1: Science as first-order services
• Middleware for building and hosting Grid
Applications, by exposing science code as Grid
services.
– 2: High-level grid services
• Middleware for building a set of high level Grid
services, composed to provide new Grid
functionality. Science isn’t first-order service, but
executed and managed by Grid services.
25. Grid Principles – Science services or Grid services
Client
E=mc2
1
Science services:
Directly callable, described
26. Grid Principles – Science services or Grid services
Client
E=mc2
1
D=A+2B+C2
Science services:
Directly callable, described
27. Grid Principles – Science services or Grid services
Client
2
D=A+2B+C2
E = mc2
E=mc2
1
D=A+2B+C2
Data
Execution
Science services:
Directly callable, described
discoverable
Science: Indirectly callable, not
directly described or discoverable
28. Grid – Practice
• Principles
– What are the principles of Grid middleware?
• Practice (and pitfalls)
– How easy is it to use in practice? What are the pitfalls?
• Potential
– What potential does Grid middleware have to
• (1) provide insight into different ways of using Service
Oriented Architectures, and
• (2) support automatic deployment and debugging?
29. Grid Practice – What to evaluate?
• OGSA > OGSI > GT3.2 – Grid SOA exemplar
– Initially evaluate installation, configuration, and
security
– Then performance and scalability, deployment,
architectural choices, etc.
• What’s the point? What are we trying to learn?
– What are some of the s/w engineering and architectural
issues surrounding Grid infrastructure? Across
organisational boundaries?
– What improvements are required before it is suitable
for production environments?
30. Grid Practice –”Realistic” test-bed
• Heterogeneous platforms
– Linux, Solaris, Windows
• Cross-organisational
– Four nodes
– Independently administered
– Firewalls and access restrictions
• Security
– UK e-Science CA
31. Grid Practice – Incremental
• Start with Core Package (Just container and basic
services – e.g. container registry service)
• Add Security
• Then try “All Services”
• Simple enough – in theory
– Relationship between packages not well understood
– Java and non-Java components
– Poor integration between some parts
41. Grid Practice – What we found
• Port number management (conflicts, discovery)
• Host access (requirements and site policies)
• Remote visibility of installation, container,
services (what, configuration, version)
• Installation by System Administrators (role
division, extra effort)
• Tomcat or Test container (different configuration)
• Linux is the only well supported platform
• Exponential increase in testing complexity as
number of nodes increases.
42. Grid Practice – Security
• Grid Security Infrastructure (GSI)
– X.509 certificates
– Mutual authentication (client/host)
– Proxy certificates (delegation and single sign-on)
• Authentication (Who are you?)
– Secure Message (Basic)
– Secure Conversation
• Signing or Encryption (prevent unauthorised altering/reading)
• Authorisation (Who is authorised to use container,
factory, service, method)
– Gridmap file (Access Control List – maps Grid to Local
identifies)
43. Grid Practice – Security
• In theory just have to
– obtain (and update) host, client, and CA certificates
– convert
– install
– configure (server, client side, container, services, etc)
– generate (and update) proxies.
• However, parts of “All Services” package also
needed.
44. Grid Practice – Security
• Interactions between security for multiple
installations
• Essential to test non-secure interoperability first
• Windows client-side security
• Testing and viewing security configuration
• Debugging secure calls
• Client side security is programmatic
• Security management scalability
– Construction and maintenance of user accounts and
grid-map file entries.
45. Grid Practice – Security
• Interactions between security for multiple
installations
– For testing may want
• multiple versions, or duplicates (with different
configurations) of same versions.
• One container with no security, and another
container with security
– May want test/production environments
46. Grid Practice – Security
• Essential to test non-secure interoperability
first
– Trying to test interoperability and security
simultaneously wasn’t fun
47. Grid Practice – Security
• Windows client-side security
– Not obvious exactly what parts of Globus are
needed for client side code with security (no
“client side + security” package).
48. Grid Practice – Security
• Testing and viewing security configuration
– View/edit and check security configuration for
containers and services
– Confusion about hierarchical security settings
• Virtual Organisations, clusters, servers, containers,
factories, services, methods, and instances.
– Remotely
– Validate security deployment before run-time
49. Grid Practice – Security
• Debugging secure calls (or any stateful service)
– Proxy interceptor approach (e.g. TCPMON) won’t
work with stateful services
• As grid handle returned to client contains the port number of
the instance, not the proxy
– But proxies are an important design pattern for SOAs…
– GT4/WS-RF may be different
• Handle resolvers, WS-Addressing and WS-
RenewableReferences
50. Grid Practice – Security
• Client side security is programmatic
– Client side code modifications required to call
services/methods with required protocols
– Should be declarative
– Sensitive to server side security credentials
51. Grid Practice – Security
• Security management scalability
– Construction and maintenance of user accounts and grid-map file
entries.
– For each server, each user needs an account, and an entry in the
container gridmap file (mapping client certificate to account)
– May also need service specific gridmap files
– Not scalable for large numbers of users, servers, services.
– Revocation of certificates, host certificate expiry problem
• Alternatives?
– Tool support
– Role based authentication
– Shared accounts or certificates (probably evil)
52. Grid Practice - Performance
• First approach (initial results)
– Scientific benchmark (SciMark2.0) modified to
measure throughput, and invoked as a Stateful Grid
Service
– Metric is Calls Per Minute (CPM) – one unit of work.
– No large-scale data movement, just SOAP parameters
and result, and computation/memory load.
• Good performance and scalability
– Minimal overhead cf standalone benchark
– Security has minimal overhead
– Sustained 4200 “jobs” an hour throughput
– Problem with client side timeouts as response times
increase
53. Grid Practice - Performance
ART (s)
0
50
100
150
200
0 10 20 30 40 50 60 70
Threads
Time(s)
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu Intel)
Edinburgh (4 hyperthread cpu Intel)
All
Tomcat
Fastest: 3.6s (Edinburgh)
Slowest: 25s (UCL)
54. Grid Practice - Performance
Throughput (CPM)
0
10
20
30
40
50
60
70
80
0 20 40 60 80
Threads
CPM
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu intel)
Edinburgh (4 hyperthread cpu Intel)
All (12 cpus)
Theoretical Maximum
95% of predicted maximum throughput
55. Grid Practice - Performance
• Tomcat vs Test container
– No difference on 3 out of 4 nodes
– But 67% faster on one node (Newcastle, slowest Intel
box)
• Attachments will work with GT3 and Tomcat
– But not with security
– Limit of 1GB (DIME)
– Bug in Axis – doesn’t clean up temporary files.
56. Grid Practice - Performance
• Stateful instances visible externally can be
problematic
– Intermittent unreliability
• On some runs, 1 exception in 300 calls (reliability of .9967)
– But non-repeatable, SOAP/network related?
• What is the safe response to exceptions? Can’t just retry.
– Possible to kill container (relies on clients being well
behaved):
• By invoking same instance/method more than once.
• By consuming container resources
– But instances can be passivated/activated in theory
– Could be used to enable fine-grain (per instance) control over
resource usage.
57. Grid Practice - Pitfalls
• Production quality Grid middleware needs
(“What this bike needs is …”)
• Support for
– Remote
– location independent
– cross-organisational
– multiple role scenarios
– Such as…
58. Grid Practice - Pitfalls (continued)
– Platform independent, automatic, installation.
– Tool support for configuration and deployment
creation, validation, viewing and editing.
– Management console for grid, nodes, globus packages,
containers and services.
– Remote deployment and management of services.
– Remote distributed debugging of grid installations,
services, and applications.
– Tool support, and more scalable processes for security.
59. Grid – Potential
• Principles
– What are the principles of Grid middleware?
• Practice (and pitfalls)
– How easy is it to use in practice? What are the pitfalls?
• Potential
– What potential does Grid middleware have to
• (1) provide insight into different ways of using Service
Oriented Architectures, and
• (2) support automatic deployment and debugging?
60. Grid Potential – Architectural alternatives
• Evaluate the two approaches in more detail
– Science exposed as services, vs science code managed
by higher level grid services.
• Explore alternative mechanisms for:
– Executing science code
– Load balancing and scheduling/resource management
– Directory services (service and resource discovery)
– Data movement (e.g. SOAP Attachments vs GridFTP)
61. Grid Potential – Architectural evaluation
• Evaluation approach
– Loosely based on ATAM + mechanisms
– Clarify the role of different GT3 mechanisms,
and quantify pros/cons
– Two versions of application
– Evaluate with
• Architecture
• Roles
• Scenarios (to quantify quality attributes)
62. Grid Potential – Architectural evaluation
• Pick a number of roles of interest
– Define attributes of interest, and scenarios to exercise
and measure them
• Deployment
– Consistency of deployment, and time to deploy
• Debugging
– Ability to locate root cause of problem and rectify
• Security admin
– Cost/time to secure increasing number of clients/nodes
• Grid owner
– Scalability and ease of management
63. Grid Potential – Architectural evaluation
• Hypothesis
– Both approaches to using Grid are identical
– But won’t be surprised by some differences – e.g.
scalability, discovery, deployment
• Problems with
– MDS3 (Directory and resource discovery service)
working with aggregated service data across sites
– GridFTP
– Wrapping Science code with MMJFS
64. Grid Potential - Deployment
• How to install and configure Grid infrastructure
and services - scalably and securely?
• Install GT3 infrastructure and security manually
– MMJFS allows executable code to be staged
automatically (But not services - could provide a
deployment service).
• Install bootstrapping code, and then install and
deploy all other code and security automatically.
– Using SmartFrog (HP) in the lab, and then test-bed.
– Firewalls, platform specific configurations, user sand-
boxing, configuring GT3 security remotely, and “trust”
with System Administrators are open issues.
65. Grid Potential – Deployment Speculation
• Explicit deployment-flows?
– In Enterprise applications are increasingly represented
as work-flows.
• Good for distributed execution, and comprehensibility.
– What if deployment plans are also represented
explicitly as flows (deployment-flows)?
– Some work on work-flow aware resource management
(for Grid).
– Deployment-flows could even be auto-magically
generated from work-flows, and executed to ensure
resources are deployed correctly JIT for work-flow
execution.
66. Grid Potential – Deployment Speculation
• For example:
– Work-flow with two tasks
• 1st task requires 10 nodes, 2nd task 100 nodes.
– Produce deployment-flow which is interleaved
with work-flow to:
• Deploy 1st service for first task to 10, and start
execution
• Deploy 2nd service to 100 nodes concurrent with
execution of 1st task, and ready for execution of 2nd.
67. Grid Potential – Deployment Speculation
T1 x 10
T2 x 100
Execute T1 Execute T2
S1S1S1S1S1S1S1
Deploy S1 x 10
S1S1S1S1S1S1S2
Deploy S2 x 100
Could also include
un-deploymentS2S2S2S2S2
68. Grid Potential - Deployment + Debugging
• Debugging distributed systems is tricky
– Need better support for cross-cutting non-functional concerns such
as deployment and debugging.
– (One) problem with debugging services is not knowing the context
of errors (to aid diagnosis or cure) – a service is just a black box
with an interface.
• Deployment aware debugging:
– Starting from functional work-flows, generate deployment-flows,
which are executed prior to, or concurrent with, functional work-
flows.
• This ensures that deployment is done consistently and automatically
with respect to application execution.
– If failure in functional work-flow, then corresponding deployment-
flow is examined to determine likely causes, and parts are re-
executed.
– Failure in deployment-flow can also possibly be managed.
69. Grid Potential - Deployment + Debugging
• Three phases of Debugging
• Debug deployment
– Relies on deployment infrastructure and deployment-flows
– What works locally or on one node may not work remotely, or identically
on all nodes without modification, and deployment framework itself may
be an extra cause of failure
• Debug/trace application + infrastructure to get working initially
– Relies on visibility/transparency of deployed and running infrastructure
and application
– Ideally want integrated (active), or at least proxy/sniffer (passive),
debugging (profiling, tracing, stepping) support.
• Debug working application upon failure
– But multiple failure modes
– Has application + infrastructure been analysed and/or tested for them all?
– Can diagnosis and rectification be done anyway?
70. Grid Potential - Deployment + Debugging
• Backtrack through deployment steps (Like peeling an onion)
– Some steps will need to be reversed, and then redone correctly
– Manage dependent, redundant, and inconsistent operations
• This approach may fix an (interesting) sub-class of problems:
• Those which can be fixed by simply redoing (or replicating) (part of) the
installation, E.g.
– Intermittent failure of container or services
– Resource starvation or overload – deploy services to more resources
• Security problems that can be fixed with reconfiguration or refresh of
certificates/proxies.
– But not:
• network, or all configuration and security/access problems.
• Or “Enterprise Web services” (from a user perspective, as users can’t
deploy)
71. Grid Potential - Deployment + Debugging
T1 x 10
Failure!
Execute T1 Execute T2
S1S1S1S1S1S1S1
Deploy S1 x 10
S1S1S1S1S1S1S2
Deploy S2 x 100
S2S2S2S2 S2
Redploy S2 on
failed node
?
72. Grid Potential - Deployment + Debugging
• What’s still needed?
– Connection between executing client code and
deployment infrastructure
– Ability to reason about relationship between work-
flow/client failures, deployment-flows and grid
infrastructure, diagnose failure causes, and plan solutions
– Ideally want applications and deployment represented
explicitly as flows – work and deployment flows.
– Could possibly infer work-flow and therefore
deployment-flow from running system in the absence of
explicit information?
– Justification – is the problem significant, and how far does
this solution go?
73. UK OGSA Evaluation Project
• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
74. UK OGSA Evaluation Project
• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not
75. UK OGSA Evaluation Project
• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite)
76. UK OGSA Evaluation Project
• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite) the
77. UK OGSA Evaluation Project
• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite) the End
78. UK OGSA Evaluation Project
• Thank you J
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite) the End…
79. Postscript – The Secret Life of Grid?
Our experiences Evaluating Grid technology reminds me of an
Australian book (“The Secret Life of Wombats”) about a school boy
who used to sneak out of his dormitory after everyone was asleep to go
“wombatting”. He spent his nights secretly crawling down Wombat
burrows with a flashlight – a potentially lethal activity (not just from
cave-ins, as wombats are ferocious when cornered!) – and wrote
copious notes resulting in a substantial increase in knowledge of these
“mysterious and often misunderstood creatures”.
80. Postscript – The Secret Life of Grid?
Our experiences Evaluating Grid technology reminds me of an
Australian book (“The Secret Life of Wombats”) about a school boy
who used to sneak out of his dormitory after everyone was asleep to go
“wombatting”. He spent his nights secretly crawling down Wombat
burrows with a flashlight – a potentially lethal activity (not just from
cave-ins, as wombats are ferocious when cornered!) – and wrote
copious notes resulting in a substantial increase in knowledge of these
“mysterious and often misunderstood creatures”.
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc
81. Postscript – The Secret Life of Grid?
Our experiences evaluating grid technology reminds me of an
Australian book (“The Secret Life of Wombats”) about a school boy
who used to sneak out of his dormitory after everyone was asleep to go
“wombatting”. He spent his nights secretly crawling down Wombat
burrows with a flashlight – a potentially lethal activity (not just from
cave-ins, as wombats are ferocious when cornered!) – and wrote
copious notes resulting in a substantial increase in knowledge of these
“mysterious and often misunderstood creatures”.
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc