1. Holistic Aggregate Resource Environment
Execution Model
FastOS USENIX 2010 Workshop
Eric Van Hensbergen (bergevan@us.ibm.com)
http://hare.fastos2.org
2. Research Objectives
• Look at ways of scaling general purpose operating systems
and runtimes to leadership class supercomputers (thousands
to millions of cores)
• Alternative approaches to systems software support, runtime
and communications subsystems
• Exploration built on top of Plan 9 distributed operating system
due to portability, built-in facilities for distributed systems and
flexible communication model
• Plan 9 support for BG/P and HARE runtime open-sourced
and available via: http://wiki.bg.anl-external.org
• Public profile available on ANL Surveyor BG/P machine,
should be usable by anyone
2
3. Roadmap
0 1 2 3
Hardware Support
Systems Infrastructure
Evaluation, Scaling, & Tuning
Year 2 Accomplishments Year
• ImprovedFramework
tracing infrastructure
• Curryinginfrastructure to 1000 nodes
• Scaling model
• Execution Blue Gene/P open sourced
• Plan 9 for open sourced
• Kittyhawk for Kittyhawk and Plan 9 installed at ANL on Surveyor
• Default profiles
4. New Publications (since Supercomputing 2009)
• Using Currying and process-private system calls to break the
one-microsecond system call barrier, Ronald G. Minnich,
John Floren, Jim Mckie; 2009 International Workshop on
Plan9.
• Measuring kernel throughput on Blue Gene/P with the Plan 9
research operating system, Ronald G. Minnich, John Floren,
Aki Nyrhinen; 2009 International Workshop on Plan9.
• XCPU3. Pravin Shinde, Eric Van Hensbergen, Eurosys, 2010
• PUSH, a Dataflow Shell. N Evans, E Van Hensbergen,
Eurosys, 2010
4
5. Ongoing Work
• File system and Cache Studies
• simple cachefs deployable on I/O nodes and compute
nodes
• experiments with direct attached storage using CORAID
• MPI Support (ROMPI)
• Enhanced Allocator
• lower overhead allocator
• working towards easier approach to multiple page sizes
• working towards schemes capable of supporting hybrid
communication models
• Scaling beyond 1000 nodes (runs on Intrepid at ANL)
• Application and Runtime Integration
5
7. Core Concept: BRASIL
Basic Resource Aggregate System Inferno Layer
• Stripped down Inferno - No GUI or anything we can live
without, minimal footprint
• Runs as a daemon (no console), all interaction via 9p mounts
of its namespace
• Different modes
• default (exports /srv/brasil or on a tcp!127.0.0.1!5670)
• gateway (exports over standard I/O - to be used by ssh initialization)
• terminal (initiates ssh connection and starts a gateway)
• Runs EVERYWHERE
• User’s workstation
• Surveyor Login Nodes
• I/O Nodes
• Compute Nodes
7
8. nompirun: legacy friendly job launch
• user initiates execution from login node using nompirun script
• ie. nompirun -n 64 ronsapp
• setup/boot/exec
• script submits job using cobalt
• When I/O node boots it connect to user’s brasild via 9P over Ethernet
• When CPU nodes boot they connect to I/O node via 9P over Collective
• After boilerplate initialization, $HOME/lib/profile is run on every node for
additional setup, namespace initialization, and environment setup
• User specified application runs with specified arguments on all compute nodes,
application (and support data and configuration) can come from user’s home
directory on login nodes or any available file server in the namespace
• Standard I/O based output from all compute nodes aggregated at I/O nodes
and sent over miniciod channel (thanks to some sample code from the zeptoos
team) to the service nodes for standard reporting
• Nodes boot and application execution begins in under 2 minutes
8
9. Our Approach: Workload Optimized Distribution !"#$%&'
!"#(#
!"#$%&'
!"#$%& !"#(#
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#$%& !"#$%& !"#(# !"#(# !"#(#
!"#$%&'
!"#(#
local service proxy service aggregate service !"#$%&'
!"#(#
!"#$%&'
!"#$%& !"#(# !"#(#
local service !"#$%&'
!"#(#
Desktop Extension PUSH Pipeline Model
remote services
Aggregation Via
Dynamic Namespace
Scaling and Reliability
Distributed Service
Model
9
11. Preferred Embodiment: BRASIL Desktop Extension Model
CPU
ssh-duct
workstation login node I/O
•Setup CPU
•User starts brasild on workstation
•brasild ssh’s to login node and starts another brasil hooking the
two together with 27b-6 and mount resources in /csrv
•User mounts brasild on workstation into namespace using 9pfuse or
v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac)
•Boot
•User runs anl/run script on workstation
•script interacts with taskfs on login node to start cobalt qsub
•when I/O nodes boot it will connect its csrv to login csrv
•when CPU nodes boot they will connect to csrv on I/O node
•Task Execution
•User runs anl/exec script on workstation to run app
•script reserves x nodes for app using taskfs
•taskfs on workstation aggregates execution by using taskfs
running on I/O nodes
11
12. Core Concept: Central Services
• Establish hierarchical namespace on cluster services /csrv/
of
• Automount remote servers based reference (ie. cd
criswell)
c3
/csrv /csrv
t /local /local
/L /l2
/local /local
/l1 /c4
/local /local
L /c1 /L
/local /local
/c2 /t
I1 I2 /local /local
/l2 /l1
/local /local
c1 c2 c3 c4 /c3 /c1
/local /local
/c4 /c2
/local /local
12
13. Core Concept: Taskfs
• Provide xcpu2 like interface for starting tasks on a node
• Hybrid model for multitask (aggregate ctl & I/O as well as
granular) /0
/ctl
/local - exported by each csrv node
/status
/fs - local (host) file system
/args
/net - local network interfaces
/env
/brasil - local (brasil) namespace
/stdin
/arch - architecture and platform
/stdout
/status - status (load/jobs/etc.)
/stderr
/env - default environment for host
/stdio
/ns - default namespace for host
/ns
/clone - establish a new task
/wait
/# - task sessions
/# - component session(s)
/ctl
...
13
15. What’s Still Missing from Execution Model?
• File System back mounts still being developed
• Can get around by mounting login node or user’s
workstation to a known place no matter where you are in
the system
• When we get file system back mounts, we’ll need a way to
get to the user’s desired file system no matter where in csrv
topology we are ($MNTTERM)
• Taskfs scheduling model still top down, needs to be able to
propagate back up to allow efficient scheduling from leaf
nodes
• Performance
• Reworking workload distribution to go bottom up to improve
scalability and lower per-task overhead
• Plan 9 native version of task model to improve performance
15
16. New Model Breaks Up Implementation
• mpipefs provides base I/O and control aggregation
• execfs provides layer on top of system procfs for additional
application control and initiating remote execution and uses
mpipefs for interface to standard I/O
• gangfs provides group process operations and aggregation
as well as providing core distributed scheduling interfaces
and builds upon execfs and use mpipes for ctl aggregation
• statusfs will provide bottom up aggregation of system status
through csrv hierarchy and feed metrics to gangfs scheduler
using mpipes
• csrv component provides membership management and
hierarchical links between nodes and provide failure
detection, avoidance and recovery
16
17. Future Work: Generalized Multipipe System Component
• Challenges
• Record separation for large data sets
• Determinism for HASH distributions
• Support for multiple models
• Our Approach
• Single synthetic file per multipipe, configurations specified
during pipe creation and initial write
• Readers and Writers tracked and isolated
• “Multipipe” mode uses headers for data flowing over pipes
• Provides record separation via size-prefix
• Can be used by filters to specify deterministic destination or can be used to
allow for type-specific destinations
• Can also send control messages in header blocks to control splicing
17
18. Future Work on Execution Model
• Caches will be necessary for desktop extension model to
perform well
• Linux target support (using private namespaces and back
mounts within taskfs execs)
• attribute-based file system queries/views and operations
• Probably best implemented as a secondary file system
layer on top of central services
• Language bindings for taskfs interactions (C, C++, python,
etc.)
• Plug-in scheduling policies
• Failure and Reliability Model
18
19. Questions?
• This work has been supported in part by the Department of
Energy Office of Science Operating and Runtime Systems for
Extreme Scale Scientific Computation project under contract
#DE-FG02-
• More Info & Publications: http://hare.fastos2.org
19
21. Core Concept: Ducts
• Ducts are bi-directional 9P connections
• They can be instantiated over any pipe
• TCP/IP Connection
export
mount
ssh
tcp/ip
mount
export
21
22. Core Concept: 27b-6 Ducts
• Just like Ducts
• Before export/mount, each side writes size-prefix canonical
name
export
mount
ssh
tcp/ip
mount
export
22