The Relationship Between Development Problems and Use of Software Engineering...
libHPC: Software sustainability and reuse through metadata preservation
1. libHPC: Software sustainability and
reuse through metadata preservation
Jeremy Cohen, John Darlington, Brian Fuchs
London e-Science Centre / Department of Computing, Imperial College London
David Moxey, Chris Cantwell, Pavel Burovskiy, Spencer Sherwin
Department of Aeronautics, Imperial College London
Neil Chue Hong
Software Sustainability Institute, University of Edinburgh
First Workshop on Maintainable Software Practices in e-Science, Chicago
Tuesday 9th October 2012
2. Introduction
• Decision making – building scientific software can be hard
• Abstraction – hide the complexity
• Efficiency – achieve the performance
• Aim for a universal technology that spans all application
domains, machines, metrics
ns
M
Num.
tio
• Coordination forms – a different approach to task
Cluster
ac
Intensive
ica
Cloud
hi
Data Intensive
ne
Multi-core
pl
s
Bioinformatics GPU
Ap
specification CFD FPGA
Cost
Time Energy
• Components – encapsulated building blocks Metrics
3. Information and decisions
Why is software development and re-use hard?
• A particular piece of code is the result of many development decisions
• Developers invest significant knowledge about the task to be solved
…however…
• Decisions made by developers cannot be reconstructed from the code
• Loss of original information and structure invested by developer(s)
4. Information and decisions
Understanding code structure and the options available and the decisions
made during development is important:
• Portability; optimisation on different architectures
• Long-term sustainability
Need an explicit representation of decisions and alternatives:
• Decision tree used to represent this (structure)
• Metadata used to annotate decision tree (information)
• Modifications can be made to decision tree (based on metadata
analysis) which can than be mapped to modified code
5. Information and decisions
e.g. code that uses a solver:
• Many options to select suitable solver – abstract components
• Choice dependent on problem being addressed, parameters, etc.
• Represent solver choice on a tree of component alternatives, leaf
nodes are concrete implementations higher-level nodes are abstract
Matrix Linear Vector
Vector Solver"
Matrix Matrix
Vector LU" Vector Jacobi" Vector
Vector
Parallel LU" Parallel LU" Sequential Parallel Jacobi
Sequential LU"
(OpenMP)" (MPI)" Jacobi" (UPC)"
8. Application flow and specification
We represent application elements using two techniques
• Data processing – core code that forms application building
blocks
a Components (first-order functions)
• Control flow, orchestration
a High-order functions
a Coordination Forms
e.g. Pipe, Parallel, Map / Reduce, …
9. Coordination Forms
• A functional/mathematical approach to job specification
• Based on work by Darlington, et al.
J. Darlington, Y. Guo, H. W. To and J. Yang. Functional skeletons for parallel coordination.
In proceedings of EURO-PAR ’95 Parallel Processing, LNCS 966/1995, p. 55-66, 1995.
Springer Berlin/Heidelberg
• Applied to components – define application flow
• May be:
• General – applicable to most applications – e.g. PIPE, PAR
• Iterative patterns – e.g FARM, ITERATE
• Domain-specific higher-level forms – e.g. Monte Carlo
• Extensible – new patterns can be introduced
10. Coordination Forms
• A given form may have multiple underlying implementations
• E.g. PAR may provide sequential, multi-threaded and MPI parallel
implementations
• Forms aim to be as lightweight as possible
• They result in code that can be run
• They intelligently glue together component building blocks
• PIPE as an example – functions f1 to fn with initial input a:
PIPE [ f1, f2,…fn ]a = (f1 ° f2 ° … fn)a
= f1(f2 (… (fn(a))))
11. Coordination Forms – Impementation
• Prototype implementation in Python
• Class wrappers for component and parameter metadata –
concrete implementation code selectable
PIPE – Compose a series of components in the order specified
PIPE ([component list], initial input)
Additional parameters can be added in component list
PAR – Run a series of components independently (perhaps in parallel)
PAR ([component list], [(input1), (input2), …, (inputn)])
E.g. for components add, multiply, divide:
2 * ( (245+34) / (6+8) )
PIPE([(multiply, 2), divide, PAR([add,add],[(245,34),(6,8)])])
12. Bioinformatics: Genome Read Pre-Processing/Mapping
Short Read
Input files –
Reference
Set (Paired)
Genome
Reference Genome – FASTA file
Single FASTQ
FASTA file
file
Reads from sequencing machine - FASTQ
bwa index FASTQ split
((sr1,sr2), u) = PAR([fastq_split, bwa_index], SR_1 SR_2
[(short_read_file, None, None),(ref_genome_file,)])
bwa aln bwa aln
(v, w) = PAR([bwa_aln, bwa_aln],
FASTA file + index file
[(ref_genome_file, sr1, None),
bwa sampe - generate alignment (paired ended)
(ref_genome_file, sr2, None)]) SAM file
samtools import
result = PIPE([samtools_index, samtools_sort, BAM file
(samtools_import, ref_genome_file), samtools sort
bwa_sampe], sorted BAM file
[ref_genome_file, [v,w], [sr1, sr2], None]) samtools index
OUTPUT
13. LibHPC Project
• LibHPC
• Two year project under EPSRC HPC Software Programme
• Imperial College London (Computing (LeSC), Aeronautics,
ICT)
• SSI, Edinburgh
• Implementing/demonstrating framework with main
supporting application (Nektar++) + other exemplars
15. Nektar++ - Hybrid Assembly
• Nektar++ operates on
matrices based on input
mesh
• Each element of input mesh
is mapped to an (elemental)
matrix
• There are two matrix
assembly strategies:
• Local
• Global