e-infrastructural needs to support informatics

1
e-Infrastructural needs to support
informatics*
David Wallom

What is e-Infrastructure?
The integration of digitally-based technology, resources, facilities, and
services combined with people and organizational structures needed to
support modern, collaborative research (and teaching).
1.Data and Storage
2.Software (and Algorithms)
3.Hardware (Compute)
4.Networks
5.Security and authentication
6.People (Collaboration, Skills, Capacity)
7.The Digital Library

Bioinformatics software challenges
• This brings a onslaught of new
challenges for bioinformatics:
– projects that used to require teams
of 500 are now accessible to small
teams
– but biology curricula (i.e.
biologists) still lack computational
skills.
– thus biologists are overwhelmed
by large amounts of data
– furthermore data types are young -
so software is young, thus
• software may be badly built (by
biologists with no formal software dev
training/xp).
• software needs to be frequently
updated (bugfixes, algorithmic
improvements (sensitivity/specificity),
new data type support).
changes everything for biology

ARCHER
• UK National Supercomputing
Service
• Replacement for HECToR
• LINPACK = 1.359 Pflop/s
• EPSRC is the managing partner
on behalf of RCUK
• NERC are the other partner
research council
• Cray XC30 Hardware
• Nodes based on 2× Intel Ivy Bridge
12-core processors
• 64GB (or 128GB) memory per
node
• 3008 nodes in total (72162 cores)
• Linked by Cray Aries interconnect
(dragonfly topology)

External Network inside JASMIN
Unmanaged Cloud – IaaS, PaaS, SaaS
JASMIN Internal Network
Panasas
storage
Lotus Batch
Compute
JASMIN Cloud Architecture
Standard Remote Access Protocols –
ftp, http, …
Managed Cloud - PaaS, SaaS
JASMIN
Analysis
Platform
VM
Project1-org
Science
Analysis
VM 0
Science
Analysis
VM 0
Science
Analysis
VM
JASMIN Cloud Management Interfaces
Direct File
System
Access
Direct access to
batch processing
cluster
Appliance
Catalogue
Firewall + NAT
Firewall
optirad-org
Science
Analysis
VM 0
Science
Analysis
VM 0
IPython
Slave VM
File Server
VM
IPython
JupyterHub VM
eos-cloud-org
Science
Analysis
VM 0
Science
Analysis VM
0
EOSCloud VM
File Server
VM
EOSCloud
Fat Node
IPython Notebook VM with access
cluster through IPython.parallel EOSCoud Desktop as a Service
with dynamic RAM boost
Appliance
Catalogue
Appliance
Catalogue
Firewall + NAT Firewall + NAT
Firewall
Thanks to Phil Kershaw

OTHER PUBLIC CLOUD PROVIDERS ARE AVAILABLE

Bio-Linux: A scalable solution
• Comprehensive, free bioinformatics workstation based on Ubuntu
Linux and Debian Med
• 11 years & 8 major releases
• Around 8000 users from 1600 locations
• 200+ bioinf packages including big integrative tools :- QIIME, Galaxy
Server, PredictProtein, EMBOSS, ...Incorporates all software
Dual BootLinux Live Local Servers Cloud

Docker, simplifying the portability
of applications and services

EOS Cloud
• A tenancy in the JASMIN Unmanaged Cloud (& QMUL RCC)
• Reusing JASMIN web interfaces and user management to
provide custom IaaS software platform
• Each receives two VMs
– Bio-Linux
– Ubuntu Docker hosting environment
• Users have total responsibility for instantiated system
• Accessible though standard remote desktop tools
• Scalability limited by support available

Why Cloud?
• Data sets can be too big or restricted to easily move
– move the compute to the data
– Researcher work patterns are maintained
• Tools such as Bio-Linux/Docker etc are community enablers
• More efficient use of shared resources
• Central maintenance of infrastructure
• Central Management of data sharing agreements possible
• Lower barrier to entry (Compared to traditional HPC and
Grid)
• What type of cloud?
• What role for traditional HPC?
TRAINING IS KEY TO MAKING INFORMED CHOICES

EOS Cloud next?
• Expand currently available resource beyond
current limitations?
• Create deployable machine image for other
cloud marketplaces
• EOS/institutional badging to give users
confidence in quality

Pilot Users
• CEH Bioinformaticians using the EOS Cloud to
study patterns in microbial
biodiversity
• Genomic and transcriptomic data from fish
toxicogenomics studies at Exeter

Pilot Users
• Creating compute pipelines and containers for each OSD in silico
analysis
– HPC, Cloud (IaaS & PaaS)
– Portable
• Run same analysis on different laptops/grids/clouds
– Repeatable/Reproducible
• Same input gives same output given that reference databases did not change
– Preservation
• All analysis tools and dependencies are in one image
• Images are simple tar.gz
• Preserving Docker and base images is preserving all analysis

Software Sustainability Institute
www.software.ac.uk
The Software Sustainability
Institute
A national facility for cultivating better, more
sustainable, research software to enable world-
class research
• Software reaches boundaries in its
development cycle that prevent
improvement, growth and adoption
• Providing the expertise and services
needed to negotiate to the next stage
• Developing the policy and tools to
support the community developing and
using research software
Supported by RCUK

Communication
Website & blog
Campaigns
Advice
Guides
Courses
Workshops
Fellowship
Research
Software
Policy
Training
Community
Consultancy
41 projects
92 evaluations
4 surgeries
33 UK SWC
workshops
1000+ learners
50,000 readers
41 domain
ambassadors
20+ workshops organised
740 researchers
50,000 grants
analysed
150+ contributed articles
19,000 unique visitors per month
272 RSEs engaged 1700 signatures 13 issues highlighted

The end of the beginning, not the beginning of the end!
• A holistic approach is required with all parts of e-infrastructure supported from the Hard to
Soft to Wet!
• Good start up investments need continuity to ensure impact
– Certain tools are foundations upon which large swathes of community depend
– Putting tools next to immovable data ensures value!
• Integrating with larger activities ensure benefits of scaling
– you can’t steer something you’re not involved with…
• Abstract underpinning e-infrastructure services from the users, as they’re not interested!
– Run something on one resource should be able to be moved to others throug hthe use of
standards etc!
– I have ignored the institutional resources…
****WARNING****
Institute for Environmental Analytics Summer School on e-infrastructure for the environment
19th – 22nd Sept ’16, Oxford.

e-infrastructural needs to support informatics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie e-infrastructural needs to support informatics

Ähnlich wie e-infrastructural needs to support informatics (20)

Mehr von David Wallom

Mehr von David Wallom (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

e-infrastructural needs to support informatics

Hinweis der Redaktion