Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Best Practices & Lessons
Learned Life Science Informatics & The Cloud
Tuesday, May 28, 13

2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
Twitter: @chris_dag
Tuesday, May 28, 13

Who, what & why
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 12+ years bridging the “gap”
between science, IT & high
performance computing
‣ www.bioteam.net
3
Tuesday, May 28, 13

Seriously.
Listen to me at your own risk
‣ Clever people ﬁnd multiple
solutions to common issues
‣ I’m fairly blunt, burnt-out and
cynical in my advanced age
‣ Signiﬁcant portion of my work
has been done in demanding
production Biotech & Pharma
environments
‣ Filter my words accordingly
4
Tuesday, May 28, 13

Other 2013 Presentations ...
Bio-IT World Boston
5
Tuesday, May 28, 13

Bio-IT World Boston: “Multi-Tenant Research Clusters”
6
http://slideshare.net/chrisdag/
Tuesday, May 28, 13

Bio-IT World Boston: “HPC Trends from the trenches.”
7
Tuesday, May 28, 13

8
Meta: Why Cloud?
What the sales & marketing folks won’t tell you
Getting Practical
Intro
HPC Case Study
1
2
3
4
5
Tuesday, May 28, 13

9
The big picture
Why we need IaaS clouds ...
Tuesday, May 28, 13

Why life science needs infrastructure clouds
10
Big Picture
‣ HUGE revolution in the rate at which lab platforms are
being redesigned, improved & refreshed
• Example: CCD sensor upgrade on that confocal
microscopy rig just doubled your storage requirements
• Example: That 2D ultrasound imager is now a 3D imager
• Example: Illumina HiSeq upgrade just doubled the rate at
which you can acquire genomes. Massive downstream
increase in storage, compute & data movement needs
Tuesday, May 28, 13

11
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR FASTER
than we can refresh our Research-IT & Scientiﬁc
Computing infrastructure
• The science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every 2-7
years
‣ We have to design systems TODAY that can support
unknown research requirements & workﬂows over many
years (gulp ...)
Tuesday, May 28, 13

12
The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago you could toss inexpensive storage and
servers at the problem; even in a nearby closet or under
a lab bench if necessary
‣ That does not work any more; real solutions required
Tuesday, May 28, 13

13
And a related problem ...
‣ It has never been easier to acquire vast amounts of data
cheaply and easily
‣ Growth rate of data creation/ingest exceeds rate at
which the storage industry is improving disk capacity
‣ Not just a storage lifecycle problem. This data *moves*
and often needs to be shared among multiple entities
and providers
• ... ideally without punching holes in your ﬁrewall or
consuming all available internet bandwidth
Tuesday, May 28, 13

If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientiﬁc staff
‣ Problems in recruiting, retention,
publication & product development
14
Tuesday, May 28, 13

15
IaaS to the Rescue
Tuesday, May 28, 13

IaaS solves the current critical “Research IT” dilemma
16
Why Cloud?
‣ IaaS clouds let us react and
respond to scientiﬁc
requirements that change far
faster than we can refresh
local datacenters and
enterprise IT platforms
Image: shanelin via Flickr
Tuesday, May 28, 13

Beyond capability and agility gains ...
17
Why Cloud?
‣ The economic beneﬁts are real, inescapable and
trending in the proper direction
‣ Internet-scale providers with millions of cores and
exabytes of spinning disk spanning the globe
leverage operational efﬁciencies you will never come
close to matching internally
‣ ... be suspicious of people who claim otherwise
Tuesday, May 28, 13

Also ...
18
Why Cloud?
‣ Clouds becoming a natural
place for data exchange &
access
‣ “scriptable everything”
enables entirely new
capabilities not possible
internally*
‣ Finance people love converting
CapEx to OpEx
Tuesday, May 28, 13

19
Meta: Why Cloud?
Getting Practical
Intro
HPC Case Study
1
2
3
4
5
Tuesday, May 28, 13

What the salesfolk won’t tell you ...
20
‣ There is no one-size-ﬁts-all research
design pattern ...
‣ You are not going to toss everything
and replace it with “Big Data”
‣ Very few of us have a single pipeline or
workﬂow that we can devote endless
engineering effort to
‣ We are not going to toss out hundreds
of legacy codes and rewrite everything
for GPUs or MapReduce
‣ For research HPC it’s all about the
building blocks { and how we can
effectively use/deploy them }
Tuesday, May 28, 13

21
What the salesfolk won’t tell you
‣ Your organization actually needs THREE tested cloud
design patterns:
‣ (1) To handle ‘legacy’ scientiﬁc apps & workﬂows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics
Tuesday, May 28, 13

Legacy HPC on the Cloud
22
Design Pattern #1 - Legacy
‣ There are many hundreds of
existing algorithms and
applications in the life science
informatics space
‣ We’ll be running/using these
codes for years to come
‣ Many can’t or will never be
refactored or rewritten
‣ I call this the “legacy” design
pattern
Tuesday, May 28, 13

23
One Easy Solution.
Tuesday, May 28, 13

StarCluster
24
Design Pattern #1 - Legacy
‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ Inﬁnite Awesomeness. Worth a talk by itself.
‣ This is your baseline
‣ Extend as needed
Tuesday, May 28, 13

25
Design Pattern #2 - “Cloudy”
‣ Some of our research workﬂows are important enough to
be rewritten for “the cloud” and the advantages that a
truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Warning: Cloud vendor lock-in potential is strongest here
Tuesday, May 28, 13

26
Design Pattern #3 - Hadoop/BigData
‣ Hadoop and “big data” need to be on your radar
‣ Be careful though, you’ll need a gas mask to avoid the
smog of marketing and vapid hype
‣ The utility is real and this does represent one “future
path” for analysis of large data sets
Tuesday, May 28, 13

27
‣ It’s going to be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2013
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-based
‣ Often a slam-dunk cloud use case
Tuesday, May 28, 13

What you need to know
28
‣ “Hadoop” and “Big Data” are now general terms
‣ You need to drill down to ﬁnd out what people actually
mean
‣ We are still in the period where senior leadership may
demand “Hadoop” or “BigData” capability without any
actual business or scientiﬁc need
Tuesday, May 28, 13

29
Hadoop & “Big Data”
‣ In broad terms you can break “Big Data” down into two very
basic use cases:
1. Compute: Hadoop can be used as a very powerful platform for
the analysis of very large data sets. The google search term
here is “map reduce”
2. Data Stores: Hadoop is driving the development of very
sophisticated “no-SQL” “non-Relational” databases and data
query engines. The google search terms include “nosql”,
“couchdb”, “hive”, “pig” & “mongodb”, etc.
‣ Your job is to ﬁgure out which type applies for the groups
requesting “Hadoop” or “BigData” capability
Tuesday, May 28, 13

Hadoop vs traditional Linux Clusters
30
High Throughput Science
‣ Hadoop is a very complex beast
‣ It’s also the way of the future so you can’t ignore it
‣ Very tight dependency on moving the ‘compute’ as close
as possible to the ‘data’
‣ Hadoop clusters are just different enough that they do
not integrate cleanly with traditional Linux HPC system
‣ Often treated as separate silo or punted to the cloud
Tuesday, May 28, 13

31
Hadoop & “Big Data”
‣ Hadoop is being driven by a small group of academics
writing and releasing open source life science hadoop
applications;
‣ Your people will want to run these codes
‣ In some academic environments you may ﬁnd people
wanting to develop on this platform
Tuesday, May 28, 13

32
Meta: Why Cloud?
Getting Practical
Intro
HPC Case Study
1
2
3
4
5
Tuesday, May 28, 13

Strategy
33
Practical Advice
‣ Research oriented IT organizations need a cloud strategy
today; or risk being bypassed by employees
Tuesday, May 28, 13

Design Patterns
34
Practical Advice
‣ Remember the three design patterns on the cloud:
• Legacy HPC systems
(replicate traditional clusters in the cloud)
• Hadoop
• Cloudy
(when you rewrite something to fully leverage cloud
capability)
Tuesday, May 28, 13

Policies and Procedures
35
Practical Advice
‣ Cloud technology bits are easy. Cloud Process and Policy
discussions take forever
‣ Start these conversations sooner rather than later!
Tuesday, May 28, 13

Core services that take time and advance planning
36
Practical Advice
‣ A few of key foundational cloud services take time and
advanced planning to deploy properly:
‣ VPNs & subnet schemes
‣ Identity Management & Access Control
‣ Data Movement
Tuesday, May 28, 13

Data Movemement
37
Practical Advice
‣ A few words & pictures on data movement ...
Tuesday, May 28, 13

38
Physical data movement station 1
Tuesday, May 28, 13

39
Physical data movement station 2
Tuesday, May 28, 13

40
“Naked” Data Movement
Tuesday, May 28, 13

41
“Naked” Data Archive
Tuesday, May 28, 13

42
Cloud Data Movement
‣ Things changed pretty deﬁnitively in 2012
‣ And the next image shows why ...
Tuesday, May 28, 13

43
March 2012
Tuesday, May 28, 13

Network vs. Physical
Cloud Data Movement
‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 MB/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases, including
genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e
44
Tuesday, May 28, 13

Network vs. Physical
Cloud Data Movement
‣ Results like this mean we now favor network-based data
movement over physical media movement
‣ Large-scale physical data movement carries a high
operational burden and consumes non-trivial staff time &
resources
45
Tuesday, May 28, 13

There are three ways to do network data movement ...
Cloud Data Movement
‣ Buy software from Aspera and be done with it
‣ Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge
contest; use their code
‣ Get GridFTP from the Globus folks
46
Tuesday, May 28, 13

SysAdmin vs Programmer
47
Practical Advice
‣ Recognize the blurring line between
IT / Informatics / SW Engineer
‣ ... and how it may mix up your org chart
Tuesday, May 28, 13

Very blurry lines in 2013 for all of these roles
48
Scientist/SysAdmin/Programmer
‣ Radical change in last ~2 years
for how IT is provisioned,
delivered, managed & supported
‣ Root cause (Technology)
Virtualization & Cloud
‣ Root Cause (Operations)
Conﬁguration Mgmt, Systems
Orchestration & Infrastructure
Automation
‣ SysAdmins & IT staff need to re-
skill and retrain to stay relevant
Tuesday, May 28, 13

49
‣ When everything has an API ..
‣ .. anything can be
‘orchestrated’ or ‘automated’
remotely
‣ And by the way ...
‣ The APIs (‘knobs & buttons’)
are accessible to all
Tuesday, May 28, 13

50
‣ IT jobs, roles and
responsibilities are
undergoing rapid
upheaval
‣ SysAdmins must learn to
program in order to
harness automation tools
‣ Programmers & Scientists
can now self-provision
and control sophisticated
IT resources
Tuesday, May 28, 13

51
‣ My take on the future ...
‣ Far more control is going into the
hands of the research end user
‣ IT support roles will radically
change -- no longer owners or
gatekeepers
‣ IT will handle policies,
procedures, reference patterns ,
security & best practices
‣ Researchers will control the
“what”, “when” and “how big”
Tuesday, May 28, 13

52
Thanks! Email: chris@bioteam.net
Tuesday, May 28, 13

53
Cloud HPC Case Study
Time Permitting ...
Tuesday, May 28, 13

Next Generation Nuclear Magnetic Resonance
54
NMR Probehead Simulation on AWS
‣ CAE Simulation Project
‣ via www.hpcexperiment.com
‣ Software: CST Studio 2012
‣ My role: Volunteer HPC Mentor
Tuesday, May 28, 13

Simulating next-generation NMR probeheads
55
Why this was an interesting project
‣ Frontend interface is graphics
heavy and requires Windows
‣ Studio ‘solvers’ run Linux or
Windows; support GPUs and MPI
task distribution
‣ Simultaneous use of local and
cloud-based solvers actually works
‣ ﬂexLM license server involved
‣ Non-trivial security and geo-
location requirements
Tuesday, May 28, 13

56
When we ran at modest scale ...
16 large compute nodes + 22 GPU nodes
$30/hour on AWS Spot Market.
HPC on the cloud is real.
Tuesday, May 28, 13

Design Attempt #1
57
‣ Hybrid Linux/Windows cloud running in AWS EU Region
‣ Failure:
• No GPU nodes in EU at the time
• No cc2.4xlarge at the time
Tuesday, May 28, 13

Design Attempt #2
58
‣ Move Hybrid Linux/Windows system to US-EAST
‣ ... with synthetic test data
‣ Best-practices VPC isolation & VPN access
‣ It looked like this ...
Tuesday, May 28, 13

Architecture #2 59
Tuesday, May 28, 13

Design Attempt #2
60
‣ Attempt #2 Failed:
‣ CST FrontEnd Controller running at end-user site could
not tolerate NAT translation used by solvers
‣ No GPU nodes available within VPC at that time
Tuesday, May 28, 13

Design Attempt #3
61
‣ Design #3 Finally works
‣ VPC shrunk to single license server running in US EAST
‣ All Windows/Linux/GPU solover nodes running in EU
‣ NO NAT, NO VPC For Solvers
‣ Extensive use of AWS spot instance servers
Tuesday, May 28, 13

At experiment end it looked like this ... 62
Tuesday, May 28, 13

63
Non Trivial HPC on the Cloud
16 large compute nodes + 22 GPU nodes
$30/hour on AWS Spot Market.
Tuesday, May 28, 13

Why this work was ‘easy’ on Amazon AWS ...
64
Nightmare on any other cloud
‣ Lets discuss why this simulation workload would be
much, much harder to do on some other cloud
platform ...
Tuesday, May 28, 13

65
Nightmare on any other cloud
1. Virtual Servers
2. Block Storage
3. Object Storage
4. ... and maybe some other
stuff if I’m lucky
‣ EC2, S3, EBS, RDS, SNS,
SQS, SWS, GPUs, SSDs,
CloudFormation, VPC, ENIs,
SecurityGroups, 10GbE
DirectConnect, Reserved
Instances, ImportExport,
Spot Market
‣ And ~25 other products and
service features with more
added monthly
‘Brand X’ Cloud AWS
Tuesday, May 28, 13

Easy on AWS; much harder elsewhere
One very speciﬁc example
66
‣ The widely used FLEXlm
license server uses NIC
MAC addresses when
generating license keys
‣ Different MAC? Science
stops. Screwed.
‣ VPC ENIs allow separation
of MAC address from
Network Interface.
Badass.
Tuesday, May 28, 13

A few other examples ...
67
VPC
Spot Market
cc* & cg*
ec2 instance
types
Incredibly powerful. Actually useful.
Approachable even if you are not an IPSEC or BGP
routing god.
Compelling economics. Once you start you’ll likely
never run anywhere else.
The competition can’t compete.
Fat nodes with bidirectional 10GbE bandwidth.
And don’t get me started on SSD or Provisioned-
performance EBS volumes.
Tuesday, May 28, 13

68
Thanks!
Email: chris@Bioteam.net
Tuesday, May 28, 13

Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned

Ähnlich wie Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned (20)

Mehr von Chris Dagdigian

Mehr von Chris Dagdigian (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned