Bio-IT for Core Facility Managers

Bio-IT For Core Facility Leaders
Tips, Tricks & Trends

2012 NERLCSD Meeting - www.nerlscd.org

1
Wednesday, October 31, 12

Intro 1
Meta-Issues (The Big Picture) 2
Infrastructure Tour 3
Compute & HPC 4
Storage 5
Cloud & Big Data 6
2

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

@chris_dag 3

BioTeam
Who, what & why

‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 12+ years bridging the “gap”
between science, IT & high
performance computing
‣ www.bioteam.net

4

Listen to me at your own risk
Seriously.

‣ Clever people ﬁnd multiple
solutions to common issues
‣ I’m fairly blunt, burnt-out and
cynical in my advanced age
‣ Signiﬁcant portion of my work
has been done in demanding
production Biotech & Pharma
environments
‣ Filter my words accordingly
5

Intro 1
Compute & HPC 4
Storage 5
Cloud & Big Data 6
6

Meta-Issues
Why you need to track this stuff ...

7

Big Picture
Why this stuff matters ...

‣ HUGE revolution in the rate at which lab instruments are
being redesigned, improved & refreshed
• Example: CCD sensor upgrade on that confocal
microscopy rig just doubled your storage requirements
• Example: That 2D ultrasound imager is now a 3D imager
• Example: Illumina HiSeq upgrade just doubled the rate at
which you can acquire genomes. Massive downstream
increase in storage, compute & data movement needs

8

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER
than we can refresh our Research-IT & Scientiﬁc
Computing infrastructure
• The science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every 2-7
years
‣ We have to design systems TODAY that can support
unknown research requirements & workﬂows over many
years (gulp ...)
9

The Central Problem Is ...

‣ The easy period is over
‣ 5 years ago you could toss inexpensive storage and
servers at the problem; even in a nearby closet or under
a lab bench if necessary
‣ That does not work any more; IT needs are too extreme
‣ 1000 CPU Linux clusters and petascale storage is the
new normal; try ﬁtting THAT in a closet!

10

The Take Home Lesson
What core facility leadership needs to understand

‣ The incredible rate of cost decreases & capability gains
seen in the lab instrumentation space is not mirrored
everywhere
‣ As gear gets cheaper/faster, scientists will simply do
more work and ask more questions. Nobody simply
banks the ﬁnancial savings when an instrument gets
50% cheaper -- they just buy two of them!
‣ IT technology is not improving at the same rate; we also
can’t change our IT infrastructures all that rapidly
11

If you get it wrong ...

‣ Lost opportunity
‣ Frustrated & very vocal researchers
‣ Problems in recruiting
‣ Publication problems

12

Intro 1
Compute & HPC 4
Storage 5
Cloud & Big Data 6
13

Infrastructure Tour
What does this stuff look like?

14

Self-contained single-instrument infrastructure

15

Ilumina GA
16

Instrument Control Workstation

17

SOLiD Sequencer ...
18

sits on top of a 24U server rack...

19

Another lab-local HPC cluster + storage

20

More lab-local servers & storage

21

Small core w/ multiple instrument support

22

Small cluster; large storage

23

Mid-sized core facility

24

Large Core Facility

25

Large Core Facility

26

Large Core Facility
27

Colocation Cages
28

Inside a colo cage
29

Linux Cluster + In-row chillers (front)
30

Linux Cluster + In-row chillers (rear)
31

1U “Pizza Box” Style Server Chassis
32

Pile of “pizza boxes”
33

4U Rackmount Servers
34

“Blade” Servers & Enclosure
35

Hybrid Modular Server
36

Integrated: Blades + Hypervisor + Storage
37

Petabyte-scale Storage
38

Real world screenshot from earlier this month

16 monster compute nodes + 22 GPU nodes
Cost? 30 bucks an hour via AWS Spot Market

Yep. This counts.
39

Physical data movement station

40

Physical data movement station

41

“Naked” Data Movement

42

“Naked” Data Archive

43

The cliche image

44

Backblaze Pod: 100 terabytes for $12,000

45

Intro 1
Compute & HPC 4
Storage 5
Cloud & Big Data 6
46

Compute
Actually the easy bit ...

47

Compute Power
Not a big deal in 2012 ...

‣ Compute power is largely a solved problem
‣ It’s just a commodity
‣ Cheap, simple & very easy to acquire
‣ Lets talk about what you need to know ...

48

Compute Trends
Thinks you should be tracking ...

‣ Facility Issues
‣ “Fat Nodes” replacing Linux Clusters
‣ Increasing presence of serious “lab-local” IT

49

Facility Stuff

‣ Compute & storage
requirements are getting
larger and larger
‣ We are packing more “stuff”
into smaller spaces
‣ This increases (radically)
electrical and cooling
requirements
50

Facility Stuff - Core issue

‣ Facility & power issues can
take many months or years to
address
‣ Sometimes it may be
impossible to address (new
building required ...)
‣ If research IT footprint is
growing fast; you must be well
versed in your facility
planning/upgrade process
51

Facility Stuff - One more thing

‣ Sometimes central IT will begin
facility upgrade efforts without
consulting with research users
• This was the reason behind one of
our more ‘interesting’ projects in
2012
‣ ... a client was weeks away from
signing off on a $MM datacenter
which would not have had enough
electricity to support current
research & faculty recruiting
commitments
52

“Fat” Nodes Replacing Clusters
53

Fat Nodes - 1 box replacing a cluster

‣ This server has 64 CPU Cores
‣ .. and up to 1TB of RAM
‣ Fantastic Genomics/Chemistry
system
• A 256GB RAM version only
costs $13,000
‣ These single systems are
replacing small clusters in
some environments
54

Fat Nodes - Clever Scale-out Packaging

‣ This 2U chassis contains 4
individual servers
‣ Systems like this get near
“blade” density without
the price premium seen
with proprietary blade
packaging
‣ These “shrink” clusters in
a major way or replace
small ones
55

The other trend ...
56

“Serious” IT now in your wet lab ...

‣ Instruments used to ship with a
Windows PC “instrument
control workstation”
‣ As instruments get more
powerful the “companion”
hardware is starting to scale-up
‣ End result: very signiﬁcant stuff
that used to live in your
datacenter is now being rolled
into lab enviroments
57

“Serious” IT now in your wet lab ...

‣ You may be surpised what
you ﬁnd in your labs in ’12
‣ ... can be problematic for a
few reasons ...
1. IT support & backup
2. Power & cooling
3. Noise
4. Security
58

Networking
Also not particularly worrisome ...

59

Networking

‣ Networking is also not super complicated
‣ It’s also fairly cheap & commoditized in ’12
‣ There are three core uses for networks:
1. Communication between servers & services
2. Message passing within a single application
3. Sharing ﬁles and data between many clients

60

Networking 1 - Servers & Services

‣ Ethernet. Period. Enough said.
‣ Your only decision is between 10-Gig and 1-Gig ethernet
‣ 1-Gig Ethernet is pervasive and dirt cheap
‣ 10-Gig Ethernet is getting cheaper and on it’s way to
becoming pervasive

61

Networking 1 - Ethernet

‣ Everything speaks ethernet
‣ 1-Gig is still the common interconnect for most things
‣ 10-Gig is the standard now for the “core”
‣ 10-Gig is the standard for top-of-rack and “aggregation”
‣ 10-Gig connections to “special” servers is the norm

62

Networking 2 - Message Passing

‣ Parallel applications can span many servers at once
‣ Communicate/coordinate via “message passing”
‣ Ethernet is fine for this but has a somewhat high latency
between message packets
‣ Many apps can tolerate Ethernet-level latency; some
applications clearly benefit from a message passing
network with lower latency
‣ There used to be many competing alternatives
‣ Clear 2012 winner is “Infiniband” 63

Networking 2 - Message Passing

‣ The only things you need to know ...
‣ Inﬁniband is an expensive networking alternative that
offers much lower latency than Ethernet
‣ You would only pay for and deploy an IB fabric if you had
an application or use case that requires it.
‣ No big deal. It’s just “another” network.

64

Networking 3 - File Sharing

‣ For ‘Omics this is the primary focus area
‣ Overwhelming need for shared read/write access to files
and data between instruments, HPC environment and
researcher desktops
‣ In HPC environments you will often have a separate
network just for file sharing traffic

65

Networking 3 - File Sharing

‣ Generic file sharing uses familiar NFS or Windows fileshare
protocols. No big deal
‣ Always implemented over Ethernet although often a mixture
of 10-Gig and 1-Gig connections
• 10-Gig connections to the file servers, storage and edge switches;
1-gig connections to cluster nodes and user desktops
‣ Infiniband also has a presence here
• Many “parallel” or “cluster” filesystems may talk to the clients
via NFS-over-ethernet but internally the distributed components
may use a private Infiband network for metadata and
coordination.
66

Storage.
(the hard bit ...)

67

Storage
Setting the stage ...

‣ Life science is generating torrents of data
‣ Size and volume often dwarf all other research areas -
particularly with Bioinformatics & Genomics work
‣ Big/Fast storage is not cheap and is not commodity
‣ There are many vendors and many ways to spectacularly
waste tons of money
‣ And we still have an overwhelming need for storage that
can be shared concurrently between many different
users, systems and clients
68

Life Science “Data Deluge”

‣ Scare stories and shocking graphs getting tiresome
‣ We’ve been dealing with terabyte-scale lab instruments
& data movement issues since 2004
• And somehow we’ve managed to survive ...
‣ Next few slides
• Try to explain why storage does not stress me out all that
much in 2012 ...

69

The sky is not falling.
1. You are not the Broad Institute or Sanger Center

‣ Overwhelming majority of us do not operate at Broad/
Sanger levels
• These folks add 200+ TB a week in primary storage
‣ We still face challenges but the scale/scope is well
within the bounds of what traditional IT technologies can
handle
‣ We’ve been doing this for years
• Many vendors, best practices, “war stories”, proven methods
and just plain “people to talk to…”
70

2. Instrument Sanity Beckons

‣ Yesteryear: Terascale .TIFF Tsunami
‣ Yesterday: RTA, in-instrument data reduction
‣ Today: Basecalls, BAMs & Outsourcing
‣ Tomorrow: Write directly to the cloud

71

3. Peta-scale storage is not really exotic or unusual any more.

‣ Peta-scale storage has not been a risky exotic technology
gamble for years now
• A few years ago you’d be betting your career
‣ Today it’s just an engineering & budget exercise
• Multiple vendors don’t ﬁnd petascale requirements particularly
troublesome and can deliver proven systems within weeks
• $1M (or less in ’12) will get you 1PB from several top vendors
‣ However, still HARD to do BIG, FAST & SAFE
• Hard but solvable; many resources & solutions out there
72

On the other hand ...

73

OMG! The Sky Is Falling!
Maybe a little panic is appropriate ...

74

The sky IS falling!
1. Those @!*#&^@ Scientists ...

‣ As instrument output declines …
‣ Downstream storage consumption by
end-user researchers is increasing
rapidly
‣ Each new genome generates new
data mashups, experiments, data
interchange conversions, etc.
‣ MUCH harder to do capacity planning
against human beings vs.
instruments
75

The sky IS falling!
2. @!*#&^@ Scientiﬁc Leadership ...

‣ Sequencing is already a
commodity
‣ NOBODY simply banks the
savings
‣ EVERYBODY buys or does
more

76

The sky IS falling!
Gigabases vs. Moores Law
OMG!!

BIG SCARY GRAPH

2007 2008 2009 2010 2011 2012
: 77

The sky IS falling!
3. Uncomfortable truths

‣ Cost of acquiring data (genomes)
falling faster than rate at which
industry is increasing drive capacity
‣ Human researchers downstream of
these datasets are also consuming
more storage (and less predictably)
‣ High-scale labs must react or
potentially have catastrophic issues
in 2012-2013

78

The sky IS falling!
5. Something will have to break ...

‣ This is not sustainable
• Downstream consumption
exceeding instrument data
reduction
• Commoditization yielding
more platforms
• Chemistry moving faster
than IT infrastructure
• What the heck are we
doing with all this
sequence?
79

CRAM it.

80

The sky IS falling!
CRAM it in 2012 ...

‣ Minor improvements are useless; order-of-magnitude needed
‣ Some people are talking about radical new methods –
compressing against reference sequences and only storing the
diffs
• With a variable compression “quality budget” to spend on
lossless techniques in the areas you care about
‣ http://biote.am/5v - Ewan Birney on “Compressing DNA”
‣ http://biote.am/5w - The actual CRAM paper
‣ If CRAM takes off, storage landscape will change
81

What comes next?
Next 18 months will be really fun...
82

What comes next.
The same rules apply for 2012 and beyond ...

‣ Accept that science changes faster than IT infrastructure
‣ Be glad you are not Broad/Sanger
‣ Flexibility, scalability and agility become the key
requirements of research informatics platforms
• Tiered storage is in your future ...
‣ Shared/concurrent access is still the overwhelming
storage use case
• We’ll still continue to use clustered, parallel and scale-out
NAS solutions
83

What comes next.
In the following year ...

‣ Many peta-scale capable systems deployed
• Most will operate in the hundreds-of-TBs range
‣ Far more aggressive “data triage”
• “.BAM only!”
‣ Genome compression via CRAM
‣ Even more data will sit untouched & unloved
‣ Growing need for tiers, HSM & even tape

84

What comes next.
In the following year ...

‣ Broad, Sanger and others will pave the way with respect
to metadata-aware & policy driven storage frameworks
• And we’ll shamelessly copy a year or two later
‣ I’m still on my cloud storage kick
• Economics are inescapable; Will be built into storage
platforms, gateways & VMs
• Amazon S3 is only a HTTP RESTful call away
• Cloud will become “just another tier”

85

What comes next.
Expect your storage to be smarter & more capable ...

‣ What do DDN, Panasas, Isilon,
BlueArc, etc. have in common?
• Under the hood they all run
Unix or Unix-like OS’s on
x86_64 architectures
‣ Some storage arrays can
already run applications natively
• More will follow
• Likely a big trend for 2012
86

But what about today?

87

Still trying to avoid this.
(100TB scientiﬁc data, no RAID, unsecured on lab benchtops )

88

Flops, Failures & Freakouts
Common storage mistakes ...
89

#1 - Unchecked Enterprise Storage Architects
‣ Scientist: “My work is priceless,
I must be able to access it at all times”
‣ Corporate/Enterprise Storage Guru:
“Hmmm …you want high availability, huh?”
‣ System delivered:
• 40TB Enterprise SAN
• Asynchronous replication to remote site
• Can’t scale, can’t do NFS easily
• ~$500K per year in operational & maintenance costs
90

#2 - Unchecked User Requirements

‣ Scientist:
“I do bioinformatics, I am rate limited by the speed of ﬁle
IO operations. Faster disk means faster science. “

‣ System delivered:
• Budget blown on top tier fastest-possible ‘Cadillac’ system

‣ Outcome:
• System ﬁlls to capacity in 9 months; zero budget left.
91

#3 - D.I.Y Cluster & Parallel Filesystems

‣ Common source of storage unhappiness
‣ Root cause:
• Not enough pre-sales time spent on design and engineering
• Choosing Open Source over Common Sense

‣ System as built:
• Not enough metadata controllers
• Issues with interconnect fabric
• Poor selection & conﬁguration of key components
‣ End result:
• Poor performance or availability
• High administrative/operational burden
92

Hard Lessons Learned
What these tales tell us ...
93


‣ End-users are not precise with storage terms
• “Extremely reliable” means no data loss;
Not millions spent on 99.99999% high availability
‣ When true costs are explained:
• Many research users will trade a small amount of uptime or
availability for more capacity or capabilities
• … will also often trade some level of performance in
exchange for a huge win in capacity or capability

94


‣ End-users demand the world but are willing to
compromise
• Necessary for IT staff to really talk to them and understand
work, needs and priorities
• Also essential to explain true costs involved
‣ People demanding the “fastest” storage often don’t have
actual metrics to back their assertions

95


‣ Software-based parallel or clustered ﬁle systems are
non-trivial to correctly implement
• Essential to involve experts in the initial design phase
• Even if using ‘open source’ version …
‣ Commercial support is essential
• And I say this as an open source zealot …

96

The road ahead
My $.02 for 2012...
97

The Road Ahead
Storage Trends & Tips for 2012

‣ Peta-capable platforms required
‣ Scale-out NAS still the best ﬁt
‣ Customers will no longer build one
big scale-out NAS tier
‣ My ‘hack’ of using nearline spec
storage as primary science tier is
probably obsolete in ’12
‣ Not everything is worth backing up
‣ Expect disruptive stuff
98

The Road Ahead
Trends & Tips for 2012

‣ Monolithic tiers no longer cut it
• Changing science & instrument
output patterns are to blame
• We can’t get away with biasing
towards capacity over
performance any more
‣ pNFS should go mainstream in ’12
• { fantastic news }
‣ Tiered storage IS in your future
• Multiple vendors & types
99

The Road Ahead

‣ Your storage will be able to run apps
• Dedupe, cloud gateways &
replication
• ‘CRAM’ or similar compression
• Storage Resource Brokers
(iRODS) & metadata servers
• HDFS/Hadoop hooks?
• Lab, Data management & LIMS
applications Drobo Appliance running
BioTeam MiniLIMS internally...

100

The Road Ahead

‣ Hadoop / MapReduce / BigData
• Just like GRID and CLOUD back
in the day you’ll need a gas mask
to survive the smog of hype and
vendor press releases.
• You still need to think about it
• ... and have a roadmap for doing it
• Deep, deep ties to your storage
• Your users want/need it
• My $.02? Fantastic cloud use case
101

Disruptive Technology Example

102

Backblaze Pod For Biotech

103

Backblaze: 100Tb for $12,000

104

Intro 1
Compute & HPC 4
Storage 5
Cloud & Big Data 6
105

The ‘C’ word
Does a Bio-IT talk exist if it does not mention “the cloud”?
106

Deﬁning the “C-word”

‣ Just like “Grid Computing” the “cloud” word has been
diluted to almost uselessness thanks to hype, vendor
FUD and lunatic marketing minions
‣ Helpful to deﬁne terms before talking seriously
‣ There are three types of cloud
‣ “IAAS”, “SAAS” & “PAAS”

107

Cloud Stuff

‣ Before I get nasty ...
‣ I am not an Amazon shill
‣ I am a jaded, cynical, zero-loyalty consumer of IT
services and products that let me get #%$^ done
‣ Because I only get paid when my #%$^ works, I am
picky about what tools I keep in my toolkit
‣ Amazon AWS is an inﬁnitely cool tool
108

Cloud Stuff - SAAS

‣ SAAS = “Software as a Service”
‣ Think:
‣ gmail.com

109

Cloud Stuff - SAAS

‣ PAAS = “Platform as a Service”
‣ Think:
‣ https://basespace.illumina.com/
‣ salesforce.com
‣ MS ofﬁce365.com, Apple iCloud, etc.

110

Cloud Stuff - IAAS

‣ IAAS = “Infrastructure as a Service”
‣ Think:
‣ Amazon Web Services
‣ Microsoft Azure

111

Cloud Stuff - IAAS

‣ When I talk “cloud” I mean IAAS
‣ And right now in 2012 Amazon IS the IAAS cloud
‣ ... everyone else is a pretender

112

Cloud Stuff - Why IAAS

‣ IAAS clouds are the focal point for life science
informatics
• Although some vendors are now offering PAAS and SAAS
options ...
‣ The “infrastructure” clouds give us the “building blocks”
we can assemble into useful stuff
‣ Right now Amazon has the best & most powerful
collection of “building blocks”
‣ The competition is years behind ...
113

A message for the
cloud pretenders…


No APIs?
Not a cloud.


No self-service?
Not a cloud.


Installing VMWare
& excreting a press release?
Not a cloud.


I have to email a human?
Not a cloud.


~50% failure rate when launching
new servers?

Stupid cloud.


Block storage
and virtual servers only?

(barely) a cloud;


Private Clouds
My $.02 cents
121

Private Clouds in 2012:

‣ I’m no longer dismissing them as “utter crap”
‣ Usable & useful in certain situations
‣ Hype vs. Reality ratio still wacky
‣ Sensible only for certain shops
• Have you seen what you have to do
to your networks & gear?
‣ There are easier ways


Private Clouds: My Advice for ‘12

‣ Remain cynical (test vendor claims)
‣ Due Diligence still essential
‣ I personally would not deploy/buy anything that does not
explicitly provide Amazon API compatibility


Private Clouds: My Advice for ‘12

Most people are better off:
1. Adding VM platforms to existing HPC clusters &
environments
2. Extending enterprise VM platforms to allow user self-
service & server catalogs


Cloud Advice
My $.02 cents
125

Cloud Advice
Don’t get left behind

‣ Research IT Organizations need a cloud strategy today
‣ Those that don’t will be bypassed by frustrated users
‣ IaaS cloud services are only a departmental credit card
away ... and some senior scientists are too big to be ﬁred
for violating IT policy :)

126

Cloud Advice
Design Patterns

‣ You actually need three tested cloud design patterns:

‣ (1) To handle ‘legacy’ scientiﬁc apps & workﬂows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics

127

Cloud Advice
Legacy HPC on the Cloud

‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ This is your baseline
‣ Extend as needed

128

Cloud Advice
“Cloudy” HPC

‣ Some of our research workﬂows are important enough to
be rewritten for “the cloud” and the advantages that a
truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Amazon Simple Workﬂow Service (SWS) look sweet
‣ Good commercial options: Cycle Computing, etc.
129

Hadoop & “Big Data”

‣ Hadoop and “big data” need to be on your radar
‣ Be careful though, you’ll need a gas mask to avoid the
smog of marketing and vapid hype
‣ The utility is real and this does represent the “future
path” for analysis of large data sets

130

Cloud Advice - Hadoop & Big Data
Big Data HPC

‣ It’s gonna be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2012
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-based
‣ Often a slam-dunk cloud use case

131

What you need to know

‣ “Hadoop” and “Big Data” are now general terms
‣ You need to drill down to ﬁnd out what people actually
mean
‣ We are still in the period where senior mgmt. may
demand “hadoop” or “big data” capability without any
actual business or scientiﬁc need

132

‣ In broad terms you can break “Big Data” down into two very
basic use cases:
1. Compute: Hadoop can be used as a very powerful platform for
the analysis of very large data sets. The google search term
here is “map reduce”
2. Data Stores: Hadoop is driving the development of very
sophisticated “no-SQL” “non-Relational” databases and data
query engines. The google search terms include “nosql”,
“couchdb”, “hive”, “pig” & “mongodb”, etc.
‣ Your job is to ﬁgure out which type applies for the groups
requesting “hadoop” or “big data” capability
133

High Throughput Science
Hadoop vs traditional Linux Clusters

‣ Hadoop is a very complex beast
‣ It’s also the way of the future so you can’t ignore it
‣ Very tight dependency on moving the ‘compute’ as close
as possible to the ‘data’
‣ Hadoop clusters are just different enough that they do
not integrate cleanly with traditional Linux HPC system
‣ Often treated as separate silo or punted to the cloud
134


‣ Hadoop is being driven by a small group of academics
writing and releasing open source life science hadoop
applications;
‣ Your people will want to run these codes
‣ In some academic environments you may ﬁnd people
wanting to develop on this platform

135

Cloud Data Movement
My $.02 cents
136

Cloud Data Movement

‣ We’ve slung a ton of data in and out of the cloud
‣ We used to be big fans of physical media movement
‣ Remember these pictures?
‣ ...

137

Physical data movement station 1

138

Physical data movement station 2

139

“Naked” Data Movement

140

“Naked” Data Archive

141

Cloud Data Movement

‣ We’ve got a new story for 2012
‣ And the next image shows why ...

142

March 2012
143

Cloud Data Movement
Wow!

‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 MB/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases, including
genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e

144

Cloud Data Movement
Wow!

‣ Results like this mean we now favor network-based data
movement over physical media movement
‣ Large-scale physical data movement carries a high
operational burden and consumes non-trivial staff time &
resources

145

Cloud Data Movement
There are three ways to do network data movement ...

‣ Buy software from Aspera and be done with it
‣ Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge
contest; use their code
‣ Get GridFTP from the Globus folks
• Trend: At every single “data movement” talk I’ve been to in
2011 it seemed that any speaker who was NOT using Aspera
was a very happy user of GridFTP. #notCoincidence

146

Putting it all together
147

Wrapping up

IT may just be a means to an end but you need to get
your head wrapped around it
‣ (1) So you use/buy/request the correct ‘stuff’
‣ (2) So you don’t get cheated by a vendor
‣ (3) Because you need to understand your tools
‣ (4) Because trends in automation and orchestration
are blurring the line between scientist & sysadmin

148

Wrapping up - Compute & Servers

‣ Servers and compute power are pretty straightforward
‣ You just need to know roughly what your preferred
compute building blocks look like
‣ ... and what special purpose resources you require (GPUs,
Large Memory, High Core Count, etc.)
‣ Some of you may also have to deal with sizing, cost and
facility (power, cooling, space) issues as well

149

Wrapping up - Networking

‣ Networking is also not hugely painful thing
‣ Ethernet rules the land; you might have to pick and choose
between 1-Gig and 10-Gig Ethernet
‣ Understand that special networking technologies like
Inﬁniband offer advantages but they are expensive and need
to be applied carefully (if at all)
‣ Knowing if your MPI apps are latency sensitive will help
‣ And remember that networking is used for multiple things
(server communication, application message passing & ﬁle
and data sharing)
150

Wrapping up - Storage

‣ If you are going to focus on one IT area, this is it
‣ It’s incredibly important for genomics and also incredibly
complicated. Many ways to waste money or buy the ‘wrong’ stuff
‣ You may only have one chance to get it correct and may have to
live with your decision for years
‣ Budget is ﬁnite. You have to balance “speed” vs “size” vs
“expansion capacity” vs “high availibility” and more ...
‣ “Petabyte-capable Scale-out NAS” is usually the best starting
point. You deviate away from NAS when scientiﬁc or technical
requirements demand “something else”.
151

Wrapping up - Hadoop / Big Data

‣ Probably the way of the future for big-data analytics. It’s
worth spending time to study; especially if you intend to
develop software in the future
‣ Popular target for current and emerging high-scale
genomics tools. If you want to use those tools you need to
deploy Hadoop
‣ It’s complicated and still changing rapidly. It can be
difﬁcult to integrate into existing setups
‣ Be cynical about hype & test vendor claims
152

Wrapping up - Cloud

‣ Cloud is the future. The economics are inescapable and the
advantages are compelling.
‣ The main obstacle holding back genomics is terabyte
scale data movement. The cloud is horrible if you have to
move 2TB of data before you can run 2Hrs of compute!
‣ Your future core facility may involve a comp bio lab
without a datacenter at all. Some organizations are
already 100% virtual and 100% cloud-based

153

The NGS cloud clincher.

700 mb/sec sustained for ~7 hours
West Coast to East Coast USA
154

Wrapping up - Cloud, continued

‣ Understand that for the foreseeable future there are THREE distinct
cloud architectures and design patterns.
‣ Vendors who push “100% hadoop” or “legacy free” solutions are
idiots and should be shoved out the door. We will be running legacy
codes and workﬂows for many years to come
‣ Your three design patterns on the cloud:
• Legacy HPC systems
(replicate traditional clusters in the cloud)
• Hadoop
• Cloudy
(when you rewrite something to fully leverage cloud capability)
155

Thanks!
Slides online at: http://slideshare.net/chrisdag/

156

Bio-IT for Core Facility Managers

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Bio-IT for Core Facility Managers

Ähnlich wie Bio-IT for Core Facility Managers (20)

Mehr von Chris Dagdigian

Mehr von Chris Dagdigian (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bio-IT for Core Facility Managers