2012: Trends from the Trenches

Trends from the Trenches
2012 Bio-IT World Expo, Boston MA

1

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

2

BioTeam
Who, what & why

‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
‣ PS: We are hiring.
3

BioTeam
Why we get invited to these sorts of talks ...

‣ Lots of people hire us across
wide range of project types
• Pharma, Biotech, EDU,
Nonproﬁt, .Gov, .Mil, etc.
‣ We get to see how groups of
smart people approach similar
problems
‣ We can speak honestly &
objectively about what we see
“in the real world”
4

Listen to me at your own risk
Seriously.

‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ All career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly

6

Introduction
1
Business & Marketplace
2
Datacenter, Facility & Infrastructure
3
Storage
4
Cloud
5
Hot for ’12 ...
6
7

Business Landscape
So far 2012 feels a lot like 2011 ...

8

Business & Meta Observations
More of the same in ’12 ...

‣ ~4 staff full time on issues involving data handling, data
management and multi-instrument Next-Gen
sequencing/analysis
‣ ~2 staff full time on infrastructure, storage and facility
related projects
• Dwan: Big infrastructure & facility projects for Fortune 20
companies, research consortia & .GOV customers
• Dag: 40% infrastructure, 20% storage, 20% cloud

‣ ~1 staff full time on Amazon Cloud projects
9

What that tells us

‣ Same problem(s) as last year
‣ Next-gen sequencing still
causing a lot of pain when it
comes to data handling,
storage, organization &
integration
‣ As sequencing continues to be
commoditized, this will likely
only get worse
10

Business & Meta Observations

‣ Companies are still spending
• On people, software, infrastructure, facility & cloud
‣ Pharma may be contracting
• ... but more and more startups are popping up and other
companies are simply continuing sane & sensible growth
‣ .GOV is of some concern
• Stimulus funding winding down or already gone; Same with
‘BioDefense’ funding & project efforts
• Grant funding organizations tightening belts
11

Introduction
1
Business & Marketplace
2
Datacenter, Facility & Infrastructure
3
Storage
4
Cloud
5
Hot for ’12 ...
6
12

Facility Observations
13

Facility & Infrastructure
Less frenetic this year

‣ No clients breaking ground on major new
datacenters this year
• Slight change from 2011
• A few electrical/cooling refresh projects in the works
‣ Multiple clients of all sizes securing additional colo
• Often for power density reasons
• Small shops & startups are going with colo+cloud
• Large shops expanding into Tier-1 colos
14

Power problems are fading or less critical ...

‣ Last year we had serious power density problems
‣ Friction between facility & research staff
‣ Arguments over density vs. power envelope vs. rack
space & physical footprint
‣ No such issues (so far) in 2012

15

HPC + Virtualization

‣ Still deploying HPC Linux Clusters w/ Scale-out NAS
‣ However, every HPC system since 2011 has also
intentionally included a VM environment integrated into
the HPC cluster

16

HPC + Virtualization

‣ HPC + Virtualization solves a lot of problems
‣ Deals with valid biz/scientiﬁc need for researchers to
run/own/manage their own servers ‘near’ HPC stack
‣ Solves a ton of research IT support issues
• Or at least leaves us a clear boundary line
‣ Lets us obtain useful “cloud” features without choking
on endless BS shoveled at us by “private cloud” vendors
• Example: Server Catalogs + Self-service Provisioning
17

Science-centric Storage
Current State Assessment

‣ Storage still making me crazy in ’12

19

Science-centric Storage
Why I’m not worried

‣ Peta-capable storage is trivial to acquire in 2012
‣ Scale-out NAS has won the battle
‣ It’s simply not as hard/risky as it used to be

20

On the other hand ...

21

OMG! The Sky Is Falling!
Maybe a little panic is appropriate ...

22

The sky IS falling!

OMG!!

BIG SCARY GRAPH

2007 2008 2009 2010 2011 2012
: 23

The sky IS falling!
Uncomfortable truths

‣ Cost of acquiring data (genomes)
falling faster than rate at which
industry is increasing drive capacity
‣ Human researchers downstream of
these datasets are also consuming
more storage (and less predictably)
‣ High-scale labs must react or
potentially have catastrophic issues
in 2012-2013

24

The sky IS falling!
Current Practices Are Not Sustainable

‣ FACT: Chemistry changing faster than we can refresh our
datacenters and research IT infrastructure
‣ FACT: Rate at which we can cheaply acquire interesting data
exceeds rate at which storage companies can increase the
capacity of their products
‣ FACT: We suck at managing, tagging, valuing & curating our
data. Few scientists really understand true cost/complexity
involved with keeping data safe, online & accessible
‣ FACT: In 2012 people still think “keep everything online, forever”
is a viable demand to be making of IT staff
‣ FACT: Something is going to break. Soon.
25

The sky IS falling!
CRAM it in 2012 ...

‣ Minor improvements are useless; order-of-magnitude needed
‣ Some people are talking about radical new methods –
compressing against reference sequences and only storing the
diffs
• With a variable compression “quality budget” to spend on
lossless techniques in the areas you care about
‣ http://biote.am/5v - Ewan Birney on “Compressing DNA”
‣ http://biote.am/5w - The actual CRAM paper
‣ If CRAM takes off, storage landscape will change
27

Storage: What comes next?
Next 18 months will be really fun...
28

What comes next.
The same rules apply for 2012 and beyond ...

‣ Accept that science changes faster than IT infrastructure
‣ Be glad you are not Broad/Sanger/BGI/NCBI
‣ Flexibility, scalability and agility become the key
requirements of research informatics platforms
• Tiered storage is in your future ...
‣ Shared/concurrent access is still the overwhelming
storage use case

29

What comes next.
In the following year ...

‣ Many peta-scale capable systems deployed
• Most will operate in the hundreds-of-TBs range
‣ Far more aggressive “data triage”
‣ Genome compression via CRAM
‣ Even more data will sit untouched & unloved
‣ Growing need for tiers, HSM & even tape

30

What comes next.
In the following year ...

‣ Broad and others are paving the way with respect to
metadata-aware & policy driven storage frameworks
• And we’ll shamelessly copy a year or two later
‣ I’m still on my cloud storage kick
• Economics are inescapable; Will be built into storage
platforms, gateways & VMs
• Amazon S3 is only a HTTP RESTful call away
• Cloud will become “just another tier”

31

What comes next.
Expect your storage to be smarter & more capable ...

‣ What do DDN, Panasas, Isilon,
BlueArc, etc. have in common?
• Under the hood they all run
Unix or Unix-like OS’s on
x86_64 architectures
‣ Some storage arrays can
already run applications natively
• More will follow
• Likely a big trend for 2012
32

Storage: The road ahead
My $.02 for 2012...
33

The Road Ahead
Trends & Tips for 2012

‣ Peta-capable platforms required
‣ Scale-out NAS still the best ﬁt
‣ Customers will no longer build one
big scale-out NAS tier
‣ My ‘hack’ of using nearline spec
storage as primary science tier is
obsolete in ’12
‣ Not everything is worth backing up
‣ Expect disruptive stuff
34

The Road Ahead

‣ Monolithic tiers no longer cut it
• Changing science & instrument
output patterns are to blame
• We can’t get away with biasing
towards capacity over
performance any more
‣ pNFS should go mainstream in ’12
• { fantastic news }
‣ Tiered storage IS in your future
• Multiple vendors & types
35

The Road Ahead

‣ Your storage will be able to run apps
• Dedupe, cloud gateways &
replication
• ‘CRAM’ or similar compression
• Storage Resource Brokers
(iRODS) & metadata servers
• HDFS/Hadoop hooks?
• Lab, Data management & LIMS
applications Drobo Appliance running
BioTeam MiniLIMS internally...

36

The Road Ahead

‣ Hadoop / MapReduce / BigData
• Just like GRID and CLOUD back
in the day you’ll need a gas mask
to survive the smog of hype and
vendor press releases.
• You still need to think about it
• ... and have a roadmap for doing it
• Deep, deep ties to your storage
• Your users want/need it
• My $.02? Fantastic cloud use case
37

Disruptive Storage Example

38

Backblaze Pod For Biotech

39

Backblaze: 100Tb for $12,000
http://bioteam.net/tag/backblaze/

40

Storage Future Feels Like This ...
Multiple Tiers, Multiple Vendors, Multiple Products

41

The ‘C’ word
Does a Bio-IT talk exist if it does not mention “the cloud”?
42

Cloud Stuff

‣ Before I get nasty ...
‣ I am not an Amazon shill
‣ I am a jaded, cynical, zero-loyalty consumer of IT
services and products that let me get #%$^ done
‣ Because I only get paid when my #%$^ works, I am
picky about what tools I keep in my toolkit
‣ Amazon AWS is an inﬁnitely cool tool
43

A message for the
cloud pretenders…

No self-service?
Not a cloud.

Installing VMWare
& excreting a press release?
Not a cloud.

I have to email a human?
Not a cloud.

~50% failure rate when launching
new servers?

Stupid cloud.

Block storage
and virtual servers only?

(barely) a cloud;

Private Clouds
My $.02 cents
51

Private Clouds in 2012:

‣ I’m no longer dismissing them as “utter crap”
‣ Usable & useful in certain situations
‣ Hype vs. Reality ratio still wacky
‣ Sensible only for certain shops
• Have you seen what you have to do
to your networks & gear?
‣ There are easier ways

Private Clouds: My Advice for ‘12

‣ Remain cynical (test vendor claims)
‣ Due Diligence still essential
‣ I personally would not deploy/buy anything that does not
explicitly provide Amazon API compatibility

Private Clouds: My Advice for ‘12

Most people are better off:
1. Adding VM platforms to existing HPC clusters &
environments
2. Extending enterprise VM platforms to allow user self-
service & server catalogs

Cloud Advice
My $.02 cents
55

Cloud Advice
Don’t get left behind

‣ Research IT Organizations need a cloud strategy today
‣ Those that don’t will be bypassed by frustrated users
‣ IaaS cloud services are only a departmental credit card
away ... and some senior scientists are too big to be ﬁred
for violating IT policy :)

56

Cloud Advice
Design Patterns

‣ You actually need three tested cloud design patterns:

‣ (1) To handle ‘legacy’ scientiﬁc apps & workﬂows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics

57

Cloud Advice
Legacy HPC on the Cloud

‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ This is your baseline
‣ Extend as needed

58

Cloud Advice
“Cloudy” HPC

‣ Some of our research workﬂows are important enough to
be rewritten for “the cloud” and the advantages that a
truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Amazon Simple Workﬂow Service (SWS) look sweet
‣ Good commercial options: Cycle Computing, etc.
59

Cloud Advice
Big Data HPC

‣ It’s gonna be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2012
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-based
‣ Often a slam-dunk cloud use case

60

Cloud Data Movement
My $.02 cents
61

Cloud Data Movement

‣ We’ve slung a ton of data in and out of the cloud
‣ We used to be big fans of physical media movement
‣ Remember these pictures?
‣ ...

62

Physical data movement station 1

63

Physical data movement station 2

64

“Naked” Data Movement

65

“Naked” Data Archive

66

Cloud Data Movement

‣ We’ve got a new story for 2012
‣ And the next image shows why ...

67

Cloud Data Movement
Wow!

‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 MB/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases, including
genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e

69

Cloud Data Movement
Wow!

‣ Results like this mean we now favor network-based data
movement over physical media movement
‣ Large-scale physical data movement carries a high
operational burden and consumes non-trivial staff time &
resources

70

Cloud Data Movement
There are three ways to do network data movement ...

‣ Buy software from Aspera and be done with it
‣ Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge
contest; use their code
‣ Get GridFTP from the Globus folks
• Trend: At every single “data movement” talk I’ve been to in
2011 it seemed that any speaker who was NOT using Aspera
was a very happy user of GridFTP. #notCoincidence

71

Cloud Data Movement
Final thoughts

‣ GridFTP has a booth on the show ﬂoor; pay them a visit
‣ Michelle Munson from Aspera speaking today in Track 2
on “High-Speed Data Movement for Effective Global
Collaboration in Genomic Research”

72

Hot topics for 2012 ...
73

Hot for ’12
BioTeam side projects & research interests

‣ Like to wrap up with some topics we think are
interesting
‣ Who knows? These might be trends for 2013!

74

Siri Voice Control of Instruments/Pipelines

‣ BioTeam revealed our work
with BT and Accelrys
yesterday morning @ BioIT
‣ We demonstrated Siri voice
control of a Pipeline Pilot
experiment running in the BT
Compute Cloud
‣ http://biote.am/7h
‣ We expect to continue doing
cool things with Siri in ’12
75

Smart Storage & Lab-local Appliances

‣ I ﬁrmly expect the “storage
arrays running apps & VMs”
trend to go mainstream
‣ This has beneﬁcial implications
for life science informatics
‣ We’ll be hitting this topic hard
on systems ranging from Drobo
to DataDirect
‣ Also working with the Intel
Modular Server concept 76

Lab Local Appliances
Intel Modular Server

‣ Interesting hardware
combination; storage +
servers + native
hypervisor
‣ VM Pool 1: MiniLIMs +
other useful lab software
‣ VM Pool 2: Amazon
Storage Gateway
Appliance
http://biote.am/7i
‣ Server Blade 3:
BrightCluster HPC Stack
77

Cloud, Community & Orchestration

‣ We love Opscode & Chef
‣ We’ll be doing more with systems orchestration in ’12
‣ And hopefully expanding our community collection of
useful Chef coobooks for life science informatics
‣ We also still love MIT StarCluster and will hopefully be
contributing plugins and enhancements back to Justin

80

Phew.
Think I’m done now.

81

Thanks!
Slides online at: http://slideshare.net/chrisdag/

82

2012: Trends from the Trenches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2012: Trends from the Trenches

Similar to 2012: Trends from the Trenches (20)

More from Chris Dagdigian

More from Chris Dagdigian (10)

Recently uploaded

Recently uploaded (20)

2012: Trends from the Trenches

Editor's Notes