Talk slides as delivered at the 2012 Bio-IT World Conference in Boston, MA
This is my annual "state of the state" address that has become somewhat popular.
3. BioTeam
Who, what & why
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
‣ PS: We are hiring.
3
4. BioTeam
Why we get invited to these sorts of talks ...
‣ Lots of people hire us across
wide range of project types
• Pharma, Biotech, EDU,
Nonprofit, .Gov, .Mil, etc.
‣ We get to see how groups of
smart people approach similar
problems
‣ We can speak honestly &
objectively about what we see
“in the real world”
4
6. Listen to me at your own risk
Seriously.
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ All career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly
6
7. Introduction
1
Business & Marketplace
2
Datacenter, Facility & Infrastructure
3
Storage
4
Cloud
5
Hot for ’12 ...
6
7
9. Business & Meta Observations
More of the same in ’12 ...
‣ ~4 staff full time on issues involving data handling, data
management and multi-instrument Next-Gen
sequencing/analysis
‣ ~2 staff full time on infrastructure, storage and facility
related projects
• Dwan: Big infrastructure & facility projects for Fortune 20
companies, research consortia & .GOV customers
• Dag: 40% infrastructure, 20% storage, 20% cloud
‣ ~1 staff full time on Amazon Cloud projects
9
10. What that tells us
‣ Same problem(s) as last year
‣ Next-gen sequencing still
causing a lot of pain when it
comes to data handling,
storage, organization &
integration
‣ As sequencing continues to be
commoditized, this will likely
only get worse
10
11. Business & Meta Observations
‣ Companies are still spending
• On people, software, infrastructure, facility & cloud
‣ Pharma may be contracting
• ... but more and more startups are popping up and other
companies are simply continuing sane & sensible growth
‣ .GOV is of some concern
• Stimulus funding winding down or already gone; Same with
‘BioDefense’ funding & project efforts
• Grant funding organizations tightening belts
11
12. Introduction
1
Business & Marketplace
2
Datacenter, Facility & Infrastructure
3
Storage
4
Cloud
5
Hot for ’12 ...
6
12
14. Facility & Infrastructure
Less frenetic this year
‣ No clients breaking ground on major new
datacenters this year
• Slight change from 2011
• A few electrical/cooling refresh projects in the works
‣ Multiple clients of all sizes securing additional colo
• Often for power density reasons
• Small shops & startups are going with colo+cloud
• Large shops expanding into Tier-1 colos
14
15. Facility & Infrastructure
Power problems are fading or less critical ...
‣ Last year we had serious power density problems
‣ Friction between facility & research staff
‣ Arguments over density vs. power envelope vs. rack
space & physical footprint
‣ No such issues (so far) in 2012
15
16. Facility & Infrastructure
HPC + Virtualization
‣ Still deploying HPC Linux Clusters w/ Scale-out NAS
‣ However, every HPC system since 2011 has also
intentionally included a VM environment integrated into
the HPC cluster
16
17. Facility & Infrastructure
HPC + Virtualization
‣ HPC + Virtualization solves a lot of problems
‣ Deals with valid biz/scientific need for researchers to
run/own/manage their own servers ‘near’ HPC stack
‣ Solves a ton of research IT support issues
• Or at least leaves us a clear boundary line
‣ Lets us obtain useful “cloud” features without choking
on endless BS shoveled at us by “private cloud” vendors
• Example: Server Catalogs + Self-service Provisioning
17
20. Science-centric Storage
Why I’m not worried
‣ Peta-capable storage is trivial to acquire in 2012
‣ Scale-out NAS has won the battle
‣ It’s simply not as hard/risky as it used to be
20
22. OMG! The Sky Is Falling!
Maybe a little panic is appropriate ...
22
23. The sky IS falling!
OMG!!
BIG SCARY GRAPH
2007 2008 2009 2010 2011 2012
: 23
24. The sky IS falling!
Uncomfortable truths
‣ Cost of acquiring data (genomes)
falling faster than rate at which
industry is increasing drive capacity
‣ Human researchers downstream of
these datasets are also consuming
more storage (and less predictably)
‣ High-scale labs must react or
potentially have catastrophic issues
in 2012-2013
24
25. The sky IS falling!
Current Practices Are Not Sustainable
‣ FACT: Chemistry changing faster than we can refresh our
datacenters and research IT infrastructure
‣ FACT: Rate at which we can cheaply acquire interesting data
exceeds rate at which storage companies can increase the
capacity of their products
‣ FACT: We suck at managing, tagging, valuing & curating our
data. Few scientists really understand true cost/complexity
involved with keeping data safe, online & accessible
‣ FACT: In 2012 people still think “keep everything online, forever”
is a viable demand to be making of IT staff
‣ FACT: Something is going to break. Soon.
25
27. The sky IS falling!
CRAM it in 2012 ...
‣ Minor improvements are useless; order-of-magnitude needed
‣ Some people are talking about radical new methods –
compressing against reference sequences and only storing the
diffs
• With a variable compression “quality budget” to spend on
lossless techniques in the areas you care about
‣ http://biote.am/5v - Ewan Birney on “Compressing DNA”
‣ http://biote.am/5w - The actual CRAM paper
‣ If CRAM takes off, storage landscape will change
27
29. What comes next.
The same rules apply for 2012 and beyond ...
‣ Accept that science changes faster than IT infrastructure
‣ Be glad you are not Broad/Sanger/BGI/NCBI
‣ Flexibility, scalability and agility become the key
requirements of research informatics platforms
• Tiered storage is in your future ...
‣ Shared/concurrent access is still the overwhelming
storage use case
29
30. What comes next.
In the following year ...
‣ Many peta-scale capable systems deployed
• Most will operate in the hundreds-of-TBs range
‣ Far more aggressive “data triage”
‣ Genome compression via CRAM
‣ Even more data will sit untouched & unloved
‣ Growing need for tiers, HSM & even tape
30
31. What comes next.
In the following year ...
‣ Broad and others are paving the way with respect to
metadata-aware & policy driven storage frameworks
• And we’ll shamelessly copy a year or two later
‣ I’m still on my cloud storage kick
• Economics are inescapable; Will be built into storage
platforms, gateways & VMs
• Amazon S3 is only a HTTP RESTful call away
• Cloud will become “just another tier”
31
32. What comes next.
Expect your storage to be smarter & more capable ...
‣ What do DDN, Panasas, Isilon,
BlueArc, etc. have in common?
• Under the hood they all run
Unix or Unix-like OS’s on
x86_64 architectures
‣ Some storage arrays can
already run applications natively
• More will follow
• Likely a big trend for 2012
32
34. The Road Ahead
Trends & Tips for 2012
‣ Peta-capable platforms required
‣ Scale-out NAS still the best fit
‣ Customers will no longer build one
big scale-out NAS tier
‣ My ‘hack’ of using nearline spec
storage as primary science tier is
obsolete in ’12
‣ Not everything is worth backing up
‣ Expect disruptive stuff
34
35. The Road Ahead
Trends & Tips for 2012
‣ Monolithic tiers no longer cut it
• Changing science & instrument
output patterns are to blame
• We can’t get away with biasing
towards capacity over
performance any more
‣ pNFS should go mainstream in ’12
• { fantastic news }
‣ Tiered storage IS in your future
• Multiple vendors & types
35
36. The Road Ahead
Trends & Tips for 2012
‣ Your storage will be able to run apps
• Dedupe, cloud gateways &
replication
• ‘CRAM’ or similar compression
• Storage Resource Brokers
(iRODS) & metadata servers
• HDFS/Hadoop hooks?
• Lab, Data management & LIMS
applications Drobo Appliance running
BioTeam MiniLIMS internally...
36
37. The Road Ahead
Trends & Tips for 2012
‣ Hadoop / MapReduce / BigData
• Just like GRID and CLOUD back
in the day you’ll need a gas mask
to survive the smog of hype and
vendor press releases.
• You still need to think about it
• ... and have a roadmap for doing it
• Deep, deep ties to your storage
• Your users want/need it
• My $.02? Fantastic cloud use case
37
41. Storage Future Feels Like This ...
Multiple Tiers, Multiple Vendors, Multiple Products
41
42. The ‘C’ word
Does a Bio-IT talk exist if it does not mention “the cloud”?
42
43. Cloud Stuff
‣ Before I get nasty ...
‣ I am not an Amazon shill
‣ I am a jaded, cynical, zero-loyalty consumer of IT
services and products that let me get #%$^ done
‣ Because I only get paid when my #%$^ works, I am
picky about what tools I keep in my toolkit
‣ Amazon AWS is an infinitely cool tool
43
52. Private Clouds in 2012:
‣ I’m no longer dismissing them as “utter crap”
‣ Usable & useful in certain situations
‣ Hype vs. Reality ratio still wacky
‣ Sensible only for certain shops
• Have you seen what you have to do
to your networks & gear?
‣ There are easier ways
53. Private Clouds: My Advice for ‘12
‣ Remain cynical (test vendor claims)
‣ Due Diligence still essential
‣ I personally would not deploy/buy anything that does not
explicitly provide Amazon API compatibility
54. Private Clouds: My Advice for ‘12
Most people are better off:
1. Adding VM platforms to existing HPC clusters &
environments
2. Extending enterprise VM platforms to allow user self-
service & server catalogs
56. Cloud Advice
Don’t get left behind
‣ Research IT Organizations need a cloud strategy today
‣ Those that don’t will be bypassed by frustrated users
‣ IaaS cloud services are only a departmental credit card
away ... and some senior scientists are too big to be fired
for violating IT policy :)
56
57. Cloud Advice
Design Patterns
‣ You actually need three tested cloud design patterns:
‣ (1) To handle ‘legacy’ scientific apps & workflows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics
57
58. Cloud Advice
Legacy HPC on the Cloud
‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ This is your baseline
‣ Extend as needed
58
59. Cloud Advice
“Cloudy” HPC
‣ Some of our research workflows are important enough to
be rewritten for “the cloud” and the advantages that a
truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Amazon Simple Workflow Service (SWS) look sweet
‣ Good commercial options: Cycle Computing, etc.
59
60. Cloud Advice
Big Data HPC
‣ It’s gonna be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2012
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-based
‣ Often a slam-dunk cloud use case
60
62. Cloud Data Movement
‣ We’ve slung a ton of data in and out of the cloud
‣ We used to be big fans of physical media movement
‣ Remember these pictures?
‣ ...
62
69. Cloud Data Movement
Wow!
‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 MB/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases, including
genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e
69
70. Cloud Data Movement
Wow!
‣ Results like this mean we now favor network-based data
movement over physical media movement
‣ Large-scale physical data movement carries a high
operational burden and consumes non-trivial staff time &
resources
70
71. Cloud Data Movement
There are three ways to do network data movement ...
‣ Buy software from Aspera and be done with it
‣ Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge
contest; use their code
‣ Get GridFTP from the Globus folks
• Trend: At every single “data movement” talk I’ve been to in
2011 it seemed that any speaker who was NOT using Aspera
was a very happy user of GridFTP. #notCoincidence
71
72. Cloud Data Movement
Final thoughts
‣ GridFTP has a booth on the show floor; pay them a visit
‣ Michelle Munson from Aspera speaking today in Track 2
on “High-Speed Data Movement for Effective Global
Collaboration in Genomic Research”
72
74. Hot for ’12
BioTeam side projects & research interests
‣ Like to wrap up with some topics we think are
interesting
‣ Who knows? These might be trends for 2013!
74
75. Siri Voice Control of Instruments/Pipelines
‣ BioTeam revealed our work
with BT and Accelrys
yesterday morning @ BioIT
‣ We demonstrated Siri voice
control of a Pipeline Pilot
experiment running in the BT
Compute Cloud
‣ http://biote.am/7h
‣ We expect to continue doing
cool things with Siri in ’12
75
76. Smart Storage & Lab-local Appliances
‣ I firmly expect the “storage
arrays running apps & VMs”
trend to go mainstream
‣ This has beneficial implications
for life science informatics
‣ We’ll be hitting this topic hard
on systems ranging from Drobo
to DataDirect
‣ Also working with the Intel
Modular Server concept 76
77. Lab Local Appliances
Intel Modular Server
‣ Interesting hardware
combination; storage +
servers + native
hypervisor
‣ VM Pool 1: MiniLIMs +
other useful lab software
‣ VM Pool 2: Amazon
Storage Gateway
Appliance
http://biote.am/7i
‣ Server Blade 3:
BrightCluster HPC Stack
77
80. Cloud, Community & Orchestration
‣ We love Opscode & Chef
‣ We’ll be doing more with systems orchestration in ’12
‣ And hopefully expanding our community collection of
useful Chef coobooks for life science informatics
‣ We also still love MIT StarCluster and will hopefully be
contributing plugins and enhancements back to Justin
80