The document discusses trends in bioinformatics infrastructure and data management from 2012. It notes that next-generation sequencing continues to cause issues with data handling and storage. Storage capabilities are increasing rapidly but the rate of data acquisition is even faster. Compression techniques like CRAM may help address this challenge. Cloud computing is becoming more widely adopted, with Amazon Web Services being the dominant platform currently. Private clouds have limited utility. Skills in data movement, tiered storage systems, and new approaches like Hadoop will be important in the coming years.
3. BioTeam
Who, what & why
‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing
3
4. BioTeam
Why we get invited to these sorts of talks ...
‣ Lots of people hire us across
wide range of project types
• Pharma, Biotech, EDU,
Nonprofit, .Gov, .Mil, etc.
‣ We get to see how groups of
smart people approach similar
problems
‣ We can speak honestly &
objectively about what we see
“in the real world”
4
6. Listen to me at your own risk
Seriously.
‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ All career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly
6
7. Introduction
1
Business & Marketplace
2
Storage
3
Cloud
4
Hot for ’12 ...
5
7
9. Business & Meta Observations
More of the same in ’12 ...
‣ ~4 staff full time on issues involving data handling, data
management and multi-instrument Next-Gen
sequencing/analysis
‣ ~2 staff full time on infrastructure, storage and facility
related projects
• Dwan: Big infrastructure & facility projects for Fortune 20
companies, research consortia & .GOV customers
• Dag: 40% infrastructure, 20% storage, 20% cloud
‣ ~1 staff full time on Amazon Cloud projects
9
10. What that tells us
‣ Same problem(s) as last year
‣ Next-gen sequencing still
causing a lot of pain when it
comes to data handling,
storage, organization &
integration
‣ As sequencing continues to be
commoditized, this will likely
only get worse
10
13. Science-centric Storage
Why I’m not worried
‣ Peta-capable storage is trivial to acquire in 2012
‣ Scale-out NAS has won the battle
‣ It’s simply not as hard/risky as it used to be
13
15. OMG! The Sky Is Falling!
Maybe a little panic is appropriate ...
15
16. The sky IS falling!
Uncomfortable truths
‣ Cost of acquiring data (genomes)
falling faster than rate at which
industry is increasing drive capacity
‣ Human researchers downstream of
these datasets are also consuming
more storage (and less predictably)
‣ High-scale labs must react or
potentially have catastrophic issues
in 2012-2013
16
17. The sky IS falling!
Current Practices Are Not Sustainable
‣ FACT: Chemistry changing faster than we can refresh our
datacenters and research IT infrastructure
‣ FACT: Rate at which we can cheaply acquire interesting data
exceeds rate at which storage companies can increase the
capacity of their products
‣ FACT: We are poor at managing, tagging, valuing & curating our
data. Few scientists really understand true cost/complexity
involved with keeping data safe, online & accessible
‣ FACT: In 2012 people still think “keep everything online, forever”
is a viable demand to be making of IT staff
‣ FACT: Something is going to break. Soon.
17
19. The sky IS falling!
CRAM it in 2012 ...
‣ Minor improvements are useless; order-of-magnitude needed
‣ Some people are talking about radical new methods –
compressing against reference sequences and only storing the
diffs
• With a variable compression “quality budget” to spend on
lossless techniques in the areas you care about
‣ http://biote.am/5v - Ewan Birney on “Compressing DNA”
‣ http://biote.am/5w - The actual CRAM paper
‣ If CRAM takes off, storage landscape will change
19
21. What comes next.
The same rules apply for 2012 and beyond ...
‣ Accept that science changes faster than IT infrastructure
‣ Be glad you are not Broad/Sanger/BGI/NCBI
‣ Flexibility, scalability and agility become the key
requirements of research informatics platforms
• Tiered storage is in your future ...
‣ Shared/concurrent access is still the overwhelming
storage use case
21
22. What comes next.
In the following year ...
‣ Many peta-scale capable systems deployed
• Most will operate in the hundreds-of-TBs range
‣ Far more aggressive “data triage”
‣ Genome compression via CRAM
‣ Even more data will sit untouched & unloved
‣ Growing need for tiers, HSM & even tape
22
23. What comes next.
In the following year ...
‣ Broad and others are paving the way with respect to
metadata-aware & policy driven storage frameworks
• And we’ll shamelessly copy a year or two later
‣ I’m still on my cloud storage kick
• Economics are inescapable; Will be built into storage
platforms, gateways & VMs
• Cloud object stores are only a HTTP RESTful call away
• Cloud will become “just another tier”
23
24. What comes next.
Expect your storage to be smarter & more capable ...
‣ What do DDN, Panasas, Isilon,
BlueArc, etc. have in common?
• Under the hood they all run
Unix or Unix-like OS’s on
x86_64 architectures
‣ Some storage arrays can
already run applications natively
• More will follow
• Likely a big trend for 2012
24
26. The Road Ahead
Trends & Tips for 2012
‣ Peta-capable platforms required
‣ Scale-out NAS still the best fit
‣ Customers will no longer build one
big scale-out NAS tier
‣ My ‘hack’ of using nearline spec
storage as primary science tier is
obsolete in ’12
‣ pNFS mainstream in 2012?
‣ Not everything is worth backing up
‣ Expect disruptive stuff
26
27. The Road Ahead
Trends & Tips for 2012
‣ Your storage will be able to run apps
• Dedupe, cloud gateways &
replication
• ‘CRAM’ or similar compression
• Storage Resource Brokers
(iRODS) & metadata servers
• HDFS/Hadoop hooks?
• Lab, Data management & LIMS
applications Drobo Appliance running
BioTeam MiniLIMS internally...
27
28. The Road Ahead
Trends & Tips for 2012
‣ Hadoop / MapReduce / BigData
• Just like GRID and CLOUD the
space is being over-hyped
• You still need to think about it
• ... and have a roadmap for doing it
• Deep, deep ties to your storage
• Your users want/need it
• My $.02? Fantastic cloud use case
28
32. Storage Future Feels Like This ...
Multiple Tiers, Multiple Vendors, Multiple Products
32
33. The ‘C’ word
Does a Bio-IT talk exist if it does not mention “the cloud”?
33
34. Cloud Stuff
‣ Before I make some blunt comments ...
‣ I am not an Amazon Cloud shill
‣ I am a jaded, cynical, zero-loyalty consumer of IT
services and products that let me get work done
‣ Because I only get paid when my solutions work, I am
picky about what tools I keep in my toolkit
‣ Amazon Web Services is a fantastic tool
34
40. Amazon is the IaaS Cloud Leader
‣ Why Amazon is attractive for infrastructure clouds:
• Anyone can do virtual servers and block/object storage
• Bio-IT needs “more stuff ” in order to get real work done
• AWS product & service stack (“the glue”) is far more
comprehensive than any other cloud competitors
- Need some examples?
- ElasticIP, VPC, IAM, SQS, SNS, SES, SimpleDB,
DynamoDB, CloudFormation, ElasticBeanstalk, SWS,
DirectConnect, etc.
40
41. Amazon Cloud Dominance Could Be A Good Thing
‣ Amazon Cloud Dominance May Be Good For Bio-IT
‣ The competition must innovate in really interesting ways
in order to compete. This is already happening.
• Purpose-built platforms for regulated/compliant operation
• “Hands-on” Managed Services for Healthcare/Pharma
• Hybrid on-premise/off-premise solutions
• Full life science solution & software service stacks
• Bespoke Service Level Agreements (SLAs)
• ,,,
41
43. Private Clouds in 2012:
‣ I’m no longer dismissing them as “useless”
‣ Usable & useful in certain situations
‣ Hype vs. Reality ratio still unbalanced
‣ Sensible only for certain environments
• Have you seen what you have to do
to your networks & gear?
‣ There are easier ways
44. Private Clouds: My Advice for ‘12
‣ Remain cynical (test vendor claims)
‣ Due Diligence still essential
‣ I personally would not deploy anything that does not
explicitly provide Amazon API compatibility
45. Private Clouds: My Advice for ‘12
Most people are better off:
1. Adding VM platforms to existing HPC clusters &
environments
2. Extending enterprise VM platforms to allow user self-
service & server catalogs
47. Cloud Advice
Don’t get left behind
‣ Research IT Organizations need a cloud strategy today
‣ Those that don’t will be bypassed by frustrated users
‣ IaaS cloud services are only a departmental credit card
away ... and some senior scientists are too big to be fired
for violating IT policy
47
48. Cloud Advice
Design Patterns
‣ You will need three tested cloud design patterns:
‣ (1) To handle ‘legacy’ scientific apps & workflows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics
48
49. Cloud Advice
(1) Legacy HPC on the Cloud
‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ This is your baseline for legacy apps on ‘the cloud’
‣ Extend as needed
49
50. Cloud Advice
(2) “Cloudy” HPC
‣ Some of our research workflows are important enough to
be rewritten for “the cloud” and the advantages that a
truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Good commercial options: Cycle Computing, BT, etc.
50
51. Cloud Advice
(3) Big Data HPC
‣ It will be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2012
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-based
‣ Often an excellent cloud use case
51
53. Cloud Data Movement
‣ Over several years we have participated in a number of
large “cloud data movement” efforts
‣ We used to be big fans of physical media movement
‣ However ...
53
55. Cloud Data Movement
‣ At first glance, physical data movement “seems easy”
‣ It’s not. It is hard to do correctly and requires significant
human effort and operational resources
‣ This has been a hard lesson learned over several years
‣ We have a new strategy for 2012 and the next image
shows why ...
55
57. Cloud Data Movement
Wow!
‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 Mb/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases, including
genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e
57
58. Cloud Data Movement
Wow!
‣ Results like this mean we now favor network-based data
movement over physical media movement
‣ Large-scale physical data movement carries a high
operational burden and consumes non-trivial staff time &
resources
‣ *Unclear if our experience holds true for Asia or
Asia-EU-Americas data transfers
58
59. Cloud Data Movement
There are three ways to do network data movement ...
‣ (1) Buy software from Aspera and be done with it
‣ (2) Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge
contest; use their code
‣ (3) Get GridFTP from the Globus folks
• Trend: At every single “data movement” talk I’ve been to in
2011 it seemed that any speaker who was NOT using Aspera
was a very happy user of GridFTP. #notCoincidence
59
61. Hot for ’12
BioTeam side projects & research interests
‣ Like to wrap up with some topics we think are
interesting
‣ Who knows? These might be trends for 2013!
61
62. Siri Voice Control of Instruments/Pipelines
‣ BioTeam recently revealed
work with BT and Accelrys
‣ Demonstrated Siri voice
control of a Pipeline Pilot
experiment running in the BT
Compute Cloud
‣ http://biote.am/7h
‣ We expect to continue doing
cool things with Siri in ’12
62
63. Smart Storage & Lab-local Appliances
‣ I firmly expect the “storage
arrays running apps & VMs”
trend to go mainstream
‣ This has beneficial implications
for life science informatics
‣ We’ll be hitting this topic hard
on systems ranging from Drobo
to DataDirect
‣ Also working with the Intel
Modular Server concept 63
64. Lab Local Appliances
Intel Modular Server
‣ Interesting hardware
combination; storage +
servers + native
hypervisor
‣ VM Pool 1: MiniLIMs +
other useful lab software
‣ VM Pool 2: Amazon
Storage Gateway
Appliance
http://biote.am/7i
‣ Server Blade 3:
BrightCluster HPC Stack
64
65. Cloud, Community & Orchestration
‣ The emerging class of “DevOps” and “Infrastructure
Automation” methods are incredibly interesting
• We love Opscode & Chef (http://opscode.com)
‣ We’ll be doing more with systems orchestration in ’12
• And hopefully expanding our community collection of
useful Chef coobooks for life science informatics
‣ We also still love MIT StarCluster and will hopefully be
contributing plugins and enhancements
65