Trends from the Trenches (Singapore Edition)

Trends from the Trenches
2012 Bio-IT World Asia, Singapore

1

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

2

BioTeam
Who, what & why

‣ Independent consulting shop
‣ Staffed by scientists forced to
learn IT, SW & HPC to get our
own research done
‣ 10+ years bridging the “gap”
between science, IT & high
performance computing

3

BioTeam
Why we get invited to these sorts of talks ...

‣ Lots of people hire us across
wide range of project types
• Pharma, Biotech, EDU,
Nonproﬁt, .Gov, .Mil, etc.
‣ We get to see how groups of
smart people approach similar
problems
‣ We can speak honestly &
objectively about what we see
“in the real world”
4

Listen to me at your own risk
Seriously.

‣ I’m not an expert, pundit,
visionary or “thought leader”
‣ All career success entirely due
to shamelessly copying what
actual smart people do
‣ I’m biased, burnt-out & cynical
‣ Filter my words accordingly

6

Introduction
1
Business & Marketplace
2
Storage
3
Cloud
4
Hot for ’12 ...
5
7

Business Landscape
So far 2012 feels a lot like 2011 ...

8

Business & Meta Observations
More of the same in ’12 ...

‣ ~4 staff full time on issues involving data handling, data
management and multi-instrument Next-Gen
sequencing/analysis
‣ ~2 staff full time on infrastructure, storage and facility
related projects
• Dwan: Big infrastructure & facility projects for Fortune 20
companies, research consortia & .GOV customers
• Dag: 40% infrastructure, 20% storage, 20% cloud

‣ ~1 staff full time on Amazon Cloud projects
9

What that tells us

‣ Same problem(s) as last year
‣ Next-gen sequencing still
causing a lot of pain when it
comes to data handling,
storage, organization &
integration
‣ As sequencing continues to be
commoditized, this will likely
only get worse
10

Science-centric Storage
Current State Assessment

‣ Storage still making me crazy in ’12

12

Science-centric Storage
Why I’m not worried

‣ Peta-capable storage is trivial to acquire in 2012
‣ Scale-out NAS has won the battle
‣ It’s simply not as hard/risky as it used to be

13

On the other hand ...

14

OMG! The Sky Is Falling!
Maybe a little panic is appropriate ...

15

The sky IS falling!
Uncomfortable truths

‣ Cost of acquiring data (genomes)
falling faster than rate at which
industry is increasing drive capacity
‣ Human researchers downstream of
these datasets are also consuming
more storage (and less predictably)
‣ High-scale labs must react or
potentially have catastrophic issues
in 2012-2013

16

The sky IS falling!
Current Practices Are Not Sustainable

‣ FACT: Chemistry changing faster than we can refresh our
datacenters and research IT infrastructure
‣ FACT: Rate at which we can cheaply acquire interesting data
exceeds rate at which storage companies can increase the
capacity of their products
‣ FACT: We are poor at managing, tagging, valuing & curating our
data. Few scientists really understand true cost/complexity
involved with keeping data safe, online & accessible
‣ FACT: In 2012 people still think “keep everything online, forever”
is a viable demand to be making of IT staff
‣ FACT: Something is going to break. Soon.
17

The sky IS falling!
CRAM it in 2012 ...

‣ Minor improvements are useless; order-of-magnitude needed
‣ Some people are talking about radical new methods –
compressing against reference sequences and only storing the
diffs
• With a variable compression “quality budget” to spend on
lossless techniques in the areas you care about
‣ http://biote.am/5v - Ewan Birney on “Compressing DNA”
‣ http://biote.am/5w - The actual CRAM paper
‣ If CRAM takes off, storage landscape will change
19

Storage: What comes next?
Next 18 months will be really fun...
20

What comes next.
The same rules apply for 2012 and beyond ...

‣ Accept that science changes faster than IT infrastructure
‣ Be glad you are not Broad/Sanger/BGI/NCBI
‣ Flexibility, scalability and agility become the key
requirements of research informatics platforms
• Tiered storage is in your future ...
‣ Shared/concurrent access is still the overwhelming
storage use case

21

What comes next.
In the following year ...

‣ Many peta-scale capable systems deployed
• Most will operate in the hundreds-of-TBs range
‣ Far more aggressive “data triage”
‣ Genome compression via CRAM
‣ Even more data will sit untouched & unloved
‣ Growing need for tiers, HSM & even tape

22

What comes next.
In the following year ...

‣ Broad and others are paving the way with respect to
metadata-aware & policy driven storage frameworks
• And we’ll shamelessly copy a year or two later
‣ I’m still on my cloud storage kick
• Economics are inescapable; Will be built into storage
platforms, gateways & VMs
• Cloud object stores are only a HTTP RESTful call away
• Cloud will become “just another tier”

23

What comes next.
Expect your storage to be smarter & more capable ...

‣ What do DDN, Panasas, Isilon,
BlueArc, etc. have in common?
• Under the hood they all run
Unix or Unix-like OS’s on
x86_64 architectures
‣ Some storage arrays can
already run applications natively
• More will follow
• Likely a big trend for 2012
24

Storage: The road ahead
My $.02 for 2012...
25

The Road Ahead
Trends & Tips for 2012

‣ Peta-capable platforms required
‣ Scale-out NAS still the best ﬁt
‣ Customers will no longer build one
big scale-out NAS tier
‣ My ‘hack’ of using nearline spec
storage as primary science tier is
obsolete in ’12
‣ pNFS mainstream in 2012?
‣ Not everything is worth backing up
‣ Expect disruptive stuff
26

The Road Ahead

‣ Your storage will be able to run apps
• Dedupe, cloud gateways &
replication
• ‘CRAM’ or similar compression
• Storage Resource Brokers
(iRODS) & metadata servers
• HDFS/Hadoop hooks?
• Lab, Data management & LIMS
applications Drobo Appliance running
BioTeam MiniLIMS internally...

27

The Road Ahead

‣ Hadoop / MapReduce / BigData
• Just like GRID and CLOUD the
space is being over-hyped
• You still need to think about it
• ... and have a roadmap for doing it
• Deep, deep ties to your storage
• Your users want/need it
• My $.02? Fantastic cloud use case

28

Disruptive Storage Example

29

Backblaze Pod For Biotech

30

100 Terabytes for $12,000 USD
http://bioteam.net/tag/backblaze/

31

Storage Future Feels Like This ...
Multiple Tiers, Multiple Vendors, Multiple Products

32

The ‘C’ word
Does a Bio-IT talk exist if it does not mention “the cloud”?
33

Cloud Stuff

‣ Before I make some blunt comments ...
‣ I am not an Amazon Cloud shill
‣ I am a jaded, cynical, zero-loyalty consumer of IT
services and products that let me get work done
‣ Because I only get paid when my solutions work, I am
picky about what tools I keep in my toolkit
‣ Amazon Web Services is a fantastic tool

34

So you think
you have a cloud?

No self-service?
Not a cloud.

Installing VMware
& issuing a press release?
Not a cloud.

Block storage
and virtual servers only?

(barely) a cloud;

Amazon is the IaaS Cloud Leader

‣ Why Amazon is attractive for infrastructure clouds:
• Anyone can do virtual servers and block/object storage
• Bio-IT needs “more stuff ” in order to get real work done
• AWS product & service stack (“the glue”) is far more
comprehensive than any other cloud competitors
- Need some examples?
- ElasticIP, VPC, IAM, SQS, SNS, SES, SimpleDB,
DynamoDB, CloudFormation, ElasticBeanstalk, SWS,
DirectConnect, etc.

40

Amazon Cloud Dominance Could Be A Good Thing

‣ Amazon Cloud Dominance May Be Good For Bio-IT
‣ The competition must innovate in really interesting ways
in order to compete. This is already happening.
• Purpose-built platforms for regulated/compliant operation
• “Hands-on” Managed Services for Healthcare/Pharma
• Hybrid on-premise/off-premise solutions
• Full life science solution & software service stacks
• Bespoke Service Level Agreements (SLAs)
• ,,,
41

Private Clouds
My $.02 cents
42

Private Clouds in 2012:

‣ I’m no longer dismissing them as “useless”
‣ Usable & useful in certain situations
‣ Hype vs. Reality ratio still unbalanced
‣ Sensible only for certain environments
• Have you seen what you have to do
to your networks & gear?
‣ There are easier ways

Private Clouds: My Advice for ‘12

‣ Remain cynical (test vendor claims)
‣ Due Diligence still essential
‣ I personally would not deploy anything that does not
explicitly provide Amazon API compatibility

Private Clouds: My Advice for ‘12

Most people are better off:
1. Adding VM platforms to existing HPC clusters &
environments
2. Extending enterprise VM platforms to allow user self-
service & server catalogs

Cloud Advice
My $.02 cents
46

Cloud Advice
Don’t get left behind

‣ Research IT Organizations need a cloud strategy today
‣ Those that don’t will be bypassed by frustrated users
‣ IaaS cloud services are only a departmental credit card
away ... and some senior scientists are too big to be ﬁred
for violating IT policy

47

Cloud Advice
Design Patterns

‣ You will need three tested cloud design patterns:

‣ (1) To handle ‘legacy’ scientiﬁc apps & workﬂows
‣ (2) The special stuff that is worth re-architecting
‣ (3) Hadoop & big data analytics

48

Cloud Advice
(1) Legacy HPC on the Cloud

‣ MIT StarCluster
• http://web.mit.edu/star/cluster/
‣ This is your baseline for legacy apps on ‘the cloud’
‣ Extend as needed

49

Cloud Advice
(2) “Cloudy” HPC

‣ Some of our research workﬂows are important enough to
be rewritten for “the cloud” and the advantages that a
truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom
‣ Many published best practices you can borrow
‣ Good commercial options: Cycle Computing, BT, etc.

50

Cloud Advice
(3) Big Data HPC

‣ It will be a MapReduce world, get used to it
‣ Little need to roll your own Hadoop in 2012
‣ ISV & commercial ecosystem already healthy
‣ Multiple providers today; both onsite & cloud-based
‣ Often an excellent cloud use case

51

Cloud Data Movement
My $.02 cents
52

Cloud Data Movement

‣ Over several years we have participated in a number of
large “cloud data movement” efforts
‣ We used to be big fans of physical media movement
‣ However ...

53

Physical Data Movement Is Not Easy.

54

Cloud Data Movement

‣ At ﬁrst glance, physical data movement “seems easy”
‣ It’s not. It is hard to do correctly and requires signiﬁcant
human effort and operational resources
‣ This has been a hard lesson learned over several years
‣ We have a new strategy for 2012 and the next image
shows why ...

55

Cloud Data Movement
Wow!

‣ With a 1GbE internet connection ...
‣ and using Aspera software ....
‣ We sustained 700 Mb/sec for more than 7 hours
freighting genomes into Amazon Web Services
‣ This is fast enough for many use cases, including
genome sequencing core facilities*
‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e

57

Cloud Data Movement
Wow!

‣ Results like this mean we now favor network-based data
movement over physical media movement
‣ Large-scale physical data movement carries a high
operational burden and consumes non-trivial staff time &
resources
‣ *Unclear if our experience holds true for Asia or
Asia-EU-Americas data transfers

58

Cloud Data Movement
There are three ways to do network data movement ...

‣ (1) Buy software from Aspera and be done with it
‣ (2) Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge
contest; use their code
‣ (3) Get GridFTP from the Globus folks
• Trend: At every single “data movement” talk I’ve been to in
2011 it seemed that any speaker who was NOT using Aspera
was a very happy user of GridFTP. #notCoincidence

59

Hot topics for 2012 ...
60

Hot for ’12
BioTeam side projects & research interests

‣ Like to wrap up with some topics we think are
interesting
‣ Who knows? These might be trends for 2013!

61

Siri Voice Control of Instruments/Pipelines

‣ BioTeam recently revealed
work with BT and Accelrys
‣ Demonstrated Siri voice
control of a Pipeline Pilot
experiment running in the BT
Compute Cloud
‣ http://biote.am/7h
‣ We expect to continue doing
cool things with Siri in ’12
62

Smart Storage & Lab-local Appliances

‣ I ﬁrmly expect the “storage
arrays running apps & VMs”
trend to go mainstream
‣ This has beneﬁcial implications
for life science informatics
‣ We’ll be hitting this topic hard
on systems ranging from Drobo
to DataDirect
‣ Also working with the Intel
Modular Server concept 63

Lab Local Appliances
Intel Modular Server

‣ Interesting hardware
combination; storage +
servers + native
hypervisor
‣ VM Pool 1: MiniLIMs +
other useful lab software
‣ VM Pool 2: Amazon
Storage Gateway
Appliance
http://biote.am/7i
‣ Server Blade 3:
BrightCluster HPC Stack
64

Cloud, Community & Orchestration

‣ The emerging class of “DevOps” and “Infrastructure
Automation” methods are incredibly interesting
• We love Opscode & Chef (http://opscode.com)
‣ We’ll be doing more with systems orchestration in ’12
• And hopefully expanding our community collection of
useful Chef coobooks for life science informatics
‣ We also still love MIT StarCluster and will hopefully be
contributing plugins and enhancements
65

Thanks!
Slides online at: http://slideshare.net/chrisdag/

66

Trends from the Trenches (Singapore Edition)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Trends from the Trenches (Singapore Edition)

Ähnlich wie Trends from the Trenches (Singapore Edition) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Trends from the Trenches (Singapore Edition)

Hinweis der Redaktion