Mapping Life Science Informatics to the Cloud

Mapping Informatics To the Cloud

2012 AIRI Petabyte Challenge
Chris Dagdigian
chris@bioteam.net

I‟m Chris.

I‟m an infrastructure
geek.

I work for the
BioTeam.

When I say “cloud”
I‟m talking IaaS.

Amazon AWS
Is the IaaS cloud.
Most others are fooling themselves.
(Has-beens, also-rans & delusional marketing zombies)

A message for the
pretenders…

No self-service?
Not a cloud.

I have to email a human?
Not a cloud.

~50% failure rate when
provisioning new servers?
Stupid cloud.

Block storage
and virtual servers only?
(barely) a cloud;

Private Clouds in 2012:

• Hype vs. Reality ratio still wacky
• Sensible only for certain shops
• Have you seen what you have to do to your networks & gear?

• There are easier ways

Private Clouds: My Advice for „12

• Remain cynical (test vendor claims)
• Due Diligence still essential
• I personally would not deploy/buy
anything that does not explicitly provide
Amazon API compatibility

Private Clouds: My Advice for „12

• Most people are better off:
• Adding VM platforms to existing HPC
clusters & environments
• Extending enterprise VM platforms to
allow user self-service & server
catalogs

Enough Bloviating. Advice time.

• We have spent decades learning
to tune research HPC systems
for shared access & many users.

• The cloud upends this model

• Far more common to see …
• Dedicated cloud resources
spun up for each app or use case
• Each system gets individually
tuned & optimized

Hybrid Clouds & Cloud Bursting

• Lots of aggressive marketing
• Lots of carefully constructed “case
studies” and prototypes
• The truth?
• Less usable than you‟ve been told
• Possible? Heck yeah.
• Practical? Only sometimes.

• Advice
• Be cynical
• Demand proof
• Test carefully

• Still want to do it?
• Buy it, don‟t build it
• Cycle Computing
• Univa
• BrightComputing
• …

• Follow the crowd
• In the real world we see:
• Separation between local
and cloud HPC resources
• Send your work to the
system most suitable

You can‟t rewrite EVERYTHING.

• Salesfolk will just glibly tell
you to rewrite your apps so
you can use whatever big
data analysis framework
they happen to be selling
today

• In life science informatics
we have hundreds of codes
that will never be rewritten.
• We‟ll be needing them for
years to come.

• Advice:
• MapReduceish methods are
the future for big-data
informatics
• It will take years to get there
• We still have to deal with
legacy algorithms and codes

• You will need:
• A process for figuring out
when it‟s worthwhile to
rewrite/re-architect
• Tested cloud strategies for
handling three use cases

You need 3 cloud
architectures:

1. Legacy HPC
2. “Cloudy” HPC
3. Big Data HPC (Hadoop)

Legacy HPC on the cloud

• MIT StarCluster
• http://web.mit.edu/star/cluster/
• This is your baseline
• Extend as needed

“Cloudy” HPC

• Use this method when …
• It makes sense to rewrite or
rearchitect an HPC workflow to
better leverage modern cloud
capabilities

“Cloudy” HPC, continued

• Ditch the legacy compute farm model
• Leverage elastic scale-out tools (***)
• Spot Instances for elastic & cheap compute
• SimpleDB for job statekeeping
• SQS for job queues & workrflow “glue”
• SNS for message passing & monitoring
• S3 for input & output data
• Etc.

Big Data HPC

• It‟s gonna be a MapReduce world
• Little need to roll your own
• Ecosystem already healthy
• Multiple providers today
• Often a slam-dunk cloud use case

The Cloud was not designed for
“us”

• HPC is an edge case for the
hyperscale IaaS clouds
• We need to deal with this
and engineer around it.

• Many examples
• Eventual consistency
• Networking & subnets
• Latency
• Node placement

• Advice
• Manage expectations
• Benchmark & test
• Evangelize
• (pester the cloud sales reps …)

• Consistently getting easier
• Amazon is not a
bottleneck
• AWS Import/Export
• AWS Direct Connect
• Aspera has some amazing
stuff out right now

• Advice
• AWS Import/Export works well
• Size of pipe is not everything
• Sweat the small stuff
• Tracking, checksums, disk speed
• Dedicated workstations
• Secure media storage

Dedicated data movement station

„naked‟ Terabyte-scale data movement

Don‟t overlook media storage …

• Advice for 2012
• BioTeam is dialing down our
advocacy of physical data
ingestion into the cloud
• Why?
• Operationally hard, expensive
and no longer strictly needed

Real world cross-country
internet-based data movement

March
2012

700Mb/sec into Amazon, stress-free & zero tuning

March
2012

• People trying to move data via
physical media quickly realize the
operational difficulties
• Bandwidth is cheaper than hiring
another body to manage physical
data ingestion & movement
• In 2012 we strongly recommend
network-based data movement
when at all possible

Big shared storage. Still hard.

• Not much we can do except
engineer around it
• AWS compute cluster
instances are a huge step
forward
• AWS competitors take note

• We are not database nerds
• We care about more than
just random IO performance
• We need it all
• Random I/O
• Long sequential read/write

• Faster Storage Options
• Software RAID on EBS
• Various GlusterFS options
• Even if you optimize
everything, the virtual NICs
are still a bottleneck

• Big Shared Storage
• 10GbE nodes and NFS
• Software RAID sets
• GlusterFS or similar
• 2012: pNFS finally?

Things fail differently in the cloud.

• Stuff breaks
• It breaks in weird ways
• Transient/temporary issues
more common than what we
see “at home”

• Advice
• Pessimism is good
• Design for failure
• Think hard about
• How will you detect?
• How will you respond?

• Advice
• Remove humans from
loop
• Automate recovery
• Automate your backups

Serial/batch computing at-scale

• Loosely coupled workflows
are ideal
• Break the pipeline into
discrete components
• Components should be able
to scale up|down
independently

• Component = Opportunity to:
• … Make a scaling decision
• (# nodes in use)
• … Make sizing decision
• (instance type in use)

… independent loosely
connected components that
can self-scale and
communicate asynchronously

Advice:
• Many people already doing
this
• Best practices are well known
• Steal from the best:
• RightScale, Opscode &
Cycle Computing

Questions?
Slides available at
http://slideshare.net/chrisdag/

Private Clouds: Pick Your Poison

• OpenStack - http://openstack.org
• Pro: Super smart developers;
significant mindshare; True
Open Source
• Con: Commitment to AWS API
compatibility (?) & stability


• CloudStack- http://cloudstack.org
• Pro: Explicit AWS API support;
very recent move away from
“open-core” model; usability
• Con: Developer mindshare?
Sudden switch to Apache


• Eucalyptus- http://eucalyptus.com
• Pro: Direct AWS API
compatibility; lots of hypervisor
support
• Con: Open-core model;
mindshare; Recent ressurection

Mapping Life Science Informatics to the Cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Mapping Life Science Informatics to the Cloud

Ähnlich wie Mapping Life Science Informatics to the Cloud (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mapping Life Science Informatics to the Cloud