SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Research Computing@Broad
An Update: Bio-IT World Expo
April, 2015
Chris Dwan (cdwan@broadinstitute.org)
Director, Research Computing and Data Services
Take Home Messages
• Go ahead and do the legwork to federate your
environment to at least one public cloud.
– It’s “just work” at this point.
• Spend a lot of time understanding your data lifecycle, then
stuff the overly bulky 95% of it in an object store fronted
by a middleware application.
– The latency sensitive, constantly used bits fit in RAM
• Human issues of consent, data privacy, and ownership
are still the hardest part of the picture.
– We must learn to work together from a shared, standards based framework
– The time is now.
The world is quite ruthless in selecting between
the dream and the reality, even where we will not.
Cormac McCarthy, All the Pretty Horses
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Twelve core faculty members and more than 200
associate members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~1000 associated researchers
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
60+ Illumina HiSeq instruments, including 14 ‘X’ sequencers
700,000+ genotyped samples
~18PB unique data / ~30PB usable file storage
The HPC Environment
Shared Everything: A reasonable architecture
Network
• ~10,000 Cores of Linux
server
• 10Gb/sec ethernet
backplane.
• All storage is available as
files (NAS) from all
servers.
Monitoring and metrics
Matt Nicholson
joins the Broad
Chris Dwan
joins the Broad
Monitoring and metrics
Matt Nicholson
joins the Broad
Gradual puppetization
Chris Dwan joins
the Broad
Monitoring and metrics
Matt Nicholson
joins the Broad
Gradual puppetization:
increased visibility
We’re pretty sure
that we actually
have ~15,000 cores
Chris Dwan joins
the Broad
Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad specific system configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Bare Metal
OS and vendor patches (Red Hat / yum, plus satellite)
Private Cloud Public Cloud
Containerized
Wonderland
Management Stack: Bare Metal
Network topology (VLANS, et al)
Many specific technical
decisions do not matter, so long
as you choose something and
make it work (Dagdigian, 2015)
Shared Everything: Ugly reality
10 Gb/sec Network
• At least six discrete
compute farms running
at least five versions of
batch schedulers (LSF
and SGE)
• Nodes “shared” by mix-
and-match between
owners.
• Nine Isilon clusters
• Five Infinidat filers
• ~19 distinct storage
technologies.
Genomics
Platform
Cancer
Program
Shared
“Farm”
Overlapping usage = Potential I/O
bottleneck when multiple groups are
doing heavy analysis
Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad specific configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
OS and vendor patches (Red Hat / yum, plus satellite)
Private Cloud Public Cloud
Containerized
Wonderland
Configuration Stack: Now with Private Cloud!
Network topology (VLANS, et al)
Re-use everything possible
from the bare metal
environment.
While inserting things that make
our life easier.
Openstack (RHEL, Icehouse)
Openstack@Broad: The least cloudy cloud.
10 Gb/sec Network
Genomics
Platform
Cancer
Program
Shared
“Farm”
Least “cloudy” implementation
possible.
IT / DevOps staff as users
Simply virtualizing and
abstracting away hardware
from user facing OS.
Note that most former
problems remain intact
Incrementally easier to manage
with very limited staff (3 FTE
Linux admins).
Openstack monitoring / UI is primitive at best
Openstack: open issues
Excluded from our project:
– Software defined networking (Neutron)
– “Cloud” storage (Cinder / Swift)
– Monitoring / Billing (Ceilometer, Heat)
– High Availability on Controllers
Custom:
– Most deployment infrastructure / scripting, including DNS
– Network encapsulation
– Active Directory integration
– All core systems administration functions
Core Message:
– Do not change both what you do and how you do it at the same time.
– Openstack could have been a catastrophe without rather extreme project
scoping.
I need “telemetry,” rather than logs*.
Jisto: Software startup with smart, smart monitoring and potential for containerized
cycle harvesting a-la condor.
*Logs let you know why you crashed. Telemetry lets you steer.
Trust no one:
This machine was not actually the
hypervisor of anything.
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
We need elastic computing, a more
cloudy cloud.
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
Subnet: 10.200.x.x
Domain: openstack.broadinstitute.org
Hostname: tenant-x-x
Starting at the bottom of the stack
There is no good answer to the question of
“DNS in your private cloud”
Private BIND domains are your friend
The particular naming scheme does not
matter. Just pick a scheme
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
Amazon VPC
VPN
Endpoint
VPN
Endpoint
More Compute!
Subnet: 10.200.x.x
Domain: openstack.broadinstitute.org
Hostname: tenant-x-x
Subnet: 10.199.x.x
Domain: aws.broadinstitute.org
Network Engineering: You don’t
have to replace everything.
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
Amazon VPC
VPN
Endpoint
VPN
Endpoint
More Compute!
Subnet: 10.200.x.x
Domain: openstack.broadinstitute.org
Hostname: tenant-x-x
Subnet: 10.199.x.x
Domain: aws.broadinstitute.org
Ignore issues of latency, network transport costs,
and data locality for the moment. We’ll get to those
later.
Differentiate Layer 2 from Layer 3 connectivity.
We are not using Amazon Direct Connect. We don’t
need to, because AWS is routable via Internet 2.
Network Engineering Makes
Everything Simpler
Physical Network Layout: More bits!
Markley Data
Centers (Boston)
Internap Data
Centers
(Somerville)
Broad: Main St.
Broad: Charles St.
Between data centers:
• 80 Gb/sec dark fiber
Internet:
1 Gb/sec
10 Gb/sec
Internet 2
10 Gb/sec
100 Gb/sec
Failover Internet:
1 Gb/sec
Metro Ring:
• 20 Gb/sec
dark fiber
My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad specific configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
OS and vendor patches (Red Hat / yum, plus satellite)
Private Cloud Public Cloud
Containerized
Wonderland
Configuration Stack: Private Hybrid Cloud!
Network topology (VLANS, et al)
Public Cloud
Infrastructure
Instance Provisioning
(CycleCloud)
CycleCloud provides straightforward, recognizable cluster
functionality with autoscaling and a clean management UI.
Do not be fooled by the 85 page “quick start guide,” it’s just a
cluster.
A social digression on cloud
resources
• Researchers are generally:
– Remarkably hardworking
– Responsible, good stewards of resources
– Not terribly engaged with IT strategy
These good character traits present social barriers to
cloud adoption
Researchs Need
– Guidance and guard rails.
– Confidence that they are not “wasting” resources
– A sufficiently familiar environment to get started
Multiple Public Clouds
Openstack
Batch Compute Farm: 2015 Edition
Production Farm Shared Research FarmTwo clusters, running the same batch
scheduler (Univa’s Grid Engine).
Production: Small number of humans
operating several production systems for
business critical data delivery.
Resarch: Many humans running ad-
hoc tasks.
Multiple Public Clouds
Openstack
End State: Compute Clusters
Production Farm Shared Research Farm
A financially in-elastic portion of the
clusters is governed by traditional
fairshare scheduling.
Fairshare allocations change slowly
(month to month) based on
conversation, investment, and
discussion of both business and
emotional factors
This allows consistent budgeting,
dynamic exploration, and ad-hoc use
without fear or guilt.
Multiple public clouds support auto-scaling
queues for projects with funding and urgency
Openstack plus public clouds provides a
consistent capacity
End State: Compute Clusters
Production Farm Shared Research Farm
On a project basis, funds can be allocated
for truly elastic burst computing.
This allows business logic to drive delivery
based on investment
A financially in-elastic portion of the clusters
is governed by traditional fairshare
scheduling.
Fairshare allocations are changed slowly
(month to month, perhaps) based on
substantial conversation, investment, and
discussion of both business logic and
feelings.
This allows consistent budgeting, dynamic
exploration, and ad-hoc use without fear or
guilt.
Broad Institute
Amazon VPC
The term I’ve heard for this is
“intercloud.”
End State: Multiple Interconnected
Public Clouds for collaboration
Google Cloud
Sibling
Institutions
Long term goal:
• Seamless collaboration inside and outside of Broad
• With elastic compute and storage
• With little or no copying of files or ad-hoc, one-off hacks
My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
End User visible OS and vendor patches (Red Hat, plus satellite)
Private Cloud Public Cloud
Containerized
Wonderland
Configuration Stack: Now with containers!
Network topology (VLANS, et al)
Public Cloud
Infrastructure
Instance Provisioning
(CycleCloud)
???
… Docker / Mesos
Kubernetes / Cloud
Foundry / Common
Workflow Language
/ …
What about the data?
Scratch Space: “Pod local,” SSD filers
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable Informatics.
• Running a relative of Lustre
• Over multiple 40Gb/sec interfaces
• Managed using hostgroups, workload affinities,
and an attentive operations team
Openstack
Production Farm
For small data: Lots of SSD / Flash
• Unreasonable requirement: Make it impossible for
spindles to be my bottleneck
– 8 GByte per second throughput
• Multiple quotes with fewer than 8 x 10Gb/sec ports
– ~100 TByte usable capacity
• I am not asking about large volume storage
• Give me sustainable pricing.
– On a NAS style file share
• Raw SAN / block device / iSCSI is not the deal.
• Solution: Scalable Informatics “Unison” filers
– A lot of vendors failed basic sanity checks on this one.
– Please listen carefully when I state requirements. I do not believe in single monolithic
solutions anymore.
Caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable
Informatics.
• Workload managed by
hostgroups, workload
affinities, and an attentive
operations team
Openstack
Production Farm
Avere Edge Filer
(physical)
On premise data
stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Plus caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable
Informatics.
• Workload managed by
hostgroups, workload
affinities, and an attentive
operations team
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)
On premise data
stores
Cloud backed data stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Plus caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable
Informatics.
• Workload managed by
hostgroups, workload
affinities, and an attentive
operations team
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)
On premise data
stores
Avere Edge Filer
(virtual)
Cloud backed
data stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Geek Cred: My First Petabyte, 2008
Cool thing: Avere
• Avere sells software with optional hardware:
• NFS front end whose block-store is an S3 bucket.
• It was born as a caching accelerator, and it does that well,
so the network considerations are in the right place.
• Since the hardware is optional …
Cool thing: Avere
• Avere sells software with optional hardware:
• NFS front end whose block-store is an S3 bucket.
• It was born as a caching accelerator, and it does that well,
so the network considerations are in the right place.
• Since the hardware is optional …
• NFS share that bridging on premise and cloud.
But what about the big data?
Broad Data Production, 2015: ~100TB /wk
Data production will continue to grow year over year
We can easily keep up with it, if we adopt appropriate
technologies.
100TB/wk ~= 1.3Gb/sec but 1PB @ 1Gb/sec ~= 12 days.
Broad Data Production, 2015:
~100TB /wk of unique information
“Data is heavy: It goes to the cheapest, closest place, and it stays
there”
Jeff Hammerbacher
Data Sizes for one 30x Whole
Genome
Base Calls from a single lane of an Illumina HiSeq X
• Approximately the coverage required for 30x on a
whole human genome.
• Record of a laboratory event
• Totally immutable
• Almost never directly used
Aligned reads from that same lane:
• Substantially better compression because of putting like
with like.
95 GB
60 GB
Aggregated, topped up, and re-normalized BAM:
• Near doubling in file size because of multiple quality scores per base.
145 GB
Variant File (VCF) and other directly usable formats
• Even smaller when we cast the distilled information into a
database of some sort
Tiny
Under the hood: ~1TB of MongoDB
And now for something completely different
File based storage: The
Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.
– We need complex, dynamic, context sensitive semantics including
consent for research use.
We’re all familiar with this
Limits of File Based Organization
Limits of File Based Organization
• The fact that whatever.bam and whatever.log are in the same
directory implies a vast amount about their relationship.
• The suffixes “bam” and “log” are also laden with meaning
• That implicit organization and metadata must be made explicit
in order to transcend the boundaries of file based storage
Limits of File Based Organization
• Broad hosts genotypes derived from perhaps 700,000
individuals
• These genotypes are organized according to a variety of
standards (~1,000 cohorts), and are spread across a variety of
filesystems
• Metadata about consent, phenotype, etc is scattered across
dozens to hundreds of “databases.”
Limits of File Based Organization
• This lack of organization is holding us back from:
• Collaboration and Federation between sibling organizations
• Substantial cost savings using policy based data motion
• Integrative research efforts
• Large scale discoveries that are currently in reach
We’re all familiar with this
Early 2014: Conversations about object storage with:
• Amazon Google
• EMC Cleversafe
• Avere Amplidata
• Data Direct Networks Infinidat
• …
My object storage opinions
• The S3 standard defines object storage
– Any application that uses any special / proprietary features is a
nonstarter – including clever metadata stuff.
• All object storage must be durable to the loss of an entire
data center
– Conversations about sizing / usage need to be incredibly simple
• Must be cost effective at scale
– Throughput and latency are considerations, not requirements
– This breaks the data question into stewardship and usage
• Must not merely re-iterate the failure modes of filesystems
The dashboard should look opaque
The dashboard should look opaque
• Object “names” should be a bag of UUIDs
• Object storage should be basically unusable without the
metadata index.
• Anything else recapitulates the failure mode of file based
storage.
The dashboard should look opaque
• Object “names” should be a bag of UUIDs
• Object storage should be basically unusable without the
metadata index.
• Anything else recapitulates the failure mode of file based
storage.
• This should scare you.
Current Object Storage Architecture
“Boss” Middleware
Consent for
Research Use
Phenotype
LIMS
Legacy File
(file://)
Cloud Providers
(AWS / Google)
• Domain specific middleware (“BOSS”) objects
• Mediates access by issuing pre-signed URLs
• Provides secured, time limited links
• A work in progress.
On Premise
Object Store
(2.6PB of EMC)
Current Object Storage
Architecture
“Boss” Middleware
Consent for
Research Use
Phenotype
LIMS
Legacy File
(file://)
Cloud Providers
(AWS / Google)
On Premise
Object Store
(2.6PB of EMC)
• Broad is currently decanting our two IRODs
archives into 2.6PB of on-premise object storage.
• This will free up 4PB of NAS filer (enough for a year
of data production).
• Have pushed at petabyte scale to Google’s cloud
storage
• At every point: Challenging but possible.
Data Deletion @ Scale
Me: “Blah Blah … I think we’re cool to delete about
600TB of data from a cloud bucket. What do you
think?”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket Ray: “BOOM!”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket
• This was my first deliberate data deletion at this scale.
• It scared me how fast / easy it was.
• Considering a “pull request” model for large scale deletions.
Files must give way to APIs
At large scale, the file/folder model for managing
data on computers becomes ineffective as a
human interface, and eventually a hindrance to
programmatic access. The solution: object storage
+ metadata.
Regulatory Issues
Ethical Issues
Technical Issues
Federated Identity Management
• This one is not solved.
• I have the names of various technologies that I think are
involved: OpenAM, Shibboleth, NIH Commons, …
• It is up to us to understand the requirements and build a
system that meets them.
• Requirements are:
– Regulatory / legal
– Organizational
– Ethical.
Genomic data is not de-identifiable
Regulatory Issues
Ethical Issues
Technical Issues
This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, this
year.
We also have an opportunity to waste vast amounts of
money and still not really help the world.
I would like to work together with you to build a better future,
sooner.
cdwan@broadinstitute.org
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues
Thank You
Research Computing Ops:
Katie Shakun, David Altschuler, Dave Gregoire, Steve Kaplan, Kirill Lozinskiy, Paul McMartin,
Zach Shulte, Brett Stogryn, Elsa Tsao
Scientific Computing Services:
Eric Jones, Jean Chang, Peter Ragone, Vince Ryan
DevOps:
Lukas Karlsson, Marc Monnar, Matt Nicholson, Ray Pete, Andrew Teixeira
DSDE Ops:
Kathleen Tibbetts, Sam Novod, Jason Rose, Charlotte Tolonen,, Ellen Winchester
Emeritus:
Tim Fennell, Cope Frazier, Eric Golin, Jay Weatherell, Ken Streck
BITS: Matthew Trunnell, Rob Damian, Cathleen Bonner, Kathy Dooley, Katey Falvey, Eugene
Opredelennov, Ian Poynter, (and many more)
DSDE: Eric Banks, David An, Kristian Cibulskis, Gabrielle Franceschelli, Adam Kiezun, Nils
Homer, Doug Voet, (and many more)
KDUX: Scott Sutherland, May Carmichael, Andrew Zimmer (and many more)
68
Partner Thank Yous
• Accunet (Nick Brown) Amazon
• Avere Cisco (Skip Giles)
• Cycle Computing EMC (Melissa Crichton, Patrick Combes)
• Google (Will Brockman) Infinidat
• Intel (Mark Bagley) Internet 2
• Red Hat Scalable Informatics (Joe Landman)
• Solina Violin Memory
• …
Take Home Messages
• Go ahead and do the legwork to federate your
environment to at least one public cloud.
– It’s “just work” at this point.
• Spend a lot of time understanding your data lifecycle, then
stuff the overly bulky 95% of it in an object store fronted
by a middleware application.
– The latency sensitive, constantly used bits fit in RAM
• Human issues of consent, data privacy, and ownership
are still the hardest part of the picture.
– We must learn to work together from a shared, standards based framework
– The time is now.
The opposite of play is not work, it’s depression
Jane McGonnigal, Reality is Broken

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogiesmark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...i_scienceEU
 
SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries Lenovo Data Center
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Chris Dagdigian
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte PushingChris Dagdigian
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...Cloudera, Inc.
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingMinhazul Arefin
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Rethink Server Backup and Regain Control
Rethink Server Backup and Regain ControlRethink Server Backup and Regain Control
Rethink Server Backup and Regain ControlDruva
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Dr. Anita Goel
 

Was ist angesagt? (20)

Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
 
SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Rethink Server Backup and Regain Control
Rethink Server Backup and Regain ControlRethink Server Backup and Regain Control
Rethink Server Backup and Regain Control
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017
 

Andere mochten auch

Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the TrenchesChris Dagdigian
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZChris Dagdigian
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeChris Dagdigian
 
2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the TrenchesChris Dagdigian
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
 
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPCAmazon Web Services
 

Andere mochten auch (7)

Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
 

Ähnlich wie 2015 04 bio it world

Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Cloud computing in biomedicine intel talk
Cloud computing in biomedicine intel talkCloud computing in biomedicine intel talk
Cloud computing in biomedicine intel talkKetan Paranjape
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systemsSri Prasanna
 
Open Cloud Consortium: An Update (04-23-10, v9)
Open Cloud Consortium: An Update (04-23-10, v9)Open Cloud Consortium: An Update (04-23-10, v9)
Open Cloud Consortium: An Update (04-23-10, v9)Robert Grossman
 
Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)Robert Grossman
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Introduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialIntroduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialGlobus
 
M0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxM0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxviveknagle4
 
Introduction and Overview of OpenStack for IaaS
Introduction and Overview of OpenStack for IaaSIntroduction and Overview of OpenStack for IaaS
Introduction and Overview of OpenStack for IaaSKeith Basil
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDavid Wallom
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
 
Modern apps with dcos
Modern apps with dcosModern apps with dcos
Modern apps with dcosSam Chen
 
Declare Victory with Big Data
Declare Victory with Big DataDeclare Victory with Big Data
Declare Victory with Big DataJ On The Beach
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchTom Connor
 
Setting up a private cloud for academic environment with OSS by Zoran Pantic ...
Setting up a private cloud for academic environment with OSS by Zoran Pantic ...Setting up a private cloud for academic environment with OSS by Zoran Pantic ...
Setting up a private cloud for academic environment with OSS by Zoran Pantic ...José Ferreiro
 

Ähnlich wie 2015 04 bio it world (20)

Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Cloud computing in biomedicine intel talk
Cloud computing in biomedicine intel talkCloud computing in biomedicine intel talk
Cloud computing in biomedicine intel talk
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
Open Cloud Consortium: An Update (04-23-10, v9)
Open Cloud Consortium: An Update (04-23-10, v9)Open Cloud Consortium: An Update (04-23-10, v9)
Open Cloud Consortium: An Update (04-23-10, v9)
 
Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Introduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialIntroduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 Tutorial
 
M0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxM0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptx
 
Introduction and Overview of OpenStack for IaaS
Introduction and Overview of OpenStack for IaaSIntroduction and Overview of OpenStack for IaaS
Introduction and Overview of OpenStack for IaaS
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
Modern apps with dcos
Modern apps with dcosModern apps with dcos
Modern apps with dcos
 
Declare Victory with Big Data
Declare Victory with Big DataDeclare Victory with Big Data
Declare Victory with Big Data
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
Climb bath
Climb bathClimb bath
Climb bath
 
Setting up a private cloud for academic environment with OSS by Zoran Pantic ...
Setting up a private cloud for academic environment with OSS by Zoran Pantic ...Setting up a private cloud for academic environment with OSS by Zoran Pantic ...
Setting up a private cloud for academic environment with OSS by Zoran Pantic ...
 

Mehr von Chris Dwan

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfChris Dwan
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdfChris Dwan
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit AllChris Dwan
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetChris Dwan
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionChris Dwan
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepoliceChris Dwan
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user trainingChris Dwan
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearingChris Dwan
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fairChris Dwan
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Chris Dwan
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"Chris Dwan
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPCChris Dwan
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformaticsChris Dwan
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinanceChris Dwan
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change MatrixChris Dwan
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaulChris Dwan
 
Response from newport
Response from newportResponse from newport
Response from newportChris Dwan
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docsChris Dwan
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy editionChris Dwan
 

Mehr von Chris Dwan (20)

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
 
Response from newport
Response from newportResponse from newport
Response from newport
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
 

2015 04 bio it world

  • 1. Research Computing@Broad An Update: Bio-IT World Expo April, 2015 Chris Dwan (cdwan@broadinstitute.org) Director, Research Computing and Data Services
  • 2. Take Home Messages • Go ahead and do the legwork to federate your environment to at least one public cloud. – It’s “just work” at this point. • Spend a lot of time understanding your data lifecycle, then stuff the overly bulky 95% of it in an object store fronted by a middleware application. – The latency sensitive, constantly used bits fit in RAM • Human issues of consent, data privacy, and ownership are still the hardest part of the picture. – We must learn to work together from a shared, standards based framework – The time is now.
  • 3. The world is quite ruthless in selecting between the dream and the reality, even where we will not. Cormac McCarthy, All the Pretty Horses
  • 4. • The Broad Institute is a non-profit biomedical research institute founded in 2004 • Fifty core faculty members and hundreds of associate members from MIT and Harvard • ~1000 research and administrative personnel, plus ~2,400+ associated researchers Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute
  • 5. • The Broad Institute is a non-profit biomedical research institute founded in 2004 • Twelve core faculty members and more than 200 associate members from MIT and Harvard • ~1000 research and administrative personnel, plus ~1000 associated researchers Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute 60+ Illumina HiSeq instruments, including 14 ‘X’ sequencers 700,000+ genotyped samples ~18PB unique data / ~30PB usable file storage
  • 6. The HPC Environment Shared Everything: A reasonable architecture Network • ~10,000 Cores of Linux server • 10Gb/sec ethernet backplane. • All storage is available as files (NAS) from all servers.
  • 7. Monitoring and metrics Matt Nicholson joins the Broad Chris Dwan joins the Broad
  • 8. Monitoring and metrics Matt Nicholson joins the Broad Gradual puppetization Chris Dwan joins the Broad
  • 9. Monitoring and metrics Matt Nicholson joins the Broad Gradual puppetization: increased visibility We’re pretty sure that we actually have ~15,000 cores Chris Dwan joins the Broad
  • 10. Metal Boot Image Provisioning (PXE / Cobbler, Kickstart) Hardware Provisioning (UCS, Xcat) Broad specific system configuration (Puppet) User or execution environment (Dotkit, docker, JVM, Tomcat) Bare Metal OS and vendor patches (Red Hat / yum, plus satellite) Private Cloud Public Cloud Containerized Wonderland Management Stack: Bare Metal Network topology (VLANS, et al) Many specific technical decisions do not matter, so long as you choose something and make it work (Dagdigian, 2015)
  • 11. Shared Everything: Ugly reality 10 Gb/sec Network • At least six discrete compute farms running at least five versions of batch schedulers (LSF and SGE) • Nodes “shared” by mix- and-match between owners. • Nine Isilon clusters • Five Infinidat filers • ~19 distinct storage technologies. Genomics Platform Cancer Program Shared “Farm” Overlapping usage = Potential I/O bottleneck when multiple groups are doing heavy analysis
  • 12.
  • 13. Metal Boot Image Provisioning (PXE / Cobbler, Kickstart) Hardware Provisioning (UCS, Xcat) Broad specific configuration (Puppet) User or execution environment (Dotkit, docker, JVM, Tomcat) Hypervisor OS Instance Provisioning (Openstack) Bare Metal OS and vendor patches (Red Hat / yum, plus satellite) Private Cloud Public Cloud Containerized Wonderland Configuration Stack: Now with Private Cloud! Network topology (VLANS, et al) Re-use everything possible from the bare metal environment. While inserting things that make our life easier.
  • 14. Openstack (RHEL, Icehouse) Openstack@Broad: The least cloudy cloud. 10 Gb/sec Network Genomics Platform Cancer Program Shared “Farm” Least “cloudy” implementation possible. IT / DevOps staff as users Simply virtualizing and abstracting away hardware from user facing OS. Note that most former problems remain intact Incrementally easier to manage with very limited staff (3 FTE Linux admins).
  • 15. Openstack monitoring / UI is primitive at best
  • 16. Openstack: open issues Excluded from our project: – Software defined networking (Neutron) – “Cloud” storage (Cinder / Swift) – Monitoring / Billing (Ceilometer, Heat) – High Availability on Controllers Custom: – Most deployment infrastructure / scripting, including DNS – Network encapsulation – Active Directory integration – All core systems administration functions Core Message: – Do not change both what you do and how you do it at the same time. – Openstack could have been a catastrophe without rather extreme project scoping.
  • 17. I need “telemetry,” rather than logs*. Jisto: Software startup with smart, smart monitoring and potential for containerized cycle harvesting a-la condor. *Logs let you know why you crashed. Telemetry lets you steer.
  • 18. Trust no one: This machine was not actually the hypervisor of anything.
  • 19. Broad Institute Firewall NAS Filers Compute Internet Internet 2 Edge Router Router We need elastic computing, a more cloudy cloud.
  • 20. Broad Institute Firewall NAS Filers Compute Internet Internet 2 Edge Router Router Subnet: 10.200.x.x Domain: openstack.broadinstitute.org Hostname: tenant-x-x Starting at the bottom of the stack There is no good answer to the question of “DNS in your private cloud” Private BIND domains are your friend The particular naming scheme does not matter. Just pick a scheme
  • 21. Broad Institute Firewall NAS Filers Compute Internet Internet 2 Edge Router Router Amazon VPC VPN Endpoint VPN Endpoint More Compute! Subnet: 10.200.x.x Domain: openstack.broadinstitute.org Hostname: tenant-x-x Subnet: 10.199.x.x Domain: aws.broadinstitute.org Network Engineering: You don’t have to replace everything.
  • 22. Broad Institute Firewall NAS Filers Compute Internet Internet 2 Edge Router Router Amazon VPC VPN Endpoint VPN Endpoint More Compute! Subnet: 10.200.x.x Domain: openstack.broadinstitute.org Hostname: tenant-x-x Subnet: 10.199.x.x Domain: aws.broadinstitute.org Ignore issues of latency, network transport costs, and data locality for the moment. We’ll get to those later. Differentiate Layer 2 from Layer 3 connectivity. We are not using Amazon Direct Connect. We don’t need to, because AWS is routable via Internet 2. Network Engineering Makes Everything Simpler
  • 23. Physical Network Layout: More bits! Markley Data Centers (Boston) Internap Data Centers (Somerville) Broad: Main St. Broad: Charles St. Between data centers: • 80 Gb/sec dark fiber Internet: 1 Gb/sec 10 Gb/sec Internet 2 10 Gb/sec 100 Gb/sec Failover Internet: 1 Gb/sec Metro Ring: • 20 Gb/sec dark fiber
  • 24. My Metal Boot Image Provisioning (PXE / Cobbler, Kickstart) Hardware Provisioning (UCS, Xcat) Broad specific configuration (Puppet) User or execution environment (Dotkit, docker, JVM, Tomcat) Hypervisor OS Instance Provisioning (Openstack) Bare Metal OS and vendor patches (Red Hat / yum, plus satellite) Private Cloud Public Cloud Containerized Wonderland Configuration Stack: Private Hybrid Cloud! Network topology (VLANS, et al) Public Cloud Infrastructure Instance Provisioning (CycleCloud)
  • 25. CycleCloud provides straightforward, recognizable cluster functionality with autoscaling and a clean management UI. Do not be fooled by the 85 page “quick start guide,” it’s just a cluster.
  • 26. A social digression on cloud resources • Researchers are generally: – Remarkably hardworking – Responsible, good stewards of resources – Not terribly engaged with IT strategy These good character traits present social barriers to cloud adoption Researchs Need – Guidance and guard rails. – Confidence that they are not “wasting” resources – A sufficiently familiar environment to get started
  • 27. Multiple Public Clouds Openstack Batch Compute Farm: 2015 Edition Production Farm Shared Research FarmTwo clusters, running the same batch scheduler (Univa’s Grid Engine). Production: Small number of humans operating several production systems for business critical data delivery. Resarch: Many humans running ad- hoc tasks.
  • 28. Multiple Public Clouds Openstack End State: Compute Clusters Production Farm Shared Research Farm A financially in-elastic portion of the clusters is governed by traditional fairshare scheduling. Fairshare allocations change slowly (month to month) based on conversation, investment, and discussion of both business and emotional factors This allows consistent budgeting, dynamic exploration, and ad-hoc use without fear or guilt.
  • 29. Multiple public clouds support auto-scaling queues for projects with funding and urgency Openstack plus public clouds provides a consistent capacity End State: Compute Clusters Production Farm Shared Research Farm On a project basis, funds can be allocated for truly elastic burst computing. This allows business logic to drive delivery based on investment A financially in-elastic portion of the clusters is governed by traditional fairshare scheduling. Fairshare allocations are changed slowly (month to month, perhaps) based on substantial conversation, investment, and discussion of both business logic and feelings. This allows consistent budgeting, dynamic exploration, and ad-hoc use without fear or guilt.
  • 30. Broad Institute Amazon VPC The term I’ve heard for this is “intercloud.” End State: Multiple Interconnected Public Clouds for collaboration Google Cloud Sibling Institutions Long term goal: • Seamless collaboration inside and outside of Broad • With elastic compute and storage • With little or no copying of files or ad-hoc, one-off hacks
  • 31. My Metal Boot Image Provisioning (PXE / Cobbler, Kickstart) Hardware Provisioning (UCS, Xcat) Broad configuration (Puppet) User or execution environment (Dotkit, docker, JVM, Tomcat) Hypervisor OS Instance Provisioning (Openstack) Bare Metal End User visible OS and vendor patches (Red Hat, plus satellite) Private Cloud Public Cloud Containerized Wonderland Configuration Stack: Now with containers! Network topology (VLANS, et al) Public Cloud Infrastructure Instance Provisioning (CycleCloud) ??? … Docker / Mesos Kubernetes / Cloud Foundry / Common Workflow Language / …
  • 32. What about the data?
  • 33. Scratch Space: “Pod local,” SSD filers 10 Gb/sec Network 80+ Gb/sec Network Scratch Space: • 3 x 70TB filers from Scalable Informatics. • Running a relative of Lustre • Over multiple 40Gb/sec interfaces • Managed using hostgroups, workload affinities, and an attentive operations team Openstack Production Farm
  • 34. For small data: Lots of SSD / Flash • Unreasonable requirement: Make it impossible for spindles to be my bottleneck – 8 GByte per second throughput • Multiple quotes with fewer than 8 x 10Gb/sec ports – ~100 TByte usable capacity • I am not asking about large volume storage • Give me sustainable pricing. – On a NAS style file share • Raw SAN / block device / iSCSI is not the deal. • Solution: Scalable Informatics “Unison” filers – A lot of vendors failed basic sanity checks on this one. – Please listen carefully when I state requirements. I do not believe in single monolithic solutions anymore.
  • 35. Caching edge filers for shared references 10 Gb/sec Network 80+ Gb/sec Network Scratch Space: • 3 x 70TB filers from Scalable Informatics. • Workload managed by hostgroups, workload affinities, and an attentive operations team Openstack Production Farm Avere Edge Filer (physical) On premise data stores Shared Research Farm Coherence on small volumes of files provided by a combination of clever network routing and Avere’s caching algorithms.
  • 36. Plus caching edge filers for shared references 10 Gb/sec Network 80+ Gb/sec Network Scratch Space: • 3 x 70TB filers from Scalable Informatics. • Workload managed by hostgroups, workload affinities, and an attentive operations team Openstack Production Farm Multiple Public Clouds Avere Edge Filer (physical) On premise data stores Cloud backed data stores Shared Research Farm Coherence on small volumes of files provided by a combination of clever network routing and Avere’s caching algorithms.
  • 37. Plus caching edge filers for shared references 10 Gb/sec Network 80+ Gb/sec Network Scratch Space: • 3 x 70TB filers from Scalable Informatics. • Workload managed by hostgroups, workload affinities, and an attentive operations team Openstack Production Farm Multiple Public Clouds Avere Edge Filer (physical) On premise data stores Avere Edge Filer (virtual) Cloud backed data stores Shared Research Farm Coherence on small volumes of files provided by a combination of clever network routing and Avere’s caching algorithms.
  • 38. Geek Cred: My First Petabyte, 2008
  • 39. Cool thing: Avere • Avere sells software with optional hardware: • NFS front end whose block-store is an S3 bucket. • It was born as a caching accelerator, and it does that well, so the network considerations are in the right place. • Since the hardware is optional …
  • 40. Cool thing: Avere • Avere sells software with optional hardware: • NFS front end whose block-store is an S3 bucket. • It was born as a caching accelerator, and it does that well, so the network considerations are in the right place. • Since the hardware is optional … • NFS share that bridging on premise and cloud.
  • 41. But what about the big data?
  • 42. Broad Data Production, 2015: ~100TB /wk Data production will continue to grow year over year We can easily keep up with it, if we adopt appropriate technologies. 100TB/wk ~= 1.3Gb/sec but 1PB @ 1Gb/sec ~= 12 days.
  • 43. Broad Data Production, 2015: ~100TB /wk of unique information “Data is heavy: It goes to the cheapest, closest place, and it stays there” Jeff Hammerbacher
  • 44. Data Sizes for one 30x Whole Genome Base Calls from a single lane of an Illumina HiSeq X • Approximately the coverage required for 30x on a whole human genome. • Record of a laboratory event • Totally immutable • Almost never directly used Aligned reads from that same lane: • Substantially better compression because of putting like with like. 95 GB 60 GB Aggregated, topped up, and re-normalized BAM: • Near doubling in file size because of multiple quality scores per base. 145 GB Variant File (VCF) and other directly usable formats • Even smaller when we cast the distilled information into a database of some sort Tiny
  • 45. Under the hood: ~1TB of MongoDB
  • 46. And now for something completely different
  • 47. File based storage: The Information Limits • Single namespace filers hit real-world limits at: – ~5PB (restriping times, operational hotspots, MTBF headaches) – ~109 files: Directories must either be wider or deeper than human brains can handle. • Filesystem paths are presumed to persist forever – Leads inevitably to forests of symbolic links • Access semantics are inadequate for the federated world. – We need complex, dynamic, context sensitive semantics including consent for research use.
  • 48. We’re all familiar with this
  • 49. Limits of File Based Organization
  • 50. Limits of File Based Organization • The fact that whatever.bam and whatever.log are in the same directory implies a vast amount about their relationship. • The suffixes “bam” and “log” are also laden with meaning • That implicit organization and metadata must be made explicit in order to transcend the boundaries of file based storage
  • 51. Limits of File Based Organization • Broad hosts genotypes derived from perhaps 700,000 individuals • These genotypes are organized according to a variety of standards (~1,000 cohorts), and are spread across a variety of filesystems • Metadata about consent, phenotype, etc is scattered across dozens to hundreds of “databases.”
  • 52. Limits of File Based Organization • This lack of organization is holding us back from: • Collaboration and Federation between sibling organizations • Substantial cost savings using policy based data motion • Integrative research efforts • Large scale discoveries that are currently in reach
  • 53. We’re all familiar with this Early 2014: Conversations about object storage with: • Amazon Google • EMC Cleversafe • Avere Amplidata • Data Direct Networks Infinidat • …
  • 54. My object storage opinions • The S3 standard defines object storage – Any application that uses any special / proprietary features is a nonstarter – including clever metadata stuff. • All object storage must be durable to the loss of an entire data center – Conversations about sizing / usage need to be incredibly simple • Must be cost effective at scale – Throughput and latency are considerations, not requirements – This breaks the data question into stewardship and usage • Must not merely re-iterate the failure modes of filesystems
  • 55. The dashboard should look opaque
  • 56. The dashboard should look opaque • Object “names” should be a bag of UUIDs • Object storage should be basically unusable without the metadata index. • Anything else recapitulates the failure mode of file based storage.
  • 57. The dashboard should look opaque • Object “names” should be a bag of UUIDs • Object storage should be basically unusable without the metadata index. • Anything else recapitulates the failure mode of file based storage. • This should scare you.
  • 58. Current Object Storage Architecture “Boss” Middleware Consent for Research Use Phenotype LIMS Legacy File (file://) Cloud Providers (AWS / Google) • Domain specific middleware (“BOSS”) objects • Mediates access by issuing pre-signed URLs • Provides secured, time limited links • A work in progress. On Premise Object Store (2.6PB of EMC)
  • 59. Current Object Storage Architecture “Boss” Middleware Consent for Research Use Phenotype LIMS Legacy File (file://) Cloud Providers (AWS / Google) On Premise Object Store (2.6PB of EMC) • Broad is currently decanting our two IRODs archives into 2.6PB of on-premise object storage. • This will free up 4PB of NAS filer (enough for a year of data production). • Have pushed at petabyte scale to Google’s cloud storage • At every point: Challenging but possible.
  • 60. Data Deletion @ Scale Me: “Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket. What do you think?”
  • 61. Data Deletion @ Scale Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket Ray: “BOOM!”
  • 62. Data Deletion @ Scale Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket • This was my first deliberate data deletion at this scale. • It scared me how fast / easy it was. • Considering a “pull request” model for large scale deletions.
  • 63. Files must give way to APIs At large scale, the file/folder model for managing data on computers becomes ineffective as a human interface, and eventually a hindrance to programmatic access. The solution: object storage + metadata. Regulatory Issues Ethical Issues Technical Issues
  • 64. Federated Identity Management • This one is not solved. • I have the names of various technologies that I think are involved: OpenAM, Shibboleth, NIH Commons, … • It is up to us to understand the requirements and build a system that meets them. • Requirements are: – Regulatory / legal – Organizational – Ethical.
  • 65. Genomic data is not de-identifiable Regulatory Issues Ethical Issues Technical Issues
  • 66. This stuff is important We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, this year. We also have an opportunity to waste vast amounts of money and still not really help the world. I would like to work together with you to build a better future, sooner. cdwan@broadinstitute.org
  • 67.
  • 68. Standards are needed for genomic data “The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.” Regulatory Issues Ethical Issues Technical Issues
  • 69. Thank You Research Computing Ops: Katie Shakun, David Altschuler, Dave Gregoire, Steve Kaplan, Kirill Lozinskiy, Paul McMartin, Zach Shulte, Brett Stogryn, Elsa Tsao Scientific Computing Services: Eric Jones, Jean Chang, Peter Ragone, Vince Ryan DevOps: Lukas Karlsson, Marc Monnar, Matt Nicholson, Ray Pete, Andrew Teixeira DSDE Ops: Kathleen Tibbetts, Sam Novod, Jason Rose, Charlotte Tolonen,, Ellen Winchester Emeritus: Tim Fennell, Cope Frazier, Eric Golin, Jay Weatherell, Ken Streck BITS: Matthew Trunnell, Rob Damian, Cathleen Bonner, Kathy Dooley, Katey Falvey, Eugene Opredelennov, Ian Poynter, (and many more) DSDE: Eric Banks, David An, Kristian Cibulskis, Gabrielle Franceschelli, Adam Kiezun, Nils Homer, Doug Voet, (and many more) KDUX: Scott Sutherland, May Carmichael, Andrew Zimmer (and many more) 68
  • 70. Partner Thank Yous • Accunet (Nick Brown) Amazon • Avere Cisco (Skip Giles) • Cycle Computing EMC (Melissa Crichton, Patrick Combes) • Google (Will Brockman) Infinidat • Intel (Mark Bagley) Internet 2 • Red Hat Scalable Informatics (Joe Landman) • Solina Violin Memory • …
  • 71. Take Home Messages • Go ahead and do the legwork to federate your environment to at least one public cloud. – It’s “just work” at this point. • Spend a lot of time understanding your data lifecycle, then stuff the overly bulky 95% of it in an object store fronted by a middleware application. – The latency sensitive, constantly used bits fit in RAM • Human issues of consent, data privacy, and ownership are still the hardest part of the picture. – We must learn to work together from a shared, standards based framework – The time is now.
  • 72. The opposite of play is not work, it’s depression Jane McGonnigal, Reality is Broken