SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Bionimbus:	
  	
  
   Lessons	
  from	
  a	
  Petabyte-­‐Scale	
  	
  
Science	
  Cloud	
  Service	
  Provider	
  (CSP)	
  
                       Robert	
  Grossman	
  
                                      	
  
            Ins?tute	
  for	
  Genomics	
  &	
  Systems	
  Biology	
  	
  
                 Center	
  for	
  Research	
  Informa?cs	
  	
  
                         Computa?on	
  Ins?tute	
  
                    Department	
  of	
  Medicine	
  
                         University	
  of	
  Chicago	
  
                                      &	
  	
  
                           Open	
  Data	
  Group	
  
                                         	
  
                           September	
  11,	
  2012	
  
The	
  OSDC	
  &	
  Bionimbus	
  Teams	
  
•  Open	
  Science	
  Data	
  Cloud	
  (OSDC)	
  Team	
  
    –  MaM	
  Greenway,	
  Allison	
  Heath,	
  Ray	
  Powell,	
  Rafael	
  
       Suarez.	
  
    –  Major	
  funding	
  for	
  the	
  OSDC	
  is	
  provided	
  by	
  the	
  Gordon	
  
       and	
  BeMy	
  Moore	
  Founda?on.	
  
•  Bionimbus	
  Team	
  
    –  Elizabeth	
  Bartom,	
  Casey	
  Brown,	
  Jason	
  Grundstad,	
  David	
  
       Hanley,	
  Nicolas	
  Negre,	
  Tom	
  Stricker,	
  MaM	
  SlaMery,	
  
       Rebecca	
  Spokony	
  &	
  Kevin	
  White.	
  
    –  Bionimbus	
  is	
  a	
  joint	
  project	
  between	
  Laboratory	
  for	
  
       Advanced	
  Compu?ng	
  &	
  White	
  Lab	
  at	
  the	
  University	
  of	
  
       Chicago	
  and	
  uses	
  in	
  part	
  the	
  OSDC	
  infrastructure.	
  
Let’s	
  Step	
  Back	
  20	
  Years	
  

                           •  1992-­‐96:	
  Petabyte	
  
                              Access	
  &	
  Storage	
  
                              Solu?ons	
  (PASS)	
  
                              Project	
  for	
  SSC.	
  
                           •  It	
  developed	
  &	
  
                              benchmarked	
  
                              federated	
  rela?onal,	
  
                              OO	
  DB,	
  object	
  
                              stores,	
  &	
  column-­‐
                              oriented	
  data	
  
                              warehouse	
  solu?ons	
  
                              at	
  the	
  TB-­‐scale.	
  	
  
A	
  picture	
  of	
  Cern’s	
  Large	
  Hadron	
  Collider	
  (LHC).	
  	
  The	
  LHC	
  took	
  about	
  a	
  decade	
  to	
  construct,	
  and	
  cost	
  about	
  
$4.75	
  billion.	
  	
  	
  Source	
  of	
  picture:	
  Conrad	
  Melvin,	
  Crea?ve	
  Commons	
  BY-­‐SA	
  2.0,	
  www.flickr.com/photos/
58220828@N07/5350788732	
  
Part	
  1.	
  
Genomics	
  as	
  a	
  Big	
  Data	
  Science	
  
Source:	
  Lincoln	
  Stein	
  
One	
  Million	
  Genomes	
  
•  Sequencing	
  a	
  million	
  genomes	
  would	
  most	
  
   likely	
  fundamentally	
  change	
  the	
  way	
  we	
  
   understand	
  genomic	
  varia?on.	
  
•  The	
  genomic	
  data	
  for	
  a	
  pa?ent	
  is	
  about	
  1	
  TB	
  
   (including	
  samples	
  from	
  both	
  tumor	
  and	
  
   normal	
  ?ssue).	
  
•  One	
  million	
  genomes	
  is	
  about	
  1000	
  PB	
  or	
  1	
  EB	
  
•  With	
  compression,	
  it	
  may	
  be	
  about	
  100	
  PB	
  
•  At	
  $1000/genome,	
  the	
  sequencing	
  would	
  cost	
  
   about	
  $1B	
  
Big	
  data	
  driven	
  discovery	
  on	
  
                1,000,000	
  genomes	
  and	
  1	
  EB	
  of	
  data.	
  



Genomic-­‐                         Improved	
                         	
  Genomic-­‐	
  
 driven	
                        understanding	
                      driven	
  drug	
  
diagnosis	
                       of	
  genomic	
                    development	
  
                                    science	
  



                          Precision	
  diagnosis	
  and	
  
                          treatment.	
  	
  Preven?ve	
  
                                health	
  care.	
  
ER+	
  




                                                                             TNBC	
  




With	
  genomics,	
  we	
  can	
  stra?fy	
  diseases	
  and	
  treat	
  each	
  
stratum	
  differently.	
              Source:	
  White	
  Lab,	
  University	
  of	
  Chicago.	
  
Clonal	
  Evolu?on	
  of	
  Tumors	
  




          Tumors	
  evolve	
  temporally	
  and	
  spa?ally.	
  
Source:	
  Mel	
  Greaves	
  &	
  Carlo	
  C.	
  Maley,	
  Clonal	
  evolu?on	
  in	
  cancer,	
  Nature,	
  
Volume	
  241,	
  pages	
  306-­‐312,	
  2012.	
  
Combina?ons	
  of	
  Rare	
  Alleles	
  
   Penetrance	
  

        High	
  
                                                                                                             rare	
  examples	
  of	
  
                                     alleles	
                                                               high-­‐penetrance	
  
                                    causing	
                                                                common	
  variants	
  	
  
                                   Mendelian	
  	
                                                              influencing	
  	
  
 Intermediate	
                     disease	
                                                                common	
  disease	
  
                                                                             Low-­‐frequency	
  
                                                                              variants	
  with	
  
                                                                    	
  intermediate	
  penetrance	
  

                               rare	
  variants	
  of	
                                                       most	
  common	
  
     Modest	
                                                                                                    variants	
  	
  
                                    small	
  effect	
  
                             very	
  hard	
  to	
  iden?fy	
                                                   implicated	
  in	
  
                              by	
  gene?c	
  means	
                                                        common	
  disease	
  
                                                                                                                 by	
  GWA	
  
        Low	
  
                                                                                                                                             Allele	
  	
  
                                          0.001	
                             0.01	
                     0.1	
                            frequency	
  
                    Very	
  rare	
                           Rare	
                      Uncommon	
                        Common	
  


Source:	
  Mark	
  McCarthy	
  
TCGA	
  Analysis	
  of	
  Lung	
  Cancer	
  
                                                                                                                    •  178	
  cases	
  of	
  
                                                                                                                       SQCC	
  (lung	
  
                                                                                                                       cancer)	
  
                                                                                                                    •  Matched	
  tumor	
  
                                                                                                                       &	
  normal	
  
                                                                                                                    •  Mean	
  of	
  360	
  
                                                                                                                       exonic	
  
                                                                                                                       muta?ons,	
  323	
  
                                                                                                                       CNV,	
  &	
  165	
  
                                                                                                                       rearrangements	
  
                                                                                                                       per	
  tumor	
  
Source:	
  The	
  Cancer	
  Genome	
  Atlas	
  Research	
  Network,	
  Comprehensive	
  genomic	
  
characteriza?on	
  of	
  squamous	
  cell	
  lung	
  cancers,	
  Nature,	
  2012,	
  doi:10.1038/nature11404.	
  
Some	
  Examples	
  of	
  Big	
  Data	
  Science	
  

Discipline	
                                                       Dura3on	
   Size	
                                                                                 #	
  Devices	
  
HEP	
  -­‐	
  LHC	
                                                10	
  years	
   15	
  PB/year*	
                                                                   One	
  

Astronomy	
  -­‐	
  LSST	
   10	
  years	
   12	
  PB/year**	
                                                                                                        One	
  

Genomics	
  -­‐	
  NGS	
                                           2-­‐4	
  years	
   0.5	
  TB/genome	
   1000’s	
  


*At	
  full	
  capacity,	
  the	
  Large	
  Hadron	
  Collider	
  (LHC),	
  the	
  world's	
  largest	
  par?cle	
  accelerator,	
  is	
  expected	
  to	
  produce	
  more	
  than	
  15	
  
million	
  Gigabytes	
  of	
  data	
  each	
  year.	
  	
  …	
  This	
  ambi?ous	
  project	
  connects	
  and	
  combines	
  the	
  IT	
  power	
  of	
  more	
  than	
  140	
  computer	
  
centres	
  in	
  33	
  countries.	
  	
  Source:	
  hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html	
  
	
  
**As	
  it	
  carries	
  out	
  its	
  10-­‐year	
  survey,	
  LSST	
  will	
  produce	
  over	
  15	
  terabytes	
  of	
  raw	
  astronomical	
  data	
  each	
  night	
  (30	
  terabytes	
  
processed),	
  resul?ng	
  in	
  a	
  database	
  catalog	
  of	
  22	
  petabytes	
  and	
  an	
  image	
  archive	
  of	
  100	
  petabytes.	
  	
  Source:	
  hMp://www.lsst.org/
News/enews/teragrid-­‐1004.html	
  
One	
  large	
  instrument	
     Many	
  smaller	
  instruments	
  
Part	
  2.	
  
What	
  Instrument	
  Do	
  we	
  Use	
  to	
  	
  
Make	
  Big	
  Data	
  Discoveries?	
  




How	
  do	
  we	
  build	
  a	
  “datascope?”	
  
TB?	
  
                                 PB?	
  
                                 EB?	
  
                                 ZB?	
  


What	
  is	
  big	
  data?	
  
Another	
  way:	
  




                                                        opencompute.org	
  

Think	
  of	
  data	
  as	
  big	
  if	
  you	
  measure	
  it	
  in	
  MW,	
  as	
  in	
  
   Facebook’s	
  Pineville	
  Data	
  Center	
  is	
  30	
  MW.	
  
An	
  algorithm	
  and	
  
compu?ng	
  
infrastructure	
  is	
  “big-­‐
data	
  scalable”	
  if	
  adding	
  
a	
  rack	
  (or	
  container)	
  of	
  
data	
  (and	
  corresponding	
  
processors)	
  allows	
  you	
  
to	
  do	
  the	
  same	
  
computa?on	
  in	
  the	
  
same	
  ?me	
  but	
  over	
  
more	
  data.	
  
Commercial	
  Cloud	
  Service	
  Provider	
  (CSP)	
  	
  
        15	
  MW	
  Data	
  Center	
  

              Monitoring,	
  
                                                     Accoun?ng	
  and	
  
            network	
  security	
  
                                                         billing	
                                Customer	
  
             and	
  forensics	
  
                                                                                                   Facing	
  
                                                                                                   Portal	
  
               Automa?c	
  
            provisioning	
  and	
                   100,000	
  servers	
  
             infrastructure	
                         1	
  PB	
  DRAM	
  
             management	
                          100’s	
  of	
  PB	
  of	
  disk	
   ~1	
  Tbps	
  egress	
  bandwidth	
  
                                                                                       	
  

 25	
  operators	
  for	
  15	
  MW	
  Commercial	
  Cloud	
           Data	
  center	
  network	
  
What	
  are	
  some	
  of	
  the	
  important	
  
differences	
  between	
  commercial	
  
  and	
  research-­‐focused	
  CSPs?	
  	
  
Science	
  CSP	
                           Commercial	
  CSP	
  
POV	
            Democra?ze	
  access	
  to	
               As	
  long	
  as	
  you	
  pay	
  the	
  bill;	
  
                 data.	
  	
  Integrate	
  data	
  to	
     as	
  long	
  as	
  the	
  business	
  
                 make	
  discoveries.	
  	
  Long	
         model	
  holds.	
  
                 term	
  archive.	
  
Data	
  &	
      Data	
  intensive	
                Internet	
  style	
  scale	
  out	
  
Storage	
               Science	
  Clouds	
  
                 compu?ng	
  &	
  HP	
  storage	
   and	
  object-­‐based	
  storage	
  
Flows	
          Large	
  data	
  flows	
  in	
  and	
       Lots	
  of	
  small	
  web	
  flows	
  
                 out	
  
Streams	
        Streaming	
  processing	
                  NA	
  
                 required	
  
Accoun?ng	
      Essen?al	
                                 Essen?al	
  
Lock	
  in	
     Moving	
  environment	
                    Lock	
  in	
  is	
  good	
  
                 between	
  CSPs	
  essen?al	
  
Part	
  3.	
  
The	
  Open	
  Cloud	
  Consor?um’s	
  	
  
Open	
  Science	
  Data	
  Cloud	
  
•  U.S	
  based	
  not-­‐for-­‐profit	
  corpora?on.	
  
•  Manages	
  cloud	
  compu?ng	
  infrastructure	
  to	
  
   support	
  scien?fic	
  research:	
  Open	
  Science	
  
   Data	
  Cloud.	
  
•  Manages	
  cloud	
  compu?ng	
  testbeds:	
  Open	
  
   Cloud	
  Testbed.	
  
	
  


www.opencloudconsor?um.org	
                                  23	
  
Cloud	
  Services	
  	
  
          Opera?ons	
  Centers	
  (CSOC)	
  

•  The	
  OSDC	
  operates	
  Cloud	
  Services	
  Opera?ons	
  
   Center	
  (or	
  CSOC).	
  
•  It	
  is	
  a	
  CSOC	
  focused	
  on	
  suppor?ng	
  Science	
  
   Clouds	
  for	
  researchers.	
  
•  Compare	
  to	
  Network	
  Opera?ons	
  Center	
  or	
  
   NOC.	
  
•  Both	
  are	
  an	
  important	
  part	
  of	
  cyber	
  
   infrastructure	
  for	
  big	
  data	
  science.	
  
Different	
  Styles	
  of	
  OSDC	
  Racks	
  
                                               •  Design	
  1:	
  Put	
  cores	
  
                                                  over	
  spindles.	
  
                                               •  Higher	
  cost	
  but	
  
                                                  easy	
  to	
  compute	
  
                                                  over	
  all	
  the	
  data.	
  
                                               •  Design	
  2:	
  separate	
  
                                                  (some	
  of	
  the	
  )
2012	
  OSDC	
  rack	
  design	
  (dray)	
  
•  950	
  TB	
  /	
  rack	
  
                                                  storage	
  from	
  the	
  
•  600	
  cores	
  /	
  rack	
                    compute.	
  
Open	
  Science	
  Data	
  Cloud	
  
                                                         Accoun?ng	
  and	
  
                  Monitoring,	
                           billing	
  (OSDC)	
  
                 compliance,	
  &	
  
                   security	
                                                                        Customer	
  Facing	
  
                                                       Science	
  Cloud	
  SW	
  
                                                           &	
  Services	
                            Portal	
  (Tukey)	
  
                  Automa?c	
  
               provisioning	
  and	
  
                                                            3	
  PB	
  2011	
  
                infrastructure	
                           10	
  PB	
  2012	
  	
  
                management	
                                                             ~100	
  Gbps	
  bandwidth	
  
                                                         able	
  to	
  scale	
  to	
  
                                                                                         	
  
                                                             100	
  PB?	
  

   5-­‐12	
  operators	
  to	
  operate	
  1-­‐5	
  MW	
  Science	
  Cloud	
   Data	
  center	
  network	
  



OSDC	
  Data	
  Stack	
  based	
  upon	
  OpenStack,	
  Hadoop,	
  GlusterFS,	
  UDT,	
  …	
  
OSDC	
  Philosophy	
  
•  We	
  try	
  to	
  automate	
  as	
  much	
  as	
  possible	
  (we	
  
   automate	
  the	
  setup	
  &	
  opera?ons	
  of	
  a	
  rack).	
  
•  We	
  try	
  to	
  write	
  as	
  liMle	
  soyware	
  as	
  possible.	
  
•  Each	
  project	
  is	
  a	
  bit	
  different,	
  but	
  in	
  general:	
  
•  We	
  assign	
  (permanent)	
  IDs	
  to	
  data	
  managed	
  by	
  
   the	
  OSDC	
  and	
  manage	
  associated	
  metadata.	
  
•  We	
  assign	
  and	
  enforce	
  permissions	
  for	
  users	
  &	
  
   groups	
  of	
  users	
  and	
  for	
  files/objects,	
  collec?ons	
  
   of	
  files/objects,	
  and	
  collec?ons	
  of	
  collec?ons.	
  
•  We	
  Support	
  RESTful	
  interfaces.	
  
•  Do	
  accoun?ng	
  for	
  storage	
  and	
  core-­‐hours.	
  
Some	
  Of	
  Our	
  Biggest	
  Mistakes	
  
•  Not	
  charging	
  those	
  who	
  were	
  the	
  largest	
  users	
  of	
  
   our	
  services.	
  	
  	
  This	
  resulted	
  in	
  a	
  lot	
  of	
  bad	
  
   behavior.	
  
•  Trying	
  to	
  support	
  donated	
  equipment	
  without	
  
   adequate	
  staff.	
  
•  Being	
  too	
  op?mis?c	
  about	
  when	
  big	
  data	
  soyware	
  
   would	
  be	
  ready	
  for	
  prime	
  ?me.	
  
•  Some	
  problems	
  with	
  big	
  data	
  soyware	
  doesn’t	
  
   show	
  up	
  at	
  less	
  than	
  the	
  full	
  scale	
  of	
  the	
  OSDC,	
  but	
  
   we	
  have	
  only	
  one	
  OSDC	
  and	
  it	
  is	
  difficult	
  to	
  test	
  at	
  
   this	
  scale.	
  
Essen?al	
  Services	
  for	
  a	
  Science	
  CSP	
  
•  Support	
  for	
  data	
  intensive	
  compu?ng	
  
•  Support	
  for	
  big	
  data	
  flows	
  
•  Account	
  management,	
  authen?ca?on	
  and	
  
   authoriza?on	
  services	
  
•  Health	
  and	
  status	
  monitoring	
  
•  Billing	
  and	
  accoun?ng	
  
•  Ability	
  to	
  rapidly	
  provision	
  infrastructure	
  
•  Security	
  services,	
  logging,	
  event	
  repor?ng	
  
•  Access	
  to	
  large	
  amounts	
  of	
  public	
  data	
  
•  High	
  performance	
  storage	
  
•  Simple	
  data	
  export	
  and	
  import	
  services	
  
Number	
  



1000’s	
       Individual	
  scien?sts	
  &	
  
               small	
  projects	
  

100’s	
  
                                  Community	
  based	
  
                                  science	
  via	
  Science	
  as	
  a	
  
10’s	
                            Service	
  
                                                                      very	
  large	
  projects	
  
                                                                              Data	
  Size	
  
              Small	
              Medium	
  to	
  Large	
  	
   Very	
  Large	
  
             Public	
                 Shared	
  community	
               Dedicated	
  	
  
             infrastructure	
         infrastructure	
                    infrastructure	
  
Part	
  4.	
  	
  Bionimbus	
  




Bionimbus	
  is	
  a	
  joint	
  project	
  between	
  Laboratory	
  For	
  Advanced	
  
Compu?ng	
  &	
  the	
  White	
  Lab	
  at	
  the	
  University	
  of	
  Chicago.	
  
Step	
  1.	
  Prepare	
  a	
  Sample	
  
Step	
  2.	
  	
  Login	
  to	
  Bionimbus	
  and	
  get	
  a	
  
Bionimbus	
  Key.	
  
Step	
  3.	
  	
  Send	
  your	
  sample	
  to	
  the	
  
sequencing	
  center.	
  
	
  
Step	
  4.	
  	
  Login	
  on	
  to	
  Bionimbus	
  and	
  	
  
view	
  your	
  data	
  
Step	
  5.	
  	
  Use	
  Bionimbus	
  to	
  perform	
  
standard	
  and	
  custom	
  pipelines.	
  




Bionimbus	
  can	
  launch	
  mul?ple	
  virtual	
  machines.	
  
Bionimbus	
  Virtual	
  Machine	
  Releases	
  	
  
              Peak	
  Calling	
   MAT	
  
                                  MA2C	
  
                                  PeakSeq	
  
                                  MACS	
  
                                  SPP	
  
              Quality	
           Various	
  
              Control	
  
              Alignment	
  &	
   Bow?e	
  
              Genotyping	
  
                                 TopHat	
  
                                 Samtools	
  
                                 Picard	
  
                                                37	
  
Soyware	
  Tools:	
  Moving	
  Genomes	
  
Bionimbus	
  Community	
  Genomic	
  Cloud	
  

                                                  researcher	
  




•  1K	
  genomes	
     Cloud	
  for	
  
•  PubMed	
            Public	
  Data	
  
•  etc.	
              	
  
                                            Personal	
  “dropbox”	
  +	
  compute	
  
Bionimbus	
  Private	
  Genomic	
  Cloud	
  

                                                  researcher	
  




•  1K	
  genomes	
     Cloud	
  for	
                                    Cloud	
  for	
           TCGA	
  
•  PubMed	
            Public	
  Data	
                                  Controlled	
  Data	
     dbGaP	
  
•  etc.	
              	
                   Personal	
  “dropbox”	
   	
  
                                            &	
  compute	
  
Bionimbus	
  Private	
  Biomedical	
  Cloud	
  
                                               researcher	
  




•  1K	
  genomes	
  
•  PubMed	
            Cloud	
  for	
                               Cloud	
  for	
           TCGA	
  
•  etc.	
              Public	
  Data	
   Personal	
  “dropbox”	
   Controlled	
  Data	
     dbGaP	
  
                       	
                plus	
  compute	
          	
  


ScaMer,	
  
gather	
                          Clinical	
                      Cloud	
  for	
  
queries	
                      Research	
  Data	
                 PHI	
  data	
  
                                Warehouse	
  
Step	
  2.	
  Send	
  sample	
  to	
  
                                                                                                   Step	
  1.	
  Get	
  Bionimbus	
  ID	
  
                     be	
  sequenced.	
  
                                                                                                   (BID),	
  assign	
  project,	
  
                                                                                                   private/community,	
  
                                              Internal	
                      BID	
  Generator	
   public	
  cloud,	
  etc.	
  
  External	
  	
                              Sequencers	
  
  sequencing	
  partner	
  
                                                                            Step	
  5.	
  	
  Cloud	
  based	
  analysis	
  	
  
                                                                            using	
  IGSB	
  and	
  3rd	
  	
  
                                                                            party	
  tools	
  and	
  applica?ons.	
  	
  
                                         Step	
  3a.	
  Return	
  raw	
  
                                         reads.	
  
Step	
  3b.	
  Return	
  
variant	
  calls,	
  	
  
CNV,	
  annota?on…	
                                       Bionimbus	
                                 Bionimbus	
  
                                                          Private	
  Cloud	
                           Community	
  
            Step	
  4.	
  Secure	
  data	
                     UC	
                                      Cloud	
  
            rou?ng	
  to	
  appropriate	
  
            cloud	
  based	
  upon	
  BID.	
  

                                              Bionimbus	
  
                                                Private	
                          dbGaP	
                            Amazon	
  
                                               Cloud	
  XY	
  
(Eucalyptus,	
  
                      web2py-­‐based	
  Front	
  End	
                    OpenStack)	
  


                                                                   U?lity	
  Cloud	
  
(PostgreSQL)	
                                                      Services	
  



            Database	
                 Analysis	
  Pipelines	
  &	
  
            Services	
                 Re-­‐analysis	
  Services	
       Intercloud	
  
                                                                         Services	
  

(IDs,	
  etc.)	
  
                                                                                (UDT,	
  
             Data	
  
                                               Data	
  	
                       replica?on)	
  
           Inges?on	
  
            Services	
                     Cloud	
  Services	
  
                                                                         (Hadoop,	
  
                                                                         Sector/Sphere)	
  
>300	
  ChIP	
  datasets	
  
                                           -­‐ Chroma?n/RNA	
  ?mecourse	
  
                                           -­‐ CBP	
  
                                           -­‐ PolII	
  
                                           -­‐ Pho/silencers	
  
                                           -­‐ HDACs	
  
                                           -­‐ Insulators	
  
                                           -­‐ TFs	
  
                                           Predic3ons	
  
                                           537	
  silencers	
  
                                           2,307	
  new	
  promoters	
  
                                           12,285	
  enhancers	
  
                                           14,145	
  insulators	
  




                                           www.modencode.org	
  
                                                            44	
  
Negre	
  et	
  al.	
  Nature	
  2011	
     	
  
                                           	
  
Part	
  5.	
  	
  	
  
Managing	
  One	
  Million	
  Genomes	
  
Enrich	
  with	
  
 Rela?onal	
  databases	
                       Summary	
  level	
  	
   clinical	
  data	
  
                                                (10-­‐100	
  TB)	
  


 NoSql	
  &	
  scien?fic	
  
 databases	
  	
  
                                    Varia?on	
  (VCF)	
  Files	
  (1-­‐10	
  PB)	
  	
  
                                               (Genomic	
  varia?on)	
  


NoSql,	
  DFS,	
  	
  	
      Sequence	
  (BAM)	
  Files	
  (100-­‐1000	
  PB)	
  	
  
file	
  overlays?	
  	
                   (Sequence	
  data	
  in	
  binary	
  form)	
  
Acknowledgements	
  
Major	
  funding	
  and	
  support	
  for	
  the	
  Open	
  Science	
  Data	
  Cloud	
  (OSDC)	
  is	
  provided	
  by	
  the	
  
Gordon	
  and	
  BeMy	
  Moore	
  Founda?on.	
  	
  This	
  funding	
  is	
  used	
  to	
  support	
  the	
  OSDC-­‐Adler,	
  
Sullivan	
  and	
  Root	
  facili?es.	
  
	
  
Addi?onal	
  funding	
  for	
  the	
  OSDC	
  has	
  been	
  provided	
  by	
  the	
  following	
  sponsors:	
  
	
  
•  The	
  OCC-­‐Y	
  Hadoop	
  Cluster	
  (approximately	
  1000	
  cores	
  and	
  1	
  PB	
  of	
  storage)	
  was	
  
     donated	
  by	
  Yahoo!	
  in	
  2011.	
  
•  Cisco	
  provides	
  the	
  OSDC	
  access	
  to	
  the	
  Cisco	
  C-­‐Wave,	
  which	
  connects	
  OSDC	
  data	
  
     centers	
  with	
  10	
  Gbps	
  wide	
  area	
  networks.	
  
•  NSF	
  awarded	
  the	
  OSDC	
  a	
  5-­‐year	
  (2010-­‐2016)	
  PIRE	
  award	
  to	
  train	
  scien?sts	
  to	
  use	
  
     the	
  OSDC	
  and	
  to	
  further	
  develop	
  the	
  underlying	
  technology.	
  
•  OSDC	
  technology	
  for	
  high	
  performance	
  data	
  transport	
  is	
  support	
  in	
  part	
  by	
  	
  NSF	
  
     Award	
  1127316.	
  
•  The	
  StarLight	
  Facility	
  in	
  Chicago	
  enables	
  the	
  OSDC	
  to	
  connect	
  to	
  over	
  30	
  high	
  
     performance	
  research	
  networks	
  around	
  the	
  world	
  at	
  10	
  Gbps	
  or	
  higher,	
  with	
  an	
  
     increasing	
  number	
  of	
  100	
  Gbps	
  connec?ons.	
  
	
  
The	
  OSDC	
  is	
  managed	
  by	
  the	
  Open	
  Cloud	
  Consor?um,	
  a	
  501(c)(3)	
  not-­‐for-­‐profit	
  
corpora?on.	
  If	
  you	
  are	
  interested	
  in	
  providing	
  funding	
  or	
  dona?ng	
  equipment	
  or	
  
services,	
  please	
  contact	
  us	
  at	
  info@opensciencedatacloud.org.	
  
For	
  more	
  informa?on	
  
•  You	
  can	
  find	
  some	
  more	
  informa?on	
  on	
  my	
  blog:	
  
    	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  rgrossman.com.	
  
•  Some	
  of	
  my	
  technical	
  papers	
  are	
  also	
  available	
  there.	
  	
  
•  My	
  email	
  address	
  is	
  robert.grossman	
  at	
  uchicago	
  dot	
  edu	
  
•  I	
  recently	
  wrote	
  a	
  popular	
  book	
  about	
  compu?ng	
  called:	
  The	
  
   Structure	
  of	
  Digital	
  Compu?ng:	
  From	
  Mainframes	
  to	
  Big	
  Data,	
  
   which	
  you	
  can	
  buy	
  from	
  Amazon.	
  


	
  
                                                                                                    Center for
                                                                                                    Research
                                                                                                    Informatics
Sources	
  for	
  images	
  

•    The	
  image	
  of	
  the	
  hard	
  disk	
  is	
  from	
  Norlando	
  Pobre,	
  Crea?ve	
  Commons.	
  
•    The	
  image	
  of	
  the	
  Facebook	
  Pineville	
  Data	
  Center	
  is	
  from	
  the	
  Intel	
  Free	
  Press,	
  
     www.flickr.com/photos/intelfreepress/6722296855/,	
  Crea?ve	
  Commons	
  BY	
  2.0.	
  
•    The	
  image	
  of	
  the	
  LHC	
  is	
  from	
  Conrad	
  Melvin,	
  Crea?ve	
  Commons	
  BY-­‐SA	
  2.0,	
  www.flickr.com/
     photos/58220828@N07/5350788732	
  

Weitere ähnliche Inhalte

Was ist angesagt?

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 
dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Big data in biology
Big data in biologyBig data in biology
Big data in biologyOmkar Reddy
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesAmanda Whitmire
 
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...GigaScience, BGI Hong Kong
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Todd Vision
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnTodd Vision
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
 

Was ist angesagt? (20)

Big Data
Big Data Big Data
Big Data
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Big data in biology
Big data in biologyBig data in biology
Big data in biology
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
DCC Keynote 2007
DCC Keynote 2007DCC Keynote 2007
DCC Keynote 2007
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universities
 
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...
 
BD2K Update
BD2K Update BD2K Update
BD2K Update
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
 

Andere mochten auch

The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Why a Manifesto for Open Science?
Why a Manifesto for Open Science?Why a Manifesto for Open Science?
Why a Manifesto for Open Science?Leslie Chan
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 

Andere mochten auch (19)

The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Why a Manifesto for Open Science?
Why a Manifesto for Open Science?Why a Manifesto for Open Science?
Why a Manifesto for Open Science?
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 

Ähnlich wie Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

Stratified Medicine - Applications and Case Studies
Stratified Medicine - Applications and Case StudiesStratified Medicine - Applications and Case Studies
Stratified Medicine - Applications and Case StudiesSpace IDEAS Hub
 
Genomics: The coming challenge to the health system
Genomics: The coming challenge to the health systemGenomics: The coming challenge to the health system
Genomics: The coming challenge to the health systemPrivate Healthcare Australia
 
Friend DREAM 2012-11-14
Friend DREAM 2012-11-14Friend DREAM 2012-11-14
Friend DREAM 2012-11-14Sage Base
 
A Systems Approach to Personalized Medicine
A Systems Approachto Personalized MedicineA Systems Approachto Personalized Medicine
A Systems Approach to Personalized MedicineLarry Smarr
 
Success of gene therapy
Success of gene therapySuccess of gene therapy
Success of gene therapynitinniper
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10Sage Base
 
Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24
Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24
Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24Sage Base
 
Paper Biology 280 S Minireview Advances In Cancer Detection And Therapeutics
Paper Biology 280 S Minireview Advances In Cancer Detection And TherapeuticsPaper Biology 280 S Minireview Advances In Cancer Detection And Therapeutics
Paper Biology 280 S Minireview Advances In Cancer Detection And TherapeuticsJoshua Mendoza-Elias
 
Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...
Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...
Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...MunevarS
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Intel IT Center
 
Bioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisBioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisDespoina Kalfakakou
 

Ähnlich wie Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture) (20)

Stratified Medicine - Applications and Case Studies
Stratified Medicine - Applications and Case StudiesStratified Medicine - Applications and Case Studies
Stratified Medicine - Applications and Case Studies
 
Genomics: The coming challenge to the health system
Genomics: The coming challenge to the health systemGenomics: The coming challenge to the health system
Genomics: The coming challenge to the health system
 
Friend DREAM 2012-11-14
Friend DREAM 2012-11-14Friend DREAM 2012-11-14
Friend DREAM 2012-11-14
 
Nanotechnology
NanotechnologyNanotechnology
Nanotechnology
 
A Systems Approach to Personalized Medicine
A Systems Approachto Personalized MedicineA Systems Approachto Personalized Medicine
A Systems Approach to Personalized Medicine
 
Lehrach
LehrachLehrach
Lehrach
 
Success of gene therapy
Success of gene therapySuccess of gene therapy
Success of gene therapy
 
Dr. David Gutman: Development and Validation of Radiology Descriptors in Gliomas
Dr. David Gutman: Development and Validation of Radiology Descriptors in GliomasDr. David Gutman: Development and Validation of Radiology Descriptors in Gliomas
Dr. David Gutman: Development and Validation of Radiology Descriptors in Gliomas
 
Building a Program in Personalized Medicine
Building a Program in Personalized Medicine Building a Program in Personalized Medicine
Building a Program in Personalized Medicine
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10
 
Wp3
Wp3Wp3
Wp3
 
Nanoparticles in cancer diagnosis
Nanoparticles in cancer diagnosisNanoparticles in cancer diagnosis
Nanoparticles in cancer diagnosis
 
Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24
Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24
Stephen Friend Genetic Alliance 25th Anniversary 2011-06-24
 
Socrate
SocrateSocrate
Socrate
 
Paper Biology 280 S Minireview Advances In Cancer Detection And Therapeutics
Paper Biology 280 S Minireview Advances In Cancer Detection And TherapeuticsPaper Biology 280 S Minireview Advances In Cancer Detection And Therapeutics
Paper Biology 280 S Minireview Advances In Cancer Detection And Therapeutics
 
Bacteria & Cancer
Bacteria & CancerBacteria & Cancer
Bacteria & Cancer
 
Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...
Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...
Science Shaping Our World-SHOW: Beyond Treading Water: Functional Analysis of...
 
Understanding cancer
Understanding cancerUnderstanding cancer
Understanding cancer
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
 
Bioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisBioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesis
 

Kürzlich hochgeladen

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Kürzlich hochgeladen (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

  • 1. Bionimbus:     Lessons  from  a  Petabyte-­‐Scale     Science  Cloud  Service  Provider  (CSP)   Robert  Grossman     Ins?tute  for  Genomics  &  Systems  Biology     Center  for  Research  Informa?cs     Computa?on  Ins?tute   Department  of  Medicine   University  of  Chicago   &     Open  Data  Group     September  11,  2012  
  • 2. The  OSDC  &  Bionimbus  Teams   •  Open  Science  Data  Cloud  (OSDC)  Team   –  MaM  Greenway,  Allison  Heath,  Ray  Powell,  Rafael   Suarez.   –  Major  funding  for  the  OSDC  is  provided  by  the  Gordon   and  BeMy  Moore  Founda?on.   •  Bionimbus  Team   –  Elizabeth  Bartom,  Casey  Brown,  Jason  Grundstad,  David   Hanley,  Nicolas  Negre,  Tom  Stricker,  MaM  SlaMery,   Rebecca  Spokony  &  Kevin  White.   –  Bionimbus  is  a  joint  project  between  Laboratory  for   Advanced  Compu?ng  &  White  Lab  at  the  University  of   Chicago  and  uses  in  part  the  OSDC  infrastructure.  
  • 3. Let’s  Step  Back  20  Years   •  1992-­‐96:  Petabyte   Access  &  Storage   Solu?ons  (PASS)   Project  for  SSC.   •  It  developed  &   benchmarked   federated  rela?onal,   OO  DB,  object   stores,  &  column-­‐ oriented  data   warehouse  solu?ons   at  the  TB-­‐scale.    
  • 4. A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about   $4.75  billion.      Source  of  picture:  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0,  www.flickr.com/photos/ 58220828@N07/5350788732  
  • 5. Part  1.   Genomics  as  a  Big  Data  Science  
  • 7. One  Million  Genomes   •  Sequencing  a  million  genomes  would  most   likely  fundamentally  change  the  way  we   understand  genomic  varia?on.   •  The  genomic  data  for  a  pa?ent  is  about  1  TB   (including  samples  from  both  tumor  and   normal  ?ssue).   •  One  million  genomes  is  about  1000  PB  or  1  EB   •  With  compression,  it  may  be  about  100  PB   •  At  $1000/genome,  the  sequencing  would  cost   about  $1B  
  • 8. Big  data  driven  discovery  on   1,000,000  genomes  and  1  EB  of  data.   Genomic-­‐ Improved    Genomic-­‐   driven   understanding   driven  drug   diagnosis   of  genomic   development   science   Precision  diagnosis  and   treatment.    Preven?ve   health  care.  
  • 9. ER+   TNBC   With  genomics,  we  can  stra?fy  diseases  and  treat  each   stratum  differently.   Source:  White  Lab,  University  of  Chicago.  
  • 10. Clonal  Evolu?on  of  Tumors   Tumors  evolve  temporally  and  spa?ally.   Source:  Mel  Greaves  &  Carlo  C.  Maley,  Clonal  evolu?on  in  cancer,  Nature,   Volume  241,  pages  306-­‐312,  2012.  
  • 11. Combina?ons  of  Rare  Alleles   Penetrance   High   rare  examples  of   alleles   high-­‐penetrance   causing   common  variants     Mendelian     influencing     Intermediate   disease   common  disease   Low-­‐frequency   variants  with    intermediate  penetrance   rare  variants  of   most  common   Modest   variants     small  effect   very  hard  to  iden?fy   implicated  in   by  gene?c  means   common  disease   by  GWA   Low   Allele     0.001   0.01   0.1   frequency   Very  rare   Rare   Uncommon   Common   Source:  Mark  McCarthy  
  • 12. TCGA  Analysis  of  Lung  Cancer   •  178  cases  of   SQCC  (lung   cancer)   •  Matched  tumor   &  normal   •  Mean  of  360   exonic   muta?ons,  323   CNV,  &  165   rearrangements   per  tumor   Source:  The  Cancer  Genome  Atlas  Research  Network,  Comprehensive  genomic   characteriza?on  of  squamous  cell  lung  cancers,  Nature,  2012,  doi:10.1038/nature11404.  
  • 13. Some  Examples  of  Big  Data  Science   Discipline   Dura3on   Size   #  Devices   HEP  -­‐  LHC   10  years   15  PB/year*   One   Astronomy  -­‐  LSST   10  years   12  PB/year**   One   Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s   *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par?cle  accelerator,  is  expected  to  produce  more  than  15   million  Gigabytes  of  data  each  year.    …  This  ambi?ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer   centres  in  33  countries.    Source:  hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html     **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes   processed),  resul?ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hMp://www.lsst.org/ News/enews/teragrid-­‐1004.html  
  • 14. One  large  instrument   Many  smaller  instruments  
  • 15. Part  2.   What  Instrument  Do  we  Use  to     Make  Big  Data  Discoveries?   How  do  we  build  a  “datascope?”  
  • 16. TB?   PB?   EB?   ZB?   What  is  big  data?  
  • 17. Another  way:   opencompute.org   Think  of  data  as  big  if  you  measure  it  in  MW,  as  in   Facebook’s  Pineville  Data  Center  is  30  MW.  
  • 18. An  algorithm  and   compu?ng   infrastructure  is  “big-­‐ data  scalable”  if  adding   a  rack  (or  container)  of   data  (and  corresponding   processors)  allows  you   to  do  the  same   computa?on  in  the   same  ?me  but  over   more  data.  
  • 19. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   Monitoring,   Accoun?ng  and   network  security   billing   Customer   and  forensics   Facing   Portal   Automa?c   provisioning  and   100,000  servers   infrastructure   1  PB  DRAM   management   100’s  of  PB  of  disk   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud   Data  center  network  
  • 20. What  are  some  of  the  important   differences  between  commercial   and  research-­‐focused  CSPs?    
  • 21. Science  CSP   Commercial  CSP   POV   Democra?ze  access  to   As  long  as  you  pay  the  bill;   data.    Integrate  data  to   as  long  as  the  business   make  discoveries.    Long   model  holds.   term  archive.   Data  &   Data  intensive   Internet  style  scale  out   Storage   Science  Clouds   compu?ng  &  HP  storage   and  object-­‐based  storage   Flows   Large  data  flows  in  and   Lots  of  small  web  flows   out   Streams   Streaming  processing   NA   required   Accoun?ng   Essen?al   Essen?al   Lock  in   Moving  environment   Lock  in  is  good   between  CSPs  essen?al  
  • 22. Part  3.   The  Open  Cloud  Consor?um’s     Open  Science  Data  Cloud  
  • 23. •  U.S  based  not-­‐for-­‐profit  corpora?on.   •  Manages  cloud  compu?ng  infrastructure  to   support  scien?fic  research:  Open  Science   Data  Cloud.   •  Manages  cloud  compu?ng  testbeds:  Open   Cloud  Testbed.     www.opencloudconsor?um.org   23  
  • 24. Cloud  Services     Opera?ons  Centers  (CSOC)   •  The  OSDC  operates  Cloud  Services  Opera?ons   Center  (or  CSOC).   •  It  is  a  CSOC  focused  on  suppor?ng  Science   Clouds  for  researchers.   •  Compare  to  Network  Opera?ons  Center  or   NOC.   •  Both  are  an  important  part  of  cyber   infrastructure  for  big  data  science.  
  • 25. Different  Styles  of  OSDC  Racks   •  Design  1:  Put  cores   over  spindles.   •  Higher  cost  but   easy  to  compute   over  all  the  data.   •  Design  2:  separate   (some  of  the  ) 2012  OSDC  rack  design  (dray)   •  950  TB  /  rack   storage  from  the   •  600  cores  /  rack   compute.  
  • 26. Open  Science  Data  Cloud   Accoun?ng  and   Monitoring,   billing  (OSDC)   compliance,  &   security   Customer  Facing   Science  Cloud  SW   &  Services   Portal  (Tukey)   Automa?c   provisioning  and   3  PB  2011   infrastructure   10  PB  2012     management   ~100  Gbps  bandwidth   able  to  scale  to     100  PB?   5-­‐12  operators  to  operate  1-­‐5  MW  Science  Cloud   Data  center  network   OSDC  Data  Stack  based  upon  OpenStack,  Hadoop,  GlusterFS,  UDT,  …  
  • 27. OSDC  Philosophy   •  We  try  to  automate  as  much  as  possible  (we   automate  the  setup  &  opera?ons  of  a  rack).   •  We  try  to  write  as  liMle  soyware  as  possible.   •  Each  project  is  a  bit  different,  but  in  general:   •  We  assign  (permanent)  IDs  to  data  managed  by   the  OSDC  and  manage  associated  metadata.   •  We  assign  and  enforce  permissions  for  users  &   groups  of  users  and  for  files/objects,  collec?ons   of  files/objects,  and  collec?ons  of  collec?ons.   •  We  Support  RESTful  interfaces.   •  Do  accoun?ng  for  storage  and  core-­‐hours.  
  • 28. Some  Of  Our  Biggest  Mistakes   •  Not  charging  those  who  were  the  largest  users  of   our  services.      This  resulted  in  a  lot  of  bad   behavior.   •  Trying  to  support  donated  equipment  without   adequate  staff.   •  Being  too  op?mis?c  about  when  big  data  soyware   would  be  ready  for  prime  ?me.   •  Some  problems  with  big  data  soyware  doesn’t   show  up  at  less  than  the  full  scale  of  the  OSDC,  but   we  have  only  one  OSDC  and  it  is  difficult  to  test  at   this  scale.  
  • 29. Essen?al  Services  for  a  Science  CSP   •  Support  for  data  intensive  compu?ng   •  Support  for  big  data  flows   •  Account  management,  authen?ca?on  and   authoriza?on  services   •  Health  and  status  monitoring   •  Billing  and  accoun?ng   •  Ability  to  rapidly  provision  infrastructure   •  Security  services,  logging,  event  repor?ng   •  Access  to  large  amounts  of  public  data   •  High  performance  storage   •  Simple  data  export  and  import  services  
  • 30. Number   1000’s   Individual  scien?sts  &   small  projects   100’s   Community  based   science  via  Science  as  a   10’s   Service   very  large  projects   Data  Size   Small   Medium  to  Large     Very  Large   Public   Shared  community   Dedicated     infrastructure   infrastructure   infrastructure  
  • 31. Part  4.    Bionimbus   Bionimbus  is  a  joint  project  between  Laboratory  For  Advanced   Compu?ng  &  the  White  Lab  at  the  University  of  Chicago.  
  • 32. Step  1.  Prepare  a  Sample  
  • 33. Step  2.    Login  to  Bionimbus  and  get  a   Bionimbus  Key.  
  • 34. Step  3.    Send  your  sample  to  the   sequencing  center.    
  • 35. Step  4.    Login  on  to  Bionimbus  and     view  your  data  
  • 36. Step  5.    Use  Bionimbus  to  perform   standard  and  custom  pipelines.   Bionimbus  can  launch  mul?ple  virtual  machines.  
  • 37. Bionimbus  Virtual  Machine  Releases     Peak  Calling   MAT   MA2C   PeakSeq   MACS   SPP   Quality   Various   Control   Alignment  &   Bow?e   Genotyping   TopHat   Samtools   Picard   37  
  • 39. Bionimbus  Community  Genomic  Cloud   researcher   •  1K  genomes   Cloud  for   •  PubMed   Public  Data   •  etc.     Personal  “dropbox”  +  compute  
  • 40. Bionimbus  Private  Genomic  Cloud   researcher   •  1K  genomes   Cloud  for   Cloud  for   TCGA   •  PubMed   Public  Data   Controlled  Data   dbGaP   •  etc.     Personal  “dropbox”     &  compute  
  • 41. Bionimbus  Private  Biomedical  Cloud   researcher   •  1K  genomes   •  PubMed   Cloud  for   Cloud  for   TCGA   •  etc.   Public  Data   Personal  “dropbox”   Controlled  Data   dbGaP     plus  compute     ScaMer,   gather   Clinical   Cloud  for   queries   Research  Data   PHI  data   Warehouse  
  • 42. Step  2.  Send  sample  to   Step  1.  Get  Bionimbus  ID   be  sequenced.   (BID),  assign  project,   private/community,   Internal   BID  Generator   public  cloud,  etc.   External     Sequencers   sequencing  partner   Step  5.    Cloud  based  analysis     using  IGSB  and  3rd     party  tools  and  applica?ons.     Step  3a.  Return  raw   reads.   Step  3b.  Return   variant  calls,     CNV,  annota?on…   Bionimbus   Bionimbus   Private  Cloud   Community   Step  4.  Secure  data   UC   Cloud   rou?ng  to  appropriate   cloud  based  upon  BID.   Bionimbus   Private   dbGaP   Amazon   Cloud  XY  
  • 43. (Eucalyptus,   web2py-­‐based  Front  End   OpenStack)   U?lity  Cloud   (PostgreSQL)   Services   Database   Analysis  Pipelines  &   Services   Re-­‐analysis  Services   Intercloud   Services   (IDs,  etc.)   (UDT,   Data   Data     replica?on)   Inges?on   Services   Cloud  Services   (Hadoop,   Sector/Sphere)  
  • 44. >300  ChIP  datasets   -­‐ Chroma?n/RNA  ?mecourse   -­‐ CBP   -­‐ PolII   -­‐ Pho/silencers   -­‐ HDACs   -­‐ Insulators   -­‐ TFs   Predic3ons   537  silencers   2,307  new  promoters   12,285  enhancers   14,145  insulators   www.modencode.org   44   Negre  et  al.  Nature  2011      
  • 45. Part  5.       Managing  One  Million  Genomes  
  • 46. Enrich  with   Rela?onal  databases   Summary  level     clinical  data   (10-­‐100  TB)   NoSql  &  scien?fic   databases     Varia?on  (VCF)  Files  (1-­‐10  PB)     (Genomic  varia?on)   NoSql,  DFS,       Sequence  (BAM)  Files  (100-­‐1000  PB)     file  overlays?     (Sequence  data  in  binary  form)  
  • 47. Acknowledgements   Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the   Gordon  and  BeMy  Moore  Founda?on.    This  funding  is  used  to  support  the  OSDC-­‐Adler,   Sullivan  and  Root  facili?es.     Addi?onal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:     •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was   donated  by  Yahoo!  in  2011.   •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data   centers  with  10  Gbps  wide  area  networks.   •  NSF  awarded  the  OSDC  a  5-­‐year  (2010-­‐2016)  PIRE  award  to  train  scien?sts  to  use   the  OSDC  and  to  further  develop  the  underlying  technology.   •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF   Award  1127316.   •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high   performance  research  networks  around  the  world  at  10  Gbps  or  higher,  with  an   increasing  number  of  100  Gbps  connec?ons.     The  OSDC  is  managed  by  the  Open  Cloud  Consor?um,  a  501(c)(3)  not-­‐for-­‐profit   corpora?on.  If  you  are  interested  in  providing  funding  or  dona?ng  equipment  or   services,  please  contact  us  at  info@opensciencedatacloud.org.  
  • 48. For  more  informa?on   •  You  can  find  some  more  informa?on  on  my  blog:                                                  rgrossman.com.   •  Some  of  my  technical  papers  are  also  available  there.     •  My  email  address  is  robert.grossman  at  uchicago  dot  edu   •  I  recently  wrote  a  popular  book  about  compu?ng  called:  The   Structure  of  Digital  Compu?ng:  From  Mainframes  to  Big  Data,   which  you  can  buy  from  Amazon.     Center for Research Informatics
  • 49. Sources  for  images   •  The  image  of  the  hard  disk  is  from  Norlando  Pobre,  Crea?ve  Commons.   •  The  image  of  the  Facebook  Pineville  Data  Center  is  from  the  Intel  Free  Press,   www.flickr.com/photos/intelfreepress/6722296855/,  Crea?ve  Commons  BY  2.0.   •  The  image  of  the  LHC  is  from  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0,  www.flickr.com/ photos/58220828@N07/5350788732