SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Clouds	
  and	
  Commons	
  for	
  the	
  Data	
  Intensive	
  
Science	
  Community	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Cloud	
  Consor>um	
  
June	
  8,	
  2015	
  
2015	
  NSF	
  Open	
  Science	
  Data	
  Cloud	
  PIRE	
  Workshop	
  
Amsterdam	
  
Collect	
  data	
  and	
  
distribute	
  files	
  via	
  Np	
  
and	
  apply	
  data	
  mining	
  	
  
Make	
  data	
  available	
  via	
  
open	
  APIs	
  and	
  
apply	
  data	
  science	
  
2000	
  
2010-­‐2015	
  
2020-­‐2025	
  
???	
  
Grids	
  and	
  
federated	
  
computa>on	
  
1.	
  Data	
  Commons	
  
We	
  have	
  a	
  problem	
  …	
  
The	
  commodi>za>on	
  of	
  
sensors	
  is	
  crea>ng	
  an	
  
explosive	
  growth	
  of	
  data	
  
It	
  can	
  take	
  weeks	
  to	
  
download	
  large	
  geo-­‐
spa>al	
  datasets	
  
Analyzing	
  the	
  data	
  is	
  more	
  
expensive	
  than	
  producing	
  it	
  
There	
  is	
  not	
  enough	
  funding	
  for	
  
every	
  researcher	
  to	
  house	
  all	
  
the	
  data	
  they	
  need	
  
Data	
  Commons	
  
Data	
  commons	
  co-­‐locate	
  data,	
  storage	
  and	
  compu>ng	
  
infrastructure,	
  and	
  commonly	
  used	
  tools	
  for	
  analyzing	
  
and	
  sharing	
  data	
  to	
  create	
  a	
  resource	
  for	
  the	
  research	
  
community.	
  
Source:	
  Interior	
  of	
  one	
  of	
  Google’s	
  Data	
  Center,	
  www.google.com/about/datacenters/	
  
The	
  Tragedy	
  of	
  the	
  Commons	
  
Source:	
  Garre[	
  Hardin,	
  The	
  Tragedy	
  of	
  the	
  Commons,	
  Science,	
  Volume	
  162,	
  Number	
  3859,	
  pages	
  
1243-­‐1248,	
  13	
  December	
  1968.	
  
Individuals	
  when	
  they	
  act	
  independently	
  following	
  their	
  
self	
  interests	
  can	
  deplete	
  a	
  deplete	
  a	
  common	
  resource,	
  
contrary	
  to	
  a	
  whole	
  group's	
  long-­‐term	
  best	
  interests.	
  
Garre[	
  Hardin	
  
7	
  
www.opencloudconsor>um.org	
  
•  U.S	
  based	
  not-­‐for-­‐profit	
  corpora>on	
  with	
  
interna>onal	
  partners.	
  
•  Manages	
  cloud	
  compu>ng	
  infrastructure	
  to	
  support	
  
scien>fic	
  research:	
  Open	
  Science	
  Data	
  Cloud,	
  OCC/
NASA	
  Project	
  Matsu,	
  &	
  OCC/NOAA	
  Data	
  Commons.	
  
•  Manages	
  cloud	
  compu>ng	
  infrastructure	
  to	
  support	
  
medical	
  and	
  health	
  care	
  research:	
  Biomedical	
  Data	
  
Commons.	
  	
  	
  
•  Manages	
  cloud	
  compu>ng	
  testbeds:	
  Open	
  Cloud	
  
Testbed.	
  
	
  
What	
  Scale?	
  
•  New	
  data	
  centers	
  are	
  
some>mes	
  divided	
  into	
  
“pods,”	
  which	
  can	
  be	
  built	
  
out	
  as	
  needed.	
  
•  A	
  reasonable	
  scale	
  for	
  what	
  is	
  
needed	
  for	
  a	
  commons	
  is	
  one	
  
of	
  these	
  pods	
  (“cyberpod”)	
  
•  Let’s	
  use	
  the	
  term	
  “datapod”	
  
for	
  the	
  analy>c	
  infrastructure	
  
that	
  scales	
  to	
  a	
  cyberpod.	
  
•  Think	
  of	
  as	
  the	
  scale	
  out	
  of	
  a	
  
database.	
  
•  Think	
  of	
  this	
  as	
  5-­‐40+	
  racks.	
  
Pod	
  A	
   Pod	
  B	
  
experimental	
  
science	
  
simula>on	
  
science	
  
data	
  science	
  
1609	
  
30x	
  
1670	
  
250x	
  
1976	
  
10x-­‐100x	
  
2004	
  
10x-­‐100x	
  
Core	
  Data	
  Commons	
  Services	
  
•  Digital	
  IDs	
  
•  Metadata	
  services	
  
•  High	
  performance	
  transport	
  
•  Data	
  export	
  
•  Pay	
  for	
  compute	
  with	
  images/containers	
  containing	
  
commonly	
  used	
  tools,	
  applica>ons	
  and	
  services,	
  
specialized	
  for	
  each	
  research	
  community	
  
Cloud	
  1	
   Cloud	
  3	
  
Data	
  Commons	
  	
  1	
  
Commons	
  
provide	
  data	
  to	
  
other	
  commons	
  
and	
  to	
  clouds	
  
Research	
  
projects	
  
producing	
  data	
  
Research	
  scien>sts	
  at	
  
research	
  center	
  B	
  
Research	
  scien>sts	
  at	
  
research	
  center	
  C	
  
Research	
  scien>sts	
  at	
  
research	
  center	
  A	
  
downloading	
  data	
  
Community	
  develops	
  
open	
  source	
  soNware	
  
stacks	
  for	
  commons	
  
and	
  clouds	
  
Cloud	
  2	
  
Data	
  
Commons	
  2	
  
Complex	
  sta>s>cal	
  
models	
  over	
  small	
  data	
  
that	
  are	
  highly	
  manual	
  
and	
  update	
  infrequently.	
  
Simpler	
  sta>s>cal	
  
models	
  over	
  large	
  data	
  
that	
  are	
  highly	
  
automated	
  and	
  update	
  
frequently.	
  
memory	
   databases	
  
GB	
   TB	
   PB	
  
W	
   KW	
   MW	
  
datapods	
  
cyber	
  pods	
  
Is	
  More	
  Different?	
  	
  Do	
  New	
  Phenomena	
  
Emerge	
  at	
  Scale	
  in	
  Biomedical	
  Data?	
  
Source:	
  P.	
  W.	
  Anderson,	
  More	
  is	
  Different,	
  Science,	
  Volume	
  177,	
  Number	
  4047,	
  4	
  August	
  1972,	
  pages	
  393-­‐396.	
  
2.	
  OCC	
  Data	
  Commons	
  
matsu.opensciencedatacloud.org	
  
OCC-­‐NASA	
  Collabora>on	
  2009	
  -­‐	
  present	
  	
  
•  Public-­‐private	
  data	
  collabora>ve	
  announced	
  April	
  21,	
  2015	
  by	
  
US	
  Secretary	
  of	
  Commerce	
  Pritzker.	
  	
  
•  AWS,	
  Google,	
  IBM,	
  MicrosoN	
  and	
  Open	
  Cloud	
  Consor>um	
  will	
  
form	
  five	
  collabora>ons.	
  
•  We	
  will	
  develop	
  an	
  OCC/NOAA	
  Data	
  Commons.	
  
University	
  of	
  
Chicago	
  biomedical	
  
data	
  commons	
  
developed	
  in	
  
collabora>on	
  with	
  
the	
  OCC.	
  
Data	
  Commons	
  Architecture	
  
Object	
  storage	
  
(permanent)	
  
Scalable	
  light	
  
weight	
  workflow	
  
Community	
  data	
  products	
  
(data	
  harmoniza>on)	
  
Data	
  
submission	
  
portal	
  and	
  
APIs	
  
Data	
  portal	
  
and	
  open	
  
APIs	
  for	
  
data	
  access	
  
Co-­‐located	
  
“pay	
  for	
  
compute”	
  
Digital	
  ID	
  Service	
  
&	
  Metadata	
  Service	
  
Devops	
  suppor>ng	
  virtual	
  machines	
  	
  and	
  containers	
  
3.	
  Scanning	
  Queries	
  over	
  Commons	
  	
  
and	
  the	
  Matsu	
  Wheel	
  
What	
  is	
  the	
  Project	
  Matsu?	
  
Matsu	
  is	
  an	
  open	
  source	
  
project	
  for	
  processing	
  satellite	
  
imagery	
  to	
  support	
  earth	
  
sciences	
  researchers	
  using	
  a	
  
data	
  commons.	
  
Matsu	
  is	
  a	
  joint	
  project	
  
between	
  the	
  Open	
  Cloud	
  
Consor>um	
  and	
  NASA’s	
  EO-­‐1	
  
Mission	
  (Dan	
  Mandl,	
  Lead)	
  
All	
  available	
  L1G	
  images	
  (2010-­‐now)	
  
NASA’s	
  Matsu	
  Mashup	
  
1.	
  Open	
  Science	
  Data	
  
Cloud	
  (OSDC)	
  stores	
  
Level	
  0	
  data	
  from	
  EO-­‐1	
  
and	
  uses	
  an	
  OpenStack-­‐
based	
  cloud	
  to	
  create	
  
Level	
  1	
  data.	
  
2.	
  OSDC	
  also	
  
provides	
  OpenStack	
  
resources	
  for	
  the	
  
Nambia	
  Flood	
  
Dashboard	
  
developed	
  by	
  Dan	
  
Mandl’s	
  team.	
  
3.	
  Project	
  Matsu	
  uses	
  
a	
  Hadoop	
  
applica>ons	
  to	
  run	
  
analy>cs	
  nightly	
  and	
  
to	
  create	
  >les	
  with	
  
OGC-­‐compliant	
  
WMTS.	
  
Amount	
  of	
  data	
  retrieved	
  
Number	
  of	
  queries	
  
mashup	
  
re-­‐analysis	
  
“wheel”	
  
row-­‐oriented	
   column-­‐oriented	
  
done	
  by	
  staff	
  
self-­‐service	
  by	
  
community	
  
Spectral anomaly detected:
Nishinoshima active volcano, Dec, 2014
4.	
  Data	
  Peering	
  for	
  Research	
  Data	
  
Tier	
  1	
  ISPs	
  “Created”	
  the	
  Internet	
  
Amount	
  of	
  data	
  retrieved	
  
Number	
  of	
  	
  
queries	
  
Number	
  of	
  sites	
  
download	
  data	
  
data	
  peering	
  
Cloud	
  1	
  
Data	
  Commons	
  	
  1	
  
Data	
  Commons	
  	
  2	
  
Data	
  Peering	
  
•  Tier	
  1	
  Commons	
  exchange	
  data	
  for	
  the	
  research	
  
community	
  at	
  no	
  charge.	
  
Three	
  Requirements	
  for	
  Data	
  Peering	
  	
  
Between	
  Data	
  Commons	
  
Two	
  Research	
  Data	
  Commons	
  with	
  a	
  Tier	
  1	
  data	
  
peering	
  rela>onship	
  agree	
  as	
  follows:	
  	
  
1.  To	
  transfer	
  research	
  data	
  between	
  them	
  at	
  no	
  cost	
  
beyond	
  the	
  fixed	
  cost	
  of	
  a	
  cross-­‐connect.	
  	
  	
  	
  	
  
2.  To	
  peer	
  with	
  at	
  least	
  two	
  other	
  Tier	
  1	
  Research	
  Data	
  
Commons	
  at	
  10	
  Gbps	
  or	
  higher.	
  
3.  To	
  support	
  Digital	
  IDs	
  (of	
  a	
  form	
  to	
  be	
  determined	
  
by	
  mutual	
  agreement)	
  so	
  that	
  a	
  researcher	
  using	
  
infrastructure	
  associated	
  with	
  one	
  Tier	
  1	
  Research	
  
Data	
  Commons	
  can	
  access	
  data	
  transparently	
  from	
  
any	
  of	
  the	
  Tier	
  1	
  Research	
  Data	
  Commons	
  that	
  
holds	
  the	
  desired	
  data.	
  	
  
	
  
5.	
  Requirements	
  and	
  Challenges	
  for	
  Data	
  
Commons	
  
Cyber	
  Pods	
  
•  New	
  data	
  centers	
  are	
  
some>mes	
  divided	
  into	
  
“pods,”	
  which	
  can	
  be	
  built	
  
out	
  as	
  needed.	
  
•  A	
  reasonable	
  scale	
  for	
  what	
  
is	
  needed	
  for	
  biomedical	
  
clouds	
  and	
  commons	
  is	
  one	
  
(or	
  more)	
  of	
  these	
  pods.	
  
•  Let’s	
  use	
  the	
  term	
  “cyber	
  
pod”	
  for	
  a	
  por>on	
  of	
  a	
  data	
  
center	
  whose	
  cyber	
  
infrastructure	
  is	
  dedicated	
  to	
  
a	
  par>cular	
  project.	
  
Pod	
  A	
   Pod	
  B	
  
The	
  5P	
  Requirements	
  
•  Permanent	
  objects	
  
•  SoNware	
  stacks	
  that	
  scale	
  to	
  cyber	
  Pods	
  
•  Data	
  Peering	
  
•  Portable	
  data	
  
•  Support	
  for	
  Pay	
  for	
  compute	
  
Requirement	
  1:	
  Permanent	
  Secure	
  Objects	
  
•  How	
  do	
  I	
  assign	
  Digital	
  IDs	
  and	
  key	
  metadata	
  to	
  open	
  
access	
  and	
  “controlled	
  access”	
  data	
  objects	
  and	
  
collec>ons	
  of	
  data	
  objects	
  to	
  support	
  distributed	
  
computa>on	
  of	
  large	
  datasets	
  by	
  communi>es	
  of	
  
researchers?	
  
–  Metadata	
  may	
  be	
  both	
  public	
  and	
  controlled	
  access	
  
–  Objects	
  must	
  be	
  secure	
  
•  Think	
  of	
  this	
  as	
  a	
  “dns	
  for	
  data.”	
  
•  The	
  test:	
  One	
  Commons	
  serving	
  the	
  cancer	
  community	
  
can	
  transfer	
  1	
  PB	
  of	
  BAM	
  files	
  to	
  another	
  Commons	
  and	
  
no	
  bioinforma>cians	
  need	
  to	
  change	
  their	
  code	
  
Requirement	
  2:	
  SoNware	
  stacks	
  	
  
that	
  scale	
  to	
  cyber	
  Pods	
  
•  How	
  can	
  I	
  add	
  a	
  rack	
  of	
  compu>ng/storage/networking	
  
equipment	
  to	
  a	
  cyber	
  pod	
  (that	
  has	
  a	
  manifest)	
  so	
  that	
  	
  
–  ANer	
  a[aching	
  to	
  power	
  
–  ANer	
  a[aching	
  to	
  network	
  
–  No	
  other	
  manual	
  configura>on	
  is	
  required	
  
–  The	
  data	
  services	
  can	
  make	
  use	
  of	
  the	
  addi>onal	
  
infrastructure	
  
–  The	
  compute	
  services	
  can	
  make	
  use	
  of	
  the	
  addi>onal	
  
infrastructure	
  	
  
•  In	
  other	
  words,	
  we	
  need	
  an	
  open	
  source	
  soNware	
  stack	
  
that	
  scales	
  to	
  cyber	
  pods.	
  
•  Think	
  of	
  data	
  services	
  that	
  scale	
  to	
  cyber	
  pods	
  as	
  
“datapods.”	
  
Core	
  Services	
  for	
  a	
  Biomedical	
  Cloud	
  
•  On	
  demand	
  compute,	
  either	
  virtual	
  machines	
  or	
  
containers	
  
•  Access	
  to	
  data	
  from	
  commons	
  or	
  other	
  cloud	
  
Core	
  Services	
  for	
  a	
  Biomedical	
  Data	
  Commons	
  
•  Digital	
  ID	
  Service	
  
•  Metadata	
  Service	
  
•  Object-­‐based	
  Storage	
  (e.g.	
  S3	
  compliant)	
  
•  Light	
  weight	
  work	
  flow	
  that	
  scales	
  to	
  a	
  pod	
  
•  Pay	
  as	
  you	
  go	
  compute	
  environments	
  
Common	
  Services	
  
•  Authen>ca>on	
  that	
  uses	
  InCommon	
  or	
  similar	
  
federa>on	
  
•  Authoriza>on	
  from	
  third	
  party	
  (DACO,	
  dbGAP)	
  
•  Access	
  controls	
  
•  Infrastructure	
  monitoring	
  
•  Infrastructure	
  automa>on	
  framework	
  
•  Security	
  and	
  compliance	
  that	
  scales	
  
•  Accoun>ng	
  and	
  billing	
  
Requirement	
  3:	
  Data	
  Peering	
  
•  How	
  can	
  a	
  cri>cal	
  mass	
  of	
  data	
  commons	
  support	
  
data	
  peering	
  so	
  that	
  a	
  research	
  at	
  one	
  of	
  the	
  
commons	
  can	
  transparently	
  access	
  data	
  managed	
  by	
  
one	
  of	
  the	
  other	
  commons	
  
– We	
  need	
  to	
  access	
  data	
  independent	
  of	
  where	
  it	
  
is	
  stored	
  
– “Tier	
  1	
  data	
  commons”	
  need	
  to	
  pass	
  research	
  data	
  
and	
  other	
  community	
  data	
  at	
  no	
  cost	
  
– We	
  need	
  to	
  be	
  able	
  to	
  transport	
  large	
  data	
  
efficiently	
  “end	
  to	
  end”	
  between	
  commons	
  
Cloud	
  1	
  
Data	
  Commons	
  	
  1	
  
Data	
  Commons	
  	
  2	
  
Data	
  Peering	
  
•  Tier	
  1	
  Data	
  Commons	
  exchange	
  data	
  for	
  the	
  
research	
  community	
  at	
  no	
  charge.	
  
Requirement	
  4:	
  Data	
  Portability	
  	
  
•  We	
  need	
  a	
  simple	
  bu[on	
  that	
  can	
  export	
  our	
  data	
  from	
  one	
  data	
  
commons	
  and	
  import	
  it	
  into	
  another	
  one	
  that	
  peers	
  with	
  it.	
  
•  We	
  also	
  need	
  this	
  to	
  work	
  for	
  controlled	
  access	
  biomedical	
  data.	
  	
  	
  
•  Think	
  of	
  this	
  as	
  “Indigo	
  Bu[on”	
  which	
  safely	
  and	
  compliantly	
  
moves	
  biomedical	
  data	
  between	
  commons,	
  similar	
  to	
  the	
  HHS	
  
“Blue	
  Bu[on.”	
  
Requirement	
  5:	
  Support	
  Pay	
  for	
  Compute	
  
•  The	
  final	
  requirement	
  is	
  to	
  support	
  “pay	
  for	
  
compute”	
  over	
  the	
  data	
  in	
  the	
  commons.	
  	
  Payments	
  
can	
  be	
  through:	
  
– Alloca>ons	
  
– “Chits”	
  
– Credit	
  cards	
  
– Data	
  commons	
  “condos”	
  
– Joint	
  grants	
  
– etc.	
  
6.	
  	
  OCC	
  Global	
  Distributed	
  Data	
  Commons	
  
The	
  Open	
  Cloud	
  Consor>um	
  is	
  prototyping	
  interopera>ng	
  and	
  
peering	
  data	
  commons	
  throughout	
  the	
  world	
  (Chicago,	
  
Toronto,	
  Cambridge	
  and	
  Asia)	
  using	
  10	
  and	
  100	
  Gbps	
  research	
  
networks.	
  
Collect	
  data	
  and	
  
distribute	
  files	
  via	
  Np	
  
and	
  apply	
  data	
  mining	
  	
  
Make	
  data	
  available	
  via	
  
open	
  APIs	
  and	
  
apply	
  data	
  science	
  
2000	
  
2010-­‐2015	
  
2020	
  -­‐	
  2025	
  
Interoperate	
  data	
  
commons,	
  support	
  
data	
  peering	
  and	
  
apply	
  ???	
  
Ques>ons?	
  
45	
  
For	
  more	
  informa>on:	
  
rgrossman.com	
  
@bobgrossman	
  

Weitere ähnliche Inhalte

Andere mochten auch

My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Gestion Creativa: I+D+i en Software Libre como transformador económico social
Gestion Creativa: I+D+i en Software Libre como transformador económico socialGestion Creativa: I+D+i en Software Libre como transformador económico social
Gestion Creativa: I+D+i en Software Libre como transformador económico socialGuadalinfo Red Social
 
Formato 500 cartas jimenez 10a
Formato 500 cartas jimenez 10aFormato 500 cartas jimenez 10a
Formato 500 cartas jimenez 10agmariajimenez
 
Codigo Procesal Penal Comentado - Francisco D´Albora
Codigo Procesal Penal Comentado - Francisco D´AlboraCodigo Procesal Penal Comentado - Francisco D´Albora
Codigo Procesal Penal Comentado - Francisco D´AlboraEscuelaDeFiscales
 
Snickers Workwear Novedades Otoño-Invierno 2014
Snickers Workwear Novedades Otoño-Invierno 2014Snickers Workwear Novedades Otoño-Invierno 2014
Snickers Workwear Novedades Otoño-Invierno 2014Suministros Herco
 
Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?
Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?
Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?Brack.ch
 
IANA Update September 2015
IANA Update September 2015IANA Update September 2015
IANA Update September 2015APNIC
 
Los conocemos mejor que tu
Los conocemos mejor que tuLos conocemos mejor que tu
Los conocemos mejor que tuDavid Morales
 
Spice world london 2012 Ben Snape
Spice world london 2012 Ben SnapeSpice world london 2012 Ben Snape
Spice world london 2012 Ben SnapeSpiceworks
 
Open source delivers great digital experiences
Open source delivers great digital experiencesOpen source delivers great digital experiences
Open source delivers great digital experiencesJeffrey McGuire
 

Andere mochten auch (19)

My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Gestion Creativa: I+D+i en Software Libre como transformador económico social
Gestion Creativa: I+D+i en Software Libre como transformador económico socialGestion Creativa: I+D+i en Software Libre como transformador económico social
Gestion Creativa: I+D+i en Software Libre como transformador económico social
 
Formato 500 cartas jimenez 10a
Formato 500 cartas jimenez 10aFormato 500 cartas jimenez 10a
Formato 500 cartas jimenez 10a
 
Codigo Procesal Penal Comentado - Francisco D´Albora
Codigo Procesal Penal Comentado - Francisco D´AlboraCodigo Procesal Penal Comentado - Francisco D´Albora
Codigo Procesal Penal Comentado - Francisco D´Albora
 
Snickers Workwear Novedades Otoño-Invierno 2014
Snickers Workwear Novedades Otoño-Invierno 2014Snickers Workwear Novedades Otoño-Invierno 2014
Snickers Workwear Novedades Otoño-Invierno 2014
 
Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?
Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?
Local Commerce – Wird der stationäre Handel wieder eine Chance verpassen?
 
Cataleg globaldis
Cataleg globaldisCataleg globaldis
Cataleg globaldis
 
IANA Update September 2015
IANA Update September 2015IANA Update September 2015
IANA Update September 2015
 
Los conocemos mejor que tu
Los conocemos mejor que tuLos conocemos mejor que tu
Los conocemos mejor que tu
 
Spice world london 2012 Ben Snape
Spice world london 2012 Ben SnapeSpice world london 2012 Ben Snape
Spice world london 2012 Ben Snape
 
Open source delivers great digital experiences
Open source delivers great digital experiencesOpen source delivers great digital experiences
Open source delivers great digital experiences
 

Mehr von Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

Mehr von Robert Grossman (14)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Clouds and Commons for the Data Intensive Science Community (June 8, 2015)

  • 1. Clouds  and  Commons  for  the  Data  Intensive   Science  Community   Robert  Grossman   University  of  Chicago   Open  Cloud  Consor>um   June  8,  2015   2015  NSF  Open  Science  Data  Cloud  PIRE  Workshop   Amsterdam  
  • 2. Collect  data  and   distribute  files  via  Np   and  apply  data  mining     Make  data  available  via   open  APIs  and   apply  data  science   2000   2010-­‐2015   2020-­‐2025   ???   Grids  and   federated   computa>on  
  • 4. We  have  a  problem  …   The  commodi>za>on  of   sensors  is  crea>ng  an   explosive  growth  of  data   It  can  take  weeks  to   download  large  geo-­‐ spa>al  datasets   Analyzing  the  data  is  more   expensive  than  producing  it   There  is  not  enough  funding  for   every  researcher  to  house  all   the  data  they  need  
  • 5. Data  Commons   Data  commons  co-­‐locate  data,  storage  and  compu>ng   infrastructure,  and  commonly  used  tools  for  analyzing   and  sharing  data  to  create  a  resource  for  the  research   community.   Source:  Interior  of  one  of  Google’s  Data  Center,  www.google.com/about/datacenters/  
  • 6. The  Tragedy  of  the  Commons   Source:  Garre[  Hardin,  The  Tragedy  of  the  Commons,  Science,  Volume  162,  Number  3859,  pages   1243-­‐1248,  13  December  1968.   Individuals  when  they  act  independently  following  their   self  interests  can  deplete  a  deplete  a  common  resource,   contrary  to  a  whole  group's  long-­‐term  best  interests.   Garre[  Hardin  
  • 7. 7   www.opencloudconsor>um.org   •  U.S  based  not-­‐for-­‐profit  corpora>on  with   interna>onal  partners.   •  Manages  cloud  compu>ng  infrastructure  to  support   scien>fic  research:  Open  Science  Data  Cloud,  OCC/ NASA  Project  Matsu,  &  OCC/NOAA  Data  Commons.   •  Manages  cloud  compu>ng  infrastructure  to  support   medical  and  health  care  research:  Biomedical  Data   Commons.       •  Manages  cloud  compu>ng  testbeds:  Open  Cloud   Testbed.    
  • 8. What  Scale?   •  New  data  centers  are   some>mes  divided  into   “pods,”  which  can  be  built   out  as  needed.   •  A  reasonable  scale  for  what  is   needed  for  a  commons  is  one   of  these  pods  (“cyberpod”)   •  Let’s  use  the  term  “datapod”   for  the  analy>c  infrastructure   that  scales  to  a  cyberpod.   •  Think  of  as  the  scale  out  of  a   database.   •  Think  of  this  as  5-­‐40+  racks.   Pod  A   Pod  B  
  • 9. experimental   science   simula>on   science   data  science   1609   30x   1670   250x   1976   10x-­‐100x   2004   10x-­‐100x  
  • 10. Core  Data  Commons  Services   •  Digital  IDs   •  Metadata  services   •  High  performance  transport   •  Data  export   •  Pay  for  compute  with  images/containers  containing   commonly  used  tools,  applica>ons  and  services,   specialized  for  each  research  community  
  • 11. Cloud  1   Cloud  3   Data  Commons    1   Commons   provide  data  to   other  commons   and  to  clouds   Research   projects   producing  data   Research  scien>sts  at   research  center  B   Research  scien>sts  at   research  center  C   Research  scien>sts  at   research  center  A   downloading  data   Community  develops   open  source  soNware   stacks  for  commons   and  clouds   Cloud  2   Data   Commons  2  
  • 12. Complex  sta>s>cal   models  over  small  data   that  are  highly  manual   and  update  infrequently.   Simpler  sta>s>cal   models  over  large  data   that  are  highly   automated  and  update   frequently.   memory   databases   GB   TB   PB   W   KW   MW   datapods   cyber  pods  
  • 13. Is  More  Different?    Do  New  Phenomena   Emerge  at  Scale  in  Biomedical  Data?   Source:  P.  W.  Anderson,  More  is  Different,  Science,  Volume  177,  Number  4047,  4  August  1972,  pages  393-­‐396.  
  • 14. 2.  OCC  Data  Commons  
  • 16. •  Public-­‐private  data  collabora>ve  announced  April  21,  2015  by   US  Secretary  of  Commerce  Pritzker.     •  AWS,  Google,  IBM,  MicrosoN  and  Open  Cloud  Consor>um  will   form  five  collabora>ons.   •  We  will  develop  an  OCC/NOAA  Data  Commons.  
  • 17. University  of   Chicago  biomedical   data  commons   developed  in   collabora>on  with   the  OCC.  
  • 18. Data  Commons  Architecture   Object  storage   (permanent)   Scalable  light   weight  workflow   Community  data  products   (data  harmoniza>on)   Data   submission   portal  and   APIs   Data  portal   and  open   APIs  for   data  access   Co-­‐located   “pay  for   compute”   Digital  ID  Service   &  Metadata  Service   Devops  suppor>ng  virtual  machines    and  containers  
  • 19. 3.  Scanning  Queries  over  Commons     and  the  Matsu  Wheel  
  • 20. What  is  the  Project  Matsu?   Matsu  is  an  open  source   project  for  processing  satellite   imagery  to  support  earth   sciences  researchers  using  a   data  commons.   Matsu  is  a  joint  project   between  the  Open  Cloud   Consor>um  and  NASA’s  EO-­‐1   Mission  (Dan  Mandl,  Lead)  
  • 21. All  available  L1G  images  (2010-­‐now)  
  • 23. 1.  Open  Science  Data   Cloud  (OSDC)  stores   Level  0  data  from  EO-­‐1   and  uses  an  OpenStack-­‐ based  cloud  to  create   Level  1  data.   2.  OSDC  also   provides  OpenStack   resources  for  the   Nambia  Flood   Dashboard   developed  by  Dan   Mandl’s  team.   3.  Project  Matsu  uses   a  Hadoop   applica>ons  to  run   analy>cs  nightly  and   to  create  >les  with   OGC-­‐compliant   WMTS.  
  • 24. Amount  of  data  retrieved   Number  of  queries   mashup   re-­‐analysis   “wheel”   row-­‐oriented   column-­‐oriented   done  by  staff   self-­‐service  by   community  
  • 25. Spectral anomaly detected: Nishinoshima active volcano, Dec, 2014
  • 26. 4.  Data  Peering  for  Research  Data  
  • 27. Tier  1  ISPs  “Created”  the  Internet  
  • 28. Amount  of  data  retrieved   Number  of     queries   Number  of  sites   download  data   data  peering  
  • 29. Cloud  1   Data  Commons    1   Data  Commons    2   Data  Peering   •  Tier  1  Commons  exchange  data  for  the  research   community  at  no  charge.  
  • 30. Three  Requirements  for  Data  Peering     Between  Data  Commons   Two  Research  Data  Commons  with  a  Tier  1  data   peering  rela>onship  agree  as  follows:     1.  To  transfer  research  data  between  them  at  no  cost   beyond  the  fixed  cost  of  a  cross-­‐connect.           2.  To  peer  with  at  least  two  other  Tier  1  Research  Data   Commons  at  10  Gbps  or  higher.   3.  To  support  Digital  IDs  (of  a  form  to  be  determined   by  mutual  agreement)  so  that  a  researcher  using   infrastructure  associated  with  one  Tier  1  Research   Data  Commons  can  access  data  transparently  from   any  of  the  Tier  1  Research  Data  Commons  that   holds  the  desired  data.      
  • 31. 5.  Requirements  and  Challenges  for  Data   Commons  
  • 32. Cyber  Pods   •  New  data  centers  are   some>mes  divided  into   “pods,”  which  can  be  built   out  as  needed.   •  A  reasonable  scale  for  what   is  needed  for  biomedical   clouds  and  commons  is  one   (or  more)  of  these  pods.   •  Let’s  use  the  term  “cyber   pod”  for  a  por>on  of  a  data   center  whose  cyber   infrastructure  is  dedicated  to   a  par>cular  project.   Pod  A   Pod  B  
  • 33. The  5P  Requirements   •  Permanent  objects   •  SoNware  stacks  that  scale  to  cyber  Pods   •  Data  Peering   •  Portable  data   •  Support  for  Pay  for  compute  
  • 34. Requirement  1:  Permanent  Secure  Objects   •  How  do  I  assign  Digital  IDs  and  key  metadata  to  open   access  and  “controlled  access”  data  objects  and   collec>ons  of  data  objects  to  support  distributed   computa>on  of  large  datasets  by  communi>es  of   researchers?   –  Metadata  may  be  both  public  and  controlled  access   –  Objects  must  be  secure   •  Think  of  this  as  a  “dns  for  data.”   •  The  test:  One  Commons  serving  the  cancer  community   can  transfer  1  PB  of  BAM  files  to  another  Commons  and   no  bioinforma>cians  need  to  change  their  code  
  • 35. Requirement  2:  SoNware  stacks     that  scale  to  cyber  Pods   •  How  can  I  add  a  rack  of  compu>ng/storage/networking   equipment  to  a  cyber  pod  (that  has  a  manifest)  so  that     –  ANer  a[aching  to  power   –  ANer  a[aching  to  network   –  No  other  manual  configura>on  is  required   –  The  data  services  can  make  use  of  the  addi>onal   infrastructure   –  The  compute  services  can  make  use  of  the  addi>onal   infrastructure     •  In  other  words,  we  need  an  open  source  soNware  stack   that  scales  to  cyber  pods.   •  Think  of  data  services  that  scale  to  cyber  pods  as   “datapods.”  
  • 36. Core  Services  for  a  Biomedical  Cloud   •  On  demand  compute,  either  virtual  machines  or   containers   •  Access  to  data  from  commons  or  other  cloud   Core  Services  for  a  Biomedical  Data  Commons   •  Digital  ID  Service   •  Metadata  Service   •  Object-­‐based  Storage  (e.g.  S3  compliant)   •  Light  weight  work  flow  that  scales  to  a  pod   •  Pay  as  you  go  compute  environments  
  • 37. Common  Services   •  Authen>ca>on  that  uses  InCommon  or  similar   federa>on   •  Authoriza>on  from  third  party  (DACO,  dbGAP)   •  Access  controls   •  Infrastructure  monitoring   •  Infrastructure  automa>on  framework   •  Security  and  compliance  that  scales   •  Accoun>ng  and  billing  
  • 38. Requirement  3:  Data  Peering   •  How  can  a  cri>cal  mass  of  data  commons  support   data  peering  so  that  a  research  at  one  of  the   commons  can  transparently  access  data  managed  by   one  of  the  other  commons   – We  need  to  access  data  independent  of  where  it   is  stored   – “Tier  1  data  commons”  need  to  pass  research  data   and  other  community  data  at  no  cost   – We  need  to  be  able  to  transport  large  data   efficiently  “end  to  end”  between  commons  
  • 39. Cloud  1   Data  Commons    1   Data  Commons    2   Data  Peering   •  Tier  1  Data  Commons  exchange  data  for  the   research  community  at  no  charge.  
  • 40. Requirement  4:  Data  Portability     •  We  need  a  simple  bu[on  that  can  export  our  data  from  one  data   commons  and  import  it  into  another  one  that  peers  with  it.   •  We  also  need  this  to  work  for  controlled  access  biomedical  data.       •  Think  of  this  as  “Indigo  Bu[on”  which  safely  and  compliantly   moves  biomedical  data  between  commons,  similar  to  the  HHS   “Blue  Bu[on.”  
  • 41. Requirement  5:  Support  Pay  for  Compute   •  The  final  requirement  is  to  support  “pay  for   compute”  over  the  data  in  the  commons.    Payments   can  be  through:   – Alloca>ons   – “Chits”   – Credit  cards   – Data  commons  “condos”   – Joint  grants   – etc.  
  • 42. 6.    OCC  Global  Distributed  Data  Commons  
  • 43. The  Open  Cloud  Consor>um  is  prototyping  interopera>ng  and   peering  data  commons  throughout  the  world  (Chicago,   Toronto,  Cambridge  and  Asia)  using  10  and  100  Gbps  research   networks.  
  • 44. Collect  data  and   distribute  files  via  Np   and  apply  data  mining     Make  data  available  via   open  APIs  and   apply  data  science   2000   2010-­‐2015   2020  -­‐  2025   Interoperate  data   commons,  support   data  peering  and   apply  ???  
  • 45. Ques>ons?   45   For  more  informa>on:   rgrossman.com   @bobgrossman