SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Using	
  the	
  Open	
  Science	
  Data	
  Cloud	
  	
  
for	
  Data	
  Science	
  Research	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Cloud	
  Consor=um	
  
June	
  17,	
  2013	
  
Data:	
  1	
  PB	
  of	
  OSDC	
  
data	
  across	
  several	
  
disciplines	
  
Instrument:	
  	
  
3000	
  cores	
  /	
  	
  
5	
  PB	
  OSDC	
  	
  
science	
  cloud	
  
+	
  +	
  
Team:	
  you	
  
and	
  your	
  
colleagues	
  
Discoveries	
  
correla=on	
  
algorithms	
  +	
  
Part	
  1	
  
What	
  Instrument	
  Do	
  we	
  Use	
  to	
  	
  
Make	
  Big	
  Data	
  Discoveries?	
  
How	
  do	
  we	
  build	
  a	
  “datascope?”	
  
What	
  is	
  big	
  data?	
  
TB?	
  PB?	
  EB?	
  	
  
W?	
  KW?	
  MW?	
  
An	
  algorithm	
  and	
  
compu=ng	
  
infrastructure	
  is	
  “big-­‐
data	
  scalable”	
  if	
  adding	
  
a	
  rack	
  (or	
  container)	
  of	
  
data	
  (and	
  corresponding	
  
processors)	
  allows	
  you	
  
to	
  do	
  the	
  same	
  
computa=on	
  in	
  the	
  
same	
  =me	
  but	
  over	
  
more	
  data.	
  
Commercial	
  Cloud	
  Service	
  Provider	
  (CSP)	
  	
  
15	
  MW	
  Data	
  Center	
  
100,000	
  servers	
  
1	
  PB	
  DRAM	
  
100’s	
  of	
  PB	
  of	
  disk	
  
Automa=c	
  
provisioning	
  and	
  
infrastructure	
  
management	
  
Monitoring,	
  
network	
  security	
  
and	
  forensics	
  
Accoun=ng	
  and	
  
billing	
   Customer	
  
Facing	
  
Portal	
  
Data	
  center	
  network	
  
~1	
  Tbps	
  egress	
  bandwidth	
  
	
  
25	
  operators	
  for	
  15	
  MW	
  Commercial	
  Cloud	
  
OSDC’s	
  vote	
  for	
  a	
  datascope:	
  a	
  
(bou=que)	
  data	
  center	
  scale	
  facility	
  
with	
  a	
  big-­‐data	
  scalable	
  analy=c	
  
infrastructure.	
  
Data:	
  1	
  PB	
  of	
  OSDC	
  
data	
  across	
  several	
  
disciplines	
  
Instrument:	
  	
  
3000	
  cores	
  /	
  	
  
5	
  PB	
  OSDC	
  	
  
science	
  cloud	
  
+	
  +	
  
Team:	
  you	
  
and	
  your	
  
colleagues	
  
Discoveries	
  
correla=on	
  
algorithms	
  +	
  
Discipline	
   Dura2on	
   Size	
   #	
  Devices	
  
HEP	
  -­‐	
  LHC	
   10	
  years	
   15	
  PB/year*	
   One	
  
Astronomy	
  -­‐	
  LSST	
   10	
  years	
   12	
  PB/year**	
   One	
  
Genomics	
  -­‐	
  NGS	
   2-­‐4	
  years	
   0.5	
  TB/genome	
   1000’s	
  
Some	
  Examples	
  of	
  Big	
  Data	
  Science	
  
*At	
  full	
  capacity,	
  the	
  Large	
  Hadron	
  Collider	
  (LHC),	
  the	
  world's	
  largest	
  par=cle	
  accelerator,	
  is	
  expected	
  to	
  produce	
  more	
  than	
  15	
  
million	
  Gigabytes	
  of	
  data	
  each	
  year.	
  	
  …	
  This	
  ambi=ous	
  project	
  connects	
  and	
  combines	
  the	
  IT	
  power	
  of	
  more	
  than	
  140	
  computer	
  
centres	
  in	
  33	
  countries.	
  	
  Source:	
  hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html	
  
	
  
**As	
  it	
  carries	
  out	
  its	
  10-­‐year	
  survey,	
  LSST	
  will	
  produce	
  over	
  15	
  terabytes	
  of	
  raw	
  astronomical	
  data	
  each	
  night	
  (30	
  terabytes	
  
processed),	
  resul=ng	
  in	
  a	
  database	
  catalog	
  of	
  22	
  petabytes	
  and	
  an	
  image	
  archive	
  of	
  100	
  petabytes.	
  	
  Source:	
  hhp://www.lsst.org/
News/enews/teragrid-­‐1004.html	
  
One	
  large	
  instrument	
   Many	
  smaller	
  instruments	
  
Part	
  2.	
  
What	
  is	
  a	
  Cloud	
  and	
  Why	
  Do	
  We	
  Care?	
  
11	
  
There	
  Are	
  Two	
  Essen=al	
  	
  
Characteris=cs	
  of	
  a	
  Cloud	
  
1.  Self	
  service	
  
2.  Scale	
  
•  Clouds	
  enable	
  you	
  to	
  compute	
  over	
  large	
  
amounts	
  of	
  data	
  with	
  the	
  necessity	
  of	
  first	
  
downloading	
  the	
  data.	
  
•  Clouds	
  can	
  be	
  designed	
  to	
  be	
  secure	
  and	
  
compliant.	
  
12	
  
Self	
  Service	
  
Self	
  Service	
  
13	
  
Scale	
  
14	
  
Types	
  of	
  Clouds	
  
•  Public	
  Clouds	
  	
  
– Amazon	
  
•  Private	
  Clouds	
  
– Run	
  internally	
  by	
  universi=es	
  or	
  companies	
  
•  Community	
  Clouds	
  
– Run	
  by	
  organiza=ons	
  (either	
  formally	
  or	
  
informally),	
  such	
  as	
  the	
  Open	
  Cloud	
  Consor=um	
  
15	
  
Amazon	
  Web	
  Services	
  
(AWS)?	
  
Community	
  clouds,	
  
science	
  clouds,	
  etc.	
  
•  Lower	
  cost	
  (at	
  medium	
  scale)	
  
•  Data	
  too	
  important	
  for	
  
commercial	
  cloud	
  
•  Compu=ng	
  over	
  scien=fic	
  
data	
  is	
  a	
  core	
  competency	
  
•  Can	
  support	
  any	
  required	
  
governance	
  /	
  security	
  
•  Scale	
  
•  Simplicity	
  of	
  a	
  credit	
  card	
  
•  Wide	
  variety	
  of	
  offerings.	
  
vs.	
  
OCC	
  supports	
  AWS	
  interop	
  and	
  burs=ng	
  when	
  permissible.	
   16	
  
Science	
  Clouds	
  
NFP	
  Science	
  Clouds	
   Commercial	
  Clouds	
  
POV	
   Democra=ze	
  access	
  to	
  
data.	
  	
  Integrate	
  data	
  to	
  
make	
  discoveries.	
  	
  Long	
  
term	
  archive.	
  
As	
  long	
  as	
  you	
  pay	
  the	
  bill;	
  
as	
  long	
  as	
  the	
  business	
  
model	
  holds.	
  
Data	
  &	
  
Storage	
  
Data	
  intensive	
  
compu=ng	
  &	
  HP	
  storage	
  
Internet	
  style	
  scale	
  out	
  
and	
  object-­‐based	
  storage	
  
Flows	
   Large	
  &	
  small	
  data	
  flows	
   Lots	
  of	
  small	
  web	
  flows	
  
Streams	
   Streaming	
  processing	
  
required	
  
NA	
  
Accoun=ng	
   Essen=al	
   Essen=al	
  
Lock	
  in	
   Moving	
  environment	
  
between	
  CSPs	
  essen=al	
  
Lock	
  in	
  is	
  good	
  
Interop	
   Cri=cal,	
  but	
  difficult	
   Customers	
  will	
  drive	
  to	
  
some	
  degree	
   17	
  
Essen=al	
  Services	
  for	
  a	
  Science	
  CSP	
  
•  Support	
  for	
  data	
  intensive	
  compu=ng	
  
•  Support	
  for	
  big	
  data	
  flows	
  
•  Account	
  management,	
  authen=ca=on	
  and	
  
authoriza=on	
  services	
  
•  Health	
  and	
  status	
  monitoring	
  
•  Billing	
  and	
  accoun=ng	
  
•  Ability	
  to	
  rapidly	
  provision	
  infrastructure	
  
•  Security	
  services,	
  logging,	
  event	
  repor=ng	
  
•  Access	
  to	
  large	
  amounts	
  of	
  public	
  data	
  
•  High	
  performance	
  storage	
  
•  Simple	
  data	
  export	
  and	
  import	
  services	
  
Datascope	
  –	
  Science	
  Cloud	
  	
  
Service	
  Provider	
  (Sci	
  CSP)	
  
Data	
  scien=st	
  
Sci	
  CSP	
  services	
  
Cloud	
  Services	
  	
  
Opera=ons	
  Centers	
  (CSOC)	
  
•  The	
  OSDC	
  operates	
  Cloud	
  Services	
  Opera=ons	
  
Center	
  (or	
  CSOC).	
  
•  It	
  is	
  a	
  CSOC	
  focused	
  on	
  suppor=ng	
  Science	
  
Clouds	
  for	
  researchers.	
  
•  Compare	
  to	
  Network	
  Opera=ons	
  Center	
  or	
  
NOC.	
  
•  Both	
  are	
  an	
  important	
  part	
  of	
  cyber	
  
infrastructure	
  for	
  big	
  data	
  science.	
  
Datascope	
  –	
  Science	
  Cloud	
  	
  
Service	
  Provider	
  (Sci	
  CSP)	
  
Data	
  scien=st	
  
Sci	
  CSP	
  services	
  
Cloud	
  Service	
  Opera=ons	
  
Center	
  (CSOC)	
  
Part	
  3	
  
Data	
  Science	
  
Data	
  
Founda=ons	
  of	
  data	
  science	
  
General	
  and	
  discipline	
  
specific	
  souware	
  
applica=ons	
  and	
  tools	
  
Models	
  and	
  algorithms	
  	
  
Establish	
  best	
  prac=ces,	
  strategies	
  for	
  
data	
  science	
  in	
  general	
  and	
  discipline	
  
specific	
  data	
  science	
  in	
  par=cular	
  
Analy=c	
  infrastructure	
  
Data	
  
What	
  are	
  the	
  founda=ons	
  for	
  data	
  science?	
  
Theory	
  to	
  Big	
  Data	
  Spectrum	
  
Simple	
  counts	
  
and	
  sta=s=cs	
  
over	
  big	
  data	
  
Mathema=cal	
  
theorems	
  
No	
  data	
   Small	
  data	
  
Big	
  data	
  
Tradi=onal	
  
sta=s=cal	
  modeling	
  
Medium	
  data	
  
(Semi-­‐)Automa=ng	
  
sta=s=cal	
  modeling	
  
GB	
   TB	
   PB	
  
OSDC	
  Datascope	
   0.5-­‐2.0	
  MW	
  
Part	
  4	
  
The	
  Open	
  Science	
  Data	
  Cloud	
  
www.opensciencedatacloud.org	
  
Data:	
  1	
  PB	
  of	
  OSDC	
  
data	
  across	
  several	
  
disciplines	
  
Instrument:	
  	
  
3000	
  cores	
  /	
  	
  
5	
  PB	
  OSDC	
  	
  
science	
  cloud	
  
+	
  +	
  
Team:	
  you	
  
and	
  your	
  
colleagues	
  
Discoveries	
  
correla=on	
  
algorithms	
  +	
  
2013	
  Open	
  Science	
  Data	
  Cloud	
  (IaaS)	
  
5	
  PB	
  2013	
  	
  
(OpenStack	
  &	
  
GlusterFS)	
  
Infrastructure	
  
automa=on	
  &	
  
management	
  
(Yates)	
  
Compliance,	
  &	
  
security	
  
(OpenFISMA)	
  
Accoun=ng	
  &	
  
billing	
  
(Salesforce.com)	
  
Customer	
  Facing	
  
Portal	
  (Tukey)	
  
Data	
  center	
  network	
  
~10-­‐100	
  Gbps	
  bandwidth	
  
	
  
5	
  engineers	
  to	
  operate	
  0.5	
  MW	
  Science	
  Cloud	
  
Science	
  Cloud	
  SW	
  
&	
  Services	
  
•  Virtual	
  Machine	
  (VM)	
  containing	
  common	
  applica=ons	
  &	
  
pipelines	
  	
  
•  Tukey	
  (OSDC	
  portal	
  &	
  middleware	
  v0.3)	
  
•  Yates	
  (infrastructure	
  automa=on	
  and	
  management	
  v0.1)	
   28	
  
Tukey	
  
•  Tukey	
  (based	
  in	
  part	
  on	
  Horizon).	
  
•  We	
  have	
  factored	
  out	
  digital	
  ID	
  service,	
  file	
  
sharing,	
  and	
  transport	
  from	
  Bionimbus	
  and	
  
Matsu.	
  
Yates	
  
•  Automa=on	
  
installa=on	
  of	
  
OSDC	
  souware	
  
stack	
  on	
  rack	
  of	
  
computers.	
  
•  Based	
  upon	
  Chef	
  
•  Version	
  0.1	
  
UDR	
  
•  UDT	
  is	
  a	
  high	
  performance	
  network	
  transport	
  protocol	
  
•  UDR	
  =	
  rsync	
  +	
  UDT	
  	
  
•  It	
  is	
  easy	
  for	
  an	
  average	
  systems	
  administrator	
  to	
  keep	
  
100’s	
  of	
  TB	
  of	
  distributed	
  data	
  synchronized.	
  	
  
•  We	
  are	
  using	
  it	
  to	
  distribute	
  c.	
  1	
  PB	
  from	
  the	
  OSDC	
  
Open	
  Science	
  Data	
  Cloud	
  Services	
  
•  Digital	
  ID	
  services	
  
•  Data	
  sharing	
  services	
  
•  Data	
  transport	
  services	
  (UDR)	
  
•  What	
  other	
  core	
  services	
  are	
  essen&al?	
  
•  Of	
  course,	
  working	
  groups	
  and	
  applica=ons	
  
always	
  add	
  their	
  own	
  services	
  
•  These	
  core	
  services	
  will	
  hopefully	
  make	
  the	
  
OSDC	
  ahrac=ve	
  as	
  a	
  plaxorm	
  (PaaS)	
  for	
  
scien=fic	
  discovery.	
  
33	
  
www.opencloudconsor=um.org	
  
•  U.S	
  based	
  not-­‐for-­‐profit	
  corpora=on.	
  
•  Manages	
  cloud	
  compu=ng	
  infrastructure	
  to	
  
support	
  scien=fic	
  research:	
  Open	
  Science	
  Data	
  
Cloud.	
  
•  Manages	
  cloud	
  compu=ng	
  infrastructure	
  to	
  
support	
  medical	
  and	
  health	
  care	
  research:	
  
Biomedical	
  Commons	
  Cloud	
  
•  Manages	
  cloud	
  compu=ng	
  testbeds:	
  Open	
  Cloud	
  
Testbed.	
  
	
  
OCC	
  Members	
  &	
  Partners	
  
•  Companies:	
  Cisco,	
  Yahoo!,	
  Intel,	
  …	
  
•  Universi=es:	
  	
  University	
  of	
  Chicago,	
  
Northwestern	
  Univ.,	
  Johns	
  Hopkins,	
  Calit2,	
  
ORNL,	
  University	
  of	
  Illinois	
  at	
  Chicago,	
  …	
  
•  Federal	
  agencies	
  and	
  labs:	
  NASA	
  
•  Interna=onal	
  Partners:	
  Univ.	
  Edinburgh,	
  AIST	
  
(Japan),	
  Univ.	
  Amsterdam,	
  …	
  
•  Partners:	
  Na=onal	
  Lambda	
  Rail	
  
34	
  
Third	
  party	
  open	
  
source	
  souware	
  
+	
  
Tukey	
  
Yates	
  
Open	
  source	
  souware	
  
developed	
  by	
  the	
  OCC	
  and	
  
open	
  standards	
  
+	
  
Data	
  center	
  
+	
  
Data	
  with	
  permissions	
  
+	
  
Authoriza=on	
  of	
  users	
  
access	
  to	
  data	
  
+	
  
Policies,	
  procedures,	
  
controls,	
  etc.	
  
+	
  
Governance,	
  legal	
  agreements	
  
+	
  
Sustainability	
  model	
   35	
  
Part	
  5	
  
OSDC	
  Data	
  
Data:	
  1	
  PB	
  of	
  OSDC	
  
data	
  across	
  several	
  
disciplines	
  
Instrument:	
  	
  
3000	
  cores	
  /	
  	
  
5	
  PB	
  OSDC	
  	
  
science	
  cloud	
  
+	
  +	
  
Team:	
  you	
  
and	
  your	
  
colleagues	
  
Discoveries	
  
correla=on	
  
algorithms	
  +	
  
OSDC	
  Public	
  Data	
  Sets	
  
•  Over	
  800	
  TB	
  of	
  open	
  access	
  data	
  in	
  the	
  OSDC	
  
•  Earth	
  sciences	
  data	
  
•  Biological	
  sciences	
  data	
  
•  Social	
  sciences	
  data	
  
•  Digital	
  humani=es	
  	
  
Part	
  6	
  
OSDC	
  Working	
  Groups	
  
Just	
  look	
  around	
  you	
  
Matsu Working Group:
Clouds to Support Earth Science
41
matsu.opensciencedatacloud.org	
  
Matsu	
  Architecture	
  
Hadoop	
  HDFS	
  
Matsu	
  Web	
  Map	
  	
  
Tile	
  Service	
  (WMTS)	
  
Matsu	
  MR-­‐based	
  
Tiling	
  Service	
  
NoSQL	
  Database	
  
Images	
  at	
  different	
  zoom	
  layers	
  
suitable	
  for	
  OGC	
  Web	
  Mapping	
  Server	
  
Level	
  0,	
  Level	
  1	
  and	
  Level	
  2	
  
images	
  
MapReduce	
  used	
  to	
  process	
  Level	
  n	
  to	
  Level	
  n+1	
  
data	
  and	
  to	
  par==on	
  images	
  for	
  different	
  zoom	
  
levels	
  
NoSQL-­‐based	
  
Analy=c	
  Services	
  
Streaming	
  Analy=c	
  
Services	
  
MR-­‐based	
  Analy=c	
  
Services	
  
Analy=c	
  Services	
   Storage	
  for	
  WMS	
  =les	
  and	
  
derived	
  data	
  products	
  
Presenta=on	
  Services	
  
Web	
  Coverage	
  
Processing	
  Service	
  
(WCPS)	
  
Workflow	
  Services	
  
Hadoop-­‐Based	
  Re-­‐Analysis	
  
Zoom	
  Level	
  1:	
  4	
  images	
   Zoom	
  Level	
  2:	
  16	
  images	
  
Zoom	
  Level	
  3:	
  64	
  images	
   Zoom	
  Level	
  4:	
  256	
  images	
  
Bionimbus	
  	
  
Working	
  Group	
  
bionimbus.opensciencedatacloud.org	
  (biological	
  data)	
  
Bionimbus	
  Protected	
  Data	
  Cloud	
  
45	
  
Analyzing	
  Data	
  From	
  	
  
The	
  Cancer	
  Genome	
  Atlas	
  (TCGA)	
  
1.  Apply	
  to	
  dbGaP	
  for	
  access	
  
to	
  data.	
  
2.  Hire	
  staff,	
  set	
  up	
  and	
  
operate	
  secure	
  compliant	
  
compu=ng	
  environment	
  to	
  
mange	
  10	
  –	
  100+	
  TB	
  of	
  data.	
  	
  	
  
3.  Get	
  environment	
  approved	
  
by	
  your	
  research	
  center.	
  
4.  Setup	
  analysis	
  pipelines.	
  
5.  Download	
  data	
  from	
  CG-­‐
Hub	
  (takes	
  days	
  to	
  weeks).	
  	
  
6.  Begin	
  analysis.	
  
Current	
  Prac2ce	
   With	
  Protected	
  Data	
  Cloud	
  (PDC)	
  
1.  Apply	
  to	
  dbGaP	
  for	
  access	
  
to	
  data.	
  
2.  Use	
  your	
  eRA	
  commons	
  
creden=als	
  to	
  login	
  to	
  the	
  
PDC,	
  select	
  the	
  data	
  that	
  
you	
  want	
  to	
  analyze,	
  and	
  
the	
  pipelines	
  that	
  you	
  want	
  
to	
  use.	
  	
  
3.  Begin	
  analysis.	
  
46	
  
One	
  Million	
  Genomes	
  
•  Sequencing	
  a	
  million	
  genomes	
  would	
  most	
  
likely	
  fundamentally	
  change	
  the	
  way	
  we	
  
understand	
  genomic	
  varia=on.	
  
•  The	
  genomic	
  data	
  for	
  a	
  pa=ent	
  is	
  about	
  1	
  TB	
  
(including	
  samples	
  from	
  both	
  tumor	
  and	
  
normal	
  =ssue).	
  
•  One	
  million	
  genomes	
  is	
  about	
  1000	
  PB	
  or	
  1	
  EB	
  
•  With	
  compression,	
  it	
  may	
  be	
  about	
  100	
  PB	
  
•  At	
  $1000/genome,	
  the	
  sequencing	
  would	
  cost	
  
about	
  $1B	
  
Big	
  data	
  driven	
  discovery	
  on	
  
1,000,000	
  genomes	
  and	
  1	
  EB	
  of	
  data.	
  
Genomic-­‐
driven	
  
diagnosis	
  
Improved	
  
understanding	
  
of	
  genomic	
  
science	
  
	
  Genomic-­‐	
  
driven	
  drug	
  
development	
  
Precision	
  diagnosis	
  and	
  
treatment.	
  	
  Preven=ve	
  
health	
  care.	
  
Biomedical	
  Commons	
  Cloud	
  (BCC)	
  
Working	
  Group	
  
Cloud	
  for	
  
Public	
  Data	
  
	
  
Cloud	
  for	
  Controlled	
  
Genomic	
  Data	
  
	
  
Cloud	
  for	
  
EMR,	
  PHI,	
  
data	
  
Example:	
  Open	
  Cloud	
  Consor=um’s	
  
Biomedical	
  Commons	
  Cloud	
  (BCC)	
  
Medical	
  Research	
  
Center	
  A	
  
Medical	
  Research	
  
Center	
  B	
  
Hospital	
  D	
  
Medical	
  Research	
  
Center	
  C	
  
49	
  
Resource	
   Who	
  users	
   Who	
  operates	
  
Open	
  Science	
  Data	
  
Cloud	
  (OSDC)	
  
Pan	
  science	
  data	
  
for	
  researchers	
  
Open	
  Cloud	
  Consor=um	
  
(OCC)	
  supported	
  by	
  
University	
  OCC	
  
members	
  
Biomedical	
  Commons	
  
Clouds	
  (BCC)	
  
(Interna=onal)	
  
biomedical	
  
researchers	
  
OCC	
  Biomedical	
  
Commons	
  Cloud	
  
Working	
  Group	
  
supported	
  by	
  OCC	
  
University	
  members	
  
Bionimbus	
  Protected	
  
Data	
  Cloud	
  
Genomics	
  
researchers	
  
University	
  of	
  Chicago	
  
supported	
  by	
  the	
  OCC	
  
50	
  
OpenFlow-­‐Enabled	
  Hadoop	
  WG	
  
•  When	
  running	
  Hadoop	
  some	
  map	
  and	
  reduce	
  jobs	
  
take	
  significantly	
  longer	
  than	
  others.	
  
•  These	
  are	
  stragglers	
  and	
  can	
  significantly	
  slow	
  down	
  
a	
  MapReduce	
  computa=on.	
  	
  
•  Stragglers	
  are	
  common	
  (dirty	
  secret	
  about	
  Hadoop)	
  
•  Infoblox	
  and	
  UChicago	
  are	
  leading	
  a	
  OCC	
  Working	
  
Group	
  on	
  OpenFlow-­‐enabled	
  Hadoop	
  that	
  will	
  
provide	
  addi=onal	
  bandwidth	
  to	
  stragglers.	
  	
  
•  We	
  have	
  a	
  testbed	
  for	
  a	
  wide	
  area	
  version	
  of	
  this	
  
project.	
  
OSDC	
  PIRE	
  Project	
  
We	
  select	
  OSDC	
  PIRE	
  Fellows	
  
(US	
  ci=zens	
  or	
  permanent	
  
residents):	
  	
  
•  We	
  give	
  them	
  tutorials	
  and	
  
training	
  on	
  big	
  data	
  science.	
  
•  We	
  provide	
  them	
  
fellowships	
  to	
  work	
  with	
  
OSDC	
  interna=onal	
  
partners.	
  
•  We	
  give	
  them	
  preferred	
  
access	
  to	
  the	
  OSDC.	
  
Nominate	
  your	
  favorite	
  scien=st	
  as	
  an	
  OSDC	
  PIRE	
  Fellow.	
  	
  
www.opensciencedatacloud.org	
  	
  (look	
  for	
  PIRE)	
  
Part	
  7	
  
Key	
  Ques=ons	
  for	
  This	
  Workshop	
  
•  Ques=on	
  1.	
  	
  How	
  can	
  we	
  add	
  partner	
  sites	
  at	
  other	
  loca=ons	
  
that	
  extend	
  the	
  OSDC?	
  	
  In	
  par=cular,	
  how	
  can	
  we	
  extend	
  the	
  
OSDC	
  to	
  sites	
  around	
  the	
  world?	
  	
  How	
  can	
  the	
  OSDC	
  
interoperate	
  with	
  other	
  science	
  clouds?	
  
•  Ques=on	
  2.	
  What	
  data	
  can	
  we	
  add	
  to	
  the	
  OSDC	
  to	
  facilitate	
  
data	
  intensive	
  cross-­‐disciplinary	
  discoveries?	
  
•  Ques=on	
  3.	
  	
  How	
  can	
  we	
  build	
  a	
  plugin	
  structure	
  so	
  that	
  
Tukey	
  can	
  be	
  extended	
  by	
  other	
  users	
  and	
  by	
  other	
  
communi=es?	
  
•  Ques=on	
  4.	
  What	
  tools	
  and	
  applica=ons	
  can	
  we	
  add	
  to	
  the	
  
OSDC	
  facilitate	
  data	
  intensive	
  cross-­‐disciplinary	
  discoveries?	
  
•  Ques=on	
  5.	
  	
  How	
  can	
  we	
  beher	
  integrate	
  digital	
  IDs	
  and	
  file	
  
sharing	
  services	
  into	
  the	
  OSDC?	
  
•  Ques=on	
  6.	
  What	
  are	
  3-­‐5	
  grand	
  challenge	
  ques=ons	
  that	
  
leverage	
  the	
  OSDC?	
  
Ques=ons	
  
Robert	
  Grossman	
  is	
  a	
  faculty	
  member	
  at	
  the	
  University	
  of	
  Chicago.	
  	
  He	
  is	
  the	
  Chief	
  
Research	
  Informa=cs	
  Officer	
  for	
  the	
  Biological	
  Sciences	
  Division,	
  a	
  Faculty	
  Member	
  
and	
  Senior	
  Fellow	
  at	
  the	
  Computa=on	
  Ins=tute	
  and	
  the	
  Ins=tute	
  for	
  Genomics	
  and	
  
Systems	
  Biology,	
  and	
  a	
  Professor	
  of	
  Medicine	
  in	
  the	
  Sec=on	
  of	
  Gene=c	
  Medicine.	
  	
  His	
  
research	
  group	
  focuses	
  on	
  big	
  data,	
  biomedical	
  informa=cs,	
  data	
  science,	
  cloud	
  
compu=ng,	
  and	
  related	
  areas.	
  	
  	
  
	
  
He	
  is	
  also	
  the	
  Founder	
  and	
  a	
  Partner	
  of	
  Open	
  Data	
  Group,	
  which	
  has	
  been	
  building	
  
predic=ve	
  models	
  over	
  big	
  data	
  for	
  companies	
  for	
  over	
  ten	
  years.	
  	
  	
  
	
  
He	
  recently	
  wrote	
  a	
  book	
  for	
  the	
  general	
  reader	
  that	
  discusses	
  big	
  data	
  (among	
  other	
  
topics)	
  called	
  the	
  Structure	
  of	
  Digital	
  Compu=ng:	
  From	
  Mainframes	
  to	
  Big	
  Data,	
  which	
  
can	
  be	
  purchased	
  from	
  Amazon.	
  
	
  
He	
  blogs	
  occasionally	
  about	
  big	
  data	
  at	
  rgrossman.com.	
  	
  	
  
	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Robert Grossman
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 

Was ist angesagt? (20)

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Big Data
Big Data Big Data
Big Data
 
Cri big data
Cri big dataCri big data
Cri big data
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendation
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 

Andere mochten auch

Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 

Andere mochten auch (7)

Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 

Ähnlich wie Using the Open Science Data Cloud for Data Science Research

Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginnerscpallares
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it worldChris Dwan
 
An Introduction to Data Intensive Computing
An Introduction to Data Intensive ComputingAn Introduction to Data Intensive Computing
An Introduction to Data Intensive ComputingCollin Bennett
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranJoseph Glorieux
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork OCTO Technology Suisse
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Scientific
Scientific Scientific
Scientific marpierc
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsLouise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsDataconomy Media
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 

Ähnlich wie Using the Open Science Data Cloud for Data Science Research (20)

Openstack For Beginners
Openstack For BeginnersOpenstack For Beginners
Openstack For Beginners
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
An Introduction to Data Intensive Computing
An Introduction to Data Intensive ComputingAn Introduction to Data Intensive Computing
An Introduction to Data Intensive Computing
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écran
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Scientific
Scientific Scientific
Scientific
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Dice presents-feb2014
Dice presents-feb2014Dice presents-feb2014
Dice presents-feb2014
 
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsLouise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx Systems
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Thoughts on Cybersecurity
Thoughts on CybersecurityThoughts on Cybersecurity
Thoughts on Cybersecurity
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 

Mehr von Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

Mehr von Robert Grossman (10)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Kürzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Using the Open Science Data Cloud for Data Science Research

  • 1. Using  the  Open  Science  Data  Cloud     for  Data  Science  Research   Robert  Grossman   University  of  Chicago   Open  Cloud  Consor=um   June  17,  2013  
  • 2. Data:  1  PB  of  OSDC   data  across  several   disciplines   Instrument:     3000  cores  /     5  PB  OSDC     science  cloud   +  +   Team:  you   and  your   colleagues   Discoveries   correla=on   algorithms  +  
  • 3. Part  1   What  Instrument  Do  we  Use  to     Make  Big  Data  Discoveries?   How  do  we  build  a  “datascope?”  
  • 4. What  is  big  data?   TB?  PB?  EB?     W?  KW?  MW?  
  • 5. An  algorithm  and   compu=ng   infrastructure  is  “big-­‐ data  scalable”  if  adding   a  rack  (or  container)  of   data  (and  corresponding   processors)  allows  you   to  do  the  same   computa=on  in  the   same  =me  but  over   more  data.  
  • 6. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   100,000  servers   1  PB  DRAM   100’s  of  PB  of  disk   Automa=c   provisioning  and   infrastructure   management   Monitoring,   network  security   and  forensics   Accoun=ng  and   billing   Customer   Facing   Portal   Data  center  network   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud  
  • 7. OSDC’s  vote  for  a  datascope:  a   (bou=que)  data  center  scale  facility   with  a  big-­‐data  scalable  analy=c   infrastructure.  
  • 8. Data:  1  PB  of  OSDC   data  across  several   disciplines   Instrument:     3000  cores  /     5  PB  OSDC     science  cloud   +  +   Team:  you   and  your   colleagues   Discoveries   correla=on   algorithms  +  
  • 9. Discipline   Dura2on   Size   #  Devices   HEP  -­‐  LHC   10  years   15  PB/year*   One   Astronomy  -­‐  LSST   10  years   12  PB/year**   One   Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s   Some  Examples  of  Big  Data  Science   *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par=cle  accelerator,  is  expected  to  produce  more  than  15   million  Gigabytes  of  data  each  year.    …  This  ambi=ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer   centres  in  33  countries.    Source:  hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html     **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes   processed),  resul=ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hhp://www.lsst.org/ News/enews/teragrid-­‐1004.html  
  • 10. One  large  instrument   Many  smaller  instruments  
  • 11. Part  2.   What  is  a  Cloud  and  Why  Do  We  Care?   11  
  • 12. There  Are  Two  Essen=al     Characteris=cs  of  a  Cloud   1.  Self  service   2.  Scale   •  Clouds  enable  you  to  compute  over  large   amounts  of  data  with  the  necessity  of  first   downloading  the  data.   •  Clouds  can  be  designed  to  be  secure  and   compliant.   12  
  • 13. Self  Service   Self  Service   13  
  • 15. Types  of  Clouds   •  Public  Clouds     – Amazon   •  Private  Clouds   – Run  internally  by  universi=es  or  companies   •  Community  Clouds   – Run  by  organiza=ons  (either  formally  or   informally),  such  as  the  Open  Cloud  Consor=um   15  
  • 16. Amazon  Web  Services   (AWS)?   Community  clouds,   science  clouds,  etc.   •  Lower  cost  (at  medium  scale)   •  Data  too  important  for   commercial  cloud   •  Compu=ng  over  scien=fic   data  is  a  core  competency   •  Can  support  any  required   governance  /  security   •  Scale   •  Simplicity  of  a  credit  card   •  Wide  variety  of  offerings.   vs.   OCC  supports  AWS  interop  and  burs=ng  when  permissible.   16  
  • 17. Science  Clouds   NFP  Science  Clouds   Commercial  Clouds   POV   Democra=ze  access  to   data.    Integrate  data  to   make  discoveries.    Long   term  archive.   As  long  as  you  pay  the  bill;   as  long  as  the  business   model  holds.   Data  &   Storage   Data  intensive   compu=ng  &  HP  storage   Internet  style  scale  out   and  object-­‐based  storage   Flows   Large  &  small  data  flows   Lots  of  small  web  flows   Streams   Streaming  processing   required   NA   Accoun=ng   Essen=al   Essen=al   Lock  in   Moving  environment   between  CSPs  essen=al   Lock  in  is  good   Interop   Cri=cal,  but  difficult   Customers  will  drive  to   some  degree   17  
  • 18. Essen=al  Services  for  a  Science  CSP   •  Support  for  data  intensive  compu=ng   •  Support  for  big  data  flows   •  Account  management,  authen=ca=on  and   authoriza=on  services   •  Health  and  status  monitoring   •  Billing  and  accoun=ng   •  Ability  to  rapidly  provision  infrastructure   •  Security  services,  logging,  event  repor=ng   •  Access  to  large  amounts  of  public  data   •  High  performance  storage   •  Simple  data  export  and  import  services  
  • 19. Datascope  –  Science  Cloud     Service  Provider  (Sci  CSP)   Data  scien=st   Sci  CSP  services  
  • 20. Cloud  Services     Opera=ons  Centers  (CSOC)   •  The  OSDC  operates  Cloud  Services  Opera=ons   Center  (or  CSOC).   •  It  is  a  CSOC  focused  on  suppor=ng  Science   Clouds  for  researchers.   •  Compare  to  Network  Opera=ons  Center  or   NOC.   •  Both  are  an  important  part  of  cyber   infrastructure  for  big  data  science.  
  • 21. Datascope  –  Science  Cloud     Service  Provider  (Sci  CSP)   Data  scien=st   Sci  CSP  services   Cloud  Service  Opera=ons   Center  (CSOC)  
  • 22. Part  3   Data  Science  
  • 23. Data   Founda=ons  of  data  science   General  and  discipline   specific  souware   applica=ons  and  tools   Models  and  algorithms     Establish  best  prac=ces,  strategies  for   data  science  in  general  and  discipline   specific  data  science  in  par=cular   Analy=c  infrastructure   Data  
  • 24. What  are  the  founda=ons  for  data  science?  
  • 25. Theory  to  Big  Data  Spectrum   Simple  counts   and  sta=s=cs   over  big  data   Mathema=cal   theorems   No  data   Small  data   Big  data   Tradi=onal   sta=s=cal  modeling   Medium  data   (Semi-­‐)Automa=ng   sta=s=cal  modeling   GB   TB   PB   OSDC  Datascope   0.5-­‐2.0  MW  
  • 26. Part  4   The  Open  Science  Data  Cloud   www.opensciencedatacloud.org  
  • 27. Data:  1  PB  of  OSDC   data  across  several   disciplines   Instrument:     3000  cores  /     5  PB  OSDC     science  cloud   +  +   Team:  you   and  your   colleagues   Discoveries   correla=on   algorithms  +  
  • 28. 2013  Open  Science  Data  Cloud  (IaaS)   5  PB  2013     (OpenStack  &   GlusterFS)   Infrastructure   automa=on  &   management   (Yates)   Compliance,  &   security   (OpenFISMA)   Accoun=ng  &   billing   (Salesforce.com)   Customer  Facing   Portal  (Tukey)   Data  center  network   ~10-­‐100  Gbps  bandwidth     5  engineers  to  operate  0.5  MW  Science  Cloud   Science  Cloud  SW   &  Services   •  Virtual  Machine  (VM)  containing  common  applica=ons  &   pipelines     •  Tukey  (OSDC  portal  &  middleware  v0.3)   •  Yates  (infrastructure  automa=on  and  management  v0.1)   28  
  • 29. Tukey   •  Tukey  (based  in  part  on  Horizon).   •  We  have  factored  out  digital  ID  service,  file   sharing,  and  transport  from  Bionimbus  and   Matsu.  
  • 30. Yates   •  Automa=on   installa=on  of   OSDC  souware   stack  on  rack  of   computers.   •  Based  upon  Chef   •  Version  0.1  
  • 31. UDR   •  UDT  is  a  high  performance  network  transport  protocol   •  UDR  =  rsync  +  UDT     •  It  is  easy  for  an  average  systems  administrator  to  keep   100’s  of  TB  of  distributed  data  synchronized.     •  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  
  • 32. Open  Science  Data  Cloud  Services   •  Digital  ID  services   •  Data  sharing  services   •  Data  transport  services  (UDR)   •  What  other  core  services  are  essen&al?   •  Of  course,  working  groups  and  applica=ons   always  add  their  own  services   •  These  core  services  will  hopefully  make  the   OSDC  ahrac=ve  as  a  plaxorm  (PaaS)  for   scien=fic  discovery.  
  • 33. 33   www.opencloudconsor=um.org   •  U.S  based  not-­‐for-­‐profit  corpora=on.   •  Manages  cloud  compu=ng  infrastructure  to   support  scien=fic  research:  Open  Science  Data   Cloud.   •  Manages  cloud  compu=ng  infrastructure  to   support  medical  and  health  care  research:   Biomedical  Commons  Cloud   •  Manages  cloud  compu=ng  testbeds:  Open  Cloud   Testbed.    
  • 34. OCC  Members  &  Partners   •  Companies:  Cisco,  Yahoo!,  Intel,  …   •  Universi=es:    University  of  Chicago,   Northwestern  Univ.,  Johns  Hopkins,  Calit2,   ORNL,  University  of  Illinois  at  Chicago,  …   •  Federal  agencies  and  labs:  NASA   •  Interna=onal  Partners:  Univ.  Edinburgh,  AIST   (Japan),  Univ.  Amsterdam,  …   •  Partners:  Na=onal  Lambda  Rail   34  
  • 35. Third  party  open   source  souware   +   Tukey   Yates   Open  source  souware   developed  by  the  OCC  and   open  standards   +   Data  center   +   Data  with  permissions   +   Authoriza=on  of  users   access  to  data   +   Policies,  procedures,   controls,  etc.   +   Governance,  legal  agreements   +   Sustainability  model   35  
  • 36. Part  5   OSDC  Data  
  • 37. Data:  1  PB  of  OSDC   data  across  several   disciplines   Instrument:     3000  cores  /     5  PB  OSDC     science  cloud   +  +   Team:  you   and  your   colleagues   Discoveries   correla=on   algorithms  +  
  • 38.
  • 39. OSDC  Public  Data  Sets   •  Over  800  TB  of  open  access  data  in  the  OSDC   •  Earth  sciences  data   •  Biological  sciences  data   •  Social  sciences  data   •  Digital  humani=es    
  • 40. Part  6   OSDC  Working  Groups   Just  look  around  you  
  • 41. Matsu Working Group: Clouds to Support Earth Science 41 matsu.opensciencedatacloud.org  
  • 42. Matsu  Architecture   Hadoop  HDFS   Matsu  Web  Map     Tile  Service  (WMTS)   Matsu  MR-­‐based   Tiling  Service   NoSQL  Database   Images  at  different  zoom  layers   suitable  for  OGC  Web  Mapping  Server   Level  0,  Level  1  and  Level  2   images   MapReduce  used  to  process  Level  n  to  Level  n+1   data  and  to  par==on  images  for  different  zoom   levels   NoSQL-­‐based   Analy=c  Services   Streaming  Analy=c   Services   MR-­‐based  Analy=c   Services   Analy=c  Services   Storage  for  WMS  =les  and   derived  data  products   Presenta=on  Services   Web  Coverage   Processing  Service   (WCPS)   Workflow  Services  
  • 43. Hadoop-­‐Based  Re-­‐Analysis   Zoom  Level  1:  4  images   Zoom  Level  2:  16  images   Zoom  Level  3:  64  images   Zoom  Level  4:  256  images  
  • 44. Bionimbus     Working  Group   bionimbus.opensciencedatacloud.org  (biological  data)  
  • 45. Bionimbus  Protected  Data  Cloud   45  
  • 46. Analyzing  Data  From     The  Cancer  Genome  Atlas  (TCGA)   1.  Apply  to  dbGaP  for  access   to  data.   2.  Hire  staff,  set  up  and   operate  secure  compliant   compu=ng  environment  to   mange  10  –  100+  TB  of  data.       3.  Get  environment  approved   by  your  research  center.   4.  Setup  analysis  pipelines.   5.  Download  data  from  CG-­‐ Hub  (takes  days  to  weeks).     6.  Begin  analysis.   Current  Prac2ce   With  Protected  Data  Cloud  (PDC)   1.  Apply  to  dbGaP  for  access   to  data.   2.  Use  your  eRA  commons   creden=als  to  login  to  the   PDC,  select  the  data  that   you  want  to  analyze,  and   the  pipelines  that  you  want   to  use.     3.  Begin  analysis.   46  
  • 47. One  Million  Genomes   •  Sequencing  a  million  genomes  would  most   likely  fundamentally  change  the  way  we   understand  genomic  varia=on.   •  The  genomic  data  for  a  pa=ent  is  about  1  TB   (including  samples  from  both  tumor  and   normal  =ssue).   •  One  million  genomes  is  about  1000  PB  or  1  EB   •  With  compression,  it  may  be  about  100  PB   •  At  $1000/genome,  the  sequencing  would  cost   about  $1B  
  • 48. Big  data  driven  discovery  on   1,000,000  genomes  and  1  EB  of  data.   Genomic-­‐ driven   diagnosis   Improved   understanding   of  genomic   science    Genomic-­‐   driven  drug   development   Precision  diagnosis  and   treatment.    Preven=ve   health  care.  
  • 49. Biomedical  Commons  Cloud  (BCC)   Working  Group   Cloud  for   Public  Data     Cloud  for  Controlled   Genomic  Data     Cloud  for   EMR,  PHI,   data   Example:  Open  Cloud  Consor=um’s   Biomedical  Commons  Cloud  (BCC)   Medical  Research   Center  A   Medical  Research   Center  B   Hospital  D   Medical  Research   Center  C   49  
  • 50. Resource   Who  users   Who  operates   Open  Science  Data   Cloud  (OSDC)   Pan  science  data   for  researchers   Open  Cloud  Consor=um   (OCC)  supported  by   University  OCC   members   Biomedical  Commons   Clouds  (BCC)   (Interna=onal)   biomedical   researchers   OCC  Biomedical   Commons  Cloud   Working  Group   supported  by  OCC   University  members   Bionimbus  Protected   Data  Cloud   Genomics   researchers   University  of  Chicago   supported  by  the  OCC   50  
  • 51. OpenFlow-­‐Enabled  Hadoop  WG   •  When  running  Hadoop  some  map  and  reduce  jobs   take  significantly  longer  than  others.   •  These  are  stragglers  and  can  significantly  slow  down   a  MapReduce  computa=on.     •  Stragglers  are  common  (dirty  secret  about  Hadoop)   •  Infoblox  and  UChicago  are  leading  a  OCC  Working   Group  on  OpenFlow-­‐enabled  Hadoop  that  will   provide  addi=onal  bandwidth  to  stragglers.     •  We  have  a  testbed  for  a  wide  area  version  of  this   project.  
  • 52. OSDC  PIRE  Project   We  select  OSDC  PIRE  Fellows   (US  ci=zens  or  permanent   residents):     •  We  give  them  tutorials  and   training  on  big  data  science.   •  We  provide  them   fellowships  to  work  with   OSDC  interna=onal   partners.   •  We  give  them  preferred   access  to  the  OSDC.   Nominate  your  favorite  scien=st  as  an  OSDC  PIRE  Fellow.     www.opensciencedatacloud.org    (look  for  PIRE)  
  • 53. Part  7   Key  Ques=ons  for  This  Workshop  
  • 54. •  Ques=on  1.    How  can  we  add  partner  sites  at  other  loca=ons   that  extend  the  OSDC?    In  par=cular,  how  can  we  extend  the   OSDC  to  sites  around  the  world?    How  can  the  OSDC   interoperate  with  other  science  clouds?   •  Ques=on  2.  What  data  can  we  add  to  the  OSDC  to  facilitate   data  intensive  cross-­‐disciplinary  discoveries?   •  Ques=on  3.    How  can  we  build  a  plugin  structure  so  that   Tukey  can  be  extended  by  other  users  and  by  other   communi=es?   •  Ques=on  4.  What  tools  and  applica=ons  can  we  add  to  the   OSDC  facilitate  data  intensive  cross-­‐disciplinary  discoveries?   •  Ques=on  5.    How  can  we  beher  integrate  digital  IDs  and  file   sharing  services  into  the  OSDC?   •  Ques=on  6.  What  are  3-­‐5  grand  challenge  ques=ons  that   leverage  the  OSDC?  
  • 56. Robert  Grossman  is  a  faculty  member  at  the  University  of  Chicago.    He  is  the  Chief   Research  Informa=cs  Officer  for  the  Biological  Sciences  Division,  a  Faculty  Member   and  Senior  Fellow  at  the  Computa=on  Ins=tute  and  the  Ins=tute  for  Genomics  and   Systems  Biology,  and  a  Professor  of  Medicine  in  the  Sec=on  of  Gene=c  Medicine.    His   research  group  focuses  on  big  data,  biomedical  informa=cs,  data  science,  cloud   compu=ng,  and  related  areas.         He  is  also  the  Founder  and  a  Partner  of  Open  Data  Group,  which  has  been  building   predic=ve  models  over  big  data  for  companies  for  over  ten  years.         He  recently  wrote  a  book  for  the  general  reader  that  discusses  big  data  (among  other   topics)  called  the  Structure  of  Digital  Compu=ng:  From  Mainframes  to  Big  Data,  which   can  be  purchased  from  Amazon.     He  blogs  occasionally  about  big  data  at  rgrossman.com.