SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Leveraging	
  Tools	
  and	
  Components
                           from	
  OODT	
  and	
  Apache
                            within	
  Climate	
  Science
                  and	
  the	
  Earth	
  System	
  Grid	
  Federa9on

                             Luca	
  Cinquini,	
  Dan	
  Crichton,	
  Chris	
  Ma2mann
                                           NASA	
  Jet	
  Propulsion	
  Laboratory,	
  
                                           California	
  Ins9tute	
  of	
  Technology




Copyright	
  2013	
  California	
  Ins>tute	
  of	
  Technology.	
  Government	
  Sponsorship	
  Acknowledged.
Outline

This	
  talk	
  will	
  present	
  the	
  Earth	
  System	
  Grid	
  Federa9on	
  (ESGF)	
  as	
  an	
  example	
  of	
  a	
  
scien9fic	
  project	
  that	
  is	
  leveraging	
  tools	
  from	
  the	
  ASF	
  to	
  successfully	
  manage	
  and	
  
distributed	
  large	
  amounts	
  of	
  scien9fic	
  data.

•The	
  Earth	
  System	
  Grid	
  Federa9on	
  (ESGF)
 ‣Science	
  mo9va9ons
 ‣High	
  level	
  architecture
 ‣Major	
  soSware	
  components:	
  Node	
  Manager,	
  Search	
  Service,	
  Web	
  Front	
  End
•Apache	
  Solr
 ‣General	
  func9onality
 ‣Scalability	
  op9ons
 ‣Usage	
  within	
  ESGF
•Apache	
  Object	
  Oriented	
  Data	
  Technology	
  (OODT)
 ‣Descrip9on	
  of	
  OODT	
  components
 ‣Current	
  usage	
  for	
  ESGF	
  publishing
 ‣Next	
  genera9on	
  ESGF	
  data	
  processing October	
  2012
                                      ExArch	
  Mee9ng,	
  
Introduc.on

The	
  Earth	
  System	
  Grid	
  Federa9on	
  (ESGF)	
  is	
  a	
  mul9-­‐agency,	
  interna9onal	
  collabora9on	
  
       of	
  people	
  and	
  ins9tu9ons	
  working	
  together	
  to	
  build	
  an	
  open	
  source	
  soSware	
  
infrastructure	
  for	
  the	
  management	
  and	
  analysis	
  of	
  Earth	
  Science	
  data	
  on	
  a	
  global	
  scale

•Historically	
  started	
  with	
  DoE	
  Earth	
  System	
  Grid	
  project,	
  expanded	
  to	
  include	
  groups	
  
and	
  ins9tu9ons	
  around	
  the	
  world	
  that	
  are	
  involved	
  in	
  management	
  of	
  climate	
  data	
  
•Recently	
  evolved	
  into	
  a	
  next	
  genera9on	
  architecture:	
  the	
  Peer-­‐To-­‐Peer	
  (P2P)	
  system	
  
(higher	
  performance,	
  more	
  reliable	
  and	
  scalable,	
  focused	
  on	
  federa9on	
  services)

                                                                            Acknowledgments
                                                 •SoSware	
  development	
  and	
  project	
  management:	
  
                                                 ANL,	
  ANU,	
  BADC,	
  CMCC,	
  DKRZ,	
  ESRL,	
  GFDL,	
  GSFC,	
  
                                                 JPL,	
  IPSL,	
  ORNL,	
  LLNL	
  (lead),	
  PMEL,	
  ...
                                                 •Opera9ons:	
  tens	
  of	
  data	
  centers	
  across	
  Asia,	
  
                                                 Australia,	
  Europe	
  and	
  North	
  America
                                                 •Funding:	
  DoE,	
  NASA,	
  NOAA,	
  NSF,	
  IS-­‐ENES,...

                                        ExArch	
  Mee9ng,	
  October	
  2012
Support	
  for	
  Climate	
  Change	
  Research
•ESGF	
  primary	
  goal	
  is	
  to	
  support	
  research	
  on	
  Climate	
  Change	
  -­‐	
  one	
  of	
  the	
  most	
  cri9cal	
  and	
  
complex	
  scien9fic	
  problems	
  of	
  our	
  9me
•IPCC	
  (Interna9onal	
  Panel	
  on	
  Climate	
  Change):	
  interna9onal	
  body	
  of	
  scien9sts	
  responsible	
  for	
  
compiling	
  reports	
  on	
  Climate	
  Change	
  to	
  the	
  global	
  community	
  and	
  poli9cal	
  leaders
•Over	
  the	
  years,	
  IPCC	
  reports	
  have	
  increasingly	
  supported	
  evidence	
  for	
  climate	
  change	
  and	
  
iden9fied	
  human	
  causes	
  for	
  this	
  change




•	
  Next	
  IPCC-­‐AR5	
  report	
  due	
  in	
  2014,	
  will	
  be	
  based	
  on	
  data	
  produced	
  by	
  CMIP5	
  modeling	
  project	
  
(and	
  delivered	
  by	
  ESGF)                   ExArch	
  Mee9ng,	
  October	
  2012
CMIP5:	
  a	
  “Big	
  Data”	
  problem
                            CMIP5:	
  Coupled	
  Model	
  Intercomparison	
  Project	
  -­‐	
  phase	
  5
•Arguably	
  the	
  largest	
  coordinated	
  modeling	
  effort	
  in	
  human	
  history
•Coordinated	
  by	
  WCRP	
  (World	
  Climate	
  Research	
  Program)
•Includes	
  40+	
  climate	
  models	
  in	
  20+	
  countries
  ‣Models	
  are	
  run	
  to	
  simulate	
  future	
  state	
  of	
  the	
  Earth	
  under	
  different	
  emission	
  scenarios
  ‣Common	
  set	
  of	
  physical	
  fields	
  (temperature,	
  humidity,	
  precipita9on	
  etc.)	
  are	
  generated
  ‣Predic9ons	
  from	
  different	
  models	
  are	
  compared,	
  averaged,	
  scored




                                                 “Big	
  Data”	
  characteris9cs
•Volume:	
  approx.	
  2.5	
  PB	
  of	
  data	
  distributed	
  around	
  the	
  world
•Variety:	
  must	
  combine	
  model	
  output	
  with	
  observa9ons	
  of	
  different	
  kind	
  (satellite,	
  in-­‐situ,	
  
radar,	
  etc.)	
  and	
  reanalysis
•Veracity:	
  data	
  and	
  algorithms	
  must	
  be	
  tracked,	
  validated
•Velocity:	
  N/A	
  as	
  immediate	
  aExArch	
  Mee9ng,	
  October	
  2012ot	
  important
                                         vailability	
  of	
  model	
  output	
  is	
  n
System	
  Architecture

                          ESGF	
  is	
  a	
  system	
  of	
  distributed	
  and	
  federated	
  Nodes	
  	
  
                that	
  interact	
  dynamically	
  through	
  a	
  Peer-­‐To-­‐Peer	
  (P2P)	
  paradigm

•Distributed:	
  data	
  and	
  metadata	
  are	
  
published,	
  stored	
  and	
  served	
  from	
  
mul9ple	
  centers	
  (“Nodes”)
•Federated:	
  Nodes	
  interoperate	
  because	
  
of	
  the	
  adop9on	
  of	
  common	
  services,	
  
protocols	
  and	
  APIs,	
  and	
  establishment	
  of	
  
mutual	
  trust	
  rela9onships
•Dynamic:	
  Nodes	
  can	
  join/leave	
  the	
  
federa9on	
  dynamically	
  -­‐	
  global	
  data	
  and	
  
services	
  will	
  change	
  accordingly

A	
  client	
  (browser	
  or	
  program)	
  can	
  start	
  from	
  any	
  Node	
  in	
  the	
  federa9on	
  and	
  discover,	
  
download	
  and	
  analyze	
  data	
  from	
  mul9ple	
  loca9ons	
  as	
  if	
  they	
  were	
  stored	
  in	
  a	
  single	
  
central	
  archive.
                                          ExArch	
  Mee9ng,	
  October	
  2012
Node	
  Architecture

•Internally,	
  each	
  ESGF	
  Node	
  is	
  composed	
  of	
  services	
  and	
  applica9ons	
  that	
  collec9vely	
  
enable	
  data	
  and	
  metadata	
  access,	
  and	
  user	
  management.	
  
•SoSware	
  components	
  are	
  grouped	
  into	
  4	
  areas	
  of	
  func9onality	
  (aka	
  “flavors”):
  •Data	
  Node	
  :	
  secure	
  data	
  publica9on	
  and	
  access
  •Index	
  Node	
  :	
  
     ‣metadata	
  indexing	
  and	
  searching	
  (w/	
  Solr)
     ‣web	
  portal	
  UI	
  to	
  drive	
  human	
  interac9on
     ‣dashboard	
  suite	
  of	
  admin	
  applica9ons
     ‣model	
  metadata	
  viewer	
  plugin
  •	
  Iden9ty	
  Provider	
  :	
  user	
  registra9on,	
  
  authen9ca9on	
  and	
  group	
  membership
  •	
  Compute	
  Node	
  :	
  analysis	
  and	
  visualiza9on

•Nodes	
  flavors	
  can	
  be	
  installed	
  in	
  various	
  combina9ons	
  depending	
  on	
  site	
  needs,	
  or	
  to	
  
achieve	
  higher	
  performance	
  and	
  scalability
•Node	
  Manager	
  is	
  installed	
  for	
  all	
  Node	
  flavors	
  to	
  exchange	
  service,	
  state	
  informa9on
                                         ExArch	
  Mee9ng,	
  October	
  2012
The	
  Node	
  Manager	
  and	
  P2P	
  Protocol

•Enables	
  con9nuos	
  exchange	
  of	
  service	
  and	
  state	
  informa9on	
  among	
  Nodes
•Internally,	
  it	
  collects	
  Node	
  health	
  informa9on	
  and	
  metrics	
  (cpu,	
  disk	
  usage,	
  etc.)
                                              Peer-­‐To-­‐Peer	
  (P2P)	
  protocol
•Gossip	
  protocol:	
  informa9on	
  is	
  exchanged	
  randomly	
  among	
  peers
 ‣Each	
  Node	
  receives	
  informa9on	
  from	
  one	
  Node,	
  merges	
  it	
  with	
  its	
  own	
  informa9on,	
  and	
  
 propagates	
  it	
  to	
  two	
  other	
  Nodes	
  at	
  random
 ‣No	
  central	
  coordina9on,	
  no	
  single	
  point	
  of	
  failure
•Nodes	
  can	
  join/leave	
  the	
  federa9on	
  dynamically
•Each	
  Node	
  is	
  bootstrapped	
  with	
  knowledge	
  of	
  one	
  default	
  peer
•Each	
  Node	
  can	
  belong	
  to	
  one	
  or	
  more	
  peer	
  groups	
  within	
  which	
  informa9on	
  is	
  exchanged
                                                                              XML	
  Registry
                                         •XML	
  document	
  that	
  is	
  payload	
  of	
  P2P	
  protocol
                                         •Contains	
  service	
  endpoints	
  and	
  SSL	
  public	
  keys	
  for	
  all	
  Nodes	
  
                                         in	
  the	
  federa9on
                                         •Derived	
  products	
  (list	
  of	
  search	
  shards,	
  trusted	
  IdPs,	
  loca9on	
  
                                         of	
  Agribute	
  Services,...)	
  are	
  used	
  by	
  federa9on-­‐wide	
  services
                                           Challenge:	
  good	
  news	
  travel	
  fast,	
  bad	
  news	
  travel	
  slow...
                                          ExArch	
  Mee9ng,	
  October	
  2012
Apache	
  Solr
Solr	
  is	
  a	
  high-­‐performance	
  web-­‐enabled	
  search	
  engine	
  built	
  on	
  top	
  of	
  Lucene
•Used	
  in	
  many	
  e-­‐commerce	
  web	
  sites
•Free	
  text	
  searches	
  (w/	
  stemming,	
  stop	
  words,	
  ...)
•Faceted	
  searches	
  (w/	
  facet	
  counts)
•Other	
  features:	
  scoring,	
  highligh9ng,	
  word	
  comple9on,...
•Generic	
  flat	
  metadata	
  model:	
  (key,	
  value+)	
  pairs


              Metadata                                   Apache Solr                       < HTTP GET
               (XML)                                                                        XML/JSON >
                                                        Tomcat/Jetty                                                Client




                                                           Lucene
                                                            Index


  Because	
  of	
  it	
  high	
  performance	
  and	
  domain-­‐agnos9c	
  metadata	
  model,	
  Solr	
  is	
  well	
  suited	
  
  to	
  fulfill	
  the	
  indexing	
  and	
  searching	
  needs	
  of	
  most	
  data-­‐intensive	
  science	
  applica9ons.
                                                ExArch	
  Mee9ng,	
  October	
  2012
Solr	
  Scalability
Solr	
  has	
  advanced	
  features	
  that	
  allow	
  it	
  to	
  scale	
  to	
  tens	
  of	
  M	
  of	
  records:
•Cores:	
  separate	
  indexes	
  containing	
  different	
  record	
  types	
  that	
  must	
  be	
  
queried	
  independently.	
  Cores	
  run	
  within	
  same	
  Solr	
  instance.
•Shards:	
  separate	
  indexes	
  containing	
  records	
  of	
  the	
  same	
  type:	
  query	
  is	
  
distributed	
  across	
  shards,	
  results	
  merged	
  together.	
  Shards	
  run	
  on	
  separate	
  
Solr	
  instances/servers.
•Replicas:	
  iden9cal	
  copies	
  of	
  the	
  same	
  index,	
  for	
  redundancy	
  and	
  load	
  
balance.	
  Replicas	
  run	
  on	
  separate	
  Solr	
  instances/servers.




                                       SHARD     SHARD    SHARD           REPLICA      REPLICA      REPLICA
       CORE    CORE    CORE
                                         A         B        C                A            B            C
        A       B       C




  ESGF	
  u9lizes	
  all	
  the	
  above	
  techniques	
  to	
  again	
  sa9sfactory	
  performance	
  
  over	
  large,	
  distributed	
  data	
  holdings October	
  2012
                                      ExArch	
  Mee9ng,	
  
ESGF	
  Search	
  Architecture

•Each	
  Index	
  Node	
  runs	
  a	
  Master	
  Solr	
  (:8984)	
  and	
  a	
  Slave	
  Solr	
  (:8983)
  ‣Metadata	
  is	
  wrigen	
  to	
  the	
  Master	
  Solr,	
  replicated	
  to	
  the	
  Slave	
  Solr
  ‣Publishing	
  opera9ons	
  (“write”)	
  do	
  not	
  affect	
  user	
  queries	
  (“read”)
  ‣Master	
  Solr	
  is	
  secured	
  within	
  firewall,	
  enabled	
  for	
  wri9ng	
  (and	
  reading)
  ‣Slave	
  Solr	
  is	
  open	
  to	
  the	
  world	
  for	
  reading	
  but	
  not	
  enabled	
  for	
  wri9ng
•A	
  query	
  ini9ated	
  at	
  any	
  Slave	
  Solr	
  is	
  distributed	
  to	
  all	
  other	
  Slave	
  Solrs	
  in	
  the	
  federa9on	
  
•Internally,	
  each	
  Master/Slave	
  Solr	
  runs	
  mul9ple	
  cores:
  ‣“datasets”	
  core	
  contains	
  high	
  level	
  descrip9ve	
  metadata	
  for	
  search	
  and	
  discovery
  ‣“files”,	
  “aggrega9ons”	
  cores	
  contain	
  inventory-­‐level	
  metadata	
  for	
  data	
  retrieval
                                                                                        INDEX NODE

                                                                                      Solr              Solr
                                                                                     Master            Slave
                                                                                     (Write)          (Read)
                                       INDEX NODE


                          datasets
                                          Solr              Solr
                            files         Master            Slave
                                         (Write)          (Read)
                        aggregations

                                                                                        INDEX NODE

                                                                                      Solr              Solr
                                                                                     Master            Slave
                                                                                     (Write)          (Read)


                                                   ExArch	
  Mee9ng,	
  October	
  2012
ESGF	
  Search	
  Architecture

•Each	
  Index	
  Node	
  may	
  create	
  local	
  replicas	
  of	
  one	
  or	
  more	
  remote	
  Solr	
  Slaves
 ‣Distributed	
  query	
  operates	
  on	
  local	
  indexes	
  only	
  -­‐	
  no	
  network	
  latency!
 ‣Local	
  query	
  gives	
  complete	
  results	
  even	
  if	
  remote	
  Index	
  Node	
  is	
  unavailable
 ‣Addi9onal	
  resources	
  are	
  needed	
  to	
  run	
  mul9ple	
  Slave	
  Solrs	
  at	
  one	
  ins9tu9on

                                                                                    INDEX NODE
                               INDEX NODE
                                                                                  Solr             Solr
                                              Solr                               Master           Slave
                                             Replica                             (Write)         (Read)
                                             (Read)



               datasets
                               Solr            Solr
                 files         Master          Slave
                              (Write)        (Read)
             aggregations
                                                                                  INDEX NODE
                                              Solr                              Solr             Solr
                                             Replica                           Master           Slave
                                             (Read)                            (Write)         (Read)




                                           ExArch	
  Mee9ng,	
  October	
  2012
ESGF	
  Search	
  Web	
  Service

ESGF	
  clients	
  do	
  not	
  query	
  Solr	
  directly,	
  rather	
  they	
  query	
  the	
  ESGF	
  Search	
  Web	
  Service	
  that	
  
front-­‐ends	
  the	
  Solr	
  slave:	
  
•Exposes	
  a	
  RESTful	
  query	
  API	
  simpler	
  than	
  Solr	
  query	
  syntax
  ‣Query	
  by	
  free	
  text
  ‣Query	
  by	
  domain-­‐specific	
  facets:	
  /search?model=...&experiment=...&9me_frequency=...
  ‣Extremely	
  popular	
  with	
  client	
  developers
•Provides	
  higher-­‐level	
  func9onality:	
  
  ‣automa9c	
  query	
  distribu9on	
  to	
  all	
  known	
  shards	
  (dynamically!)
  ‣pruning	
  of	
  unresponsive	
  shards
  ‣genera9on	
  of	
  wget	
  scripts	
  with	
  same	
  query	
  syntax:	
  /wget?model=...&experiment=...
  ‣genera9on	
  of	
  RSS	
  feeds	
  for	
  latest	
  data	
  across	
  the	
  federa9on
•Insulates	
  client	
  versus	
  possible	
  (unlikely)	
  future	
  changes	
  of	
  back-­‐end	
  engine	
  technology
•ESGF	
  Web	
  Portal	
  is	
  a	
  client	
  to	
  the	
  ESGF	
  Search	
  Web	
  Service           INDEX NODE
                                                                                                          Solr
                                                                                                         Replica
                                                                                                         (Read)



                                                                                datasets
                                                                                              Solr         Solr      ESGF         ESGF
                                                                                  files       Master       Slave      Search        Web
                                                                                             (Write)     (Read)       WS          Portal
                                                                              aggregations



                                                                                                          Solr
                                                                                                         Replica
                                                                                                         (Read)

                                              ExArch	
  Mee9ng,	
  October	
  2012
Web	
  Portal	
  Front	
  End	
  
ESGF	
  Web	
  Portal:	
  web	
  applica9on	
  to	
  allow	
  human	
  users	
  to	
  interact	
  with	
  the	
  system
•User	
  registra9on,	
  authen9ca9on	
  and	
  group	
  membership
•Data	
  search,	
  access,	
  analysis	
  and	
  visualiza9on
The	
  Web	
  Portal	
  Search	
  UI	
  allows	
  users	
  to	
  formulate	
  search	
  requests	
  and	
  access	
  data:
•Search	
  by	
  free	
  text,	
  facets,	
  geo-­‐spa9al	
  and	
  temporal	
  constraints
•Datasets	
  added	
  to	
  “data	
  cart”
•Files	
  downloaded	
  via	
  single	
  HTTP	
  links,	
  wget	
  scripts,	
  Globus	
  Online	
  (experimental)
•Hyperlinks	
  to	
  higher	
  services	
  for	
  data	
  analysis	
  and	
  visualiza9on	
  and	
  metadata	
  inspec9on




                                            ExArch	
  Mee9ng,	
  October	
  2012
ESGF	
  Global	
  Distributed	
  Search
•Query	
  from	
  any	
  ESGF	
  web	
  portal	
  is	
  distributed	
  to	
  all	
  other	
  web	
  portals
•Users	
  can	
  query	
  in	
  real	
  9me	
  a	
  world-­‐wide	
  database	
  containing	
  tens	
  of	
  M	
  of	
  records




              Query	
  Pagern
        •“Dataset”-­‐level	
  
        query	
  is	
  distributed	
  
        to	
  mul9ple	
  Index	
  
        Nodes	
  
        •“File”-­‐level	
  query	
  
        issued	
  versus	
  a	
  
        single	
  specific	
  Index	
  
        Node




                                              ExArch	
  Mee9ng,	
  October	
  2012
NASA	
  Obs4MIPs	
  Datasets
•ESGF	
  allows	
  access	
  not	
  only	
  to	
  model	
  output,	
  but	
  also	
  observa9onal	
  and	
  reanalysis	
  
datasets
•NASA	
  Obs4MIPs:	
  selected	
  satellite	
  observa9ons	
  especially	
  prepared	
  for	
  usage	
  in	
  
IPCC-­‐AR5,	
  formaged	
  as	
  output	
  from	
  global	
  climate	
  models:
  ‣Formaged	
  as	
  NetCDF/CF
  ‣Level	
  3	
  products:	
  global	
  lat/lon	
  grids	
  (not	
  satellite	
  swaths)
  ‣CMIP5	
  metadata	
  conven9ons	
  for	
  search	
  and	
  discovery	
  
  ‣Include	
  detailed	
  Technical	
  Notes	
  to	
  instruct	
  on	
  proper	
  use	
  of	
  observa9ons
•Goal:	
  facilitate	
  comparison	
  of	
  observa9ons	
  with	
  models	
  and	
  therefore	
  quan9fy	
  the	
  
reliability	
  of	
  model	
  predic9ons	
  for	
  climate	
  change	
  (“scoring”)	
  
•Prepara9on	
  and	
  publica9on	
  of	
  Obs4MIPs	
  datasets	
  into	
  ESGF	
  was	
  accomplished	
  by	
  
establishing	
  a	
  data	
  processing	
  pipeline	
  that	
  leveraged	
  OODT	
  components	
  (CAS	
  File	
  
Manager	
  and	
  CAS	
  Curator)




                                         ExArch	
  Mee9ng,	
  October	
  2012
Apache	
  OODT
OODT:	
  Object	
  Oriented	
  Data	
  Technology:	
  framework	
  for	
  management,	
  discovery	
  and	
  
access	
  of	
  distributed	
  data	
  resources.	
  Main	
  features:
•Modularity:	
  framework	
  of	
  standalone	
  components	
  that	
  can	
  be	
  deployed	
  in	
  various	
  
configura9ons	
  to	
  fulfill	
  a	
  project	
  specific	
  requirements
•Configurability:	
  each	
  component	
  can	
  be	
  easily	
  configured	
  to	
  invoke	
  alternate	
  out-­‐of-­‐
the-­‐box	
  func9onality	
  or	
  deployment	
  op9ons
•Extensibility:	
  components	
  can	
  be	
  extended	
  by	
  providing	
  alternate	
  implementa9ons	
  
to	
  its	
  core	
  APIs	
  (expressed	
  as	
  Java	
  interfaces)	
  or	
  configuring	
  custom	
  plugins	
  




Apache	
  OODT	
  home:	
  hgp://oodt.apache.org/
Apache	
  OODT	
  wiki:	
  hgps://cwiki.apache.org/confluence/display/OODT/Home
Apache	
  OODT	
  Jira:	
  hgps://issues.apache.org/jira/browse/OODT
                                       ExArch	
  Mee9ng,	
  October	
  2012
Apache	
  OODT	
  Components
     OODT	
  is	
  an	
  “eco-­‐system”	
  of	
  “data/metadata-­‐centric”	
  soSware	
  components
 that	
  altogether	
  provide	
  extensive	
  func9onality	
  for	
  science	
  data	
  management	
  needs	
  

  Data	
  Movement:	
  push	
  or	
  pull	
  data	
  from	
  remote	
  loca9ons,	
  crawling	
  staging	
  areas

Data	
  Processing:	
                                                         Data	
  Archiving:	
  
job	
  submission,	
                                                          storage	
  of	
  data	
  resources	
  
execu9on,	
  monitoring,	
                                                    and	
  associated	
  metadata
coordina9on




                                                                                        Web	
  Applica9ons:	
  
                                                                                        web	
  UI	
  and	
  RESTful	
  
Product	
  Delivery:	
                                                                  endpoints	
  for	
  job	
  
retrieval	
  and	
                                                                      submission,	
  metadata	
  
transforma9on	
  of	
  data/                                                            management,
metadata	
  products                ExArch	
  Mee9ng,	
  October	
  2012                product	
  access
Apache	
  OODT	
  Adop.on
OODT	
  is	
  used	
  opera9onally	
  to	
  manage	
  scien9fic	
  data	
  by	
  several	
  projects	
  in	
  disparate	
  
scien9fic	
  domains:
•Earth	
  Sciences:
  ‣NASA	
  satellite	
  missions	
  (SMAP,...)	
  are	
  using	
  OODT	
  components	
  as	
  the	
  base	
  for	
  their	
  
  data	
  processing	
  pipeline	
  for	
  genera9on	
  and	
  archiving	
  of	
  products	
  from	
  raw	
  
  observa9ons
  ‣ESGF	
  (Earth	
  System	
  Grid	
  Federa9on)	
  used	
  OODT	
  to	
  build	
  and	
  publish	
  observa9onal	
  
  data	
  products	
  in	
  support	
  of	
  climate	
  change	
  research
•Health	
  Sciences:	
  EDRN	
  (Early	
  Detec9on	
  Research	
  Network)	
  uses	
  OODT	
  to	
  collect,	
  tag	
  
and	
  distribute	
  data	
  products	
  to	
  support	
  research	
  in	
  early	
  cancer	
  detec9on
•Planetary	
  Science:	
  PDS	
  (Planetary	
  Data	
  System)	
  is	
  developing	
  data	
  transforma9on	
  
and	
  delivery	
  services	
  based	
  on	
  OODT	
  as	
  part	
  of	
  its	
  world-­‐wide	
  product	
  access	
  
infrastructure
•Radio	
  Astronomy:	
  several	
  projects	
  (VLBA,	
  ALMA,	
  Haystack,...)	
  adop9ng	
  OODT	
  based	
  
on	
  successful	
  VASTR	
  example



                                         ExArch	
  Mee9ng,	
  October	
  2012
OODT	
  CAS	
  File	
  Manager

•Purpose:	
  service	
  for	
  cataloging,	
  archiving	
  and	
  delivery	
  of	
  data	
  products	
  (files	
  and	
  
directories)	
  and	
  associated	
  metadata
•Example	
  Use	
  Case:	
  science	
  mission	
  or	
  climate	
  model	
  that	
  generates	
  files	
  in	
  a	
  staging	
  
area.	
  Files	
  are	
  submiged	
  to	
  the	
  File	
  Manager	
  to:
  ‣Extract/generate	
  project	
  specific	
  metadata
  ‣Move	
  files	
  to	
  final	
  archive	
  loca9on
  ‣Store	
  metadata	
  and	
  file	
  loca9on	
  references	
  into	
  metadata	
  store
  ‣Later,	
  clients	
  may	
  query	
  product	
  metadata	
  and	
  retrieve	
  data	
  products	
  via	
  XML-­‐RPC




                                          ExArch	
  Mee9ng,	
  October	
  2012
OODT	
  CAS	
  File	
  Manager
The	
  CAS	
  File	
  Manager	
  provides	
  many	
  configura9on	
  opportuni9es	
  (“extension	
  points”)	
  
that	
  allow	
  customiza9on	
  to	
  project	
  specific	
  needs:	
  
•Policy	
  Files:	
  XML	
  schema	
  instances	
  that	
  specify	
  the	
  supported	
  Product	
  Types	
  and	
  
their	
  associated	
  metadata	
  elements
  ‣Can	
  ingest	
  files	
  of	
  any	
  type:	
  music,	
  videos,	
  images,	
  binary	
  files	
  (e.g.	
  NetCDF),	
  etc.
  ‣Products	
  can	
  be	
  Flat	
  (all	
  files	
  in	
  one	
  dir)	
  or	
  Hierarchical	
  (span	
  mul9ple	
  dirs)
•Metadata	
  Extractors:	
  each	
  ProductType	
  can	
  be	
  associated	
  with	
  a	
  specific	
  Metadata	
  
Extractor	
  which	
  is	
  run	
  when	
  the	
  product	
  is	
  ingested
  ‣User	
  can	
  write	
  their	
  own	
  metadata	
  extractor	
  in	
  Java...
  ‣...or	
  use	
  the	
  ExternMetExtractor	
  to	
  run	
  a	
  metadata	
  extractor	
  in	
  any	
  other	
  language




                                                                                             Note:	
  check	
  out	
  
                                                                                             Apache	
  Tika	
  for	
  flexible	
  
                                                                                             metadata	
  extrac9on
                                         ExArch	
  Mee9ng,	
  October	
  2012
OODT	
  CAS	
  File	
  Manager

•Metadata	
  Catalog:	
  extracted	
  metadata	
  can	
  be	
  stored	
  in	
  various	
  back-­‐end	
  stores
 ‣Out-­‐of-­‐the-­‐box	
  implementa9ons	
  for	
  Lucene,	
  MySQL	
  and	
  Oracle
 ‣Solr	
  back-­‐end	
  implementa9on	
  forthcoming
•Data	
  Transfer	
  Protocol:	
  files	
  may	
  be	
  retrieved	
  from	
  the	
  local	
  file	
  system	
  or	
  over	
  HTTP,	
  
stored	
  in-­‐place	
  or	
  moved	
  to	
  a	
  different	
  target	
  des9na9on	
  for	
  archiving
•Valida9on	
  Layer:	
  inspects	
  products	
  and	
  metadata	
  before	
  inges9on	
  for	
  required	
  
metadata	
  elements
•Versioner:	
  specifies	
  how	
  data	
  products	
  are	
  organized	
  within	
  archive




                                           ExArch	
  Mee9ng,	
  October	
  2012
OODT	
  CAS	
  Curator

•Purpose:	
  web	
  applica9on	
  for	
  interac9ng	
  with	
  the	
  File	
  Manager	
  (i.e.	
  web-­‐based	
  client	
  
to	
  the	
  File	
  Manager	
  service):
  ‣Submit	
  data	
  inges9on	
  jobs
  ‣Generate,	
  inspect,	
  update	
  and	
  add	
  metadata	
  (“cura9on”)
•Features:
  ‣Web	
  user	
  interface	
  (human-­‐driven	
  interac9on)	
  +	
  REST	
  API	
  (machine-­‐machine	
  
  interac9on)
  ‣Runs	
  within	
  generic	
  servlet	
  container	
  (Tomcat,	
  Jegy,	
  etc.)
  ‣Pluggable	
  security	
  layer	
  for	
  authen9ca9on	
  and	
  authoriza9on




                                        ExArch	
  Mee9ng,	
  October	
  2012
OODT	
  CAS	
  Curator

•Web	
  User	
  Interface:	
  used	
  by	
  humans	
  to	
  manually	
  interact	
  with	
  the	
  system	
  
 ‣Drag-­‐and-­‐drop	
  selec9on	
  of	
  files	
  from	
  staging	
  area
 ‣Selec9on	
  of	
  metadata	
  extractor	
  from	
  available	
  pool
 ‣Submission	
  of	
  job	
  for	
  bulk	
  inges9on	
  to	
  File	
  Manager




                                         ExArch	
  Mee9ng,	
  October	
  2012
OODT	
  CAS	
  Curator

•Web	
  REST	
  API:	
  used	
  by	
  programs	
  and	
  scripts	
  for	
  automa9c	
  interac9on
 ‣Based	
  on	
  Apache	
  JAX-­‐RS	
  project	
  (project	
  for	
  Resrul	
  web	
  services)
 ‣Retrieve,	
  list	
  and	
  start	
  inges9on	
  tasks
 ‣Annotate	
  exis9ng	
  products	
  with	
  enhanced	
  metadata
•Example	
  HTTP/POST	
  invoca9on:	
  
 ‣curl	
  -­‐-­‐data	
  “id=<product_id>&metadata.<name>=<value>”	
  hgp://<hostname>/
  curator/services/metadata/update




                                 ExArch	
  Mee9ng,	
  October	
  2012
Obs4MIPs	
  Publishing	
  Pipeline	
  for	
  IPCC-­‐AR5

OODT	
  File	
  Manager	
  and	
  Curator	
  components	
  were	
  used	
  to	
  setup	
  a	
  data	
  processing	
  
pipeline	
  to	
  publish	
  Obs4MIPS	
  datasets	
  into	
  ESGF	
  for	
  use	
  by	
  IPCC-­‐AR5	
  climate	
  studies
•Obs4MIPs	
  datasets	
  were	
  prepared	
  by	
  NASA	
  mission+data	
  teams	
  and	
  transferred	
  to	
  a	
  
staging	
  directory	
  at	
  JPL
•Files	
  were	
  post-­‐processed	
  by	
  CMOR
  ‣CMOR:	
  climate	
  package	
  to	
  add	
  required	
  metadata	
  such	
  as	
  NetCDF	
  global	
  agributes
•Curator	
  web	
  interface	
  was	
  used	
  to	
  submit	
  inges9on	
  jobs	
  to	
  the	
  File	
  Manager	
  
  ‣Custom	
  Versioner	
  used	
  to	
  re-­‐organize	
  files	
  according	
  to	
  CMIP5	
  directory	
  
  structure:	
  /<project>/<ins9tu9on>/<9me_frequency>/<variable>/..../<filename>
  ‣Custom	
  Obs4MIPs	
  policies	
  defined	
  the	
  required	
  metadata	
  elements
  ‣Custom	
  metadata	
  extractor	
  used	
  to	
  parse	
  file	
  content	
  and	
  directory	
  path	
  to	
  
  generate	
  metadata	
  content
  ‣Valida9on	
  Layer	
  enforced	
  existence	
  of	
  required	
  metadata	
  elements
•ESGF	
  publishing	
  soSware	
  used	
  to	
  publish	
  files	
  and	
  metadata	
  from	
  archive



                                        ExArch	
  Mee9ng,	
  October	
  2012
CMAC	
  “next	
  genera.on”	
  Obs4MIPs
•Current	
  Obs4MIPs	
  publishing	
  infrastructure	
  has	
  several	
  limita9ons:	
  
 ‣Manual:	
  crea9on	
  of	
  Obs4MIPs	
  datasets	
  was	
  a	
  very	
  labor	
  intensive	
  process	
  executed	
  
  by	
  DAAC	
  teams
  ‣Not	
  generic:	
  each	
  DAAC	
  developed	
  a	
  one-­‐off	
  soSware	
  solu9on	
  not	
  conforming	
  to	
  
  any	
  specifica9on	
  or	
  API
  ‣Not	
  scalable:	
  publica9on	
  of	
  Obs4MIPs	
  datasets	
  was	
  accomplished	
  manually	
  via	
  
  Curator	
  web	
  UI

•NASA	
  centers	
  (JPL,	
  GSFC,	
  LaRC)	
  are	
  working	
  on	
  a	
  next	
  genera9on	
  Obs4MIPs	
  
infrastructure	
  that	
  will	
  facilitate	
  prepara9on	
  and	
  publica9on	
  of	
  Obs4MIPs	
  datasets	
  and	
  
comparison	
  with	
  model	
  output.	
  Goals:
  ‣Generalize	
  the	
  prepara9on	
  of	
  Obs4MIPs	
  datasets	
  by	
  defining	
  an	
  API	
  that	
  can	
  be	
  
  implemented	
  by	
  each	
  DAAC	
  for	
  each	
  specific	
  observing	
  mission
  ‣Provide	
  tools	
  for	
  valida9on	
  of	
  generated	
  products
  ‣Automa9cally	
  publish	
  product	
  to	
  ESGF
•Funding:	
  NASA	
  CMAC	
  (Computa9onal	
  Modeling	
  Algorithms	
  and	
  Cyber-­‐infrastructure)
•Looking	
  at	
  building	
  an	
  automa9c	
  Obs4MIPs	
  data	
  processing	
  and	
  publica9on	
  pipeline,	
  
to	
  be	
  installed	
  at	
  each	
  DAAC,	
  using	
  OODT	
  components:	
  Crawler	
  and	
  Workflow	
  Manager
                                        ExArch	
  Mee9ng,	
  October	
  2012
CAS	
  Crawler
The	
  CAS	
  Crawler	
  is	
  an	
  OODT	
  component	
  that	
  can	
  be	
  used	
  to	
  list	
  the	
  contents	
  of	
  a	
  
staging	
  area	
  and	
  submit	
  products	
  for	
  inges9on	
  to	
  the	
  CAS	
  File	
  manager.	
  Typically	
  used	
  
for	
  automa9c	
  detec9on	
  of	
  new	
  products	
  transferred	
  from	
  a	
  remote	
  source.	
  
Features:
•Runs	
  as	
  single	
  scan	
  or	
  as	
  9me-­‐interval	
  daemon
•Wide	
  range	
  of	
  configura9on	
  op9ons:
             ‣Inges9on	
  pre-­‐condi9ons
             ‣Client-­‐side	
  data	
  transfer
             ‣Client-­‐side	
  metadata	
  extractors
             ‣Post-­‐ingest	
  ac9ons	
  
	
  	
  	
  	
  	
  (on	
  success	
  or	
  failure)




                                           ExArch	
  Mee9ng,	
  October	
  2012
CAS	
  Workflow	
  Manager
The	
  CAS	
  Workflow	
  Manager	
  (WM)	
  is	
  an	
  OODT	
  component	
  for	
  the	
  defini9on,	
  execu9on	
  
and	
  monitoring	
  of	
  workflows.	
  A	
  workflow	
  is	
  an	
  ordered	
  sequence	
  of	
  processing	
  tasks,	
  
where	
  each	
  task	
  reads/processes/writes	
  data.	
  Typically	
  used	
  to	
  establish	
  processing	
  
pipelines	
  for	
  scien9fic	
  data,	
  for	
  example	
  data	
  streaming	
  from	
  NASA	
  observing	
  systems.


Client-­‐Server	
  Architecture
•WM	
  client	
  submits	
  an	
  “event”	
  to	
  the	
  server	
  
via	
  XML/RPC	
  protocol
•WM	
  server	
  matches	
  the	
  event	
  to	
  one	
  or	
  
more	
  workflow	
  defini9ons
•WM	
  submits	
  workflow	
  request	
  to	
  the	
  
Workflow	
  Engine
•Workflow	
  engine	
  starts	
  a	
  workflow	
  instance	
  
on	
  one	
  of	
  many	
  Processing	
  Node
•Workflow	
  instance	
  execu9on	
  metadata	
  is	
  
saved	
  to	
  repository
•WM	
  client	
  monitors	
  execu9on,	
  retrieves	
  final	
  
metadata                                  ExArch	
  Mee9ng,	
  October	
  2012
CAS	
  Workflow	
  Manager

Features
•Separate	
  standard	
  interfaces	
  for	
  workflow	
  engine	
  and	
  workflow	
  repository
•Out-­‐of-­‐the-­‐box	
  implementa9ons,	
  may	
  be	
  overridden	
  by	
  custom	
  components
•XML-­‐based	
  workflow	
  defini9on	
  language
•Op9onal	
  defini9on	
  of	
  pre/post	
  condi9ons	
  for	
  workflow	
  execu9on
•Shared	
  metadata	
  context	
  for	
  all	
  tasks	
  within	
  a	
  workflow	
  instance
•Scalability:
  ‣Client	
  can	
  submit	
  workflow	
  request	
  to	
  any	
  number	
  of	
  workflow	
  managers
  ‣Workflow	
  manager	
  can	
  instan9ate	
  workflow	
  instances	
  on	
  any	
  number	
  of	
  
  processing	
  nodes,	
  on	
  local	
  or	
  remote	
  servers




NASA	
  lingo:
•“Process	
  Control	
  System”	
  or	
  PCS:	
  a	
  data	
  processing	
  pipeline	
  expressed	
  as	
  workflow
•“Program	
  Genera9on	
  Executable”	
  or	
  PGE:	
  a	
  single	
  task	
  within	
  a	
  PCS
                                      ExArch	
  Mee9ng,	
  October	
  2012
CMAC	
  draR	
  Obs4MIPs	
  pipeline
CMAC	
  is	
  working	
  on	
  an	
  automa9c,	
  deployable	
  data	
  processing	
  pipeline	
  for	
  Obs4MIPs:
•OODT	
  Crawler	
  setup	
  to	
  automa9cally	
  detect	
  new	
  data	
  products:	
  
 ‣Level-­‐2	
  HDF	
  files,	
  images,	
  or	
  NcML	
  descriptors	
  of	
  data	
  stored	
  on	
  OpenDAP	
  servers
•OODT	
  Crawler	
  sends	
  “publish”	
  event	
  to	
  Workflow	
  Manager
•OODT	
  Workflow	
  Manager	
  executes	
  workflow	
  instance	
  composed	
  of	
  following	
  tasks:
 ‣Data	
  extrac9on	
  through	
  mission	
  specific	
  algorithm	
  that	
  conforms	
  to	
  general	
  API
 ‣Conversion	
  of	
  data	
  to	
  CMIP	
  format	
  (NetCDF/CF)
 ‣Valida9on	
  via	
  CMOR-­‐light	
  checker
 ‣Metadata	
  harves9ng
 ‣Publica9on	
  to	
  ESGF
‣OODT	
  File	
  Manager	
  archives	
  products	
  into	
  final	
  CMIP	
  standard	
  directory	
  structure




                                       ExArch	
  Mee9ng,	
  October	
  2012
Summary	
  &	
  Conclusions
•Science	
  projects	
  are	
  increasingly	
  facing	
  “Big	
  Data”	
  challenges	
  similar	
  to	
  e-­‐commerce:
 ‣Climate	
  science:	
  XB	
  expected	
  from	
  next	
  genera9on	
  sensors	
  and	
  models
 ‣Radio	
  astronomy:	
  next	
  genera9on	
  telescopes	
  to	
  collect	
  hundreds	
  of	
  TB	
  per	
  sec
•These	
  challenges	
  cannot	
  be	
  met	
  by	
  building	
  9mely,	
  expensive	
  in-­‐house	
  solu9ons,	
  rather,	
  
science	
  projects	
  should	
  leverage	
  proven,	
  open	
  source	
  frameworks	
  that	
  can	
  be	
  customized	
  
and	
  extended	
  to	
  the	
  project	
  specific	
  needs

•Apache	
  SF	
  includes	
  many	
  freely	
  available	
  technologies	
  that	
  can	
  be	
  used	
  in	
  eScience:
 ‣Solr	
  provides	
  general,	
  powerful,	
  scalable	
  search	
  capabili9es	
  
 ‣OODT	
  is	
  a	
  framework	
  of	
  soSware	
  components	
  that	
  can	
  be	
  combined	
  to	
  build	
  automa9c,	
  
  flexible	
  data	
  processing	
  pipelines	
  including	
  data	
  movement,	
  metadata	
  extrac9on,	
  data	
  
  archiving	
  &	
  product	
  delivery

•The	
  Earth	
  System	
  Grid	
  Federa9on	
  is	
  currently	
  using	
  Solr	
  and	
  OODT	
  to	
  serve	
  the	
  climate	
  
change	
  community	
  world-­‐wide,	
  and	
  working	
  on	
  even	
  more	
  extensive	
  usage	
  in	
  the	
  future	
  to	
  
automa9ze	
  the	
  distribu9on	
  of	
  large	
  amounts	
  of	
  NASA	
  observa9ons.	
  

•In	
  the	
  future,	
  ESGF	
  is	
  looking	
  at	
  leveraging	
  Solr	
  and	
  OODT	
  to	
  expand	
  to	
  other	
  science	
  
domains	
  (biology,	
  radio-­‐astronomy,	
  health	
  sciences,	
  ...)
                                      ExArch	
  Mee9ng,	
  October	
  2012
Ques9ons	
  ?
   Text

Weitere ähnliche Inhalte

Was ist angesagt?

Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...Tal Lavian Ph.D.
 
Scientific
Scientific Scientific
Scientific marpierc
 
Utilising Cloud Computing for Research through Infrastructure, Software and D...
Utilising Cloud Computing for Research through Infrastructure, Software and D...Utilising Cloud Computing for Research through Infrastructure, Software and D...
Utilising Cloud Computing for Research through Infrastructure, Software and D...David Wallom
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
 
Scaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesScaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesMatthew Vaughn
 
Challenges and Issues of Next Cloud Computing Platforms
Challenges and Issues of Next Cloud Computing PlatformsChallenges and Issues of Next Cloud Computing Platforms
Challenges and Issues of Next Cloud Computing PlatformsFrederic Desprez
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Update on the Exascale Computing Project (ECP)
Update on the Exascale Computing Project (ECP)Update on the Exascale Computing Project (ECP)
Update on the Exascale Computing Project (ECP)inside-BigData.com
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking FunctionalityNicholas Loulloudes
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
grid mining
grid mininggrid mining
grid miningARNOLD
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveTim Bell
 
Federation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research CloudFederation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research CloudOpenStack
 

Was ist angesagt? (14)

Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...
Lambda Data Grid: An Agile Optical Platform for Grid Computing and Data-inten...
 
Scientific
Scientific Scientific
Scientific
 
Utilising Cloud Computing for Research through Infrastructure, Software and D...
Utilising Cloud Computing for Research through Infrastructure, Software and D...Utilising Cloud Computing for Research through Infrastructure, Software and D...
Utilising Cloud Computing for Research through Infrastructure, Software and D...
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
Scaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesScaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data Challenges
 
Challenges and Issues of Next Cloud Computing Platforms
Challenges and Issues of Next Cloud Computing PlatformsChallenges and Issues of Next Cloud Computing Platforms
Challenges and Issues of Next Cloud Computing Platforms
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Update on the Exascale Computing Project (ECP)
Update on the Exascale Computing Project (ECP)Update on the Exascale Computing Project (ECP)
Update on the Exascale Computing Project (ECP)
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionality
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
grid mining
grid mininggrid mining
grid mining
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
 
Federation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research CloudFederation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research Cloud
 

Andere mochten auch

A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRLucaCinquini
 
A look at Apache OODT Balance framework
A look at Apache OODT Balance frameworkA look at Apache OODT Balance framework
A look at Apache OODT Balance frameworkskhudiky
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNADaniel S. Katz
 
Apache Airavata ApacheCon2013
Apache Airavata ApacheCon2013Apache Airavata ApacheCon2013
Apache Airavata ApacheCon2013smarru
 
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTChris Mattmann
 

Andere mochten auch (6)

A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
ApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTRApacheCon NA 2013 VFASTR
ApacheCon NA 2013 VFASTR
 
A look at Apache OODT Balance framework
A look at Apache OODT Balance frameworkA look at Apache OODT Balance framework
A look at Apache OODT Balance framework
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNA
 
Apache Airavata ApacheCon2013
Apache Airavata ApacheCon2013Apache Airavata ApacheCon2013
Apache Airavata ApacheCon2013
 
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
 

Ähnlich wie ApacheCon NA 2013

ACES QuakeSim 2011
ACES QuakeSim 2011ACES QuakeSim 2011
ACES QuakeSim 2011marpierc
 
Research in the Cloud
Research in the CloudResearch in the Cloud
Research in the CloudDavid Wallom
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015Splunk
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudAdianto Wibisono
 
Federated Cloud Computing
Federated Cloud ComputingFederated Cloud Computing
Federated Cloud ComputingDavid Wallom
 
Adoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchAdoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchYehia El-khatib
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.pptvishal choudhary
 
Frank Würthwein - NRP and the Path forward
Frank Würthwein - NRP and the Path forwardFrank Würthwein - NRP and the Path forward
Frank Würthwein - NRP and the Path forwardLarry Smarr
 
OGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation PlatformsOGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation Platformsterradue
 
DI4R 2018 - Ellip: a collaborative workplace for EO Open Science
DI4R 2018 - Ellip: a collaborative workplace for EO Open ScienceDI4R 2018 - Ellip: a collaborative workplace for EO Open Science
DI4R 2018 - Ellip: a collaborative workplace for EO Open Scienceterradue
 
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for DevelopersCloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for DevelopersAlan Sill
 
Kerry Taylor - Semantics & sensors
Kerry Taylor - Semantics & sensorsKerry Taylor - Semantics & sensors
Kerry Taylor - Semantics & sensorsWeb Directions
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...David Wallom
 

Ähnlich wie ApacheCon NA 2013 (20)

ACES QuakeSim 2011
ACES QuakeSim 2011ACES QuakeSim 2011
ACES QuakeSim 2011
 
Research in the Cloud
Research in the CloudResearch in the Cloud
Research in the Cloud
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015Exxon - SplunkLive! São Paulo 2015
Exxon - SplunkLive! São Paulo 2015
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
 
RFCs for HDF5 and HDF-EOS5 Status Update
RFCs for HDF5 and HDF-EOS5 Status UpdateRFCs for HDF5 and HDF-EOS5 Status Update
RFCs for HDF5 and HDF-EOS5 Status Update
 
Federated Cloud Computing
Federated Cloud ComputingFederated Cloud Computing
Federated Cloud Computing
 
Adoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchAdoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific Research
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.ppt
 
Frank Würthwein - NRP and the Path forward
Frank Würthwein - NRP and the Path forwardFrank Würthwein - NRP and the Path forward
Frank Würthwein - NRP and the Path forward
 
2016 open-source-network-softwarization
2016 open-source-network-softwarization2016 open-source-network-softwarization
2016 open-source-network-softwarization
 
2016 open-source-network-softwarization
2016 open-source-network-softwarization2016 open-source-network-softwarization
2016 open-source-network-softwarization
 
OGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation PlatformsOGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation Platforms
 
DI4R 2018 - Ellip: a collaborative workplace for EO Open Science
DI4R 2018 - Ellip: a collaborative workplace for EO Open ScienceDI4R 2018 - Ellip: a collaborative workplace for EO Open Science
DI4R 2018 - Ellip: a collaborative workplace for EO Open Science
 
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for DevelopersCloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for Developers
 
Kerry Taylor - Semantics & sensors
Kerry Taylor - Semantics & sensorsKerry Taylor - Semantics & sensors
Kerry Taylor - Semantics & sensors
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...
 
DGterzo
DGterzoDGterzo
DGterzo
 

ApacheCon NA 2013

  • 1. Leveraging  Tools  and  Components from  OODT  and  Apache within  Climate  Science and  the  Earth  System  Grid  Federa9on Luca  Cinquini,  Dan  Crichton,  Chris  Ma2mann NASA  Jet  Propulsion  Laboratory,   California  Ins9tute  of  Technology Copyright  2013  California  Ins>tute  of  Technology.  Government  Sponsorship  Acknowledged.
  • 2. Outline This  talk  will  present  the  Earth  System  Grid  Federa9on  (ESGF)  as  an  example  of  a   scien9fic  project  that  is  leveraging  tools  from  the  ASF  to  successfully  manage  and   distributed  large  amounts  of  scien9fic  data. •The  Earth  System  Grid  Federa9on  (ESGF) ‣Science  mo9va9ons ‣High  level  architecture ‣Major  soSware  components:  Node  Manager,  Search  Service,  Web  Front  End •Apache  Solr ‣General  func9onality ‣Scalability  op9ons ‣Usage  within  ESGF •Apache  Object  Oriented  Data  Technology  (OODT) ‣Descrip9on  of  OODT  components ‣Current  usage  for  ESGF  publishing ‣Next  genera9on  ESGF  data  processing October  2012 ExArch  Mee9ng,  
  • 3. Introduc.on The  Earth  System  Grid  Federa9on  (ESGF)  is  a  mul9-­‐agency,  interna9onal  collabora9on   of  people  and  ins9tu9ons  working  together  to  build  an  open  source  soSware   infrastructure  for  the  management  and  analysis  of  Earth  Science  data  on  a  global  scale •Historically  started  with  DoE  Earth  System  Grid  project,  expanded  to  include  groups   and  ins9tu9ons  around  the  world  that  are  involved  in  management  of  climate  data   •Recently  evolved  into  a  next  genera9on  architecture:  the  Peer-­‐To-­‐Peer  (P2P)  system   (higher  performance,  more  reliable  and  scalable,  focused  on  federa9on  services) Acknowledgments •SoSware  development  and  project  management:   ANL,  ANU,  BADC,  CMCC,  DKRZ,  ESRL,  GFDL,  GSFC,   JPL,  IPSL,  ORNL,  LLNL  (lead),  PMEL,  ... •Opera9ons:  tens  of  data  centers  across  Asia,   Australia,  Europe  and  North  America •Funding:  DoE,  NASA,  NOAA,  NSF,  IS-­‐ENES,... ExArch  Mee9ng,  October  2012
  • 4. Support  for  Climate  Change  Research •ESGF  primary  goal  is  to  support  research  on  Climate  Change  -­‐  one  of  the  most  cri9cal  and   complex  scien9fic  problems  of  our  9me •IPCC  (Interna9onal  Panel  on  Climate  Change):  interna9onal  body  of  scien9sts  responsible  for   compiling  reports  on  Climate  Change  to  the  global  community  and  poli9cal  leaders •Over  the  years,  IPCC  reports  have  increasingly  supported  evidence  for  climate  change  and   iden9fied  human  causes  for  this  change •  Next  IPCC-­‐AR5  report  due  in  2014,  will  be  based  on  data  produced  by  CMIP5  modeling  project   (and  delivered  by  ESGF) ExArch  Mee9ng,  October  2012
  • 5. CMIP5:  a  “Big  Data”  problem CMIP5:  Coupled  Model  Intercomparison  Project  -­‐  phase  5 •Arguably  the  largest  coordinated  modeling  effort  in  human  history •Coordinated  by  WCRP  (World  Climate  Research  Program) •Includes  40+  climate  models  in  20+  countries ‣Models  are  run  to  simulate  future  state  of  the  Earth  under  different  emission  scenarios ‣Common  set  of  physical  fields  (temperature,  humidity,  precipita9on  etc.)  are  generated ‣Predic9ons  from  different  models  are  compared,  averaged,  scored “Big  Data”  characteris9cs •Volume:  approx.  2.5  PB  of  data  distributed  around  the  world •Variety:  must  combine  model  output  with  observa9ons  of  different  kind  (satellite,  in-­‐situ,   radar,  etc.)  and  reanalysis •Veracity:  data  and  algorithms  must  be  tracked,  validated •Velocity:  N/A  as  immediate  aExArch  Mee9ng,  October  2012ot  important vailability  of  model  output  is  n
  • 6. System  Architecture ESGF  is  a  system  of  distributed  and  federated  Nodes     that  interact  dynamically  through  a  Peer-­‐To-­‐Peer  (P2P)  paradigm •Distributed:  data  and  metadata  are   published,  stored  and  served  from   mul9ple  centers  (“Nodes”) •Federated:  Nodes  interoperate  because   of  the  adop9on  of  common  services,   protocols  and  APIs,  and  establishment  of   mutual  trust  rela9onships •Dynamic:  Nodes  can  join/leave  the   federa9on  dynamically  -­‐  global  data  and   services  will  change  accordingly A  client  (browser  or  program)  can  start  from  any  Node  in  the  federa9on  and  discover,   download  and  analyze  data  from  mul9ple  loca9ons  as  if  they  were  stored  in  a  single   central  archive. ExArch  Mee9ng,  October  2012
  • 7. Node  Architecture •Internally,  each  ESGF  Node  is  composed  of  services  and  applica9ons  that  collec9vely   enable  data  and  metadata  access,  and  user  management.   •SoSware  components  are  grouped  into  4  areas  of  func9onality  (aka  “flavors”): •Data  Node  :  secure  data  publica9on  and  access •Index  Node  :   ‣metadata  indexing  and  searching  (w/  Solr) ‣web  portal  UI  to  drive  human  interac9on ‣dashboard  suite  of  admin  applica9ons ‣model  metadata  viewer  plugin •  Iden9ty  Provider  :  user  registra9on,   authen9ca9on  and  group  membership •  Compute  Node  :  analysis  and  visualiza9on •Nodes  flavors  can  be  installed  in  various  combina9ons  depending  on  site  needs,  or  to   achieve  higher  performance  and  scalability •Node  Manager  is  installed  for  all  Node  flavors  to  exchange  service,  state  informa9on ExArch  Mee9ng,  October  2012
  • 8. The  Node  Manager  and  P2P  Protocol •Enables  con9nuos  exchange  of  service  and  state  informa9on  among  Nodes •Internally,  it  collects  Node  health  informa9on  and  metrics  (cpu,  disk  usage,  etc.) Peer-­‐To-­‐Peer  (P2P)  protocol •Gossip  protocol:  informa9on  is  exchanged  randomly  among  peers ‣Each  Node  receives  informa9on  from  one  Node,  merges  it  with  its  own  informa9on,  and   propagates  it  to  two  other  Nodes  at  random ‣No  central  coordina9on,  no  single  point  of  failure •Nodes  can  join/leave  the  federa9on  dynamically •Each  Node  is  bootstrapped  with  knowledge  of  one  default  peer •Each  Node  can  belong  to  one  or  more  peer  groups  within  which  informa9on  is  exchanged XML  Registry •XML  document  that  is  payload  of  P2P  protocol •Contains  service  endpoints  and  SSL  public  keys  for  all  Nodes   in  the  federa9on •Derived  products  (list  of  search  shards,  trusted  IdPs,  loca9on   of  Agribute  Services,...)  are  used  by  federa9on-­‐wide  services Challenge:  good  news  travel  fast,  bad  news  travel  slow... ExArch  Mee9ng,  October  2012
  • 9. Apache  Solr Solr  is  a  high-­‐performance  web-­‐enabled  search  engine  built  on  top  of  Lucene •Used  in  many  e-­‐commerce  web  sites •Free  text  searches  (w/  stemming,  stop  words,  ...) •Faceted  searches  (w/  facet  counts) •Other  features:  scoring,  highligh9ng,  word  comple9on,... •Generic  flat  metadata  model:  (key,  value+)  pairs Metadata Apache Solr < HTTP GET (XML) XML/JSON > Tomcat/Jetty Client Lucene Index Because  of  it  high  performance  and  domain-­‐agnos9c  metadata  model,  Solr  is  well  suited   to  fulfill  the  indexing  and  searching  needs  of  most  data-­‐intensive  science  applica9ons. ExArch  Mee9ng,  October  2012
  • 10. Solr  Scalability Solr  has  advanced  features  that  allow  it  to  scale  to  tens  of  M  of  records: •Cores:  separate  indexes  containing  different  record  types  that  must  be   queried  independently.  Cores  run  within  same  Solr  instance. •Shards:  separate  indexes  containing  records  of  the  same  type:  query  is   distributed  across  shards,  results  merged  together.  Shards  run  on  separate   Solr  instances/servers. •Replicas:  iden9cal  copies  of  the  same  index,  for  redundancy  and  load   balance.  Replicas  run  on  separate  Solr  instances/servers. SHARD SHARD SHARD REPLICA REPLICA REPLICA CORE CORE CORE A B C A B C A B C ESGF  u9lizes  all  the  above  techniques  to  again  sa9sfactory  performance   over  large,  distributed  data  holdings October  2012 ExArch  Mee9ng,  
  • 11. ESGF  Search  Architecture •Each  Index  Node  runs  a  Master  Solr  (:8984)  and  a  Slave  Solr  (:8983) ‣Metadata  is  wrigen  to  the  Master  Solr,  replicated  to  the  Slave  Solr ‣Publishing  opera9ons  (“write”)  do  not  affect  user  queries  (“read”) ‣Master  Solr  is  secured  within  firewall,  enabled  for  wri9ng  (and  reading) ‣Slave  Solr  is  open  to  the  world  for  reading  but  not  enabled  for  wri9ng •A  query  ini9ated  at  any  Slave  Solr  is  distributed  to  all  other  Slave  Solrs  in  the  federa9on   •Internally,  each  Master/Slave  Solr  runs  mul9ple  cores: ‣“datasets”  core  contains  high  level  descrip9ve  metadata  for  search  and  discovery ‣“files”,  “aggrega9ons”  cores  contain  inventory-­‐level  metadata  for  data  retrieval INDEX NODE Solr Solr Master Slave (Write) (Read) INDEX NODE datasets Solr Solr files Master Slave (Write) (Read) aggregations INDEX NODE Solr Solr Master Slave (Write) (Read) ExArch  Mee9ng,  October  2012
  • 12. ESGF  Search  Architecture •Each  Index  Node  may  create  local  replicas  of  one  or  more  remote  Solr  Slaves ‣Distributed  query  operates  on  local  indexes  only  -­‐  no  network  latency! ‣Local  query  gives  complete  results  even  if  remote  Index  Node  is  unavailable ‣Addi9onal  resources  are  needed  to  run  mul9ple  Slave  Solrs  at  one  ins9tu9on INDEX NODE INDEX NODE Solr Solr Solr Master Slave Replica (Write) (Read) (Read) datasets Solr Solr files Master Slave (Write) (Read) aggregations INDEX NODE Solr Solr Solr Replica Master Slave (Read) (Write) (Read) ExArch  Mee9ng,  October  2012
  • 13. ESGF  Search  Web  Service ESGF  clients  do  not  query  Solr  directly,  rather  they  query  the  ESGF  Search  Web  Service  that   front-­‐ends  the  Solr  slave:   •Exposes  a  RESTful  query  API  simpler  than  Solr  query  syntax ‣Query  by  free  text ‣Query  by  domain-­‐specific  facets:  /search?model=...&experiment=...&9me_frequency=... ‣Extremely  popular  with  client  developers •Provides  higher-­‐level  func9onality:   ‣automa9c  query  distribu9on  to  all  known  shards  (dynamically!) ‣pruning  of  unresponsive  shards ‣genera9on  of  wget  scripts  with  same  query  syntax:  /wget?model=...&experiment=... ‣genera9on  of  RSS  feeds  for  latest  data  across  the  federa9on •Insulates  client  versus  possible  (unlikely)  future  changes  of  back-­‐end  engine  technology •ESGF  Web  Portal  is  a  client  to  the  ESGF  Search  Web  Service INDEX NODE Solr Replica (Read) datasets Solr Solr ESGF ESGF files Master Slave Search Web (Write) (Read) WS Portal aggregations Solr Replica (Read) ExArch  Mee9ng,  October  2012
  • 14. Web  Portal  Front  End   ESGF  Web  Portal:  web  applica9on  to  allow  human  users  to  interact  with  the  system •User  registra9on,  authen9ca9on  and  group  membership •Data  search,  access,  analysis  and  visualiza9on The  Web  Portal  Search  UI  allows  users  to  formulate  search  requests  and  access  data: •Search  by  free  text,  facets,  geo-­‐spa9al  and  temporal  constraints •Datasets  added  to  “data  cart” •Files  downloaded  via  single  HTTP  links,  wget  scripts,  Globus  Online  (experimental) •Hyperlinks  to  higher  services  for  data  analysis  and  visualiza9on  and  metadata  inspec9on ExArch  Mee9ng,  October  2012
  • 15. ESGF  Global  Distributed  Search •Query  from  any  ESGF  web  portal  is  distributed  to  all  other  web  portals •Users  can  query  in  real  9me  a  world-­‐wide  database  containing  tens  of  M  of  records Query  Pagern •“Dataset”-­‐level   query  is  distributed   to  mul9ple  Index   Nodes   •“File”-­‐level  query   issued  versus  a   single  specific  Index   Node ExArch  Mee9ng,  October  2012
  • 16. NASA  Obs4MIPs  Datasets •ESGF  allows  access  not  only  to  model  output,  but  also  observa9onal  and  reanalysis   datasets •NASA  Obs4MIPs:  selected  satellite  observa9ons  especially  prepared  for  usage  in   IPCC-­‐AR5,  formaged  as  output  from  global  climate  models: ‣Formaged  as  NetCDF/CF ‣Level  3  products:  global  lat/lon  grids  (not  satellite  swaths) ‣CMIP5  metadata  conven9ons  for  search  and  discovery   ‣Include  detailed  Technical  Notes  to  instruct  on  proper  use  of  observa9ons •Goal:  facilitate  comparison  of  observa9ons  with  models  and  therefore  quan9fy  the   reliability  of  model  predic9ons  for  climate  change  (“scoring”)   •Prepara9on  and  publica9on  of  Obs4MIPs  datasets  into  ESGF  was  accomplished  by   establishing  a  data  processing  pipeline  that  leveraged  OODT  components  (CAS  File   Manager  and  CAS  Curator) ExArch  Mee9ng,  October  2012
  • 17. Apache  OODT OODT:  Object  Oriented  Data  Technology:  framework  for  management,  discovery  and   access  of  distributed  data  resources.  Main  features: •Modularity:  framework  of  standalone  components  that  can  be  deployed  in  various   configura9ons  to  fulfill  a  project  specific  requirements •Configurability:  each  component  can  be  easily  configured  to  invoke  alternate  out-­‐of-­‐ the-­‐box  func9onality  or  deployment  op9ons •Extensibility:  components  can  be  extended  by  providing  alternate  implementa9ons   to  its  core  APIs  (expressed  as  Java  interfaces)  or  configuring  custom  plugins   Apache  OODT  home:  hgp://oodt.apache.org/ Apache  OODT  wiki:  hgps://cwiki.apache.org/confluence/display/OODT/Home Apache  OODT  Jira:  hgps://issues.apache.org/jira/browse/OODT ExArch  Mee9ng,  October  2012
  • 18. Apache  OODT  Components OODT  is  an  “eco-­‐system”  of  “data/metadata-­‐centric”  soSware  components that  altogether  provide  extensive  func9onality  for  science  data  management  needs   Data  Movement:  push  or  pull  data  from  remote  loca9ons,  crawling  staging  areas Data  Processing:   Data  Archiving:   job  submission,   storage  of  data  resources   execu9on,  monitoring,   and  associated  metadata coordina9on Web  Applica9ons:   web  UI  and  RESTful   Product  Delivery:   endpoints  for  job   retrieval  and   submission,  metadata   transforma9on  of  data/ management, metadata  products ExArch  Mee9ng,  October  2012 product  access
  • 19. Apache  OODT  Adop.on OODT  is  used  opera9onally  to  manage  scien9fic  data  by  several  projects  in  disparate   scien9fic  domains: •Earth  Sciences: ‣NASA  satellite  missions  (SMAP,...)  are  using  OODT  components  as  the  base  for  their   data  processing  pipeline  for  genera9on  and  archiving  of  products  from  raw   observa9ons ‣ESGF  (Earth  System  Grid  Federa9on)  used  OODT  to  build  and  publish  observa9onal   data  products  in  support  of  climate  change  research •Health  Sciences:  EDRN  (Early  Detec9on  Research  Network)  uses  OODT  to  collect,  tag   and  distribute  data  products  to  support  research  in  early  cancer  detec9on •Planetary  Science:  PDS  (Planetary  Data  System)  is  developing  data  transforma9on   and  delivery  services  based  on  OODT  as  part  of  its  world-­‐wide  product  access   infrastructure •Radio  Astronomy:  several  projects  (VLBA,  ALMA,  Haystack,...)  adop9ng  OODT  based   on  successful  VASTR  example ExArch  Mee9ng,  October  2012
  • 20. OODT  CAS  File  Manager •Purpose:  service  for  cataloging,  archiving  and  delivery  of  data  products  (files  and   directories)  and  associated  metadata •Example  Use  Case:  science  mission  or  climate  model  that  generates  files  in  a  staging   area.  Files  are  submiged  to  the  File  Manager  to: ‣Extract/generate  project  specific  metadata ‣Move  files  to  final  archive  loca9on ‣Store  metadata  and  file  loca9on  references  into  metadata  store ‣Later,  clients  may  query  product  metadata  and  retrieve  data  products  via  XML-­‐RPC ExArch  Mee9ng,  October  2012
  • 21. OODT  CAS  File  Manager The  CAS  File  Manager  provides  many  configura9on  opportuni9es  (“extension  points”)   that  allow  customiza9on  to  project  specific  needs:   •Policy  Files:  XML  schema  instances  that  specify  the  supported  Product  Types  and   their  associated  metadata  elements ‣Can  ingest  files  of  any  type:  music,  videos,  images,  binary  files  (e.g.  NetCDF),  etc. ‣Products  can  be  Flat  (all  files  in  one  dir)  or  Hierarchical  (span  mul9ple  dirs) •Metadata  Extractors:  each  ProductType  can  be  associated  with  a  specific  Metadata   Extractor  which  is  run  when  the  product  is  ingested ‣User  can  write  their  own  metadata  extractor  in  Java... ‣...or  use  the  ExternMetExtractor  to  run  a  metadata  extractor  in  any  other  language Note:  check  out   Apache  Tika  for  flexible   metadata  extrac9on ExArch  Mee9ng,  October  2012
  • 22. OODT  CAS  File  Manager •Metadata  Catalog:  extracted  metadata  can  be  stored  in  various  back-­‐end  stores ‣Out-­‐of-­‐the-­‐box  implementa9ons  for  Lucene,  MySQL  and  Oracle ‣Solr  back-­‐end  implementa9on  forthcoming •Data  Transfer  Protocol:  files  may  be  retrieved  from  the  local  file  system  or  over  HTTP,   stored  in-­‐place  or  moved  to  a  different  target  des9na9on  for  archiving •Valida9on  Layer:  inspects  products  and  metadata  before  inges9on  for  required   metadata  elements •Versioner:  specifies  how  data  products  are  organized  within  archive ExArch  Mee9ng,  October  2012
  • 23. OODT  CAS  Curator •Purpose:  web  applica9on  for  interac9ng  with  the  File  Manager  (i.e.  web-­‐based  client   to  the  File  Manager  service): ‣Submit  data  inges9on  jobs ‣Generate,  inspect,  update  and  add  metadata  (“cura9on”) •Features: ‣Web  user  interface  (human-­‐driven  interac9on)  +  REST  API  (machine-­‐machine   interac9on) ‣Runs  within  generic  servlet  container  (Tomcat,  Jegy,  etc.) ‣Pluggable  security  layer  for  authen9ca9on  and  authoriza9on ExArch  Mee9ng,  October  2012
  • 24. OODT  CAS  Curator •Web  User  Interface:  used  by  humans  to  manually  interact  with  the  system   ‣Drag-­‐and-­‐drop  selec9on  of  files  from  staging  area ‣Selec9on  of  metadata  extractor  from  available  pool ‣Submission  of  job  for  bulk  inges9on  to  File  Manager ExArch  Mee9ng,  October  2012
  • 25. OODT  CAS  Curator •Web  REST  API:  used  by  programs  and  scripts  for  automa9c  interac9on ‣Based  on  Apache  JAX-­‐RS  project  (project  for  Resrul  web  services) ‣Retrieve,  list  and  start  inges9on  tasks ‣Annotate  exis9ng  products  with  enhanced  metadata •Example  HTTP/POST  invoca9on:   ‣curl  -­‐-­‐data  “id=<product_id>&metadata.<name>=<value>”  hgp://<hostname>/ curator/services/metadata/update ExArch  Mee9ng,  October  2012
  • 26. Obs4MIPs  Publishing  Pipeline  for  IPCC-­‐AR5 OODT  File  Manager  and  Curator  components  were  used  to  setup  a  data  processing   pipeline  to  publish  Obs4MIPS  datasets  into  ESGF  for  use  by  IPCC-­‐AR5  climate  studies •Obs4MIPs  datasets  were  prepared  by  NASA  mission+data  teams  and  transferred  to  a   staging  directory  at  JPL •Files  were  post-­‐processed  by  CMOR ‣CMOR:  climate  package  to  add  required  metadata  such  as  NetCDF  global  agributes •Curator  web  interface  was  used  to  submit  inges9on  jobs  to  the  File  Manager   ‣Custom  Versioner  used  to  re-­‐organize  files  according  to  CMIP5  directory   structure:  /<project>/<ins9tu9on>/<9me_frequency>/<variable>/..../<filename> ‣Custom  Obs4MIPs  policies  defined  the  required  metadata  elements ‣Custom  metadata  extractor  used  to  parse  file  content  and  directory  path  to   generate  metadata  content ‣Valida9on  Layer  enforced  existence  of  required  metadata  elements •ESGF  publishing  soSware  used  to  publish  files  and  metadata  from  archive ExArch  Mee9ng,  October  2012
  • 27. CMAC  “next  genera.on”  Obs4MIPs •Current  Obs4MIPs  publishing  infrastructure  has  several  limita9ons:   ‣Manual:  crea9on  of  Obs4MIPs  datasets  was  a  very  labor  intensive  process  executed   by  DAAC  teams ‣Not  generic:  each  DAAC  developed  a  one-­‐off  soSware  solu9on  not  conforming  to   any  specifica9on  or  API ‣Not  scalable:  publica9on  of  Obs4MIPs  datasets  was  accomplished  manually  via   Curator  web  UI •NASA  centers  (JPL,  GSFC,  LaRC)  are  working  on  a  next  genera9on  Obs4MIPs   infrastructure  that  will  facilitate  prepara9on  and  publica9on  of  Obs4MIPs  datasets  and   comparison  with  model  output.  Goals: ‣Generalize  the  prepara9on  of  Obs4MIPs  datasets  by  defining  an  API  that  can  be   implemented  by  each  DAAC  for  each  specific  observing  mission ‣Provide  tools  for  valida9on  of  generated  products ‣Automa9cally  publish  product  to  ESGF •Funding:  NASA  CMAC  (Computa9onal  Modeling  Algorithms  and  Cyber-­‐infrastructure) •Looking  at  building  an  automa9c  Obs4MIPs  data  processing  and  publica9on  pipeline,   to  be  installed  at  each  DAAC,  using  OODT  components:  Crawler  and  Workflow  Manager ExArch  Mee9ng,  October  2012
  • 28. CAS  Crawler The  CAS  Crawler  is  an  OODT  component  that  can  be  used  to  list  the  contents  of  a   staging  area  and  submit  products  for  inges9on  to  the  CAS  File  manager.  Typically  used   for  automa9c  detec9on  of  new  products  transferred  from  a  remote  source.   Features: •Runs  as  single  scan  or  as  9me-­‐interval  daemon •Wide  range  of  configura9on  op9ons: ‣Inges9on  pre-­‐condi9ons ‣Client-­‐side  data  transfer ‣Client-­‐side  metadata  extractors ‣Post-­‐ingest  ac9ons            (on  success  or  failure) ExArch  Mee9ng,  October  2012
  • 29. CAS  Workflow  Manager The  CAS  Workflow  Manager  (WM)  is  an  OODT  component  for  the  defini9on,  execu9on   and  monitoring  of  workflows.  A  workflow  is  an  ordered  sequence  of  processing  tasks,   where  each  task  reads/processes/writes  data.  Typically  used  to  establish  processing   pipelines  for  scien9fic  data,  for  example  data  streaming  from  NASA  observing  systems. Client-­‐Server  Architecture •WM  client  submits  an  “event”  to  the  server   via  XML/RPC  protocol •WM  server  matches  the  event  to  one  or   more  workflow  defini9ons •WM  submits  workflow  request  to  the   Workflow  Engine •Workflow  engine  starts  a  workflow  instance   on  one  of  many  Processing  Node •Workflow  instance  execu9on  metadata  is   saved  to  repository •WM  client  monitors  execu9on,  retrieves  final   metadata ExArch  Mee9ng,  October  2012
  • 30. CAS  Workflow  Manager Features •Separate  standard  interfaces  for  workflow  engine  and  workflow  repository •Out-­‐of-­‐the-­‐box  implementa9ons,  may  be  overridden  by  custom  components •XML-­‐based  workflow  defini9on  language •Op9onal  defini9on  of  pre/post  condi9ons  for  workflow  execu9on •Shared  metadata  context  for  all  tasks  within  a  workflow  instance •Scalability: ‣Client  can  submit  workflow  request  to  any  number  of  workflow  managers ‣Workflow  manager  can  instan9ate  workflow  instances  on  any  number  of   processing  nodes,  on  local  or  remote  servers NASA  lingo: •“Process  Control  System”  or  PCS:  a  data  processing  pipeline  expressed  as  workflow •“Program  Genera9on  Executable”  or  PGE:  a  single  task  within  a  PCS ExArch  Mee9ng,  October  2012
  • 31. CMAC  draR  Obs4MIPs  pipeline CMAC  is  working  on  an  automa9c,  deployable  data  processing  pipeline  for  Obs4MIPs: •OODT  Crawler  setup  to  automa9cally  detect  new  data  products:   ‣Level-­‐2  HDF  files,  images,  or  NcML  descriptors  of  data  stored  on  OpenDAP  servers •OODT  Crawler  sends  “publish”  event  to  Workflow  Manager •OODT  Workflow  Manager  executes  workflow  instance  composed  of  following  tasks: ‣Data  extrac9on  through  mission  specific  algorithm  that  conforms  to  general  API ‣Conversion  of  data  to  CMIP  format  (NetCDF/CF) ‣Valida9on  via  CMOR-­‐light  checker ‣Metadata  harves9ng ‣Publica9on  to  ESGF ‣OODT  File  Manager  archives  products  into  final  CMIP  standard  directory  structure ExArch  Mee9ng,  October  2012
  • 32. Summary  &  Conclusions •Science  projects  are  increasingly  facing  “Big  Data”  challenges  similar  to  e-­‐commerce: ‣Climate  science:  XB  expected  from  next  genera9on  sensors  and  models ‣Radio  astronomy:  next  genera9on  telescopes  to  collect  hundreds  of  TB  per  sec •These  challenges  cannot  be  met  by  building  9mely,  expensive  in-­‐house  solu9ons,  rather,   science  projects  should  leverage  proven,  open  source  frameworks  that  can  be  customized   and  extended  to  the  project  specific  needs •Apache  SF  includes  many  freely  available  technologies  that  can  be  used  in  eScience: ‣Solr  provides  general,  powerful,  scalable  search  capabili9es   ‣OODT  is  a  framework  of  soSware  components  that  can  be  combined  to  build  automa9c,   flexible  data  processing  pipelines  including  data  movement,  metadata  extrac9on,  data   archiving  &  product  delivery •The  Earth  System  Grid  Federa9on  is  currently  using  Solr  and  OODT  to  serve  the  climate   change  community  world-­‐wide,  and  working  on  even  more  extensive  usage  in  the  future  to   automa9ze  the  distribu9on  of  large  amounts  of  NASA  observa9ons.   •In  the  future,  ESGF  is  looking  at  leveraging  Solr  and  OODT  to  expand  to  other  science   domains  (biology,  radio-­‐astronomy,  health  sciences,  ...) ExArch  Mee9ng,  October  2012
  • 33. Ques9ons  ? Text