SlideShare ist ein Scribd-Unternehmen logo
1 von 131
Big
Data:
tools
and

techniques
for
working

  with
large
data
sets
               Ian
Stokes‐Rees,
PhD
        Harvard
Medical
School,
Boston,
USA

 Workshop
on
Tools,
Technologies
and
Collaborative

Opportunities
for
HPC
in
Life
Sciences
and
Healthcare

         http://portal.sbgrid.org
       ijstokes@hkl.hms.harvard.edu
Slides
and
Contact
   ijstokes@hkl.hms.harvard.edu

   http://linkedin.com/in/ijstokes
   http://slidesha.re/ijstokes-thailand2011




Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Slides
and
Contact
   ijstokes@hkl.hms.harvard.edu

   http://linkedin.com/in/ijstokes
   http://slidesha.re/ijstokes-thailand2011


   http://www.sbgrid.org
   http://portal.sbgrid.org
   http://www.opensciencegrid.org



Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
rotational      translation
 2D
simple
crystal           Patterson
map
                                               search          search




   score
model:
                                                               aggregate
best
peak,
R
factor,          alternatives   composites
                                                              and
cluster
 electron
density
Big Data - Ian Stokes-Rees                       ijstokes@hkl.hms.harvard.edu
Protein Structure Studies




Big Data - Ian Stokes-Rees      ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...




Big Data - Ian Stokes-Rees     ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data




Big Data - Ian Stokes-Rees                     ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics




Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity




Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity
                               •   ownership
issues
‐
security
and
collaboration




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity
                               •   ownership
issues
‐
security
and
collaboration
                               •   provenance
‐
origin,
access,
changes




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity
                               •   ownership
issues
‐
security
and
collaboration
                               •   provenance
‐
origin,
access,
changes


           Today,
we’ll
think
about
software,
hardware,
and

           models
for
coping
with
large
quantities
of
data
Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Next
Generation
Sequencing




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
High
Energy
Physics




Big Data - Ian Stokes-Rees       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
40
MHz
bunch
crossing
rate
     10
million
data
channels
     1
KHz
level
1
event
recording
rate
     1­10
MB
per
event
     14
hours
per
day,
7+
months
/
year
     4
detectors
     6
PB
of
data
/
year
     globally
distribute
data
for
analysis
(x2)



Big Data - Ian Stokes-Rees                    ijstokes@hkl.hms.harvard.edu
Molecular
Dynamics
Simulations




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Molecular
Dynamics
Simulations
                                   1
fs
time
step
                                   1ns
snapshot
                                   1
us
simulation
                                   1e6
steps
                                   1000
frames
                                   10
MB
/
frame
                                   10
GB
/
sim
                                   20
CPU­years
                                   3
months
(wall­
                                   clock)

Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




    77
page
PDF
(bespoke
report)
Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




               Clinical
Document
Architecture
XML
representation
Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




                        HTML
rendering
of
XML
via
XSLT
transform
Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Clinical
Imaging
Data




   DICOM
­
Digital
Imaging
and

   Communications
in
Medicine
   2D,
3D,
4D
Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Clinical
Imaging
Data




Big Data - Ian Stokes-Rees       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.


         Potential
for
great
new
insights
...




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.


         Potential
for
great
new
insights
...


         ...
if
we
can
organize,
access,
share,
and

         analyze
this
data
ef[iciently



Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...




Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers

     • Educate
yourself
on
available
tools
and
technology




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers

     • Educate
yourself
on
available
tools
and
technology

     • Design
your
data
management
system
suitably



Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store




Big Data - Ian Stokes-Rees           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store




Big Data - Ian Stokes-Rees           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process




Big Data - Ian Stokes-Rees           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  •   Where
to
store
                  •   How
to
store
                  •   How
to
process
                  •   Organization,
searching,

                      and
meta‐data




Big Data - Ian Stokes-Rees                ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process
                  • Organization,
searching,

                    and
meta‐data
                  • How
to
manage
access




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process
                  • Organization,
searching,

                    and
meta‐data
                  • How
to
manage
access
                  • How
to
copy,
move,
and

                    backup


Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process
                  • Organization,
searching,

                    and
meta‐data
                  • How
to
manage
access
                  • How
to
copy,
move,
and

                    backup
                  • Provenance

Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  •   Where
to
store
                  •   How
to
store
                  •   How
to
process
                  •   Organization,
searching,

                      and
meta‐data
                  •   How
to
manage
access
                  •   How
to
copy,
move,
and

                      backup
                  •   Provenance
                  •   Lifecycle
Big Data - Ian Stokes-Rees                ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)




Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)
  • RAM
     •   fast
     •   expensive
     •   volatile




Big Data - Ian Stokes-Rees          ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)
  • RAM
     •   fast
                             • local
disk
     •   expensive
                              •   get
a
good
controller
(SATA/SAS2)
     •   volatile
                              •   lots
of
fast
spinning
disk
(7200+
rpm)
                              •   high
bandwidth
possible
                              •   good
Oirst
stop
for
data
                              •   hard
to
share,
persist,
backup
                              •   SSD
good
for
random
reads:
lots
of
small

                                  Oiles,
unpredictable
I/O
patterns
                              •   large
Oiles,
sequential
I/O,
spinning
disk

                                  comparable
to
SSDs




Big Data - Ian Stokes-Rees                    ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)
  • RAM
     •   fast
                                          • local
disk
     •   expensive
                                           •   get
a
good
controller
(SATA/SAS2)
     •   volatile
                                           •   lots
of
fast
spinning
disk
(7200+
rpm)
                                           •   high
bandwidth
possible
                                           •   good
Oirst
stop
for
data
                                           •   hard
to
share,
persist,
backup
  • Parallel
Filesystem                    •   SSD
good
for
random
reads:
lots
of
small

     •   gluster,
luster,
gpfs                 Oiles,
unpredictable
I/O
patterns
     •   HDFS
(Hadoop)                     •   large
Oiles,
sequential
I/O,
spinning
disk

     •   auto‐replication
for
parallel
        comparable
to
SSDs
         decentralized
I/O



Big Data - Ian Stokes-Rees                                 ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)




Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)
 • SAN
with
high
performance

   interconnect
   •   Storage
Area
Network
   •   fully
managed
data
storage
   •   Oiber
channel
(2
Gb/s)
or
InOiniband

       (10,20,40
Gb/s)
interconnect
   •   parallel,
non‐blocking,
dedicated

       routes




Big Data - Ian Stokes-Rees                     ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)
 • SAN
with
high
performance

   interconnect
   •   Storage
Area
Network
   •   fully
managed
data
storage              • NAS
over
ethernet
   •   Oiber
channel
(2
Gb/s)
or
InOiniband
      •   Network
Attached
Storage
       (10,20,40
Gb/s)
interconnect               •   Think
NFS,
CIFS,
Samba
network

   •   parallel,
non‐blocking,
dedicated
             interface
to
storage
       routes                                     •   ethernet
1
Gb/s
with
contention

                                                      (effective
limit
of
~500
Mb/s)
                                                  •   SATA
(10k
rpm,
2
TB,
3
Gb/s)
                                                  •   SAS2
(15k
rpm,
750
GB,
6
Gb/s)


                                               • Cloud
storage
                                                  •   Amazon
S3
                                                  •   Box.net,
Dropbox
                                                  •   BackBlaze:
bit.ly/backblaze‐20


Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)
 • SAN
with
high
performance

   interconnect
   •   Storage
Area
Network
   •   fully
managed
data
storage              • NAS
over
ethernet
   •   Oiber
channel
(2
Gb/s)
or
InOiniband
      •   Network
Attached
Storage
       (10,20,40
Gb/s)
interconnect               •   Think
NFS,
CIFS,
Samba
network

   •   parallel,
non‐blocking,
dedicated
             interface
to
storage
       routes                                     •   ethernet
1
Gb/s
with
contention

                                                      (effective
limit
of
~500
Mb/s)
                                                  •   SATA
(10k
rpm,
2
TB,
3
Gb/s)
 • Hybrid
                                                  •   SAS2
(15k
rpm,
750
GB,
6
Gb/s)
   •   Create
in‐house
tiered
storage

                                               • Cloud
storage
                                                  •   Amazon
S3
                                                  •   Box.net,
Dropbox
                                                  •   BackBlaze:
bit.ly/backblaze‐20


Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
How
to
store
(data
formats)
       • ASCII                      • SQL
DB
           •   tab
delimited           •   MySQL
           •   comma
separated         •   sqlite
       • XML                           •   Oracle
                                       •   Access
           •   DTD
deOinition?
                                       •   SQL
Server
           •   Schema
deOinition?
           •   Namespaces?          • Hierarchical
DB
       •   JSON                        •   Berkeley
XML
DB
                                       •   LDAP
       •   NetCDF
                                    • Object‐Relational
Mapper
       •   HDF5                        •   SQL
Alchemy
(Python)
       •   DICOM                       •   Hibernate
(Java,
.NET)
                                       •   Django
ORM
(Python)
       •   Matlab
.MAT
format
                                    • No‐SQL
DB
       •   NumPy
.NPZ
format           •   MongoDB
       •   Bespoke
binary              •   CouchDB
Big Data - Ian Stokes-Rees                      ijstokes@hkl.hms.harvard.edu
How
to
process

  • Analytical
software      • Analytical
environments
     •   custom
programs        •   multi‐core
machine
‐
48+
core

     •   Matlab                     systems
for
under
$5000
(USD)
     •   Perl                   •   GPU
     •   R                      •   compute
cluster
     •   Python                 •   supercomputers
     •   SAS,
SPSS              •   grid
computing
     •   Tableau                •   cloud
computing
                                •   web‐based
services
                                •   network
of
workstations
(NOW)
                                •   Map/Reduce
models
                                •   “screen‐saver”
computing
(BOINC)




Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
48 cores, single system image
For
$500
to
$2000
(USD),
up
to
order
of
magnitude

processing
speedups
may
be
possible
GPU
Computing
200­800
stream

                        processing
cores
per
card




For
$500
to
$2000
(USD),
up
to
order
of
magnitude

processing
speedups
may
be
possible
Open
Science
Grid




                             www.opensciencegrid.org
Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Map/Reduce
         • Unix
users:
            •   cat | grep | sort | unique > file
         • Map/Reduce
equivalent:
            •   input | map | shuffle | reduce > output
         • HadoopFS
(HDFS)
            •   large
data
set
is
automatically
spread
and
replicated
across
local

                storage
resources
(disks)
of
each
node
in
a
cluster
         • Map
            •   creates
a
job
for
each
data
block
in
the
input
            •   maps
the
computational
kernel
to
each
job
            •   schedules
jobs
to
nodes
with
required
data
block
            •   each
job
produces
a
set
of
key/value
pair
job
result
         • Reduce
            •   collect
results
from
Map
stage
based
on
keys
(Combine)
            •   aggregates
values
to
produce
task
(Oinal)
result

Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Extensions

         • Pig
and
Hive
            •   pig.apache.org



hive.apache.org
            •   simplify
writing
Map/Reduce
programs
for
Hadoop
            •   SQL‐like
query
language
for
datasets
available
on
HDFS
         • Cloudera
            •   www.cloudera.com
            •   packaged
distribution
of
Hadoop
+
extensions
            •   education
+
training
material
         • Amazon
Elastic
Map
Reduce
            •   aws.amazon.com/elasticmapreduce
            •   Amazon
“cloud‐based”
hosting
of
Hadoop
for
Map/Reduce
using
EC2

                for
compute
and
S3
for
storage



Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Organization,
Searching,
and
Meta‐Data
         • Few
“software”
solutions
for
this
problem
            •   iRODS

provides
some
of
this
            •   Unix
“locate”
database
            •   SAN
solutions
may
index
software
and
provide
tools
for
searching
         • Establish
protocols,
document,
communicate
            •   director
hierarchy
            •   Oile
naming
            •   persisted
working
space
            •   scratch/temporary
space
         • Filesystem
functionality
            •   many
Oile
systems
have
per‐Oile
meta‐data
controls
to
add
arbitrary

                key/value
pairs
         • Augmented
web‐based
view
            •   cern_meta
Apache
module
provides
key/value
pairs
in
HTTP
HEAD
            •   ability
to
assert
arbitrary
web
organization
on
top
of
Oilesystem

                organization,
with
searching
and
graphical
views

Big Data - Ian Stokes-Rees                                    ijstokes@hkl.hms.harvard.edu
•   www.irods.org
         •   File‐like
paradigm
for
data‐management
         •   addition
of
meta‐data
         •   can
integrate
database
resources
         •   provides
rich
access
policy
management
         •   automated
workOlows
based
on
data
actions
             •   add,
remove,
modify
         • automated
replication
         • built‐in
provenance
             •   information
life‐cycle
management

Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Search:
Apache


         •   lucene.apache.org
         •   Java‐based
         •   full
text
querying
and
searching
         •   indexing
         •   Solr
provides
web
interface




Big Data - Ian Stokes-Rees                      ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki

   • You
know
Wikipedia




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki

   • You
know
Wikipedia
   • It
is
built
using
Mediawiki




Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki

   • You
know
Wikipedia
   • It
is
built
using
Mediawiki
   • Semantic
Media
Wiki
adds
Semantic
Web
features
      •   Flexible
key/value
schemas
      •   User
deOined
and
changeable
object
classes
      •   Built‐in
knowledge
of
dates
→
timelines
      •   Built‐in
knowledge
of
locations
→
maps
      •   Built‐in
handling
of
images
→
picture
galleries




Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Access
Control




Big Data - Ian Stokes-Rees          ijstokes@hkl.hms.harvard.edu
Access
Control
  • Need
a
strong
Identity
Management
environment
     •   individuals:
identity
tokens
and
identiOiers
     •   groups:
membership
lists
     •   Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐
         based




Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Access
Control
  • Need
a
strong
Identity
Management
environment
     •   individuals:
identity
tokens
and
identiOiers
     •   groups:
membership
lists
     •   Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐
         based
  • Need
to
manage
and
communicate
Access
Control
policies
     •   institutionally
driven
     •   user
driven




Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Access
Control
  • Need
a
strong
Identity
Management
environment
     •   individuals:
identity
tokens
and
identiOiers
     •   groups:
membership
lists
     •   Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐
         based
  • Need
to
manage
and
communicate
Access
Control
policies
     •   institutionally
driven
     •   user
driven
  • Need
Authorization
System
     •   Policy
Enforcement
Point
(shell
login,
data
access,
web
access,
start
application)
     •   Policy
Decision
Point
(store
policies
and
understand
relationship
of
identity
token


         and
policy)




Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Case
Study:
SBGrid
         • www.sbgrid.org
         • computing
expertise
for
protein
structure
and

           function
research
            •   software
            •   training
            •   technical
support
            •   storage
            •   cluster
and
grid
computing
         • 150
member
labs
in
consortium
            •   about
1000
total
researchers
         • structure
imaging
and
model
building:
            •   imaging
techniques
are
data
intensive
            •   model
determination
techniques
are
compute
intensive


Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
SBGrid
Science
Portal
                GlobusOnline                            UC San Diego
                 @Argonne              GUMS
    User                              GUMS
                                   GridFTP +            glideinWMS
              data                  Hadoop                 factory          Open Science Grid


     computations
                                                                                         MyProxy
                                                                                       @NCSA, UIUC
     monitoring      interfaces            data          computation     ID mgmt
      Ganglia                         scp                Condor          FreeIPA
                     Apache                                                             DOEGrids CA
      Nagios                          GridFTP            Cycle Server                    @Lawrence
                     GridSite                                            LDAP
      RSV                             SRM                VDT                            Berkley Labs
                     Django                                              VOMS
                                                         Globus
      pacct                           WebDAV
                     Sage Math                                           GUMS
                                                         glideinWMS                    Gratia Acct'ing
                     R-Studio                                            GACL           @FermiLab
                                    file          SQL
                     shell CLI    server          DB       cluster
                                                                                         Monitoring
     SBGrid Science Portal @ Harvard Medical School                                      @Indiana


Big Data - Ian Stokes-Rees                                              ijstokes@hkl.hms.harvard.edu
Data
Model

    • Data
Tiers
       •   VO­wide:
all
sites,
admin
managed,
very
stable
       •   User
project:
all
sites,
user
managed,
1‐10
weeks,
1‐3
GB
       •   User
static:
all
sites,
user
managed,
indeOinite,
10
MB
       •   Job
set:
all
sites,
infrastructure
managed,
1‐10
days,
0.1‐1
GB
       •   Job:
direct
to
worker
node,
infrastructure
managed,
1
day,
<10
MB
       •   Job
indirect:
to
worker
node
via
UCSD,
infrastructure
managed,
1

           day,
<10
GB



Big Data - Ian Stokes-Rees                            ijstokes@hkl.hms.harvard.edu
Data
Management
 quota
 du
scan
 tmpwatch
 conventions
 workOlow
integration

 Data
Movement
 scp
(users)
 rsync
(VO‐wide)
 grid‐ftp
(UCSD)
 curl
(WNs)
 cp
(NFS)
 htcp
(secure
web)




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
red
­
push
<iles
   green
­
pull
<iles




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
red
­
push
<iles
   green
­
pull
<iles




                             1.
user
<ile
upload

Big Data - Ian Stokes-Rees                         ijstokes@hkl.hms.harvard.edu
red
­
push
<iles
   green
­
pull
<iles

                             2.
replicate
gold
standard




                               1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
3.
Auto­replicate




    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                      ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs



                        3.
Auto­replicate




    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs




    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs
                                                                             6.
pull
<iles
from
                                                                              SBGrid
to
WNs

    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs
                                                                             6.
pull
<iles
from
                                                                              SBGrid
to
WNs

    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard


                                                                         7.
job
results
copied

                                                                             back
to
SBGrid



                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs
                                                                             6.
pull
<iles
from
                                                                              SBGrid
to
WNs

    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard


                                                                         7.
job
results
copied

                                                                             back
to
SBGrid
                                                                        8a.
large
job
results

                                                                          copied
to
UCSD
                                                                         8b.
later
pulled
to

                                     1.
user
<ile
upload                       SBGrid
Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup




Big Data - Ian Stokes-Rees      ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup




Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data




Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data
       • GridFTP




Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data
       • GridFTP
       • Storage
Resource
Broker
(SRB)


Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data
       • GridFTP
       • Storage
Resource
Broker
(SRB)
       • GlobusOnline

Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Globus
Online:
High
Performance

           Reliable
3rd
Party
File
Transfer
                     http://www.globusonline.org




                 portal

       cluster




                                                        data collection
                                                            facility
             lab file
             server



                             desktop   laptop
Big Data - Ian Stokes-Rees                  ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Summary
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers

     • Educate
yourself
on
available
tools
and
technology

     • Design
your
data
management
system
suitably



Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Acknowledgements
&
Questions
  • Piotr
Sliz
     •   Principle
Investigator,
head
of
SBGrid
  • SBGrid
System
Administrators
     •   Ian
Levesque,
Peter
Doherty
  • Globus
Online
Team
     •   Steve
Tueke,
Ian
Foster,
Rachana

         Ananthakrishnan,
Raj
Kettimuthu

  • Terrence
Martin
     •   System
administrator
at
UCSD
for
assistance
and

         encouragement
using
1
PB
Hadoop
storage
array
  • Brian
Bockleman
     •   Physics
faculty
at
University
of
Nebraska
  • Steve
Timm
     •   System
administrator
at
FermiLab
  • Ruth
Pordes
     •   Director
of
OSG,
for
championing
SBGrid
Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Acknowledgements
&
Questions
  • Piotr
Sliz
     •   Principle
Investigator,
head
of
SBGrid
  • SBGrid
System
Administrators
     •   Ian
Levesque,
Peter
Doherty                        Please
contact
me

  • Globus
Online
Team                                      with
any
questions:
     •   Steve
Tueke,
Ian
Foster,
Rachana
                  • Ian
Stokes‐Rees
         Ananthakrishnan,
Raj
Kettimuthu
                   • ijstokes@hkl.hms.harvard.edu
                                                            • ijstokes@spmetric.com
  • Terrence
Martin
     •   System
administrator
at
UCSD
for
assistance
and

         encouragement
using
1
PB
Hadoop
storage
array      Look
at
our
work
  • Brian
Bockleman                                           •   portal.sbgrid.org
     •   Physics
faculty
at
University
of
Nebraska            •   www.sbgrid.org
                                                              •   www.opensciencegrid.org
  • Steve
Timm
     •   System
administrator
at
FermiLab
  • Ruth
Pordes
     •   Director
of
OSG,
for
championing
SBGrid
Big Data - Ian Stokes-Rees                                   ijstokes@hkl.hms.harvard.edu

Weitere ähnliche Inhalte

Ähnlich wie Big Data: tools and techniques for working with large data sets

2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1Boston Consulting Group
 
Lecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxLecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxMOAZZAMALISATTI
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Turi, Inc.
 
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...ASIS&T
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...HostedbyConfluent
 
Geospatial Rectification of Web Transactions and Data Security
Geospatial Rectification of Web Transactions and Data SecurityGeospatial Rectification of Web Transactions and Data Security
Geospatial Rectification of Web Transactions and Data SecurityPhoenix TS
 
Predictive Analytics - BarCamp Boston 2011
Predictive Analytics - BarCamp Boston 2011Predictive Analytics - BarCamp Boston 2011
Predictive Analytics - BarCamp Boston 2011Vedant Misra
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2Boston Consulting Group
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensingtensing-gis
 
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014StampedeCon
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
 
Top 10 Myths Regarding Data Scientists Roles in India | Edureka
Top 10 Myths Regarding Data Scientists Roles in India | EdurekaTop 10 Myths Regarding Data Scientists Roles in India | Edureka
Top 10 Myths Regarding Data Scientists Roles in India | EdurekaEdureka!
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 

Ähnlich wie Big Data: tools and techniques for working with large data sets (20)

2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
 
Big Data
Big Data Big Data
Big Data
 
Lecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxLecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptx
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
 
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
 
DBMS
DBMSDBMS
DBMS
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
 
Geospatial Rectification of Web Transactions and Data Security
Geospatial Rectification of Web Transactions and Data SecurityGeospatial Rectification of Web Transactions and Data Security
Geospatial Rectification of Web Transactions and Data Security
 
Big Data on The Cloud
Big Data on The CloudBig Data on The Cloud
Big Data on The Cloud
 
Predictive Analytics - BarCamp Boston 2011
Predictive Analytics - BarCamp Boston 2011Predictive Analytics - BarCamp Boston 2011
Predictive Analytics - BarCamp Boston 2011
 
Data_Science.ppt
Data_Science.pptData_Science.ppt
Data_Science.ppt
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensing
 
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
Top 10 Myths Regarding Data Scientists Roles in India | Edureka
Top 10 Myths Regarding Data Scientists Roles in India | EdurekaTop 10 Myths Regarding Data Scientists Roles in India | Edureka
Top 10 Myths Regarding Data Scientists Roles in India | Edureka
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 

Mehr von Boston Consulting Group

Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Boston Consulting Group
 
2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesreesBoston Consulting Group
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceBoston Consulting Group
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesreesBoston Consulting Group
 

Mehr von Boston Consulting Group (13)

Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Anaconda Data Science Collaboration
Anaconda Data Science CollaborationAnaconda Data Science Collaboration
Anaconda Data Science Collaboration
 
Python Blaze Overview
Python Blaze OverviewPython Blaze Overview
Python Blaze Overview
 
Making Data Analytics Awesome
Making Data Analytics AwesomeMaking Data Analytics Awesome
Making Data Analytics Awesome
 
Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...
 
SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012
 
2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees
 
Grid Computing Overview
Grid Computing OverviewGrid Computing Overview
Grid Computing Overview
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interface
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Big Data: tools and techniques for working with large data sets

  • 1. Big
Data:
tools
and
 techniques
for
working
 with
large
data
sets Ian
Stokes‐Rees,
PhD Harvard
Medical
School,
Boston,
USA Workshop
on
Tools,
Technologies
and
Collaborative
 Opportunities
for
HPC
in
Life
Sciences
and
Healthcare http://portal.sbgrid.org ijstokes@hkl.hms.harvard.edu
  • 2. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 3. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011 http://www.sbgrid.org http://portal.sbgrid.org http://www.opensciencegrid.org Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 4. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 5. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 6. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 7. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 8. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 9. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 10. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 11. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 12. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 13. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 14. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 15. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 16. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 17. rotational translation 2D
simple
crystal Patterson
map search search score
model: aggregate best
peak,
R
factor, alternatives composites and
cluster electron
density Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 18. Protein Structure Studies Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 19. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 20. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 21. Data,
Data
Everywhere
... Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 22. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 23. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 24. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 25. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 26. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 27. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 28. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 29. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 30. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 31. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 32. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changes Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 33. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changes Today,
we’ll
think
about
software,
hardware,
and
 models
for
coping
with
large
quantities
of
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 34. Next
Generation
Sequencing Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 35. High
Energy
Physics Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 36. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 37. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 38. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 39. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 40. 40
MHz
bunch
crossing
rate 10
million
data
channels 1
KHz
level
1
event
recording
rate 1­10
MB
per
event 14
hours
per
day,
7+
months
/
year 4
detectors 6
PB
of
data
/
year globally
distribute
data
for
analysis
(x2) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 41. Molecular
Dynamics
Simulations Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 42. Molecular
Dynamics
Simulations 1
fs
time
step 1ns
snapshot 1
us
simulation 1e6
steps 1000
frames 10
MB
/
frame 10
GB
/
sim 20
CPU­years 3
months
(wall­ clock) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 43. Electronic
Patient
Records Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 44. Electronic
Patient
Records 77
page
PDF
(bespoke
report) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 45. Electronic
Patient
Records Clinical
Document
Architecture
XML
representation Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 46. Electronic
Patient
Records HTML
rendering
of
XML
via
XSLT
transform Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 47. Clinical
Imaging
Data DICOM
­
Digital
Imaging
and
 Communications
in
Medicine 2D,
3D,
4D Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 48. Clinical
Imaging
Data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 49. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 50. It
is
clear
there
is
no
shortage
of
data. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 51. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
... Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 52. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
... ...
if
we
can
organize,
access,
share,
and
 analyze
this
data
ef[iciently Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 53. Jumping
to
the
end
... Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 54. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 55. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 56. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 57. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 58. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitably Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 59. Problems
arising
from
“Big
Data” Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 60. Problems
arising
from
“Big
Data” • Where
to
store Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 61. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 62. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 63. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 64. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 65. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 66. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • Provenance Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 67. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • Provenance • Lifecycle Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 68. Where
to
store
(I) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 69. Where
to
store
(I) • RAM • fast • expensive • volatile Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 70. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • SSD
good
for
random
reads:
lots
of
small
 Oiles,
unpredictable
I/O
patterns • large
Oiles,
sequential
I/O,
spinning
disk
 comparable
to
SSDs Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 71. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • Parallel
Filesystem • SSD
good
for
random
reads:
lots
of
small
 • gluster,
luster,
gpfs Oiles,
unpredictable
I/O
patterns • HDFS
(Hadoop) • large
Oiles,
sequential
I/O,
spinning
disk
 • auto‐replication
for
parallel
 comparable
to
SSDs decentralized
I/O Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 72. Where
to
store
(II) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 73. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • Oiber
channel
(2
Gb/s)
or
InOiniband
 (10,20,40
Gb/s)
interconnect • parallel,
non‐blocking,
dedicated
 routes Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 74. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 75. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • Hybrid • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Create
in‐house
tiered
storage • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 76. How
to
store
(data
formats) • ASCII • SQL
DB • tab
delimited • MySQL • comma
separated • sqlite • XML • Oracle • Access • DTD
deOinition? • SQL
Server • Schema
deOinition? • Namespaces? • Hierarchical
DB • JSON • Berkeley
XML
DB • LDAP • NetCDF • Object‐Relational
Mapper • HDF5 • SQL
Alchemy
(Python) • DICOM • Hibernate
(Java,
.NET) • Django
ORM
(Python) • Matlab
.MAT
format • No‐SQL
DB • NumPy
.NPZ
format • MongoDB • Bespoke
binary • CouchDB Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 77. How
to
process • Analytical
software • Analytical
environments • custom
programs • multi‐core
machine
‐
48+
core
 • Matlab systems
for
under
$5000
(USD) • Perl • GPU • R • compute
cluster • Python • supercomputers • SAS,
SPSS • grid
computing • Tableau • cloud
computing • web‐based
services • network
of
workstations
(NOW) • Map/Reduce
models • “screen‐saver”
computing
(BOINC) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 78. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 79. 48 cores, single system image
  • 80.
  • 82. GPU
Computing
200­800
stream
 processing
cores
per
card For
$500
to
$2000
(USD),
up
to
order
of
magnitude
 processing
speedups
may
be
possible
  • 83.
  • 84.
  • 85.
  • 86. Open
Science
Grid www.opensciencegrid.org Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 87. Map/Reduce • Unix
users: • cat | grep | sort | unique > file • Map/Reduce
equivalent: • input | map | shuffle | reduce > output • HadoopFS
(HDFS) • large
data
set
is
automatically
spread
and
replicated
across
local
 storage
resources
(disks)
of
each
node
in
a
cluster • Map • creates
a
job
for
each
data
block
in
the
input • maps
the
computational
kernel
to
each
job • schedules
jobs
to
nodes
with
required
data
block • each
job
produces
a
set
of
key/value
pair
job
result • Reduce • collect
results
from
Map
stage
based
on
keys
(Combine) • aggregates
values
to
produce
task
(Oinal)
result
 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 88. Extensions • Pig
and
Hive • pig.apache.org



hive.apache.org • simplify
writing
Map/Reduce
programs
for
Hadoop • SQL‐like
query
language
for
datasets
available
on
HDFS • Cloudera • www.cloudera.com • packaged
distribution
of
Hadoop
+
extensions • education
+
training
material • Amazon
Elastic
Map
Reduce • aws.amazon.com/elasticmapreduce • Amazon
“cloud‐based”
hosting
of
Hadoop
for
Map/Reduce
using
EC2
 for
compute
and
S3
for
storage Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 89. Organization,
Searching,
and
Meta‐Data • Few
“software”
solutions
for
this
problem • iRODS

provides
some
of
this • Unix
“locate”
database • SAN
solutions
may
index
software
and
provide
tools
for
searching • Establish
protocols,
document,
communicate • director
hierarchy • Oile
naming • persisted
working
space • scratch/temporary
space • Filesystem
functionality • many
Oile
systems
have
per‐Oile
meta‐data
controls
to
add
arbitrary
 key/value
pairs • Augmented
web‐based
view • cern_meta
Apache
module
provides
key/value
pairs
in
HTTP
HEAD • ability
to
assert
arbitrary
web
organization
on
top
of
Oilesystem
 organization,
with
searching
and
graphical
views Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 90. www.irods.org • File‐like
paradigm
for
data‐management • addition
of
meta‐data • can
integrate
database
resources • provides
rich
access
policy
management • automated
workOlows
based
on
data
actions • add,
remove,
modify • automated
replication • built‐in
provenance • information
life‐cycle
management Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 91. Search:
Apache • lucene.apache.org • Java‐based • full
text
querying
and
searching • indexing • Solr
provides
web
interface Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 92. Meta‐Data:
Semantic
Media
Wiki Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 93. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 94. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
Mediawiki Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 95. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
Mediawiki • Semantic
Media
Wiki
adds
Semantic
Web
features • Flexible
key/value
schemas • User
deOined
and
changeable
object
classes • Built‐in
knowledge
of
dates
→
timelines • Built‐in
knowledge
of
locations
→
maps • Built‐in
handling
of
images
→
picture
galleries Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 96. Access
Control Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 97. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 98. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
driven Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 99. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
driven • Need
Authorization
System • Policy
Enforcement
Point
(shell
login,
data
access,
web
access,
start
application) • Policy
Decision
Point
(store
policies
and
understand
relationship
of
identity
token

 and
policy) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 100. Case
Study:
SBGrid • www.sbgrid.org • computing
expertise
for
protein
structure
and
 function
research • software • training • technical
support • storage • cluster
and
grid
computing • 150
member
labs
in
consortium • about
1000
total
researchers • structure
imaging
and
model
building: • imaging
techniques
are
data
intensive • model
determination
techniques
are
compute
intensive Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 101. SBGrid
Science
Portal GlobusOnline UC San Diego @Argonne GUMS User GUMS GridFTP + glideinWMS data Hadoop factory Open Science Grid computations MyProxy @NCSA, UIUC monitoring interfaces data computation ID mgmt Ganglia scp Condor FreeIPA Apache DOEGrids CA Nagios GridFTP Cycle Server @Lawrence GridSite LDAP RSV SRM VDT Berkley Labs Django VOMS Globus pacct WebDAV Sage Math GUMS glideinWMS Gratia Acct'ing R-Studio GACL @FermiLab file SQL shell CLI server DB cluster Monitoring SBGrid Science Portal @ Harvard Medical School @Indiana Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 102. Data
Model • Data
Tiers • VO­wide:
all
sites,
admin
managed,
very
stable • User
project:
all
sites,
user
managed,
1‐10
weeks,
1‐3
GB • User
static:
all
sites,
user
managed,
indeOinite,
10
MB • Job
set:
all
sites,
infrastructure
managed,
1‐10
days,
0.1‐1
GB • Job:
direct
to
worker
node,
infrastructure
managed,
1
day,
<10
MB • Job
indirect:
to
worker
node
via
UCSD,
infrastructure
managed,
1
 day,
<10
GB Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 103. Data
Management quota du
scan tmpwatch conventions workOlow
integration Data
Movement scp
(users) rsync
(VO‐wide) grid‐ftp
(UCSD) curl
(WNs) cp
(NFS) htcp
(secure
web) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 104. red
­
push
<iles green
­
pull
<iles Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 105. red
­
push
<iles green
­
pull
<iles 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 106. red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 107. 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 108. 4.
pull
<iles
from UCSD
to
WNs 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 109. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 110. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 111. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 112. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 8a.
large
job
results
 copied
to
UCSD 8b.
later
pulled
to
 1.
user
<ile
upload SBGrid Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 113. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 114. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 115. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 116. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 117. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 118. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 119. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 120. Copy,
Move,
Backup Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 121. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 122. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 123. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 124. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 125. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 126. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB) • GlobusOnline Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 127. Globus
Online:
High
Performance
 Reliable
3rd
Party
File
Transfer http://www.globusonline.org portal cluster data collection facility lab file server desktop laptop Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 128. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 129. Summary • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitably Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 130. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty • Globus
Online
Team • Steve
Tueke,
Ian
Foster,
Rachana
 Ananthakrishnan,
Raj
Kettimuthu
 • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array • Brian
Bockleman • Physics
faculty
at
University
of
Nebraska • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGrid Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 131. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty Please
contact
me
 • Globus
Online
Team with
any
questions: • Steve
Tueke,
Ian
Foster,
Rachana
 • Ian
Stokes‐Rees Ananthakrishnan,
Raj
Kettimuthu
 • ijstokes@hkl.hms.harvard.edu • ijstokes@spmetric.com • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array Look
at
our
work • Brian
Bockleman • portal.sbgrid.org • Physics
faculty
at
University
of
Nebraska • www.sbgrid.org • www.opensciencegrid.org • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGrid Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n