SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
An	
  Introduc+on	
  to	
  	
  
Data	
  Intensive	
  Compu+ng	
  
                  	
  
 Chapter	
  1:	
  Introduc+on	
  
       Robert	
  Grossman	
  
      University	
  of	
  Chicago	
  
       Open	
  Data	
  Group	
  
                   	
  
         Collin	
  BenneB	
  
       Open	
  Data	
  Group	
  
                   	
  
      November	
  14,	
  2011	
  
                                        1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
     a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
     b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
     a.  Databases	
  
     b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
     c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
     a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
     b.  MapReduce	
  
     c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
      (1100-­‐1200)	
  
	
  
Our	
  perspec+ve	
  is	
  to	
  consider	
  data	
  intensive	
  
compu+ng	
  from	
  the	
  viewpoint	
  of	
  u+lity	
  and	
  
data	
  clouds.	
  	
  	
  




For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  
please	
  see:	
  	
  
                           rgrossman.com	
  
Sec+on	
  1.1	
  	
  
Data	
  Intensive	
  Science	
  




            Two	
  of	
  the	
  14	
  high	
  throughput	
  sequencers	
  at	
  the	
  
            Ontario	
  Ins+tute	
  for	
  Cancer	
  Research	
  (OICR).	
  	
  	
  

                                                                                          4	
  
Moore’s	
  law	
  also	
  
applies	
  to	
  the	
  
instruments	
  that	
  are	
  
producing	
  data.	
  
	
  
This	
  is	
  crea+ng	
  new	
  
paradigms:	
  “data	
  
intensive	
  science”	
  
and	
  “data	
  intensive	
  
compu+ng.”	
  
Source:	
  Lincoln	
  Stein	
  
Data	
  is	
  Big	
  If	
  It	
  is	
  Measured	
  in	
  MW	
  
                     •  Data	
  is	
  big	
  if	
  you	
  measure	
  it	
  in	
  
                        MegawaBs.	
  
                     •  As	
  in,	
  a	
  good	
  sweet	
  spot	
  for	
  a	
  
                        data	
  center	
  is	
  15	
  MW.	
  
                     •  As	
  in,	
  Facebook’s	
  leased	
  data	
  
                        centers	
  are	
  typically	
  between	
  
                        2.5	
  MW	
  and	
  6.0	
  MW.	
  
                     •  Facebook’s	
  new	
  Pineville	
  data	
  
                        center	
  is	
  30	
  MW.	
  
                     •  Google’s	
  compu+ng	
  
                        infrastructure	
  uses	
  260	
  MW.	
  
Some	
  Big	
  Data	
  Sciences	
  

Discipline	
                                                       Dura-on	
   Size	
                                                                                 #	
  Devices	
  
HEP	
  -­‐	
  LHC	
                                                10	
  years	
   15	
  PB/year*	
                                                                   One	
  

Astronomy	
  -­‐	
  LSST	
   10	
  years	
   12	
  PB/year**	
                                                                                                        One	
  

Genomics	
  -­‐	
  NGS	
                                           2-­‐4	
  years	
   0.4	
  TB/genome	
   1000’s	
  


*At	
  full	
  capacity,	
  the	
  Large	
  Hadron	
  Collider	
  (LHC),	
  the	
  world's	
  largest	
  par+cle	
  accelerator,	
  is	
  expected	
  to	
  produce	
  more	
  than	
  15	
  
million	
  Gigabytes	
  of	
  data	
  each	
  year.	
  	
  …	
  This	
  ambi+ous	
  project	
  connects	
  and	
  combines	
  the	
  IT	
  power	
  of	
  more	
  than	
  140	
  computer	
  
centres	
  in	
  33	
  countries.	
  	
  Source:	
  hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html	
  
	
  
**As	
  it	
  carries	
  out	
  its	
  10-­‐year	
  survey,	
  LSST	
  will	
  produce	
  over	
  15	
  terabytes	
  of	
  raw	
  astronomical	
  data	
  each	
  night	
  (30	
  terabytes	
  
processed),	
  resul+ng	
  in	
  a	
  database	
  catalog	
  of	
  22	
  petabytes	
  and	
  an	
  image	
  archive	
  of	
  100	
  petabytes.	
  	
  Source:	
  hBp://www.lsst.org/
News/enews/teragrid-­‐1004.html	
  
An	
  algorithm	
  and	
  
                                compu+ng	
  infrastructure	
  
                                is	
  “big-­‐data	
  scalable”	
  if	
  
                                adding	
  a	
  rack	
  of	
  data	
  (and	
  
                                corresponding	
  processors)	
  
                                does	
  not	
  increase	
  the	
  +me	
  
                                required	
  to	
  complete	
  the	
  
                                computa+on	
  but	
  increases	
  
                                the	
  amount	
  of	
  data	
  that	
  
                                can	
  be	
  processed.	
  
Add	
  capacity	
  with	
  
constant	
  +me	
  (ACCT)	
  
Sec+on	
  1.2	
  
What’s	
  New	
  with	
  Clouds?	
  




                                       10	
  
The	
  Term	
  ‘In	
  the	
  Cloud’	
  is	
  Annoying	
  	
  
•  “Personally,	
  I	
  find	
  the	
  term	
  ‘in	
  the	
  cloud’	
  
   preten+ous	
  and	
  annoying.	
  …	
  the	
  world’s	
  
   marketers	
  and	
  P.R.	
  people	
  seem	
  to	
  think	
  that	
  
   ‘the	
  cloud’	
  just	
  means	
  ‘online.’	
  ”	
  	
  David	
  Pogue,	
  
   NYT	
  June	
  16,	
  2011.	
  	
  	
  	
  
•  More	
  specifically	
  he	
  notes	
  that	
  you	
  can	
  think	
  
   of	
  the	
  cloud	
  as	
  “data	
  and	
  applica+on	
  sopware	
  
   stored	
  on	
  remote	
  servers	
  [and	
  accessed	
  via	
  
   the	
  Internet]”	
  
U+lity	
  Clouds	
  

Infrastructure	
  as	
  a	
  Service	
  (IaaS)	
  




        Amazon	
  Data	
  Center	
  




                                                     12	
  
Data	
  Clouds	
  


Large	
  Data	
  Cloud	
  Services	
  




                                                    ad	
  targe+ng	
  	
  



     Yahoo	
  Data	
  Center	
  

                                                                      13	
  
Virtualiza+on	
  


                                                       App	
  
                                                                 App	
  
                                                                           App	
  
                                                        OS	
  
App	
        App	
       App	
                                   OS	
  
                                                                            OS	
  

              OS	
  
                                                            Hyperviser	
  

          Computer	
  
                                                             Computer	
  




                                                                                     14	
  
Idea	
  Dates	
  Back	
  to	
  the	
  1960s	
  

                     App	
             App	
              App	
  
                     CMS	
             MVS	
              CMS	
  
                                  IBM	
  VM/370	
  

                                IBM	
  Mainframe	
  


                   Na+ve	
  (Full)	
  Virtualiza+on	
  
                   Examples:	
  Vmware	
  ESX	
  



•  Virtualiza+on	
  first	
  widely	
  deployed	
  with	
  IBM	
  
   VM/370.	
  
                                                                    15	
  
Scale	
  is	
  New	
  




                         16	
  
Usage	
  Based	
  Pricing	
  Is	
  New	
  


                      costs	
  the	
  same	
  as	
  



1	
  computer	
  in	
  a	
  rack	
                     120	
  computers	
  in	
  	
  three	
  
for	
  120	
  hours	
                                  racks	
  for	
  1	
  hour	
  




                                                                                                 17	
  
Simplicity	
  is	
  New	
  


                  +	
                          ..	
  and	
  you	
  have	
  a	
  computer	
  
                                               ready	
  to	
  work.	
  


Elas+c,	
  on	
  demand	
  provisioning.	
  


A	
  new	
  programmer	
  can	
  develop	
  a	
  
program	
  to	
  process	
  a	
  container	
  full	
  
of	
  data	
  with	
  less	
  than	
  day	
  of	
  
training	
  using	
  MapReduce.	
  
                                                                                           18	
  
Sec+on	
  1.4	
  	
  
U+lity	
  Clouds	
  
Customer’s	
                    Cloud	
  Service	
  Provider’s	
  
Responsibility	
                Responsibility	
  

                     IaaS	
               PaaS	
                       SaaS	
  
                 Apps	
                    Apps	
                        Apps	
  


             Frameworks	
             Frameworks	
                   Frameworks	
  


                     VM	
                   VM	
                         VM	
  


             Hyperviser,	
            Hyperviser,	
                  Hyperviser,	
  
              network	
                network	
                      network	
  
Amazon	
  Style	
  Data	
  Cloud	
  

                               Load	
  Balancer	
  


                           Simple	
  Queue	
  Service	
  



SDB	
     EC2	
  Instance	
                           EC2	
  Instance	
  
           EC2	
  Instance	
                           EC2	
  Instance	
  
            EC2	
  Instance	
                           EC2	
  Instance	
  
             EC2	
  Instance	
                           EC2	
  Instance	
  
                  EC2	
  Instance	
                           EC2	
  Instance	
  
                   EC2	
  Instances	
                          EC2	
  Instances	
  



                                     S3	
  Storage	
  Services	
  
                                                                                      21
NIST	
  Defini+on	
  
•  Cloud	
  compu+ng	
  is	
  a	
  model	
  for	
  enabling	
  
   ubiquitous,	
  convenient,	
  on-­‐demand	
  network	
  
   access	
  to	
  a	
  shared	
  pool	
  of	
  configurable	
  
   compu+ng	
  resources	
  that	
  can	
  be	
  rapidly	
  
   provisioned	
  and	
  released	
  with	
  minimal	
  
   management	
  effort	
  or	
  service	
  provider	
  
   interac+on.	
  
NIST	
  Defini+on	
  
Essential Characteristics             Deployment Models
  •  On-demand / self-service            •    Private
  •  Broad network access                •    Community
  •  Resource pooling                    •    Public
  •  Rapid elasticity                    •    Hybrid
  •  Measured service
Service Models
  •  Software as a Service (SaaS) – consumer runs
  provider s applications on cloud infrastructure
  •  Platform as a Service (PaaS) – consumer runs
  consumer-created applications on the cloud
  using tools supported by provider
  •  Infrastructure as a Service (IaaS) – consumer uses
  provider s processing, storage, and networks
Sec+on	
  1.5	
  
Data	
  Clouds	
  
Google’s	
  Large	
  Data	
  Cloud	
  

      Applica+ons	
  

  Compute	
  Services	
      Google’s	
  MapReduce	
  

Data	
  Services	
           Google’s	
  BigTable	
  

   Storage	
  Services	
     Google	
  File	
  System	
  (GFS)	
  

   Google’s	
  Stack	
  

                                                                     25
Hadoop’s	
  Large	
  Data	
  Cloud	
  

      Applica+ons	
  

  Compute	
  Services	
      Hadoop’s	
  MapReduce	
  

Data	
  Services	
           NoSQL	
  Databases	
  

   Storage	
  Services	
     Hadoop	
  Distributed	
  File	
  
                             System	
  (HDFS)	
  
  Hadoop’s	
  Stack	
  

                                                                 26
Ques+ons?	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 

Was ist angesagt? (20)

Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 

Andere mochten auch

Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 

Andere mochten auch (13)

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 

Ähnlich wie Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
kevinflorian
 

Ähnlich wie Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial) (20)

An Introduction to Data Intensive Computing
An Introduction to Data Intensive ComputingAn Introduction to Data Intensive Computing
An Introduction to Data Intensive Computing
 
AWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptAWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Introduction To Cloud Computing.ppt
Introduction To Cloud Computing.pptIntroduction To Cloud Computing.ppt
Introduction To Cloud Computing.ppt
 
cloud computing services
cloud computing servicescloud computing services
cloud computing services
 
Internet of behaviours features and documents
Internet of behaviours features and documentsInternet of behaviours features and documents
Internet of behaviours features and documents
 
L2 3.fa19
L2 3.fa19L2 3.fa19
L2 3.fa19
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
cloud.ppt
cloud.pptcloud.ppt
cloud.ppt
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)
 

Mehr von Robert Grossman

Mehr von Robert Grossman (9)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

  • 1. An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  1:  Introduc+on   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneB   Open  Data  Group     November  14,  2011   1  
  • 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3. Our  perspec+ve  is  to  consider  data  intensive   compu+ng  from  the  viewpoint  of  u+lity  and   data  clouds.       For  the  most  current  version  of  these  notes,   please  see:     rgrossman.com  
  • 4. Sec+on  1.1     Data  Intensive  Science   Two  of  the  14  high  throughput  sequencers  at  the   Ontario  Ins+tute  for  Cancer  Research  (OICR).       4  
  • 5. Moore’s  law  also   applies  to  the   instruments  that  are   producing  data.     This  is  crea+ng  new   paradigms:  “data   intensive  science”   and  “data  intensive   compu+ng.”  
  • 7. Data  is  Big  If  It  is  Measured  in  MW   •  Data  is  big  if  you  measure  it  in   MegawaBs.   •  As  in,  a  good  sweet  spot  for  a   data  center  is  15  MW.   •  As  in,  Facebook’s  leased  data   centers  are  typically  between   2.5  MW  and  6.0  MW.   •  Facebook’s  new  Pineville  data   center  is  30  MW.   •  Google’s  compu+ng   infrastructure  uses  260  MW.  
  • 8. Some  Big  Data  Sciences   Discipline   Dura-on   Size   #  Devices   HEP  -­‐  LHC   10  years   15  PB/year*   One   Astronomy  -­‐  LSST   10  years   12  PB/year**   One   Genomics  -­‐  NGS   2-­‐4  years   0.4  TB/genome   1000’s   *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par+cle  accelerator,  is  expected  to  produce  more  than  15   million  Gigabytes  of  data  each  year.    …  This  ambi+ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer   centres  in  33  countries.    Source:  hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html     **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes   processed),  resul+ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hBp://www.lsst.org/ News/enews/teragrid-­‐1004.html  
  • 9. An  algorithm  and   compu+ng  infrastructure   is  “big-­‐data  scalable”  if   adding  a  rack  of  data  (and   corresponding  processors)   does  not  increase  the  +me   required  to  complete  the   computa+on  but  increases   the  amount  of  data  that   can  be  processed.   Add  capacity  with   constant  +me  (ACCT)  
  • 10. Sec+on  1.2   What’s  New  with  Clouds?   10  
  • 11. The  Term  ‘In  the  Cloud’  is  Annoying     •  “Personally,  I  find  the  term  ‘in  the  cloud’   preten+ous  and  annoying.  …  the  world’s   marketers  and  P.R.  people  seem  to  think  that   ‘the  cloud’  just  means  ‘online.’  ”    David  Pogue,   NYT  June  16,  2011.         •  More  specifically  he  notes  that  you  can  think   of  the  cloud  as  “data  and  applica+on  sopware   stored  on  remote  servers  [and  accessed  via   the  Internet]”  
  • 12. U+lity  Clouds   Infrastructure  as  a  Service  (IaaS)   Amazon  Data  Center   12  
  • 13. Data  Clouds   Large  Data  Cloud  Services   ad  targe+ng     Yahoo  Data  Center   13  
  • 14. Virtualiza+on   App   App   App   OS   App   App   App   OS   OS   OS   Hyperviser   Computer   Computer   14  
  • 15. Idea  Dates  Back  to  the  1960s   App   App   App   CMS   MVS   CMS   IBM  VM/370   IBM  Mainframe   Na+ve  (Full)  Virtualiza+on   Examples:  Vmware  ESX   •  Virtualiza+on  first  widely  deployed  with  IBM   VM/370.   15  
  • 16. Scale  is  New   16  
  • 17. Usage  Based  Pricing  Is  New   costs  the  same  as   1  computer  in  a  rack   120  computers  in    three   for  120  hours   racks  for  1  hour   17  
  • 18. Simplicity  is  New   +   ..  and  you  have  a  computer   ready  to  work.   Elas+c,  on  demand  provisioning.   A  new  programmer  can  develop  a   program  to  process  a  container  full   of  data  with  less  than  day  of   training  using  MapReduce.   18  
  • 19. Sec+on  1.4     U+lity  Clouds  
  • 20. Customer’s   Cloud  Service  Provider’s   Responsibility   Responsibility   IaaS   PaaS   SaaS   Apps   Apps   Apps   Frameworks   Frameworks   Frameworks   VM   VM   VM   Hyperviser,   Hyperviser,   Hyperviser,   network   network   network  
  • 21. Amazon  Style  Data  Cloud   Load  Balancer   Simple  Queue  Service   SDB   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instances   S3  Storage  Services   21
  • 22. NIST  Defini+on   •  Cloud  compu+ng  is  a  model  for  enabling   ubiquitous,  convenient,  on-­‐demand  network   access  to  a  shared  pool  of  configurable   compu+ng  resources  that  can  be  rapidly   provisioned  and  released  with  minimal   management  effort  or  service  provider   interac+on.  
  • 23. NIST  Defini+on   Essential Characteristics Deployment Models •  On-demand / self-service •  Private •  Broad network access •  Community •  Resource pooling •  Public •  Rapid elasticity •  Hybrid •  Measured service Service Models •  Software as a Service (SaaS) – consumer runs provider s applications on cloud infrastructure •  Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider •  Infrastructure as a Service (IaaS) – consumer uses provider s processing, storage, and networks
  • 24. Sec+on  1.5   Data  Clouds  
  • 25. Google’s  Large  Data  Cloud   Applica+ons   Compute  Services   Google’s  MapReduce   Data  Services   Google’s  BigTable   Storage  Services   Google  File  System  (GFS)   Google’s  Stack   25
  • 26. Hadoop’s  Large  Data  Cloud   Applica+ons   Compute  Services   Hadoop’s  MapReduce   Data  Services   NoSQL  Databases   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   26