SlideShare a Scribd company logo
1 of 16
Cloud Computing, Data Mining and Cyberinfrastructure Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox  gcf@indiana.eduwww.infomall.org/salsa http://grids.ucs.indiana.edu/ptliupages/ (Presented by Marlon Pierce) Community Grids Laboratory,  Chair Department of Informatics School of InformaticsIndiana University
Cloud Computing: Infrastructure and Runtimes Cloud infrastructure: outsourcing of servers, computing, data, file space, etc. Handled through Web services that control virtual machine lifecycles. Cloud runtimes:: tools for using clouds to do data-parallel computations.  Apache Hadoop, Google MapReduce, Microsoft Dryad, and others  Designed for information retrieval but are excellent for a wide range of machine learning and science applications. Apache Mahout  Also may be a good match for 32-128 core computers available in the next 5 years. Can also do traditional parallel computing Clustering algorithms: applications for cloud-based data mining.  Run on cloud infrastructure Implement with cloud runtimes.
Commercial Clouds Boldfaced names have open source versions
Open Architecture Clouds Amazon, Google, Microsoft, et al., don’t tell you how to build a cloud. Proprietary knowledge Indiana University and others want to document this publically.  What is the right way to build a cloud? It is more than just running software. What is the minimum-sized organization to run a cloud? Department?  University? University Consortium? Outsource it all? Analogous issues in government, industry, and enterprise. Example issues: What hardware setups work best?  What are you getting into? What is the best virtualization technology for different problems? What is the right way to implement S3- and EBS-like data services?  Content Distribution Systems? Persistent, reliable service hosting?
Cloud Runtimes What science can you do on a cloud?
Data-File Parallelism and Clouds Now that you have a cloud, you may want to do large scale processing with it. Classic problems are to perform the same (sequential) algorithm on fragments of extremely large data sets. Cloud runtime engines manage these replicated algorithms in the cloud. Can be chained together in pipelines (Hadoop) or DAGs(Dryad). Runtimes manage problems like failure control. We are exploring both scientific applications and classic parallel algorithms (clustering, matrix multiplication) using Clouds and cloud runtimes.
7 H n Y Y U U 4n S S 4n M M n D D n X X U U N N MapReduce implemented  by Hadoop map(key, value) reduce(key, list<value>) Example: Word Histogram Start with a set of words Each map task counts number of occurrences in each data partition Reduce phase adds these counts Dryadsupports general dataflow
Geospatial Examples Image processing and mining Ex: SAR Images from Polar Grid project (J. Wang)  Apply to 20 TB of data Flood modeling I Chaining flood models over a geographic area.  Flood modeling II Parameter fits and inversion problems. Real time GPS processing Filter
File/Data Parallel Examples from Biology EST (Expressed Sequence Tag) Assembly:  (Dong) 2 million mRNA sequences generates 540000 files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates) MultiParanoid/InParanoid gene sequence clustering: (Dong) 476 core years just for Prokaryotes Population Genomics: (Lynch) Looking at all pairs separated by up to 1000 nucleotides Sequence-based transcriptome profiling: (Cherbas, Innes) MAQ, SOAP Systems Microbiology: (Brun) BLAST, InterProScan Metagenomics: (Fortenberry, Nelson) Pairwise alignment of 7243 16s sequence data took 12 hours on TeraGrid All can use Dryad or Hadoop 9
MPI Applications on Clouds
Kmeans Clustering Performance – 128 CPU cores Overhead Perform Kmeans clustering for up to 40 million 3D data points Amount of communication depends only on the number of cluster centers Amount of communication  << Computation and the amount of data processed At the highest granularity VMs show at least 3.5 times overhead compared to bare-metal Extremely large overheads for smaller grain sizes
Deterministic Annealing for Clustering Highly parallelizable algorithms for data-mining
DeterministicAnnealing F({y}, T) Solve Linear Equations for each temperature Nonlinearity effects mitigated by initializing with solution at previous higher temperature Configuration {y}                     Minimum evolving as temperature decreases                     Movement at fixed temperature going to local minima if not initialized “correctly
Various Sequence Clustering Results 14 3000 Points : Clustal MSAKimura2 Distance 4500 Points : Pairwise Aligned 4500 Points : Clustal MSA Map distances to 4D Sphere before MDS
Obesity Patient ~ 20 dimensional data 15 Will use our 8 node Windows HPC system to  run 36,000 records Working with Gilbert Liu IUPUI to map patient clusters to environmental factors 2000 records         6 Clusters Refinement of 3 of clusters to left into 5 4000 records  8 Clusters
Conclusions We believe Cloud Computing will dramatically change scientific computing infrastructure. No more clusters in the closet? Controllable, sustainable computing. But we need to know what we are getting into. Even better, clouds (wherever they are) are well suited for a wide range of scientific and computer science applications. We are exploring some of these in biology, clustering, geospatial processing, and hopefully chemistry.

More Related Content

What's hot

Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science PlatformTed Habermann
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Twister4Azure - Iterative MapReduce for Azure Cloud
Twister4Azure - Iterative MapReduce for Azure CloudTwister4Azure - Iterative MapReduce for Azure Cloud
Twister4Azure - Iterative MapReduce for Azure CloudThilina Gunarathne
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkRob Emanuele
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
OpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences DataOpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences DataOpenTopography Facility
 
DATACUBES: Conquering Space & Time
DATACUBES: Conquering Space & TimeDATACUBES: Conquering Space & Time
DATACUBES: Conquering Space & Timeplan4all
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark Summit
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_finalmanishduttpurohit
 
Machine learning astronomical structure
Machine learning astronomical structureMachine learning astronomical structure
Machine learning astronomical structurePanditNitesh
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesOWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesMahdi Atawneh
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsRob Emanuele
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechRob Emanuele
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Results from Retrofit for the Future.
Results from Retrofit for the Future.Results from Retrofit for the Future.
Results from Retrofit for the Future.Tortrix Ltd
 

What's hot (20)

Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Twister4Azure - Iterative MapReduce for Azure Cloud
Twister4Azure - Iterative MapReduce for Azure CloudTwister4Azure - Iterative MapReduce for Azure Cloud
Twister4Azure - Iterative MapReduce for Azure Cloud
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Hadoop
HadoopHadoop
Hadoop
 
OpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences DataOpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences Data
 
DATACUBES: Conquering Space & Time
DATACUBES: Conquering Space & TimeDATACUBES: Conquering Space & Time
DATACUBES: Conquering Space & Time
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 
Machine learning astronomical structure
Machine learning astronomical structureMachine learning astronomical structure
Machine learning astronomical structure
 
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triplesOWL reasoning with WebPIE: calculating the closer of 100 billion triples
OWL reasoning with WebPIE: calculating the closer of 100 billion triples
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Results from Retrofit for the Future.
Results from Retrofit for the Future.Results from Retrofit for the Future.
Results from Retrofit for the Future.
 

Viewers also liked

IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...Shakas Technologies
 
Mobile security
Mobile securityMobile security
Mobile securityMphasis
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Matthew McCullough
 
Cloud computing 2015 ieee papers Data mining ieee project titles
Cloud computing  2015 ieee papers  Data mining ieee project titlesCloud computing  2015 ieee papers  Data mining ieee project titles
Cloud computing 2015 ieee papers Data mining ieee project titlesDoClick Solutions
 
Data Science, Knowledge Discover, Mining and Learning
Data Science, Knowledge Discover, Mining and LearningData Science, Knowledge Discover, Mining and Learning
Data Science, Knowledge Discover, Mining and LearningEUBrasilCloudFORUM .
 
What is IPX?
What is IPX?What is IPX?
What is IPX?whatisipx
 
محاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصل
محاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصلمحاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصل
محاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصلScswomen
 
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Essam Obaid
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cristian Consonni
 
AWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design PatternsAWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design PatternsAmazon Web Services
 

Viewers also liked (16)

IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
 
Mobile security
Mobile securityMobile security
Mobile security
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]
 
Cloud computing 2015 ieee papers Data mining ieee project titles
Cloud computing  2015 ieee papers  Data mining ieee project titlesCloud computing  2015 ieee papers  Data mining ieee project titles
Cloud computing 2015 ieee papers Data mining ieee project titles
 
Data Science, Knowledge Discover, Mining and Learning
Data Science, Knowledge Discover, Mining and LearningData Science, Knowledge Discover, Mining and Learning
Data Science, Knowledge Discover, Mining and Learning
 
Curriculum vitae
Curriculum vitaeCurriculum vitae
Curriculum vitae
 
Secure cloud mining By ahlam
Secure cloud mining  By ahlamSecure cloud mining  By ahlam
Secure cloud mining By ahlam
 
What is IPX?
What is IPX?What is IPX?
What is IPX?
 
حكومة القرن الـ 21
حكومة القرن الـ 21حكومة القرن الـ 21
حكومة القرن الـ 21
 
Final ppt
Final pptFinal ppt
Final ppt
 
محاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصل
محاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصلمحاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصل
محاضرة الحوسبة السحابية لـ د.هبة كردي @SCSWomen #تقنيةوتواصل
 
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...
 
Security issues in cloud database
Security  issues  in cloud   database Security  issues  in cloud   database
Security issues in cloud database
 
AWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design PatternsAWS Security Best Practices and Design Patterns
AWS Security Best Practices and Design Patterns
 
Introduction to AWS Security
Introduction to AWS SecurityIntroduction to AWS Security
Introduction to AWS Security
 

Similar to Slide 1

Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...jaliyae
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesInderjeet Singh
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 

Similar to Slide 1 (20)

Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Slide 1

  • 1. Cloud Computing, Data Mining and Cyberinfrastructure Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox gcf@indiana.eduwww.infomall.org/salsa http://grids.ucs.indiana.edu/ptliupages/ (Presented by Marlon Pierce) Community Grids Laboratory, Chair Department of Informatics School of InformaticsIndiana University
  • 2. Cloud Computing: Infrastructure and Runtimes Cloud infrastructure: outsourcing of servers, computing, data, file space, etc. Handled through Web services that control virtual machine lifecycles. Cloud runtimes:: tools for using clouds to do data-parallel computations. Apache Hadoop, Google MapReduce, Microsoft Dryad, and others Designed for information retrieval but are excellent for a wide range of machine learning and science applications. Apache Mahout Also may be a good match for 32-128 core computers available in the next 5 years. Can also do traditional parallel computing Clustering algorithms: applications for cloud-based data mining. Run on cloud infrastructure Implement with cloud runtimes.
  • 3. Commercial Clouds Boldfaced names have open source versions
  • 4. Open Architecture Clouds Amazon, Google, Microsoft, et al., don’t tell you how to build a cloud. Proprietary knowledge Indiana University and others want to document this publically. What is the right way to build a cloud? It is more than just running software. What is the minimum-sized organization to run a cloud? Department? University? University Consortium? Outsource it all? Analogous issues in government, industry, and enterprise. Example issues: What hardware setups work best? What are you getting into? What is the best virtualization technology for different problems? What is the right way to implement S3- and EBS-like data services? Content Distribution Systems? Persistent, reliable service hosting?
  • 5. Cloud Runtimes What science can you do on a cloud?
  • 6. Data-File Parallelism and Clouds Now that you have a cloud, you may want to do large scale processing with it. Classic problems are to perform the same (sequential) algorithm on fragments of extremely large data sets. Cloud runtime engines manage these replicated algorithms in the cloud. Can be chained together in pipelines (Hadoop) or DAGs(Dryad). Runtimes manage problems like failure control. We are exploring both scientific applications and classic parallel algorithms (clustering, matrix multiplication) using Clouds and cloud runtimes.
  • 7. 7 H n Y Y U U 4n S S 4n M M n D D n X X U U N N MapReduce implemented by Hadoop map(key, value) reduce(key, list<value>) Example: Word Histogram Start with a set of words Each map task counts number of occurrences in each data partition Reduce phase adds these counts Dryadsupports general dataflow
  • 8. Geospatial Examples Image processing and mining Ex: SAR Images from Polar Grid project (J. Wang) Apply to 20 TB of data Flood modeling I Chaining flood models over a geographic area. Flood modeling II Parameter fits and inversion problems. Real time GPS processing Filter
  • 9. File/Data Parallel Examples from Biology EST (Expressed Sequence Tag) Assembly: (Dong) 2 million mRNA sequences generates 540000 files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates) MultiParanoid/InParanoid gene sequence clustering: (Dong) 476 core years just for Prokaryotes Population Genomics: (Lynch) Looking at all pairs separated by up to 1000 nucleotides Sequence-based transcriptome profiling: (Cherbas, Innes) MAQ, SOAP Systems Microbiology: (Brun) BLAST, InterProScan Metagenomics: (Fortenberry, Nelson) Pairwise alignment of 7243 16s sequence data took 12 hours on TeraGrid All can use Dryad or Hadoop 9
  • 11. Kmeans Clustering Performance – 128 CPU cores Overhead Perform Kmeans clustering for up to 40 million 3D data points Amount of communication depends only on the number of cluster centers Amount of communication << Computation and the amount of data processed At the highest granularity VMs show at least 3.5 times overhead compared to bare-metal Extremely large overheads for smaller grain sizes
  • 12. Deterministic Annealing for Clustering Highly parallelizable algorithms for data-mining
  • 13. DeterministicAnnealing F({y}, T) Solve Linear Equations for each temperature Nonlinearity effects mitigated by initializing with solution at previous higher temperature Configuration {y} Minimum evolving as temperature decreases Movement at fixed temperature going to local minima if not initialized “correctly
  • 14. Various Sequence Clustering Results 14 3000 Points : Clustal MSAKimura2 Distance 4500 Points : Pairwise Aligned 4500 Points : Clustal MSA Map distances to 4D Sphere before MDS
  • 15. Obesity Patient ~ 20 dimensional data 15 Will use our 8 node Windows HPC system to run 36,000 records Working with Gilbert Liu IUPUI to map patient clusters to environmental factors 2000 records 6 Clusters Refinement of 3 of clusters to left into 5 4000 records 8 Clusters
  • 16. Conclusions We believe Cloud Computing will dramatically change scientific computing infrastructure. No more clusters in the closet? Controllable, sustainable computing. But we need to know what we are getting into. Even better, clouds (wherever they are) are well suited for a wide range of scientific and computer science applications. We are exploring some of these in biology, clustering, geospatial processing, and hopefully chemistry.