Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
This document discusses matching data intensive applications to hardware and software architectures. It provides examples of over 50 big data applications and analyzes their characteristics to identify common patterns. These patterns are used to propose a "big data version" of the Berkeley dwarfs and NAS parallel benchmarks for evaluating data-intensive systems. The document also analyzes hardware architectures from clouds to HPC and proposes integrating HPC concepts into the Apache software stack to develop an HPC-ABDS software stack for high performance data analytics. Key aspects of applications, hardware, and software architectures are illustrated with examples and diagrams.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
The Open Cloud Consortium operates the Open Science Data Cloud, a not-for-profit cloud computing infrastructure that supports scientific research. The Open Cloud Consortium manages cloud computing testbeds and resources donated by universities, companies, government agencies, and international partners. Its goal is to democratize access to data and computing power for scientific discovery through its Open Science Data Cloud.
This document summarizes a seminar presentation on big data analytics. It reviews 25 research papers published between 2011-2014 on issues related to big data analysis, real-time big data analysis using Hadoop in cloud computing, and classification of big data using tools and frameworks. The review process involved a 5-stage analysis of the papers. Key issues identified include big data analysis, real-time analysis using Hadoop in clouds, and classification using tools like Hadoop, MapReduce, HDFS. Promising solutions discussed are MapReduce Agent Mobility framework, PuntStore with pLSM index, IOT-StatisticDB statistical database mechanism, and visual clustering analysis.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks,
knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing
scalable systems for processing and analyzing large scale graphs has become one of the
timeliest problems facing the big data research community. In general, distributed processing of
big graphs is a challenging task due to their size and the inherent irregular structure of graph
computations. In this paper, we present a comprehensive overview of the state-of-the-art to
better understand the challenges of developing very high-scalable graph processing systems. In
addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Scientific Application Development and Early results on SummitGanesan Narayanasamy
The document summarizes Oak Ridge National Laboratory's (ORNL) new supercomputer Summit and its capabilities for scientific applications and early results. Summit is the most powerful and smartest supercomputer in the world, with 200 petaflops of performance and capabilities well-suited for machine learning and artificial intelligence applications. ORNL is preparing scientific applications for Summit through its Center for Accelerated Application Readiness program to enable early science results and ensure applications are optimized for Summit's architecture.
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
This document provides an introduction to data intensive computing. It discusses how advances in instruments are producing massive amounts of data, creating new paradigms of "data intensive science" and computing. It also discusses how utility clouds like Amazon and data clouds are addressing this challenge by providing on-demand access to vast computing resources and data storage at large scale. The document outlines different models for responsibility between cloud service providers and customers.
1. The document discusses the limitations of Hadoop for advanced analytics tasks beyond basic statistics like mean and variance.
2. It introduces several distributed data analytics platforms like Spark, Storm, and GraphLab that can perform tasks like linear algebra, graph processing, and iterative machine learning algorithms more efficiently than Hadoop.
3. Specific use cases from companies that moved from Hadoop to these platforms are discussed, where they saw significantly faster performance for tasks like logistic regression, collaborative filtering, and k-means clustering.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides a history and market overview of Apache Spark. It discusses the motivation for distributed data processing due to increasing data volumes, velocities and varieties. It then covers brief histories of Google File System, MapReduce, BigTable, and other technologies. Hadoop and MapReduce are explained. Apache Spark is introduced as a faster alternative to MapReduce that keeps data in memory. Competitors like Flink, Tez and Storm are also mentioned.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
This document discusses matching data intensive applications to hardware and software architectures. It provides examples of over 50 big data applications and analyzes their characteristics to identify common patterns. These patterns are used to propose a "big data version" of the Berkeley dwarfs and NAS parallel benchmarks for evaluating data-intensive systems. The document also analyzes hardware architectures from clouds to HPC and proposes integrating HPC concepts into the Apache software stack to develop an HPC-ABDS software stack for high performance data analytics. Key aspects of applications, hardware, and software architectures are illustrated with examples and diagrams.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
The Open Cloud Consortium operates the Open Science Data Cloud, a not-for-profit cloud computing infrastructure that supports scientific research. The Open Cloud Consortium manages cloud computing testbeds and resources donated by universities, companies, government agencies, and international partners. Its goal is to democratize access to data and computing power for scientific discovery through its Open Science Data Cloud.
This document summarizes a seminar presentation on big data analytics. It reviews 25 research papers published between 2011-2014 on issues related to big data analysis, real-time big data analysis using Hadoop in cloud computing, and classification of big data using tools and frameworks. The review process involved a 5-stage analysis of the papers. Key issues identified include big data analysis, real-time analysis using Hadoop in clouds, and classification using tools like Hadoop, MapReduce, HDFS. Promising solutions discussed are MapReduce Agent Mobility framework, PuntStore with pLSM index, IOT-StatisticDB statistical database mechanism, and visual clustering analysis.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks,
knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing
scalable systems for processing and analyzing large scale graphs has become one of the
timeliest problems facing the big data research community. In general, distributed processing of
big graphs is a challenging task due to their size and the inherent irregular structure of graph
computations. In this paper, we present a comprehensive overview of the state-of-the-art to
better understand the challenges of developing very high-scalable graph processing systems. In
addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Scientific Application Development and Early results on SummitGanesan Narayanasamy
The document summarizes Oak Ridge National Laboratory's (ORNL) new supercomputer Summit and its capabilities for scientific applications and early results. Summit is the most powerful and smartest supercomputer in the world, with 200 petaflops of performance and capabilities well-suited for machine learning and artificial intelligence applications. ORNL is preparing scientific applications for Summit through its Center for Accelerated Application Readiness program to enable early science results and ensure applications are optimized for Summit's architecture.
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
This document provides an introduction to data intensive computing. It discusses how advances in instruments are producing massive amounts of data, creating new paradigms of "data intensive science" and computing. It also discusses how utility clouds like Amazon and data clouds are addressing this challenge by providing on-demand access to vast computing resources and data storage at large scale. The document outlines different models for responsibility between cloud service providers and customers.
1. The document discusses the limitations of Hadoop for advanced analytics tasks beyond basic statistics like mean and variance.
2. It introduces several distributed data analytics platforms like Spark, Storm, and GraphLab that can perform tasks like linear algebra, graph processing, and iterative machine learning algorithms more efficiently than Hadoop.
3. Specific use cases from companies that moved from Hadoop to these platforms are discussed, where they saw significantly faster performance for tasks like logistic regression, collaborative filtering, and k-means clustering.
Ähnlich wie Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides a history and market overview of Apache Spark. It discusses the motivation for distributed data processing due to increasing data volumes, velocities and varieties. It then covers brief histories of Google File System, MapReduce, BigTable, and other technologies. Hadoop and MapReduce are explained. Apache Spark is introduced as a faster alternative to MapReduce that keeps data in memory. Competitors like Flink, Tez and Storm are also mentioned.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
This document provides an overview of Hadoop and related big data technologies. It discusses the core Hadoop projects like HDFS, MapReduce, Hive and Spark. It also covers ingestion tools like Flume and Sqoop and real-time streaming tools like Storm and Kafka. Example use cases for web analytics, data warehousing and IoT are presented. Finally deployment options on premise and in the cloud are briefly discussed.
This document provides an overview of the SPIDAL Dibbs project which aims to develop middleware and high performance analytics libraries for scalable data science. The project involves multiple universities and focuses on developing tools like HPC-ABDS to enable interoperability between high performance computing and Apache big data stack technologies. It also involves developing applications in various domains, the SPIDAL library for scalable analytics, and benchmarks to evaluate performance.
This document discusses running the Apache Spark framework on HPC clusters at Virginia Tech (VT) for big data analytics and machine learning. It describes implementing Spark on the VT Advanced Research Computing (ARC) clusters, which allow both fine-grained parallelism for machine learning algorithms and coarse-grained parallelism for big data. Evaluation results show the resource utilization of Spark deployed in standalone and YARN modes at different scales. Future work aims to examine scheduler overhead, shared resource contention, running machine learning on real network logs, and analyzing performance on streaming data.
The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
This document proposes a container-based sizing framework for Apache Hadoop/Spark clusters that uses a multi-objective genetic algorithm approach. It emulates container execution on different cloud platforms to optimize configuration parameters for minimizing execution time and deployment cost. The framework uses Docker containers with resource constraints to model cluster performance on various public clouds and instance types. Optimization finds Pareto-optimal configurations balancing time and cost across objectives.
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss the challenges in designing runtime environments for MPI+X (PGAS-OpenSHMEM/UPC/CAF/UPC++, OpenMP and Cuda) programming models by taking into account support for multi-core systems (KNL and OpenPower), high networks, GPGPUs (including GPUDirect RDMA) and energy awareness. Features and sample performance numbers from MVAPICH2 libraries will be presented. For the Deep Learning domain, we will focus on popular Deep Learning framewords (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-enabled Big Data stacks. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://youtu.be/i2I6XqOAh_I
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
Sahara is an OpenStack project that provides an abstraction layer for provisioning and managing Apache Hadoop clusters and jobs in OpenStack clouds. It allows users to easily deploy and scale Hadoop clusters on demand without having to manage the underlying infrastructure. Sahara uses plugins to integrate various Hadoop distributions like Hortonworks Data Platform (HDP) and Cloudera Distribution including Apache Hadoop (CDH). It leverages other OpenStack services like Nova, Neutron, Swift, Cinder, Heat etc. to provision, configure and manage the Hadoop clusters and jobs.
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
This document discusses Hadoop and OpenStack Sahara. Sahara is an OpenStack project that allows users to provision and manage Hadoop clusters within OpenStack. It provides a plugin mechanism to support different Hadoop distributions like Hortonworks Data Platform (HDP). The HDP plugin fully integrates HDP clusters with Sahara using the Ambari API for cluster management. Sahara handles tasks like cluster scaling, integration with Swift for storage, and data locality. Its plugin architecture allows different Hadoop versions and distributions to be deployed and managed through Sahara.
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High Performance Computing and Big Data Geoffrey Fox
This document proposes a hybrid software stack that combines large-scale data systems from both research and commercial applications. It runs the commodity Apache Big Data Stack (ABDS) using enhancements from High Performance Computing (HPC) to improve performance. Examples are given from bioinformatics and financial informatics. Parallel and distributed runtimes like MPI, Storm, Heron, Spark and Flink are discussed, distinguishing between parallel (tightly-coupled) and distributed (loosely-coupled) systems. The document also discusses optimizing Java performance and differences between capacity and capability computing. Finally, it explains how this HPC-ABDS concept allows convergence of big data, big simulation, cloud and HPC systems.
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
This document discusses SQL Server and big data analytics projects in the real world. It covers the big data technology landscape, big data analytics, and three big data analytics scenarios using different technologies like Hadoop, MongoDB, and SQL Server. It also discusses SQL Server's role in the big data world and how to get data into Hadoop for analysis.
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
This document discusses designing high-performance middleware for HPC, AI, and data science applications. It provides an overview of the MVAPICH2 project, which develops an open-source MPI library supporting modern HPC architectures and networking technologies. MVAPICH2 aims to provide a converged software stack for HPC, deep learning, and data science through libraries like MVAPICH2, HiDL, and HiBD. The document outlines challenges in communication library design for exascale systems and MVAPICH2's architecture supporting programming models across domains.
The document provides an overview of the Apache Hadoop ecosystem. It describes Hadoop as a distributed, scalable storage and computation system based on Google's architecture. The ecosystem includes many related projects that interact, such as YARN, HDFS, Impala, Avro, Crunch, and HBase. These projects innovate independently but work together, with Hadoop serving as a flexible data platform at the core.
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
Audience Level
Intermediate
Synopsis
High performance computing and cloud computing have traditionally been seen as separate solutions to separate problems, dealing with issues of performance and flexibility respectively. In a diverse research environment however, both sets of compute requirements can occur. In addition to the administrative benefits in combining both requirements into a single unified system, opportunities are provided for incremental expansion.
The deployment of the Spartan cloud-HPC hybrid system at the University of Melbourne last year is an example of such a design. Despite its small size, it has attracted international attention due to its design features. This presentation, in addition to providing a grounding on why one would wish to build an HPC-cloud hybrid system and the results of the deployment, provides a complete technical overview of the design from the ground up, as well as problems encountered and planned future developments.
Speaker Bio
Lev Lafayette is the HPC and Training Officer at the University of Melbourne. Prior to that he worked at the Victorian Partnership for Advanced Computing for several years in a similar role.
Ähnlich wie Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations (20)
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
Most things are dominated by Artificial Intelligence (AI). Technology Companies like Amazon, Google, Facebook, and Microsoft are AI First organizations.
Engineering achievement today is highlighted by the AI buried in a vehicle or machine. Industry (Manufacturing) 4.0 focusses on the AI-Driven future of the Industrial Internet of Things.
Software is eating the world.
We can describe much computer systems work as designing, building and using the Global AI and Modelling supercomputer which itself is autonomously tuned by AI. We suggest that this is not just a bunch of buzzwords but has profound significance and examine consequences of this for education and research.
Naively high-performance computing should be relevant for the AI supercomputer but somehow the corporate juggernaut is not making so much use of it. We discuss how to change this.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
DTW: 2015 Data Teaching Workshop – 2nd IEEE STC CC and RDA Workshop on Curricula and Teaching Methods in Cloud Computing, Big Data, and Data Science
as part of CloudCom 2015 (http://2015.cloudcom.org/), Vancouver, Nov 30-Dec 3, 2015.
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics; The other is BDOSSP: Big Data Open Source Software and Projects. Links are
http://openedx.scholargrid.org/ BDAA Fall 2015
http://datascience.scholargrid.org/ BDOSSP Spring 2016
http://bigdataopensourceprojects.soic.indiana.edu/ Spring 2015
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
Invited talk at NSF/TCPP Workshop on Parallel and Distributed Computing Education Edupar at IPDPS 2015 May 25, 2015 5/25/2015 Hyderabad
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics https://bigdatacourse.appspot.com/course. The other is BDOSSP: Big Data Open Source Software and Projects http://bigdataopensourceprojects.soic.indiana.edu/
Data Science Curriculum at Indiana UniversityGeoffrey Fox
The document provides details about the Data Science curriculum at Indiana University. It discusses the background of the School of Informatics and Computing, including its establishment and inclusion of computer science, library and information science programs. It then describes the Data Science certificate and masters programs, including course requirements, tracks, and admissions. The programs aim to provide students with skills in data analysis, lifecycle, management, and applications through coursework in relevant technical areas.
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
This memo describes experiences with online teaching in Spring Semester 2014. We discuss the technologies used and the approach to teaching/learning.
This work is based on Google Course Builder for a Big Data overview course
Big Data and Clouds: Research and EducationGeoffrey Fox
Presentation September 9 2013 PPAM 2013 Warsaw
Economic Imperative: There are a lot of data and a lot of jobs
Computing Model: Industry adopted clouds which are attractive for data analytics. HPC also useful in some cases
Progress in scalable robust Algorithms: new data need different algorithms than before
Progress in Data Intensive Programming Models
Progress in Data Science Education: opportunities at universities
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
Describes FutureGrid and its role as a Computing Testbed as a Service. FutureGrid is user-customizable, accessed interactively and supports Grid, Cloud and HPC software with and without VM’s. Lessons learnt and example use cases are described
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
The document discusses emerging technologies for distributed computing including Web services, grids, and Web 2.0. It describes how these technologies combine to build electronic infrastructures for applications like e-science, e-business, and net-centric computing. These infrastructures exploit internet technologies and provide integrated access to data, people, and resources as distributed services.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
artificial intelligence and data science contents.pptx
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations
1. Next Generation Grid: Integrating Parallel and
Distributed Computing Runtimes for an HPC
Enhanced Cloud and Fog Spanning IoT Big Data
and Big Simulations
`
Geoffrey Fox, Supun Kamburugamuve, Judy Qiu, Shantenu Jha
June 28, 2017
IEEE Cloud 2017 Honolulu Hawaii
gcf@indiana.edu
http://www.dsc.soic.indiana.edu/, http://spidal.org/
Department of Intelligent Systems Engineering
School of Informatics and Computing, Digital Science Center
Indiana University Bloomington
1
2. “Next Generation Grid – HPC Cloud” Problem Statement
• Design a dataflow event-driven FaaS (microservice) framework running across
application and geographic domains.
• Build on Cloud best practice but use HPC wherever possible and useful to get high
performance
• Smoothly support current paradigms Hadoop, Spark, Flink, Heron, MPI, DARMA …
• Use interoperable common abstractions but multiple polymorphic
implementations.
• i.e. do not require a single runtime
• Focus on Runtime but this implicitly suggests programming and execution model
• This next generation Grid based on data and edge devices – not computing as in
old Grid
2
3. • Data gaining in importance compared to simulations
• Data analysis techniques changing with old and new applications
• All forms of IT increasing in importance; both data and simulations increasing
• Internet of Things and Edge Computing growing in importance
• Exascale initiative driving large supercomputers
• Use of public clouds increasing rapidly
• Clouds becoming diverse with subsystems containing GPU’s, FPGA’s, high
performance networks, storage, memory …
• They have economies of scale; hard to compete with
• Serverless computing attractive to user:
“No server is easier to manage than no server”
Important Trends I
3
4. • Rich software stacks:
• HPC for Parallel Computing
• Apache for Big Data including some edge computing (streaming data)
• On general principles parallel and distributed computing has different requirements even if
sometimes similar functionalities
• Apache stack typically uses distributed computing concepts
• For example, Reduce operation is different in MPI (Harp) and Spark
• Important to put grain size into analysis
• Its easier to make dataflow efficient if grain size large
• Streaming Data ubiquitous including data from edge
• Edge computing has some time-sensitive applications
• Choosing a good restaurant can wait seconds
• Avoiding collisions must be finished in milliseconds
Important Trends II
4
5. • Classic Supercomputers will continue for large simulations and may run other
applications but these codes will be developed on
• Next-Generation Commodity Systems which are dominant force
• Merge Cloud HPC and Edge computing
• Clouds running in multiple giant datacenters offering all types of computing
• Distributed data sources associated with device and Fog processing resources
• Server-hidden computing for user pleasure
• Support a distributed event driven dataflow computing model covering batch
and streaming data
• Needing parallel and distributed (Grid) computing ideas
Predictions/Assumptions
5
6. Motivation Summary
• Explosion of Internet of Things and Cloud Computing
• Clouds will continue to grow and will include more use cases
• Edge Computing is adding an additional dimension to Cloud Computing
• Device --- Fog ---Cloud
• Event driven computing is becoming dominant
• Signal generated by a Sensor is an edge event
• Accessing a HPC linear algebra function could be event driven and replace traditional libraries
by FaaS (as NetSolve GridSolve Neos did in old Grid)
• Services will be packaged as a powerful Function as a Service FaaS
• Serverless must be important: users not interested in low level details of IaaS or
even PaaS?
• Applications will span from Edge to Multiple Clouds
6
8. • Unit of Processing is an Event driven Function
• Can have state that may need to be preserved in place (Iterative MapReduce)
• Can be hierarchical as in invoking a parallel job
• Functions can be single or 1 of 100,000 maps in large parallel code
• Processing units run in clouds, fogs or devices but these all have similar architecture
• Fog (e.g. car) looks like a cloud to a device (radar sensor) while public cloud looks
like a cloud to the fog (car)
• Use polymorphic runtime that uses different implementations depending on
environment e.g. on fault-tolerance – latency (performance) tradeoffs
• Data locality (minimize explicit dataflow) properly supported as in HPF alignment
commands (specify which data and computing needs to be kept together)
Proposed Approach I
8
9. • Analyze the runtime of existing systems
• Hadoop, Spark, Flink, Naiad Big Data Processing
• Storm, Heron Streaming Dataflow
• Kepler, Pegasus, NiFi workflow
• Harp Map-Collective, MPI and HPC AMT runtime like DARMA
• And approaches such as GridFTP and CORBA/HLA (!) for wide area data links
• Propose polymorphic unification (given function can have different
implementations)
• Choose powerful scheduler (Mesos?)
• Support processing locality/alignment including MPI’s never move model with
grain size consideration
• One should integrate HPC and Clouds
Proposed Approach II
9
11. • Google likes to show a timeline; we can build on (Apache version of) this
• 2002 Google File System GFS ~HDFS
• 2004 MapReduce Apache Hadoop
• 2006 Big Table Apache Hbase
• 2008 Dremel Apache Drill
• 2009 Pregel Apache Giraph
• 2010 FlumeJava Apache Crunch
• 2010 Colossus better GFS
• 2012 Spanner horizontally scalable NewSQL database ~CockroachDB
• 2013 F1 horizontally scalable SQL database
• 2013 MillWheel ~Apache Storm, Twitter Heron (Google not first!)
• 2015 Cloud Dataflow Apache Beam with Spark or Flink (dataflow) engine
• Functionalities not identified: Security, Data Transfer, Scheduling, DevOps, serverless
computing (assume OpenWhisk will improve to handle robustly lots of large functions)
Components of Big Data Stack
11
12. HPC-ABDS
Integrated
wide range of
HPC and Big
Data
technologies.
I gave up
updating!
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination
: Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA,
Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,
H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables,
CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,
Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook
Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco,
Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty,
ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon
SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB,
H-Store
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
January
29
2016
12
13. What do we need in runtime for distributed HPC FaaS
• Finish examination of all the current tools
• Handle Events
• Handle State
• Handle Scheduling and Invocation of Function
• Define data-flow graph that needs to be analyzed
• Handle data flow execution graph with internal event-driven model
• Handle geographic distribution of Functions and Events
• Design dataflow collective and P2P communication model
• Decide which streaming approach to adopt and integrate
• Design in-memory dataset model for backup and exchange of data in data flow (fault
tolerance)
• Support DevOps and server-hidden cloud models
• Support elasticity for FaaS (connected to server-hidden)
13
14. Communication Primitives
• Big data systems do not
implement optimized
communications
• It is interesting to see no
AllReduce
implementations
• AllReduce has to be done
with Reduce + Broadcast
• No consideration of
RDMA except as add-on
14
15. Optimized Dataflow Communications
• Novel feature of our approach
• Optimize the dataflow graph to
facilitate different algorithms
• Example - Reduce
• Add subtasks and arrange them
according to an optimized
algorithm
• Trees, Pipelines
• Preserves the asynchronous
nature of dataflow
computation
Reduce communication as a
dataflow graph modification
15
16. Dataflow Graph State and Scheduling
• State is a key issue and handled differently in systems
• CORBA, AMT, MPI and Storm/Heron have long running tasks that preserve
state
• Spark and Flink preserve datasets across dataflow node
• All systems agree on coarse grain dataflow; only keep state in exchanged
data.
• Scheduling is one key area where dataflow systems differ
• Dynamic Scheduling
• Fine grain control of dataflow graph
• Graph cannot be optimized
• Static Scheduling
• Less control of the dataflow graph
• Graph can be optimized
16
18. Fault Tolerance
• Similar form of check-pointing mechanism is used in HPC and Big Data
• MPI, Flink, Spark
• Flink and Spark do better than MPI due to use of database technologies; MPI is a bit
harder due to richer state
• Checkpoint after each stage of the dataflow graph
• Natural synchronization point
• Generally allows user to choose when to checkpoint (not every stage)
• Executors (processes) don’t have external state, so can be considered as
coarse grained operations
18
19. Spark Kmeans Flink Streaming Dataflow
• P = loadPoints()
• C = loadInitCenters()
• for (int i = 0; i < 10; i++) {
• T = P.map().withBroadcast(C)
• C = T.reduce() }
19
24. Dataflow for a linear algebra kernel
Typical target of HPC AMT System
Danalis 2016 24
25. Dataflow Frameworks
• Every major big data framework is
designed according to dataflow
model
• Batch Systems
• Hadoop, Spark, Flink, Apex
• Streaming Systems
• Storm, Heron, Samza, Flink, Apex
• HPC AMT Systems
• Legion, Charm++, HPX-5, Dague, COMPs
• Design choices in dataflow
• Efficient in different application areas
25
26. HPC Runtime versus ABDS distributed Computing Model
on Data Analytics
Hadoop writes to disk and is slowest; Spark
and Flink spawn many processes and do
not support AllReduce directly;
MPI does in-place combined
reduce/broadcast and is fastest
Need Polymorphic Reduction capability
choosing best implementation
Use HPC architecture with
Mutable model
Immutable data
26
28. Multidimensional Scaling
MDS execution time on 16 nodes
with 20 processes in each node with
varying number of points
MDS execution time with 32000
points on varying number of nodes.
Each node runs 20 parallel tasks
28
29. K-Means Clustering in Spark, Flink, MPI
Map (nearest
centroid
calculation)
Reduce (update
centroids)
Data Set
<Points>
Data Set <Initial
Centroids>
Data Set
<Updated
Centroids>
Broadcast
Dataflow for K-means
K-Means execution time on 16 nodes
with 20 parallel tasks in each node with
10 million points and varying number of
centroids. Each point has 100 attributes.
K-Means execution time on varying number
of nodes with 20 processes in each node
with 10 million points and 16000 centroids.
Each point has 100 attributes.
30. Heron High Performance Interconnects
• Infiniband & Intel Omni-Path
integrations
• Using Libfabric as a library
• Natively integrated to Heron through
Stream Manager without needing to
go through JNI
30
31. Summary of HPC Cloud – Next Generation Grid
• We suggest an event driven computing model built around Cloud and HPC
and spanning batch, streaming, batch and edge applications
• Expand current technology of FaaS (Function as a Service) and server-
hidden computing
• We have integrated HPC into many Apache systems with HPC-ABDS
• We have analyzed the different runtimes of Hadoop, Spark, Flink, Storm,
Heron, Naiad, DARMA (HPC Asynchronous Many Task)
• There are different technologies for different circumstances but can be unified by
high level abstractions such as communication collectives
• Need to be careful about treatment of state – more research needed
31
Hinweis der Redaktion
Note the differences in communication architectures
Note times are in log scale
Bars indicate compute only times, which is similar across these frameworks
Overhead is dominated by communications in Flink and Spark