SlideShare a Scribd company logo
1 of 88
CHAPTER 8 Data-Intensive Computing
What is data-intensive computing?
• Data-intensive computing is concerned with
production, manipulation, and analysis of
large-scale data in the range of hundreds of
megabytes (MB) to petabytes (PB)
Characterizing data-intensive
computations
• Data-intensive applications not only deal with
huge volumes of data but, very often, also
exhibit compute-intensive properties
• Datasets are commonly persisted in several
formats and distributed across different
locations.
Challenges ahead
• Scalable algorithms that can search and process
massive datasets
• New metadata management technologies that
can scale to handle complex, heterogeneous, and
distributed data sources
• Advances in high-performance computing
platforms aimed at providing a better support for
accessing in-memory multiterabyte data
structures
• High-performance, highly reliable, petascale
distributed file systems
Challenges ahead
• Data signature-generation techniques for data
reduction and rapid processing
• New approaches to software mobility for
delivering algorithms that are able to move the
computation to where the data are located
• Specialized hybrid interconnection architectures
that provide better support for filtering
multigigabyte datastreams coming from high-
speed networks and scientific instruments
• Flexible and high-performance software
integration techniques
Historical perspective
storage, networking technologies, algorithms,
and infrastructure software all together
• The early age: high-speed wide-area
networking
• Data grids
• Data clouds and “Big Data”
• Databases and data-intensive computing
high-speed wide-area networking
• 1989, the first experiments in high-speed
networking as a support for remote
visualization of scientific data led the way
• Two years later, the potential of using high-
speed wide area networks for enabling high-
speed, TCP/IP-based distributed applications
was demonstrated at Supercomputing 1991
• Another important milestone was set with the
Clipper project,
Data grids
• huge computational power and storage
facilities
• A data grid provides services that help users
discover, transfer, and manipulate large
datasets stored in distributed repositories
• Data grids offer two main functionalities: high-
performance and reliable file transfer for
moving large amounts of data
Characteristics and introduce new challenges
• Massive datasets
• Shared data collections
• Unified namespace
• Access restrictions
Data clouds and “Big Data”
• Scientific computing
• searching, online advertising, and social media
• It is critical for such companies to efficiently
analyze these huge datasets because they
constitute a precious source of information
about their customers
• Log analysis is an example
Data clouds and “Big Data”
Cloud technologies support data-intensive
computing in several ways:
• By providing a large amount of compute
instances on demand
• By providing a storage system
• By providing frameworks and programming
APIs
Databases and data-intensive
computing
• Distributed Database
Technologies for data-intensive
computing
Data-intensive computing concerns the
development of applications that are mainly
focused on processing large quantities of data.
Storage systems
• Growing of popularity of Big Data
• Growing importance of data analytics in the
business chain
• Presence of data in several forms, not only
structured
• New approaches and technologies for
computing
Storage systems
• High-performance distributed file systems and
storage clouds
– Lustre
– IBM General Parallel File System (GPFS)
– Google File System (GFS)
– Sector
– Amazon Simple Storage Service (S3)
• NoSQL systems
– Apache CouchDB and MongoDB
– Amazon Dynamo
– Google Bigtable
– Hadoop HBase
High-performance distributed file
systems and storage clouds
• Lustre
• The Lustre file system is a massively parallel distributed file
system that covers the needs of a small workgroup of
clusters to a large-scale computing cluster.
• The file system is used by several of the Top 500
supercomputing systems,
• Lustre is designed to provide access to petabytes (PBs) of
storage to serve thousands of clients with an I/O
throughput of hundreds of gigabytes per second (GB/s)
High-performance distributed file
systems and storage clouds
• IBM General Parallel File System (GPFS).
• high-performance distributed file system developed by
IBM
• support for the RS/6000 supercomputer and Linux
computing clusters
• GPFS is built on the concept of shared disks
• GPFS distributes the metadata of the entire file system and
provides transparent access
High-performance distributed file
systems and storage clouds
• Google File System (GFS)
• Distributed applications in Google’s computing cloud
• The system has been designed to be a fault tolerant, highly
available, distributed file system built on commodity
hardware and standard Linux operating systems.
• large files
• workloads primarily consist of two kinds of reads: large
streaming reads and small random reads.
High-performance distributed file
systems and storage clouds
• Sector
• storage cloud that supports the execution of data-intensive
applications
• deployed on commodity hardware across a wide-area
network.
• Compared to other file systems, Sector does not partition
a file into blocks but replicates the entire files on multiple
nodes
• The system’s architecture is composed of four nodes: a
security server, one or more master nodes, slave nodes,
and client machines
High-performance distributed file
systems and storage clouds
• Amazon Simple Storage Service (S3)
• Amazon S3 is the online storage service provided by
Amazon.
• support high availability, reliability, scalability, infinite
storage,
• The system offers a flat storage space organized into
buckets
• Each bucket can store multiple objects, each identified by a
unique key. Objects are identified by unique URLs and
exposed through HTTP,
NoSQL systems
• Document stores (Apache Jackrabbit, Apache
CouchDB, SimpleDB, Terrastore).
• Graphs (AllegroGraph, Neo4j, FlockDB,
Cerebrum).
• Multivalue databases (OpenQM, Rocket U2,
OpenInsight).
• Object databases (ObjectStore, JADE, ZODB).
• Tabular stores (Google BigTable, Hadoop HBase,
Hypertable).
• Tuple stores (Apache River).
NoSQL systems
• Apache CouchDB and MongoDB
– document stores
– schema-less
– RESTful interface and represent data in JSON
format.
– allow querying and indexing data by using the
MapReduce programming model
– JavaScript as a base language for data querying
and manipulation rather than SQL
NoSQL systems
• Amazon Dynamo
– The main goal of Dynamo is to provide an
incrementally scalable and highly available storage
system.
– serving 10 million requests per day
– objects are stored and retrieved with a unique
identifier (key)
NoSQL systems
• Google Bigtable
– scale up to petabytes of data across thousands of
server
– Bigtable provides storage support for several
Google applications
– Bigtable’s key design goals are wide applicability,
scalability, high performance, and high availability.
– Bigtable organizes the data storage in tables of
which the rows are distributed over the
distributed file system supporting the middleware
NoSQL systems
• Apache Cassandra
– managing large amounts of structured data spread
across many commodity servers
– Cassandra was initially developed by Facebook
– Currently, it provides storage support for several
very large Web applications such as Facebook,
Digg, and Twitter
– second-generation distributed database
– column family
NoSQL systems
• Hadoop HBase.
– distributed database
– Hadoop distributed programming platform.
– HBase is designed by taking inspiration from
Google Bigtable
– main goal is to offer real-time read/write
operations for tables with billions of rows and
millions of columns by leveraging clusters of
commodity hardware
Programming platforms
• large quantity of information
• runtime systems able to efficiently manage huge
volumes of data.
• database management systems based on the
relational model - unsuccessful
• unstructured or semistructured
• large size or a huge number of medium-sized files
rather than rows in a database
• Distributed workflows
The MapReduce programming model
• map and reduce
• Google introduced for processing large
quantities of data.
• Data transfer and management are completely
handled by the distributed storage
infrastructure
Examples of MapReduce
• Distributed grep
– recognition of patterns within text streams
• Count of URL-access frequency
– key-value <pair , URL,1>, <URL, total-count>
• Reverse Web-link graph
– <target, source>, <target, list (source) >
• Term vector per host.
– Word Counting
Exapmles of MapReduce
• Inverted index
– <word, document-id>, < word, list(document-id)>
• Distributed sort
• Statistical algorithms such as Support Vector
Machines (SVM), Linear Regression (LR), Naive
Bayes (NB), and Neural Network (NN)
• two major stages can be represented in the
terms of Map Reduce computation.
– Analysis
• operates directly on the data input file
• embarrassingly parallel
– Aggregation
• operates on the intermediate results
• aimed at aggregating, summing, and/or elaborating
• previous stage to present the data in their final form
Variations and extensions of
MapReduce
• MapReduce constitutes a simplified model for
processing large quantities of data
• model can be applied to several different
problem scenarios
• They aim at extending the MapReduce
application space and providing developers
with an easier interface for designing
distributed algorithms.
frameworks
• Hadoop
• Pig
• Hive
• Map-Reduce-Merge
• Twister
Hadoop
• Apache Hadoop is an open-source software
framework used for distributed storage and
processing of dataset of big data using the
MapReduce programming model.
• Initially developed and supported by Yahoo
• 40,000 machines and more than 300,000
cores
• http://hadoop.apache.org/
Pig
• platform that allows the analysis of large
datasets
• high-level language for expressing data
analysis programs
• Developers can express their data analysis
programs in a textual language called Pig
Latin,
• https://pig.apache.org/
Hive.
• Provides a data warehouse infrastructure on
top of Hadoop MapReduce.
• It provides tools for easy data summarization
• classical data warehouse,
• https://hive.apache.org/
Map-Reduce-Merge
• Map-Reduce-Merge is an extension of the
MapReduce model
• merging data already partitioned and sorted
by map and reduce modules
Twister
• extension of the MapReduce model that allows
the creation of iterative executions of
MapReduce jobs.
• 1. Configure Map
• 2. Configure Reduce
• 3. While Condition Holds True Do
– a. Run MapReduce
– b. Apply Combine Operation to Result
– c. Update Condition
• 4. Close
Alternatives to MapReduce
• Sphere.
• All-Pairs.
• DryadLINQ
Sphere.
• Sector Distributed File System (SDFS).
• Sphere implements the stream processing model
(Single Program, Multiple Data)
• user-defined functions (UDFs)
• it is built on top of Sector’s API for data access
• UDFs are expressed in terms of programs that read and
write streams.
• Sphere client sends a request for processing to the
master node, which returns the list of available slaves,
and the client will choose the slaves on which to
execute Sphere processes
All-Pairs
• Biometrics
• (1) model the system;
• (2) distribute the data;
• (3) dispatch batch jobs; and
• (4) clean up the system
DryadLINQ.
• Microsoft Research project that investigates
programming models for writing parallel and
distributed programs
• small cluster to a large datacenter
• Automatically parallelizing the execution of
applications without requiring the developer
to know about distributed and parallel
programming.
Aneka MapReduce programming
• Developing MapReduce applications on top of
Aneka
• Mapper and Reducer - Aneka MapReduce APIs
• Three classes are of Importent for application
development:
– Mapper < K,V >
– Reducer <K,V >
– MapReduceApplication <M,R >
– The submission and execution of a MapReduce
job is performed through the class
MapReduceApplication <M,R >
• Map Function APIs.
• IMapInput<K,V> provides access to the input
key-value pair on which the map operation is
performed
Reduce Function APIs
• Reduce (IReduceInputEnumerator < V > input)
• reduce operation is applied to a collection of
values that are mapped to the same key
• MapReduceApplication <M,R>
• InvokeAndWait method:
ApplicationBase<M,R>
• WordCounterMapper and
WordCounterReducer classes
• The parameters that can be controlled
– Partitions
– Attempts
– SynchReduce
– IsInputReady
– FetchResults
– LogFile
• WordCounter Job. -Program
Runtime support
• Task Scheduling
– MapReduceScheduler class.
• Task Execution.
– MapReduceExecutor
Task Scheduling
Task Execution.
Distributed file system support
• Supports - Other programming models
• MapReduce model does not leverage the default
Storage Service for storage and data transfer
• uses a distributed file system implementation
• management are significantly different with
respect to the other models
• Distributed file system implementations
guarantee high availability and better efficiency
• Aneka provides the capability of interfacing
with different storage implementations
• Retrieving the location of files and file chunks
• Accessing a file by means of a stream
• classes SeqReader and SeqWriter
Example application
• MapReduce is a very useful model for
processing large quantities of data
• Semistructured
• logs or Web pages
• logs produced by the Aneka container
Parsing Aneka logs
• Aneka components produce a lot of
information that is stored in the form of log
files
• DD MMM YY hh:mm:ss level - message
Mapper design and implementation
Reducer design and implementation
Driver program
Result

More Related Content

What's hot

Distributed shared memory shyam soni
Distributed shared memory shyam soniDistributed shared memory shyam soni
Distributed shared memory shyam soniShyam Soni
 
Operating Systems: Virtual Memory
Operating Systems: Virtual MemoryOperating Systems: Virtual Memory
Operating Systems: Virtual MemoryDamian T. Gordon
 
Mobile Software Engineering (at University of Cambridge Wednesday Seminars)
Mobile Software Engineering (at University of Cambridge Wednesday Seminars)Mobile Software Engineering (at University of Cambridge Wednesday Seminars)
Mobile Software Engineering (at University of Cambridge Wednesday Seminars)3scale.net
 
Google app engine
Google app engineGoogle app engine
Google app engineSuraj Mehta
 
Cloud platforms - Cloud Computing
Cloud platforms - Cloud ComputingCloud platforms - Cloud Computing
Cloud platforms - Cloud ComputingAditi Rai
 
Cloud applications - Protein Structure Predication and gene expression data...
Cloud applications - Protein Structure Predication  and  gene expression data...Cloud applications - Protein Structure Predication  and  gene expression data...
Cloud applications - Protein Structure Predication and gene expression data...Pushpendra Singh Dangi
 
Load balancing in cloud computing.pptx
Load balancing in cloud computing.pptxLoad balancing in cloud computing.pptx
Load balancing in cloud computing.pptxHitesh Mohapatra
 
file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada umardanjumamaiwada
 
Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)Robert Grossman
 
Intro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesIntro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesAmazon Web Services
 
AWS 101 Lunch and Learn | London
AWS 101 Lunch and Learn | LondonAWS 101 Lunch and Learn | London
AWS 101 Lunch and Learn | LondonAmazon Web Services
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to VirtualizationRahul Hada
 

What's hot (20)

Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
AWS SQS SNS
AWS SQS SNSAWS SQS SNS
AWS SQS SNS
 
Scheduling in cloud
Scheduling in cloudScheduling in cloud
Scheduling in cloud
 
AWS Elastic Compute Cloud (EC2)
AWS Elastic Compute Cloud (EC2) AWS Elastic Compute Cloud (EC2)
AWS Elastic Compute Cloud (EC2)
 
Distributed shared memory shyam soni
Distributed shared memory shyam soniDistributed shared memory shyam soni
Distributed shared memory shyam soni
 
Operating Systems: Virtual Memory
Operating Systems: Virtual MemoryOperating Systems: Virtual Memory
Operating Systems: Virtual Memory
 
Amazon SQS overview
Amazon SQS overviewAmazon SQS overview
Amazon SQS overview
 
Mobile Software Engineering (at University of Cambridge Wednesday Seminars)
Mobile Software Engineering (at University of Cambridge Wednesday Seminars)Mobile Software Engineering (at University of Cambridge Wednesday Seminars)
Mobile Software Engineering (at University of Cambridge Wednesday Seminars)
 
Google app engine
Google app engineGoogle app engine
Google app engine
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Cloud platforms - Cloud Computing
Cloud platforms - Cloud ComputingCloud platforms - Cloud Computing
Cloud platforms - Cloud Computing
 
Intro to AWS: Storage Services
Intro to AWS: Storage ServicesIntro to AWS: Storage Services
Intro to AWS: Storage Services
 
Cloud applications - Protein Structure Predication and gene expression data...
Cloud applications - Protein Structure Predication  and  gene expression data...Cloud applications - Protein Structure Predication  and  gene expression data...
Cloud applications - Protein Structure Predication and gene expression data...
 
Load balancing in cloud computing.pptx
Load balancing in cloud computing.pptxLoad balancing in cloud computing.pptx
Load balancing in cloud computing.pptx
 
Unit 4
Unit 4Unit 4
Unit 4
 
file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada
 
Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)
 
Intro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesIntro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute Services
 
AWS 101 Lunch and Learn | London
AWS 101 Lunch and Learn | LondonAWS 101 Lunch and Learn | London
AWS 101 Lunch and Learn | London
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to Virtualization
 

Similar to VTU 6th Sem Elective CSE - Module 4 cloud computing

Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 

Similar to VTU 6th Sem Elective CSE - Module 4 cloud computing (20)

Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Big Data
Big DataBig Data
Big Data
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 

Recently uploaded

Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 

Recently uploaded (20)

Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 

VTU 6th Sem Elective CSE - Module 4 cloud computing

  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. What is data-intensive computing? • Data-intensive computing is concerned with production, manipulation, and analysis of large-scale data in the range of hundreds of megabytes (MB) to petabytes (PB)
  • 10. Characterizing data-intensive computations • Data-intensive applications not only deal with huge volumes of data but, very often, also exhibit compute-intensive properties • Datasets are commonly persisted in several formats and distributed across different locations.
  • 11.
  • 12. Challenges ahead • Scalable algorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous, and distributed data sources • Advances in high-performance computing platforms aimed at providing a better support for accessing in-memory multiterabyte data structures • High-performance, highly reliable, petascale distributed file systems
  • 13. Challenges ahead • Data signature-generation techniques for data reduction and rapid processing • New approaches to software mobility for delivering algorithms that are able to move the computation to where the data are located • Specialized hybrid interconnection architectures that provide better support for filtering multigigabyte datastreams coming from high- speed networks and scientific instruments • Flexible and high-performance software integration techniques
  • 14. Historical perspective storage, networking technologies, algorithms, and infrastructure software all together • The early age: high-speed wide-area networking • Data grids • Data clouds and “Big Data” • Databases and data-intensive computing
  • 15. high-speed wide-area networking • 1989, the first experiments in high-speed networking as a support for remote visualization of scientific data led the way • Two years later, the potential of using high- speed wide area networks for enabling high- speed, TCP/IP-based distributed applications was demonstrated at Supercomputing 1991 • Another important milestone was set with the Clipper project,
  • 16. Data grids • huge computational power and storage facilities • A data grid provides services that help users discover, transfer, and manipulate large datasets stored in distributed repositories • Data grids offer two main functionalities: high- performance and reliable file transfer for moving large amounts of data
  • 17.
  • 18. Characteristics and introduce new challenges • Massive datasets • Shared data collections • Unified namespace • Access restrictions
  • 19. Data clouds and “Big Data” • Scientific computing • searching, online advertising, and social media • It is critical for such companies to efficiently analyze these huge datasets because they constitute a precious source of information about their customers • Log analysis is an example
  • 20. Data clouds and “Big Data” Cloud technologies support data-intensive computing in several ways: • By providing a large amount of compute instances on demand • By providing a storage system • By providing frameworks and programming APIs
  • 22. Technologies for data-intensive computing Data-intensive computing concerns the development of applications that are mainly focused on processing large quantities of data.
  • 23. Storage systems • Growing of popularity of Big Data • Growing importance of data analytics in the business chain • Presence of data in several forms, not only structured • New approaches and technologies for computing
  • 24. Storage systems • High-performance distributed file systems and storage clouds – Lustre – IBM General Parallel File System (GPFS) – Google File System (GFS) – Sector – Amazon Simple Storage Service (S3)
  • 25. • NoSQL systems – Apache CouchDB and MongoDB – Amazon Dynamo – Google Bigtable – Hadoop HBase
  • 26. High-performance distributed file systems and storage clouds • Lustre • The Lustre file system is a massively parallel distributed file system that covers the needs of a small workgroup of clusters to a large-scale computing cluster. • The file system is used by several of the Top 500 supercomputing systems, • Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands of clients with an I/O throughput of hundreds of gigabytes per second (GB/s)
  • 27. High-performance distributed file systems and storage clouds • IBM General Parallel File System (GPFS). • high-performance distributed file system developed by IBM • support for the RS/6000 supercomputer and Linux computing clusters • GPFS is built on the concept of shared disks • GPFS distributes the metadata of the entire file system and provides transparent access
  • 28. High-performance distributed file systems and storage clouds • Google File System (GFS) • Distributed applications in Google’s computing cloud • The system has been designed to be a fault tolerant, highly available, distributed file system built on commodity hardware and standard Linux operating systems. • large files • workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
  • 29. High-performance distributed file systems and storage clouds • Sector • storage cloud that supports the execution of data-intensive applications • deployed on commodity hardware across a wide-area network. • Compared to other file systems, Sector does not partition a file into blocks but replicates the entire files on multiple nodes • The system’s architecture is composed of four nodes: a security server, one or more master nodes, slave nodes, and client machines
  • 30. High-performance distributed file systems and storage clouds • Amazon Simple Storage Service (S3) • Amazon S3 is the online storage service provided by Amazon. • support high availability, reliability, scalability, infinite storage, • The system offers a flat storage space organized into buckets • Each bucket can store multiple objects, each identified by a unique key. Objects are identified by unique URLs and exposed through HTTP,
  • 31.
  • 32. NoSQL systems • Document stores (Apache Jackrabbit, Apache CouchDB, SimpleDB, Terrastore). • Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum). • Multivalue databases (OpenQM, Rocket U2, OpenInsight). • Object databases (ObjectStore, JADE, ZODB). • Tabular stores (Google BigTable, Hadoop HBase, Hypertable). • Tuple stores (Apache River).
  • 33. NoSQL systems • Apache CouchDB and MongoDB – document stores – schema-less – RESTful interface and represent data in JSON format. – allow querying and indexing data by using the MapReduce programming model – JavaScript as a base language for data querying and manipulation rather than SQL
  • 34. NoSQL systems • Amazon Dynamo – The main goal of Dynamo is to provide an incrementally scalable and highly available storage system. – serving 10 million requests per day – objects are stored and retrieved with a unique identifier (key)
  • 35.
  • 36. NoSQL systems • Google Bigtable – scale up to petabytes of data across thousands of server – Bigtable provides storage support for several Google applications – Bigtable’s key design goals are wide applicability, scalability, high performance, and high availability. – Bigtable organizes the data storage in tables of which the rows are distributed over the distributed file system supporting the middleware
  • 37.
  • 38. NoSQL systems • Apache Cassandra – managing large amounts of structured data spread across many commodity servers – Cassandra was initially developed by Facebook – Currently, it provides storage support for several very large Web applications such as Facebook, Digg, and Twitter – second-generation distributed database – column family
  • 39. NoSQL systems • Hadoop HBase. – distributed database – Hadoop distributed programming platform. – HBase is designed by taking inspiration from Google Bigtable – main goal is to offer real-time read/write operations for tables with billions of rows and millions of columns by leveraging clusters of commodity hardware
  • 40. Programming platforms • large quantity of information • runtime systems able to efficiently manage huge volumes of data. • database management systems based on the relational model - unsuccessful • unstructured or semistructured • large size or a huge number of medium-sized files rather than rows in a database • Distributed workflows
  • 41. The MapReduce programming model • map and reduce • Google introduced for processing large quantities of data. • Data transfer and management are completely handled by the distributed storage infrastructure
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50. Examples of MapReduce • Distributed grep – recognition of patterns within text streams • Count of URL-access frequency – key-value <pair , URL,1>, <URL, total-count> • Reverse Web-link graph – <target, source>, <target, list (source) > • Term vector per host. – Word Counting
  • 51. Exapmles of MapReduce • Inverted index – <word, document-id>, < word, list(document-id)> • Distributed sort • Statistical algorithms such as Support Vector Machines (SVM), Linear Regression (LR), Naive Bayes (NB), and Neural Network (NN)
  • 52. • two major stages can be represented in the terms of Map Reduce computation. – Analysis • operates directly on the data input file • embarrassingly parallel – Aggregation • operates on the intermediate results • aimed at aggregating, summing, and/or elaborating • previous stage to present the data in their final form
  • 53.
  • 54. Variations and extensions of MapReduce • MapReduce constitutes a simplified model for processing large quantities of data • model can be applied to several different problem scenarios • They aim at extending the MapReduce application space and providing developers with an easier interface for designing distributed algorithms.
  • 55. frameworks • Hadoop • Pig • Hive • Map-Reduce-Merge • Twister
  • 56. Hadoop • Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. • Initially developed and supported by Yahoo • 40,000 machines and more than 300,000 cores • http://hadoop.apache.org/
  • 57. Pig • platform that allows the analysis of large datasets • high-level language for expressing data analysis programs • Developers can express their data analysis programs in a textual language called Pig Latin, • https://pig.apache.org/
  • 58. Hive. • Provides a data warehouse infrastructure on top of Hadoop MapReduce. • It provides tools for easy data summarization • classical data warehouse, • https://hive.apache.org/
  • 59. Map-Reduce-Merge • Map-Reduce-Merge is an extension of the MapReduce model • merging data already partitioned and sorted by map and reduce modules
  • 60. Twister • extension of the MapReduce model that allows the creation of iterative executions of MapReduce jobs. • 1. Configure Map • 2. Configure Reduce • 3. While Condition Holds True Do – a. Run MapReduce – b. Apply Combine Operation to Result – c. Update Condition • 4. Close
  • 61. Alternatives to MapReduce • Sphere. • All-Pairs. • DryadLINQ
  • 62. Sphere. • Sector Distributed File System (SDFS). • Sphere implements the stream processing model (Single Program, Multiple Data) • user-defined functions (UDFs) • it is built on top of Sector’s API for data access • UDFs are expressed in terms of programs that read and write streams. • Sphere client sends a request for processing to the master node, which returns the list of available slaves, and the client will choose the slaves on which to execute Sphere processes
  • 63. All-Pairs • Biometrics • (1) model the system; • (2) distribute the data; • (3) dispatch batch jobs; and • (4) clean up the system
  • 64. DryadLINQ. • Microsoft Research project that investigates programming models for writing parallel and distributed programs • small cluster to a large datacenter • Automatically parallelizing the execution of applications without requiring the developer to know about distributed and parallel programming.
  • 65. Aneka MapReduce programming • Developing MapReduce applications on top of Aneka • Mapper and Reducer - Aneka MapReduce APIs
  • 66.
  • 67. • Three classes are of Importent for application development: – Mapper < K,V > – Reducer <K,V > – MapReduceApplication <M,R > – The submission and execution of a MapReduce job is performed through the class MapReduceApplication <M,R >
  • 68. • Map Function APIs. • IMapInput<K,V> provides access to the input key-value pair on which the map operation is performed
  • 69.
  • 70. Reduce Function APIs • Reduce (IReduceInputEnumerator < V > input) • reduce operation is applied to a collection of values that are mapped to the same key
  • 71.
  • 72. • MapReduceApplication <M,R> • InvokeAndWait method: ApplicationBase<M,R> • WordCounterMapper and WordCounterReducer classes
  • 73. • The parameters that can be controlled – Partitions – Attempts – SynchReduce – IsInputReady – FetchResults – LogFile
  • 75. Runtime support • Task Scheduling – MapReduceScheduler class. • Task Execution. – MapReduceExecutor
  • 78. Distributed file system support • Supports - Other programming models • MapReduce model does not leverage the default Storage Service for storage and data transfer • uses a distributed file system implementation • management are significantly different with respect to the other models • Distributed file system implementations guarantee high availability and better efficiency
  • 79. • Aneka provides the capability of interfacing with different storage implementations • Retrieving the location of files and file chunks • Accessing a file by means of a stream • classes SeqReader and SeqWriter
  • 80.
  • 81. Example application • MapReduce is a very useful model for processing large quantities of data • Semistructured • logs or Web pages • logs produced by the Aneka container
  • 82. Parsing Aneka logs • Aneka components produce a lot of information that is stored in the form of log files • DD MMM YY hh:mm:ss level - message
  • 83.
  • 84.
  • 85. Mapper design and implementation
  • 86. Reducer design and implementation