SlideShare ist ein Scribd-Unternehmen logo
1 von 88
Introduction to Big Data
What’s Big Data?
2
 Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
 The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
 The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with
the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of
research, prevent diseases, link legal citations, combat
crime, and determine real-time roadway traffic conditions.”
Characteristics Of Big Data
Big data can be described by the following
characteristics:
 Volume
 Variety
 Velocity
 Variability
Big Data: 3V’s
4
Volume (Scale)
5
 Data Volume
 44x increase from 2009 2020
 From 0.8 zettabytes to 35zb
 Data volume is increasing
exponentially
Exponential increase in
collected/generated data
 Volume – The name Big Data itself is related to a
size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also,
whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume
of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?
TBs
of
data
every
day
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB a
The Earthscope
• The Earthscope is the world's
largest science project. Designed
to track North America's
geological evolution, this
observatory records data over 3.8
million square miles, amassing 67
terabytes of data. It analyzes
seismic slips in the San Andreas
fault, sure, but also the plume of
magma underneath Yellowstone
and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scien
Variety (Complexity)
10
 Relational Data
(Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 A single application can be
generating/collecting many types of data
 Big Public Data (online, weather, finance,
etc)
To extract knowledge all these types of
data need to linked together
 Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the
nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases
were the only sources of data considered by most of
the applications. Nowadays, data in the form of
emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing
data.
A Single View to the Customer
Customer
Social
Media
Gamin
g
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchas
e
Velocity (Speed)
13
 Data is begin generated fast and need to be
processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store next to
you
 Healthcare monitoring: sensors monitoring your activities and body
 any abnormal measurements require immediate reaction
 Velocity – The term 'velocity' refers to the speed of
generation of data. How fast the data is generated
and processed to meet the demands, determines
real potential in the data.
 Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is
massive and continuous.
Real-time/Fast Data
15
 The progress and innovation is no longer hindered by the ability to
collect data
 But, by the ability to manage, analyze, summarize, visualize, and
discover knowledge from the collected data in a timely manner and
in a scalable fashion
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
Some Make it 4V’s
17
 Variability – This refers to the inconsistency
which can be shown by the data at times, thus
hampering the process of being able to handle
and manage the data effectively
Big Data Challenges
The major challenges associated with big data are
as follows −
 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
Harnessing Big Data
20
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
The Model Has Changed…
21
 The Model of Generating/Consuming Data has
Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data
What’s driving Big Data
22
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
Big Data:
Batch Processing &
Distributed Data
Store
Hadoop/Spark;
HBase/Cassandra
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other
SQL Reporting Tools
Interactive
Business
Intelligence &
In-memory
RDBMS
QliqView, Tableau,
HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business Intelligence
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
Big Data Analytics
24
 Big data is more real-time in
nature than traditional DW
applications
 Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
 Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
Big Data Technology
26
Distributed File System (DFS)
 distributed file system (DFS) is a file system with
data stored on a server. The data is accessed
and processed as if it was stored on the local
client machine.
 The DFS makes it convenient to share
information and files among users on a network in
a controlled and authorized way. The server
allows the client users to share files and store
data just as if they are storing the information
locally. However, the servers have full control
over the data, and give access control to the
clients.
 As there has been exceptional growth in network-
based computing, client/server-based applications
have brought revolutions in the process of building
distributed file systems.
 Sharing storage resources and information on a
network is one of the key elements in both local area
networks (LANs) and wide area networks (WANs).
Different technologies like DFS have been developed
to bring convenience and efficiency to sharing
resources and files on a network, as networks
themselves also evolve.
 One process involved in implementing the DFS is
giving access control and storage management
controls to the client system in a centralized way
 A DFS allows efficient and well-managed data
and storage sharing options on a network
compared to other options. Another option for
users in network-based computing is a shared
disk file system. A shared disk file system puts the
access control on the client’s systems, so that the
data is inaccessible when the client system goes
offline. DFS, however, is fault-tolerant, and the
data is accessible even if some of the network
nodes are offline.
The Google File System
 GFS shares many of the same goals as previous
distributed file systems such as performance,
scalability, reliability, and availability. However, its
design has been driven by key observations of
application workloads and technological
environment, both current and anticipated, that
reflect a marked departure from some earlier file
system design assumptions. GFS reexamined
traditional choices and explored radically different
points in the design space
DESIGN OVERVIEW
In designing a file system have been guided by
assumptions that offer both challenges and
opportunities.
 The system is built from many inexpensive
commodity components that often fail. It must
constantly monitor itself and detect, tolerate, and
recover promptly from component failures on a
routine basis
 The system stores a modest number of large
files. We expect a few million files, each typically
100 MB or larger in size. Multi-GB files are the
common case and should be managed efficiently.
Small files must be supported, but we need not
optimize for them.
 The system must efficiently implement well-defined
semantics for multiple clients that concurrently
append to the same file. Our files are often used as
producer consumer queues or for many-way merging.
Hundreds of producers, running one per machine, will
concurrently append to a file. Atomicity with minimal
synchronization overhead is essential. The file may
be read later, or a consumer may be reading through
High sustained bandwidth is more important than
low latency. Most of our target applications place a
premium on processing data in bulk at a high rate,
while few have stringent response time
requirements for an individual read or write.
Interface
GFS provides a familiar file system interface, though it
does not implement a standard API such as POSIX.
Files are organized hierarchically in directories and
identified by pathnames.We support the usual
operations to create, delete, open, close, read, and
write files.
Moreover, GFS has snapshot and record append
operations.Snapshot creates a copy of a file or a
directory tree at low cost. Record append allows
multiple clients to append data to the same file
concurrently while guaranteeing the atomicity of each
individual client’s append. It is useful for implementing
multi-way merge results and producerconsumer
queues that many clients can simultaneously append
Chunk server
 A chunk server stores actual data of virtual
machines and Containers and services requests
to it. All data is split into chunks and can be
stored in Storage cluster in multiple copies called
replicas. ... This requires at least three chunk
servers to be set up in the cluster.
Architecture
 A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, Each of these is
typically a commodity Linux machine running a user-level server
process. It is easy to run both a chunkserver and a client on the
same machine, as long as machine resources permit and the
lower reliability caused by running possibly flaky application code
is acceptable.
Files are divided into fixed-size chunks. Each chunk
is identified by an immutable and globally unique
64 bit chunk handle assigned by the master at the
time of chunk creation.
Chunk servers store chunks on local disks as Linux
files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each
chunk is replicated on multiple Chunk servers. By
default, we store three replicas, though users can
designate different replication levels for different
regions of the file namespace
The master maintains all file system metadata. This
includes the namespace, access control information, the
mapping from files to chunks, and the current locations of
chunks.It also controls system-wide activities such as
chunk lease management, garbage collection of
orphaned chunks, and chunk migration between chunk
servers. The master periodically communicates with each
chunkserver in HeartBeat messages to give it instructions
GFS Master
 Having a single master vastly simplifies our
design and enables the master to make
sophisticated chunk placement and replication
decisions using global knowledge. How ever,we
must minimize its involvement in reads and writes
so that it does not become a bottleneck. Clients
never read and write file data through the master.
Instead, a client asks the master which chunk
servers it should contact. It caches this
information for a limited time and interacts with
the chunkservers directly for many subsequent
operations.
Chunk Size
 Chunk size is one of the key design parameters.
We have chosen 64 MB, which is much larger
than typical file system block sizes. Each chunk
replica is stored as a plain Linux file on a
chunkserver and is extended only as
needed.Lazy space allocation avoids wasting
space due to internal fragmentation, perhaps the
greatest objection against such a large chunk
size.
 A large chunk size offers several important
advantages.First, it reduces clients’ need to
interact with the master because reads and writes
on the same chunk require only one initial request
to the master for chunk location information.
Chunk Locations
 The master does not keep a persistent record of which chunkservers have a
replica of a given chunk. It simply polls chunkservers for that information at
startup. The master can keep itself up-to-date thereafter because it controls all
chunk placement and monitors chunkserver status with regular HeartBeat
messages.
 We initially attempted to keep chunk location information persistently at the
master, but we decided that it was much simpler to request the data from chunk
servers at startup,and periodically thereafter. This eliminated the problem of
keeping the master and chunkservers in sync as chunkservers join and leave the
cluster, change names, fail, restart, and so on. In a cluster with hundreds of
servers, these events happen all too often.Another way to understand this design
decision is to realize that a chunkserver has the final word over what chunks it
does or does not have on its own disks. There is no point in trying to maintain a
consistent view of this information on the master because errors on a
chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.
CONCLUSIONS
 The Google File System demonstrates the qualities
essential for supporting large-scale data processing
workloads on commodity hardware.
 GFS has successfully met googles storage needs and
is widely used within Google as the storage platform
for research and development as well as production
data processing. It is an important tool that enables us
to continue to innovate and Attack problems on the
scale of the entire web.
What is Hadoop?
Hadoop Distributed File System
(HDFS)
Hadoop Distributed File System
(HDFS)
Why should I use Hadoop?
HDFS: Key Features
Hadoop Distributed File System (HDFS )
Who uses Hadoop?
What features does Hadoop
offer?
When should you choose
Hadoop?
When should you avoid Hadoop?
Hadoop application usage
Hadoop Distributed File System (HDFS )
How HDFS works: Split Data
How HDFS works: Replication
Hadoop
Hadoop employs a master/slave architecture for both
distributed storage and distributed computation. The
distributed storage system is called the Hadoop
Distributed File System (HDFS). On a fully configured
cluster, "running Hadoop" means running a set of
daemons, or resident programs, on the different
servers in you network. These daemons have specific
roles; some exists only on one server, some exist
The daemons include
NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker
NameNode
The NameNode is the master of HDFS that directs the slave DataNode daemons
to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps
track of how your files are broken down into file blocks, which nodes store
those blocks and the overall health of the distributed filesystem.
The server hosting the NameNode typically doesn't store any user data or
perform any computations for a MapReduce program to lower the workload
on the machine, hence memory & I/O intensive.
There is unfortunately a negative aspect to the importance of the NameNode -
it's a single point of failure of your Hadoop cluster. For any of the other
daemons, if their host fail for software or hardware reasons, the Hadoop
cluster will likely continue to function smoothly or you can quickly restart it.
Not so for the NameNode.
DataNode
 Each slave machine in your cluster will host a DataNode
daemon to perform the grunt work of the distributed
filesystem - reading and writing HDFS blocks to actual
files on the local file system
When you want to read or write a HDFS file, the file is
broken into blocks and the NameNode will tell your client
which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to
process the local files corresponding to the blocks.
DataNode
 Furthermore, a DataNode may communicate with
other DataNodes to replicate its data blocks for
redundancy.This ensures that if any one
DataNode crashes or becomes inaccessible over
the network, you'll still able to read the files.
DataNodes are constantly reporting to the
NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks
it's currently storing. After this mapping is
complete, the DataNodes continually poll the
NameNode to provide information regarding local
changes as well as receive instructions to create,
move, or delete from the local disk.
Secondary NameNode(SNN)
 The SNN is an assistant daemon for monitoring the
state of the cluster HDFS. Like the NameNode, each
cluster has one SNN, and it typically resides on its
own machine as well. No other DataNode or
TaskTracker daemons run on the same server. The
SNN differs from the NameNode in that this process
doesn't receive or record any real-time changes to
HDFS. Instead, it communicates with the NameNode
to take snapshots of the HDFS metadata at intervals
defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point
of failure for a Hadoop cluster, and the SNN
snapshots help minimize the downtime and loss of
data.
fsimage(filesystem image) file and
edits file
 The HDFS namespace is stored by the NameNode.
The NameNode uses a transaction log called
the EditLog to persistently record every change that
occurs to file system metadata. For example, creating
a new file in HDFS causes the NameNode to insert a
record into the EditLog indicating this. Similarly,
changing the replication factor of a file causes a new
record to be inserted into the EditLog. The
NameNode uses a file in its local host OS file system
to store the EditLog. The entire file system
namespace, including the mapping of blocks to files
and file system properties, is stored in a file called
the FsImage. The FsImage is stored as a file in the
NameNode’s local file system too.
JobTracker
 Once you submit your code to your cluster, the
JobTracker determines the execution plan by
determining which files to process, assigns nodes to
different tasks, and monitors all tasks as they're
running. should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different
node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop
cluster. It's typically run on a server as a master node
of the cluster.
TaskTracker
 the computing daemons also follow a master/slave
architecture: the JobTracker is the master overseeing the
overall execution of a MapReduce job and the
TaskTracker manage the execution of individual tasks on
each slave node.
Each TaskTracker is responsible for executing the
individual tasks that the JobTracker assigns. Although
there is a single TaskTracker per slave node, each
TaskTracker can spawn multiple JVMs to handle many
map or reduce tasks in parallel.
One responsibility of the TaskTracker is to constantly
communicate with the JobTracker. If the JobTracker fails
to receive a heartbeat from a TaskTracker within a
specified amount of time, it will assume the TaskTracker
has crashed and will resubmit the corresponding tasks to
other nodes in the cluster.
Topology of a typical Hadoop
cluster.

Weitere ähnliche Inhalte

Was ist angesagt?

Cloud storage or computing & its working
Cloud storage or computing & its workingCloud storage or computing & its working
Cloud storage or computing & its workingpiyush mishra
 
Cloud computing by NADEEM AHMED
Cloud computing by NADEEM AHMEDCloud computing by NADEEM AHMED
Cloud computing by NADEEM AHMEDNA000000
 
Cloud Computing Presentation by Skcript
Cloud Computing Presentation by SkcriptCloud Computing Presentation by Skcript
Cloud Computing Presentation by SkcriptSkcript
 
Cloud computing and Advantages
Cloud computing and AdvantagesCloud computing and Advantages
Cloud computing and AdvantagesToneshkumar Pardhi
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?Richard Harvey
 
What the [bleep] is "The Cloud'?
What the [bleep] is "The Cloud'?What the [bleep] is "The Cloud'?
What the [bleep] is "The Cloud'?Joel Kline
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud ComputingSiva Arunachalam
 
Introduction to Cloud computing
Introduction to Cloud computingIntroduction to Cloud computing
Introduction to Cloud computingMathews Job
 
Cloud computing
Cloud computingCloud computing
Cloud computingwaghu
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud ComputingArwa
 
Data and Database Contexts
Data and Database ContextsData and Database Contexts
Data and Database ContextsUttar Tamang ✔
 
cloud computing
cloud computingcloud computing
cloud computingIdris Shah
 
The Cloud Presentation 2016
The Cloud Presentation 2016The Cloud Presentation 2016
The Cloud Presentation 2016Joel Kline
 

Was ist angesagt? (20)

Cloud storage or computing & its working
Cloud storage or computing & its workingCloud storage or computing & its working
Cloud storage or computing & its working
 
Cloud computing by NADEEM AHMED
Cloud computing by NADEEM AHMEDCloud computing by NADEEM AHMED
Cloud computing by NADEEM AHMED
 
Cloud Computing Presentation by Skcript
Cloud Computing Presentation by SkcriptCloud Computing Presentation by Skcript
Cloud Computing Presentation by Skcript
 
Cloud computing and Advantages
Cloud computing and AdvantagesCloud computing and Advantages
Cloud computing and Advantages
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
 
What the [bleep] is "The Cloud'?
What the [bleep] is "The Cloud'?What the [bleep] is "The Cloud'?
What the [bleep] is "The Cloud'?
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Introduction to Cloud computing
Introduction to Cloud computingIntroduction to Cloud computing
Introduction to Cloud computing
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
I'm Cloud Confused!
I'm Cloud Confused!I'm Cloud Confused!
I'm Cloud Confused!
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Data and Database Contexts
Data and Database ContextsData and Database Contexts
Data and Database Contexts
 
Cloud Computing Presentation
Cloud Computing PresentationCloud Computing Presentation
Cloud Computing Presentation
 
cloud computing
cloud computingcloud computing
cloud computing
 
The Cloud Presentation 2016
The Cloud Presentation 2016The Cloud Presentation 2016
The Cloud Presentation 2016
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
cloud computing
cloud computingcloud computing
cloud computing
 
Basics Of Cloud Computing
Basics Of Cloud ComputingBasics Of Cloud Computing
Basics Of Cloud Computing
 
Jjm cloud computing
Jjm cloud computingJjm cloud computing
Jjm cloud computing
 

Ähnlich wie Unit 1

Ähnlich wie Unit 1 (20)

Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
All About Big Data
All About Big Data All About Big Data
All About Big Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
1
11
1
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data
Big data Big data
Big data
 

Mehr von vishal choudhary (20)

SE-Lecture1.ppt
SE-Lecture1.pptSE-Lecture1.ppt
SE-Lecture1.ppt
 
SE-Testing.ppt
SE-Testing.pptSE-Testing.ppt
SE-Testing.ppt
 
SE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.pptSE-CyclomaticComplexityand Testing.ppt
SE-CyclomaticComplexityand Testing.ppt
 
SE-Lecture-7.pptx
SE-Lecture-7.pptxSE-Lecture-7.pptx
SE-Lecture-7.pptx
 
Se-Lecture-6.ppt
Se-Lecture-6.pptSe-Lecture-6.ppt
Se-Lecture-6.ppt
 
SE-Lecture-5.pptx
SE-Lecture-5.pptxSE-Lecture-5.pptx
SE-Lecture-5.pptx
 
XML.pptx
XML.pptxXML.pptx
XML.pptx
 
SE-Lecture-8.pptx
SE-Lecture-8.pptxSE-Lecture-8.pptx
SE-Lecture-8.pptx
 
SE-coupling and cohesion.ppt
SE-coupling and cohesion.pptSE-coupling and cohesion.ppt
SE-coupling and cohesion.ppt
 
SE-Lecture-2.pptx
SE-Lecture-2.pptxSE-Lecture-2.pptx
SE-Lecture-2.pptx
 
SE-software design.ppt
SE-software design.pptSE-software design.ppt
SE-software design.ppt
 
SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
SE-Lecture-4.pptx
SE-Lecture-4.pptxSE-Lecture-4.pptx
SE-Lecture-4.pptx
 
SE-Lecture=3.pptx
SE-Lecture=3.pptxSE-Lecture=3.pptx
SE-Lecture=3.pptx
 
Multimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptxMultimedia-Lecture-Animation.pptx
Multimedia-Lecture-Animation.pptx
 
MultimediaLecture5.pptx
MultimediaLecture5.pptxMultimediaLecture5.pptx
MultimediaLecture5.pptx
 
Multimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptxMultimedia-Lecture-7.pptx
Multimedia-Lecture-7.pptx
 
MultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptxMultiMedia-Lecture-4.pptx
MultiMedia-Lecture-4.pptx
 
Multimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptxMultimedia-Lecture-6.pptx
Multimedia-Lecture-6.pptx
 
Multimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptxMultimedia-Lecture-3.pptx
Multimedia-Lecture-3.pptx
 

Kürzlich hochgeladen

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Kürzlich hochgeladen (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

Unit 1

  • 2. What’s Big Data? 2  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
  • 3. Characteristics Of Big Data Big data can be described by the following characteristics:  Volume  Variety  Velocity  Variability
  • 5. Volume (Scale) 5  Data Volume  44x increase from 2009 2020  From 0.8 zettabytes to 35zb  Data volume is increasing exponentially Exponential increase in collected/generated data
  • 6.  Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.
  • 7. 12+ TBs of tweet data every day 25+ TBs of log data every day ? TBs of data every day 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014
  • 8. Maximilien Brice, © CERN CERN’s Large Hydron Collider (LHC) generates 15 PB a
  • 9. The Earthscope • The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44 363598/ns/technology_and_scien
  • 10. Variety (Complexity) 10  Relational Data (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF), …  Streaming Data  You can only scan the data once  A single application can be generating/collecting many types of data  Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together
  • 11.  Variety – The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
  • 12. A Single View to the Customer Customer Social Media Gamin g Entertain Bankin g Financ e Our Known History Purchas e
  • 13. Velocity (Speed) 13  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities  Examples  E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you  Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction
  • 14.  Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.  Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
  • 15. Real-time/Fast Data 15  The progress and innovation is no longer hindered by the ability to collect data  But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 16. Real-Time Analytics/Decision Requirement Customer Influence Behavior Product Recommendations that are Relevant & Compelling Friend Invitations to join a Game or Activity that expands business Preventing Fraud as it is Occurring & preventing more proactively Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play
  • 17. Some Make it 4V’s 17
  • 18.  Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively
  • 19. Big Data Challenges The major challenges associated with big data are as follows −  Capturing data  Curation  Storage  Searching  Sharing  Transfer  Analysis  Presentation
  • 20. Harnessing Big Data 20  OLTP: Online Transaction Processing (DBMSs)  OLAP: Online Analytical Processing (Data Warehousing)  RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
  • 21. The Model Has Changed… 21  The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 22. What’s driving Big Data 22 - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time
  • 23. Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Big Data: Real Time & Single View Graph Databases The Evolution of Business Intelligence 1990’s 2000’s 2010’s Speed Scale Scale Speed
  • 24. Big Data Analytics 24  Big data is more real-time in nature than traditional DW applications  Traditional DW architectures (e.g. Exadata, Teradata) are not well- suited for big data apps  Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps
  • 25.
  • 27. Distributed File System (DFS)  distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine.  The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. The server allows the client users to share files and store data just as if they are storing the information locally. However, the servers have full control over the data, and give access control to the clients.
  • 28.  As there has been exceptional growth in network- based computing, client/server-based applications have brought revolutions in the process of building distributed file systems.  Sharing storage resources and information on a network is one of the key elements in both local area networks (LANs) and wide area networks (WANs). Different technologies like DFS have been developed to bring convenience and efficiency to sharing resources and files on a network, as networks themselves also evolve.  One process involved in implementing the DFS is giving access control and storage management controls to the client system in a centralized way
  • 29.  A DFS allows efficient and well-managed data and storage sharing options on a network compared to other options. Another option for users in network-based computing is a shared disk file system. A shared disk file system puts the access control on the client’s systems, so that the data is inaccessible when the client system goes offline. DFS, however, is fault-tolerant, and the data is accessible even if some of the network nodes are offline.
  • 30. The Google File System  GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. GFS reexamined traditional choices and explored radically different points in the design space
  • 31. DESIGN OVERVIEW In designing a file system have been guided by assumptions that offer both challenges and opportunities.  The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis  The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
  • 32.  The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through
  • 33. High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.
  • 34. Interface GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames.We support the usual operations to create, delete, open, close, read, and write files. Moreover, GFS has snapshot and record append operations.Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append
  • 35. Chunk server  A chunk server stores actual data of virtual machines and Containers and services requests to it. All data is split into chunks and can be stored in Storage cluster in multiple copies called replicas. ... This requires at least three chunk servers to be set up in the cluster.
  • 36. Architecture  A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients, Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
  • 37. Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple Chunk servers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace
  • 38. The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunk servers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions
  • 39. GFS Master  Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. How ever,we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunk servers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
  • 40. Chunk Size  Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed.Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.  A large chunk size offers several important advantages.First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information.
  • 41. Chunk Locations  The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.  We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunk servers at startup,and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
  • 42. CONCLUSIONS  The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware.  GFS has successfully met googles storage needs and is widely used within Google as the storage platform for research and development as well as production data processing. It is an important tool that enables us to continue to innovate and Attack problems on the scale of the entire web.
  • 44. Hadoop Distributed File System (HDFS)
  • 45. Hadoop Distributed File System (HDFS)
  • 46. Why should I use Hadoop?
  • 48. Hadoop Distributed File System (HDFS )
  • 50. What features does Hadoop offer?
  • 51. When should you choose Hadoop?
  • 52. When should you avoid Hadoop?
  • 54. Hadoop Distributed File System (HDFS )
  • 55. How HDFS works: Split Data
  • 56. How HDFS works: Replication
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76. Hadoop Hadoop employs a master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop Distributed File System (HDFS). On a fully configured cluster, "running Hadoop" means running a set of daemons, or resident programs, on the different servers in you network. These daemons have specific roles; some exists only on one server, some exist
  • 77. The daemons include NameNode DataNode Secondary NameNode JobTracker TaskTracker
  • 78. NameNode The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks and the overall health of the distributed filesystem. The server hosting the NameNode typically doesn't store any user data or perform any computations for a MapReduce program to lower the workload on the machine, hence memory & I/O intensive. There is unfortunately a negative aspect to the importance of the NameNode - it's a single point of failure of your Hadoop cluster. For any of the other daemons, if their host fail for software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or you can quickly restart it. Not so for the NameNode.
  • 79. DataNode  Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem - reading and writing HDFS blocks to actual files on the local file system When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks.
  • 80. DataNode  Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.This ensures that if any one DataNode crashes or becomes inaccessible over the network, you'll still able to read the files. DataNodes are constantly reporting to the NameNode. Upon initialization, each of the DataNodes informs the NameNode of the blocks it's currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete from the local disk.
  • 81.
  • 82. Secondary NameNode(SNN)  The SNN is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn't receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data.
  • 83. fsimage(filesystem image) file and edits file  The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.
  • 84.
  • 85. JobTracker  Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they're running. should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster. It's typically run on a server as a master node of the cluster.
  • 86. TaskTracker  the computing daemons also follow a master/slave architecture: the JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTracker manage the execution of individual tasks on each slave node. Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel. One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.
  • 87.
  • 88. Topology of a typical Hadoop cluster.