2. Outlines
1. INTRODUCTION
2. WHAT IS BIG DATA
3. BIG DATA GENERATORS
4. CHARACTERISTIC OF BIG DATA
5. BENEFIT OF BIG DATA
6. HADOOP
HDFS
Map Reduce
7. BI VS BIG DATA
5. What is Big data
Is very large data sets that may be analyzed computationally to reveal patterns,
trends, and associations, especially relating to Customers behavior and
interactions.
Big Data in general is defined as high volume, velocity and variety information
assets that demand cost-effective, innovative forms of information processing
for enhanced insight and decision making.”
A technology term about Data that becomes too large to be managed in a
manner that is previously known to work normally.
6. Big Data generators
This data comes from everywhere:
sensors used to gather climate information,
posts to social media sites,
digital pictures
online Shopping
Airlines
purchase transaction records, and many more…
This data is “ big data.”
8. Volume
It is the size of the data which determines the value and potential of the data under
consideration. The name ‘Big Data’ itself contains a term which is related to size and
hence the characteristic.
9. Variety
Data today comes in all types of formats. Structured, numeric data in traditional
databases. Unstructured text documents, email, stock ticker data and financial
transactions and semi-structured data too.
10. Velocity
speed of generation of data or how fast the data is generated and processed to meet the
demands and the challenges which lie ahead in the path of growth and development.
12. Benefit of Big data
Cost Reduction from Big Data Technologies
Time Reduction from Big Data
Developing New Big Data-Based Offerings
Supporting Internal Business Decisions
Real-time big data isn’t just a process for storing petabytes or Exabyte's of data in
a data warehouse, It’s about the ability to make better decisions and take
meaningful actions at the right time.
13. What is Hadoop
Flexible and available architecture for large scale computation and data
performance on a network of commodity hardware
Framework that allows for distributed processing of large data sets across clusters
of commodity servers
– Store large amount of data
– Process the large amount of data stored
14. Why Hadoop ?
open source,
highly reliable,
distributed data processing platform
Handles large amounts of data
Stores data in native format
Delivers linear scalability at low cost
Resilient in case of infrastructure failures
Transparent application scalability
15. HDFS Hadoop Distributed File System
HDFS enables Hadoop to store huge files. It’s a scalable file system
that distributes and stores data across all machines in a Hadoop cluster.
Scale-Out Architecture - Add servers to increase capacity
High Availability - Serve mission-critical workflows and applications
Fault Tolerance - Automatically and seamlessly recover from failures
Load Balancing - Place data intelligently for maximum efficiency and
utilization
Tunable Replication - Multiple copies of each file provide data protection and
computational performance
16. Namenode and datanode
DataNode- There is a piece of software running on each of these nodes of the cluster called Datanode
which
runs on slave nodes which make up the majority of the machines of a cluster. The name node
places the data into these data nodes.
NameNode- It Runs on a master node that tracks and directs the storage of the cluster.
Also we know that the nodes or blocks which make up the original 150 MB file and
that is handled by a separate machine is the Namenode. Information stored here is
called as metadata.
17.
18.
19. MapReduce
MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster
Scale-out Architecture - Add servers to increase processing power
Security & Authentication - Works with HDFS security to make sure that only
approved users can operate against the data in the system
Resource Manager - Employs data locality and server resources to determine optimal
computing operations
Optimized Scheduling - Completes jobs according to prioritization
Flexibility - Procedures can be written in virtually any programming language
Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail
independently and restart automatically