2. Big Data
Big data can be defined as large volumes of data, too complex to be dealt with by traditional processing
technologies.
Organizations want to get insightful information from big data to know their customers and get a
competitive advantage.
Data Then Data Now
3. Apache
Hadoop
• An open-source framework written in java, that
processes big data in a parallel manner and stores it
on distributed files systems that are linked together
in clusters.
Why use Hadoop?
• Data growth in volume and variety at a high
velocity.
• Big organizations wanted to get value from their data
for revenue and profit.
• There was a need for distributed storage machines
where big data could be stored and processed.
4. Componen
ts
of
Hadoop
• Hadoop Common provides a collection
of utilities and libraries that support
other Hadoop modules.
• It contains the necessary Java archive
(JAR) files and scripts required to start
Hadoop
1. Hadoop
Common
Hadoop is made up of
individual components
that enable it to store
and process data
5. 2. HDFS
• Hadoop Distributed File System
(HDFS)
• Hadoop storage is handled by
HDFS.
• HDFS circulates multiple copies of
data to the nodes, grouped into
racks in a cluster.
• HDFS deploys a master-slave
architecture
• Name Node: Master node that
monitors data nodes and contains
all metadata.
• Data Nodes: Slave nodes that
contains the actual data in form of
blocks. They frequently send their
status updates through heartbeat
signals to the NameNode.
• Secondary Name Node has a copy
of Name Node’s metadata in disk.
6. 3.
MapReduce
• MapReduce is a
component that uses
simple programming
models to process
huge amounts of data
in a parallel and
distributed manner on
large clusters of
commodity hardware.
7. 4. YARN
• Yet Another Resource
Negotiator (YARN).
• YARN is responsible for
allocating system resources to
the various applications
running in a Hadoop cluster
and scheduling tasks to be
executed on different cluster
nodes.
• YARN decentralizes
execution and monitoring of
processing jobs by separating
the various responsibilities
into these components:
1. Resource manager
2. Node Manager
3. Application master
4. Containers
8. HADOOP
ECOSYSTEM
• Being a framework, Hadoop is
made of several modules that are
supported by a large ecosystem
of technologies.
• All are services to solve big data
problems.
• It includes Apache projects and
various commercial tools and
solutions that supplement or
support the four major
components mentioned in slides
before.
9. Hadoop Strengths & Weaknesses
Weaknesses
• Hadoop fails when it needs to access many small files.
• Not good for real-time data applications because it used batch
processing.
• Not good when the work cannot be done in parallel or when there
are dependencies in that data.
• Hadoop supports Machine Learning and Artificial Intelligence to a
limited extent
• Not good for intensive calculations with little data.
Strengths
• Hadoop runs at a lower cost since it relies on any disk storage type for
data processing.
• Flexibility: Hadoop can deal with structured or unstructured data
• Fault Tolerant because data is replicated on various nodes.
10. HDFS VS PVFS
Similarities
• Both HDFS and PVFS divide a file into multiple pieces, called chunks in HDFS and
stripe units in PVFS, that are stored on different data servers.
• HDFS and PVFS have a similar high-level design. They are user-level cluster file
systems that store file data and file metadata on different types of servers, i.e., two
different user-level processes that run on separate nodes and use the lower-layer local
file systems for persistent storage.
Differences
HDFS PVFS
Designed for data intensive computing Designed for high performance computing
Co-locates compute and storage on the same
node
(beneficial to Hadoop/MapReduce model
where
computation is moved closer to the data)
Separate compute and storage nodes (easy
manageability and incremental growth)
Not optimized for small files Uses few optimizations for packing small files
11. Hadoop
Projection
• Big data is increasing exponentially
hence the urgent need for
technologies that can store, process
and analysis big data in real-time
for reasons such as gaining
competitive advantage, increased
profits and revenue.
• Hadoop is the backbone of big data
and competitors such as Microsoft
Azure, MapR, Databricks Amazon
Web Services etc… that have
developed cutting-edge
technologies on top of it.
12. Hadoop
Projection
• Major giants such as NASA, Yahoo, Adobe are moving
towards Apache Spark which is also by Apache Software
Foundation.
• Spark is an open-source distributed computing engine for
processing and analyzing huge volumes of data in real-
time.
• Apache Spark is compact, 100x faster in memory and 10x
faster on disk than Hadoop. Its ecosystem contains well
build features that are continually being improved.
• Spark can perform Machine Learning through its own
MLlib, which performs iterative in-memory ML
computations.
• Apache Spark replaces the MapReduce component of
Hadoop but not Hadoop as a whole.
13. References
• A. G. Wendy, R. M. Mohammad, H. Marleen and F. Frans, "Debating big data: A literature review on
realizing value from big data," Debating big data: A literature review on realizing value from big data,
vol. 26, no. 3, pp. 191-209, 2017.
• A. A. Ifeyinwa and f. n. Henry, "Big Data and Business: Trends, Platforms, Success Factors and
Applications," Big Data and Cognitive Computing, vol. 3, no. 2, 2019.
• K. Khushboo and G. Neeraj, "Analysis of Hadoop MapReduce scheduling in heterogenous environment.,"
Ain Shams Engineering Journal, pp. 1101-1110, 2021.
• H. H. Baydaa and R. Z. Subhi, "Improvised Distributions framework of Hadoop: A review," International
journal of Science and Business, pp. 31-41, 2021.
• R. .. Z. Rizgar, R. M. Z. Subhi, M. .. S. Hanan and M. H. Lailan, "Characteristics and Analysis of Hadoop
Distributed Systems," pp. 1555-1564, 2020.
• A. Otmane and F. Renaud, "Processing of Big Data with Apache Hadoop in the Current Challenging Era
of COVID-19," Big Data and Cognitive Computing, vol. 5, no. 1, 2021.
• W. Meng, W. Chase Q., C. Huiyan, L. Yang, W. Yongqiang and H. Aiqin, "On MapReduce Scheduling in
Hadoop Yarn on Heterogeneous Clusters," in 2018 17th IEEE International Conference On Trust, Security
And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data
Science And Engineering (TrustCom/BigDataSE), 2018.
• Hadoop, "Apache Hadoop YARN," Apache Software Foundation, 21 February 2022. [Online]. Available:
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.