The book of elephant tattoo

Elephant Tattoo
1 | P a g e
Mohamed Magdy
The Book
OF The
Elephant
Tattoo
ElephantTattoo

Elephant Tattoo
2 | P a g e
Elephant Tattoo
Mohamed Magdy
Mail:
My work :
Transforming Data into Business Value
Every enterprise is powered by data. We take information in & analyze it,
manipulate it, and create more as output. Every application creates data,
whether it is log messages, metrics, user activity, outgoing messages, or
something else. Every byte of data has a story to tell
About : Mohamed Magdy
Magdy is A Big Data Engineer & Data scientist
Master's Degree in Informatics (present)
Professional Diploma in Big Data and Data Science (Nile University)
Bachelor Degree in Information System
(OCP) Oracle Certified Professional
(OCP) Oracle Certified Professional Apps
LinkidIn &Twitter Url :

Elephant Tattoo
4 | P a g e
The Book OF The
Elephant Tattoo

Elephant Tattoo
5 | P a g e
Whats BIG DATA ?
Big data is a term that describes the large volume of data –
both structured and unstructured – that inundates a business
on a day-to-day basis. But it’s not the amount of data that’s
important. It’s what organizations do with the data that
matters. Big data can be analyzed for insights that lead to
better decisions and strategic business moves.

Elephant Tattoo
6 | P a g e
Big Data History & Current Considerations ?
While the term “big data” is relatively new, the act of gathering and storing large
amounts of information for eventual analysis is ages old. The concept gained
momentum in the early 2000s when industry analyst Doug Laney articulated the
now-mainstream definition of big data as the three Vs:
Volume. Organizations collect data from a variety of sources, including business transactions, social
media and information from sensor or machine-to-machine data. In the past, storing it would’ve been
a problem – but new technologies (such as Hadoop) have eased the burden.
Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner.
RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real
time.
Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial transactions.

Elephant Tattoo
7 | P a g e
Why Big Data ?
The importance of big data doesn’t revolve around how much data you have, but
what you do with it. You can take data from any source and analyze it to find
answers that enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making. When you
combine big data with high-powered analytics, you can accomplish business-
related tasks such as:
Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.
Data, in today’s business and technology world, is indispensable. The Big Data technologies and initiatives are rising
to analyze this data for gaining insights that can help in making strategic decisions. The concept evolved at the
beginning of 21st
century, and every technology giant is now making use of Big Data technologies. Big Data refers to
vast and voluminous data sets that may be structured or unstructured. This massive amount of data is produced every
day by businesses and users. Big Data analytics is the process of examining the large data sets to underline insights
and patterns. The Data analytics field in itself is vast.
Importance of Big Data Analytics
The Big Data analytics is indeed a revolution in the field of Information Technology.
The use of Data analytics by the companies is enhancing every year. The primary
focus of the companies is on customers. Hence the field is flourishing in Business to
Consumer (B2C) applications.We divide the analytics into different types as per the
nature of the environment. We have three divisions of Big Data analytics:
Prescriptive Analytics, Predictive Analytics, and Descriptive Analytics. This field
offers immense potential, and in this blog, we will discuss four perspectives to
explain why big data analytics is so important today?

Elephant Tattoo
8 | P a g e
 Data Science Perspective
 Business Perspective
 Real-time Usability Perspective
 Job Market Perspective
Big Data Analytics and Data Sciences
The analytics involves the use of advanced techniques and tools of analytics on the
data obtained from different sources in different sizes. Big data has the properties of
high variety, volume, and velocity. The data sets come from various online networks,
web pages, audio and video devices, social media, logs and many other sources.
Big Data analytics involves the use of analytics techniques like machine learning,
data mining, natural language processing, and statistics. The data is extracted,
prepared and blended to provide analysis for the businesses.

Elephant Tattoo
9 | P a g e
Businesses and Big Data Analytics
Big Data analytics tools and techniques are rising in demand due to the use of Big Data in
businesses. Organizations can find new opportunities and gain new insights to run their business
efficiently. These tools help in providing meaningful information for making better business
decisions.
The companies can improve their strategies by keeping in mind the customer focus. Big data
analytics efficiently helps operations to become more effective. This helps in improving the
profits of the company.
Big data analytics tools like Hadoop helps in reducing the cost of storage. This further increases
the efficiency of the business. With latest analytics tools, analysis of data becomes easier and
quicker. This, in turn, leads to faster decision making saving time and energy.
What is Hadoop ?
Hadoop is an open-source software framework for storing data and
running applications on clusters of commodity hardware. It provides

Elephant Tattoo
10 | P a g e
massive storage for any kind of data, enormous processing power
and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop core modules?
Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across
multiple machines without prior organization. Version (HDFS-1), (HDFS-2) ,(HDFS-3)
YARN – (Yet Another Resource Negotiator) provides resource management for the processes
running on Hadoop.

Elephant Tattoo
11 | P a g e
MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a
master node that takes inputs and partitions them into smaller sub problems and then distributes
them to worker nodes. After the map step has taken place, the master node takes the answers to all
of the sub problems and combines them to produce output.
Hadoop History ?
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally
developed to support distribution for the Nutch search engine project. Doug, who was
working at Yahoo! at the time and is now Chief Architect of Cloudera, named the project
after his son's toy elephant. Cutting's son was 2 years old at the time and just beginning to
talk. He called his beloved stuffed yellow elephant "Hadoop"
Apache Hadoop's MapReduce and HDFS components originally derived respectively from
Google's MapReduce and Google File System (GFS) papers.
What is GFS Google File System ?
a scalable distributed file system for large distributed data-intensive applications. It provides fault
tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance
to a large number of clients. While sharing many of the same goals as previous distributed file systems
designed and implemented the Google File System (GFS) to meet the rapidly growing demands
of Google’s data processing needs. GFS shares many of the same goals as previous distributed
file systems such as performance, scalability, reliability, and availability
Big Data Glossary
While we've attempted to define concepts as we've used them throughout the guide, sometimes it's
helpful to have specialized terminology available in a single place:
 Big data: Big data is an umbrella term for datasets that cannot reasonably be handled by traditional
computers or tools due to their volume, velocity, and variety. This term is also typically applied to
technologies and strategies to work with this type of data.
 Batch processing: Batch processing is a computing strategy that involves processing data in large
sets. This is typically ideal for non-time sensitive work that operates on very large sets of data. The
process is started and at a later time, the results are returned by the system.

Elephant Tattoo
12 | P a g e
 Cluster computing: Clustered computing is the practice of pooling the resources of multiple machines
and managing their collective capabilities to complete tasks. Computer clusters require a cluster
management layer which handles communication between the individual nodes and coordinates work
assignment.
 Data lake: Data lake is a term for a large repository of collected data in a relatively raw state. This is
frequently used to refer to the data collected in a big data system which might be unstructured and
frequently changing. This differs in spirit to data warehouses (defined below).
 Data mining: Data mining is a broad term for the practice of trying to find patterns in large sets of data.
It is the process of trying to refine a mass of data into a more understandable and cohesive set of
information.
 Data warehouse: Data warehouses are large, ordered repositories of data that can be used for
analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has
been cleaned, integrated with other sources, and is generally well-ordered. Data warehouses are often
spoken about in relation to big data, but typically are components of more conventional systems.
 ETL: ETL stands for extract, transform, and load. It refers to the process of taking raw data and
preparing it for the system's use. This is traditionally a process associated with data warehouses, but
characteristics of this process are also found in the ingestion pipelines of big data systems.
.
 In-memory computing: In-memory computing is a strategy that involves moving the working datasets
entirely within a cluster's collective memory. Intermediate calculations are not written to disk and are
instead held in memory. This gives in-memory computing systems like Apache Spark a huge
advantage in speed over I/O bound systems like Hadoop's MapReduce.
 Machine learning: Machine learning is the study and practice of designing systems that can learn,
adjust, and improve based on the data fed to them. This typically involves implementation of predictive
and statistical algorithms that can continually zero in on "correct" behavior and insights as more data
flows through the system.
 Map reduce (big data algorithm): Map reduce (the big data algorithm, not Hadoop's MapReduce
computation engine) is an algorithm for scheduling work on a computing cluster. The process involves
splitting the problem set up (mapping it to different nodes) and computing over them to produce
intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a
single value for each set.
 NoSQL: NoSQL is a broad term referring to databases designed outside of the traditional relational
model. NoSQL databases have different trade-offs compared to relational databases, but are often
well-suited for big data systems due to their flexibility and frequent distributed-first architecture.
 Stream processing: Stream processing is the practice of computing over individual data items as they
move through a system. This allows for real-time analysis of the data being fed to the system and is
useful for time-sensitive operations using high velocity metrics.

Elephant Tattoo
13 | P a g e
Hadoop Architecture

Elephant Tattoo
14 | P a g e
HDFS Architecture
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also makes
applications available to parallel processing

Elephant Tattoo
15 | P a g e
HDFS Hold Meta Data (information about data Like Location of Blocks and Parts of the
Files and Number of copy of the file ) in A server(PYSICAL COMPUTER) Named :Master
NODE
And The another Servers That Named :Data Node that hold the data it self the file csv or
txt or json file , read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the name node.
What is Block ?
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB ,128MB, but it can be
increased as per the need to change in HDFS configuration.

Elephant Tattoo
16 | P a g e
Example Of HDFS
Think of a file that contains the phone numbers for everyone in the United States; the people with a
last name starting with A might be stored on server 1, B on server 2, and so on.
In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to reconstruct
the entire phonebook, your program would need the blocks from every server in the cluster. To
achieve availability as components fail, HDFS replicates these smaller pieces onto two additional
servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole
environment;
for example, a development Hadoop cluster typically doesn’t need any data redundancy.) This
redundancy offers multiple benefits, the most obvious being higher availability.
In addition, this redundancy allows the Hadoop cluster to break work up into smaller chunks and run
those jobs on all the servers in the cluster for better scalability. Finally, you get the benefit of data
locality, which is critical when working with large data sets

Elephant Tattoo
17 | P a g e
HDFS VERSIONS 1,2,3
HDFS 1 : Supports MapReduce (MR) processing model only. Don’t supports any another tool
for processing
Has limited scaling of nodes. Limited to 4000 nodes per cluster
Works on concepts of slots – slots can run either a Map task or a Reduce task only.
A single Namenode to manage the entire namespace.
Has a limitation to serve as a platform for event processing, streaming and real-time operations
A Namenode failure affects the stack.
HDFS 2:
Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message
Passing Interface) MPI & HBase coprocessors.
Works on concepts of containers. Using containers can run generic tasks.
Multiple Namenode servers manage multiple namespaces.

Elephant Tattoo
18 | P a g e
Has to feature to overcome SPOF with a standby Namenode and in the case of
Namenode failure, it is configured for automatic recovery.
Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real-time
operations.
The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure.
High Availability : If the Namenode is down due to some unplanned event such as a machine crash, the
whole Hadoop cluster will be down as well.
Hadoop2.x comes with the solution for this problem, which allows users to configure clusters with
redundant namenodes, removing the chance that a lone name node will become a single point of failure
within a cluster.
HDFS 3:
Support for Erasure Coding in HDFS
Considering the rapid growth trends in data and data center hardware, support for erasure coding in
Hadoop 3.0 is an important feature in years to come. Erasure Coding is a 50 years old technique that
lets any random piece of data to be recovered based on other piece of data i.e. metadata stored
around it. Erasure Coding is more like an advanced RAID technique that recovers data automatically
when hard disk fails.
JDK 8 is the minimum runtime version of JAVA required to run Hadoop 3.x as many dependency
library files have been used from JDK 8.
Storage overhead in Hadoop 3.0 is reduced to 50% with support for Erasure Coding. In this case, if
here are 8 data blocks then a total of only 12 blocks will occupy the storage space.
Hadoop 3.0 supports 2 or more Name Nodes.

Elephant Tattoo
19 | P a g e
Yarn (Yet Another Resource Negotiator)
Apache Hadoop YARN is the resource management and job scheduling technology in the open
source Hadoop distributed processing framework. One of Apache Hadoop's core components, YARN
is responsible for allocating system resources to the various applications running in a Hadoop
cluster and scheduling tasks to be executed on different cluster nodes
In a cluster architecture, Apache Hadoop YARN sits between HDFS and the processing engines
being used to run applications. It combines a central resource manager with containers, application
coordinators and node-level agents that monitor processing operations in individual cluster
nodes. YARN can dynamically allocate resources to applications as needed, a capability designed to
improve resource utilization and application performance compared with Map Reducer’s more static
allocation approach.
Why YARN?
In Hadoop version 1.0 which is also referred to as MRV1 (Map Reduce Version 1), Map Reduce performed both
processing and resource management functions. It consisted of a Job Tracker which was the single master. The
Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map
and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers
periodically reported their progress to the Job Tracker.

Elephant Tattoo
20 | P a g e
YARN Architecture
YARN enabled the users to perform operations as per requirement by using a variety of tools
like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.
Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all your
processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN Architecture
consists of the following main components:
1. Resource Manager: Runs on a master daemon and manages the resource allocation in the cluster.
2. Node Manager: They run on the slave daemons and are responsible for the execution of a task on every
single Data Node.
3. Application Master: Manages the user job lifecycle and resource needs of individual applications. It
works along with the Node Manager and monitors the execution of tasks.
4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single node.
Components of YARN
You can consider YARN as the brain of your Hadoop Ecosystem. The image below represents the
YARN Architecture.

Elephant Tattoo
21 | P a g e
The first component of YARN Architecture is,
Resource Manager
 It is the ultimate authority in resource allocation.
 On receiving the processing requests, it passes parts of requests to corresponding node managers
accordingly, where the actual processing takes place.
 It is the arbitrator of the cluster resources and decides the allocation of the available resources for
competing applications.
 Optimizes the cluster utilization like keeping all resources in use all the time against various constraints
such as capacity guarantees, fairness, and SLAs.
 It has two major components: a) Scheduler b) Application Manager a) Scheduler
 The scheduler is responsible for allocating resources to the various running applications subject to
constraints of capacities, queues etc.
 It is called a pure scheduler in Resource Manager, which means that it does not perform any monitoring
or tracking of status for the applications.
 If there is an application failure or hardware failure, the Scheduler does not guarantee to restart the failed
tasks.

Elephant Tattoo
22 | P a g e
 Performs scheduling based on the resource requirements of the applications.
 It has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the
various applications. There are two such plug-ins: Capacity Scheduler and Fair Scheduler, which are
currently used as Schedulers in Resource Manager.
Application Manager
 It is responsible for accepting job submissions.
 Negotiates the first container from the Resource Manager for executing the application specific
Application Master.
 Manages running the Application Masters in a cluster and provides service for restarting the Application
Master container on failure.
Node Manager
 It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given
node.
 It registers with the Resource Manager and sends heartbeats with the health status of the node.
 Its primary goal is to manage application containers assigned to it by the resource manager.
 It keeps up-to-date with the Resource Manager.
 Application Master requests the assigned container from the Node Manager by sending it a Container
Launch Context (CLC) which includes everything the application needs in order to run. The Node
Manager creates the requested container process and starts it.
 Monitors resource usage (memory, CPU) of individual containers.
 Performs Log management.
 It also kills the container as directed by the Resource Manager.
Application Master
 An application is a single job submitted to the framework. Each such application has a unique Application
Master associated with it which is a framework specific entity.
 It is the process that coordinates an application’s execution in the cluster and also manages faults.

Elephant Tattoo
23 | P a g e
 Its task is to negotiate resources from the Resource Manager and work with the Node Manager to execute
and monitor the component tasks.
 It is responsible for negotiating appropriate resource containers from the ResourceManager, tracking
their status and monitoring progress.
 Once started, it periodically sends heartbeats to the Resource Manager to affirm its health and to update
the record of its resource demands.
The fourth component is:
Container
 It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
 YARN containers are managed by a container launch context which is container life-cycle(CLC). This
record contains a map of environment variables, dependencies stored in a remotely accessible storage,
security tokens, payload for Node Manager services and the command necessary to create the process.
 It grants rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific
host.
Application Submission in YARN
Refer to the image and have a look at the steps involved in application submission of Hadoop YARN:
1) Submit the job
2) Get Application ID
3) Application Submission Context
4) a: Start Container Launch
B: Launch Application Master
5) Allocate Resources
6) a: Container
B: Launch
7) Execute

Elephant Tattoo
24 | P a g e
Refer to the given image and see the following steps involved in Application workflow of Apache
Hadoop YARN:
1. Client submits an application
2. Resource Manager allocates a container to start Application Manager
3. Application Manager registers with Resource Manager
4. Application Manager asks containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Application Manager unregisters with Resource Manager
MapReduce ?
The term "MapReduce" actually refers to two separate and distinct tasks that Hadoop programs
perform. The first is the map job, which takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples into a smaller
set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed
after the map job.

Elephant Tattoo
25 | P a g e
Example OF MapReduce job
Let’s look at a simple example.
Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms)
that represent a city and the corresponding temperature recorded in that city for the various
measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can
imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even
billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small
the amount of data you need to analyze, the key principles we’re covering here remain the same.
Either way, in this example, city is the key and temperature is the value. Toronto, 20 Whitby, 25 New
York, 22 Rome, 32 Toronto, 4 Rome, 33 New York, 18 Out of all the data we have collected, we want
to find the maximum temperature for each city across all of the data files (note that each file might
have the same city represented multiple times). Using the MapReduce framework, we can break this
down into five map tasks, where each mapper works on one of the five files and the mapper task
goes through the data and returns the maximum temperature for each city. For example, the results
produced from one mapper task for the data above would look like this: (Toronto, 20) (Whitby, 25)
(New York, 22) (Rome, 33) Let’s assume the other four mapper tasks (working on the other four files
not shown here) produced the following intermediate results: (Toronto, 18) (Whitby, 27) (New York,
32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New
York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30) All five of these output
streams would be fed into the reduce tasks, which combine the input results and output a single value
for each city, producing a final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome,
38) As an analogy, you can think of map and reduce tasks as the way a census was conducted in
Roman times, where the census bureau would dispatch its people to each city in the empire. Each
census taker in each city would be tasked to count the number of people in that city and then return
their results to the capital city. There, the results from each city would be reduced to a single count
(sum of all cities) to determine the overall population of the empire. This mapping of people to cities,
in parallel, and then combining the results (reducing) is much more efficient than sending a single
person to count every person in the empire in a serial fashion.

Elephant Tattoo
26 | P a g e
Another Big Data Tools

Elephant Tattoo
27 | P a g e
SPARK ?
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of
the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark
has its own cluster management computation, it uses Hadoop for storage purpose only.
Spark is a lightning-fast cluster computing technology, designed for fast computation. It is
based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.
Features of Apache Spark
Apache Spark has following features.
 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Elephant Tattoo
28 | P a g e
 The following diagram shows three ways of how Spark can be built with Hadoop
components.
Components of Spark
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Elephant Tattoo
29 | P a g e
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib
developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is
nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout
gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.
Apache Spark Abstractions & Concepts
In the below section, it is briefly discussed regarding the abstractions and concepts of Spark -
 RDD (Resilient Distributed Dataset) - RDD is the central and significant unit of data in Apache Spark. It is
a distributed collection of components across cluster nodes and can implement parallel operations.
Parallelized Collections, External datasets and Existing RDDs are the three methods for creating RDD.
 DAG (Direct Acyclic Graph) - DAG is a designated graph with no inscribed sequences. It reads data from
HDFS and Map & Reduce operations are applied. It comprises a series of vertices such that every edge is
directed from initial to succeeding in the progression.
 Spark Shell - An interactive Shell that can execute a command line of the application effective because of
the interactive testing and capability to read a large amount of data sources in various types.
 Transformations - It builds a new RDD from the existing one. It transfers the dataset to the function and
then returns the new dataset.
 Actions - It turns final result to driver program or corresponds it to the external data store.
Apache Spark Architecture
Spark is accessible, intense, powerful and proficient Big Data tool for handling different enormous
information challenges. Apache Spark takes after an ace/slave engineering with two primary
Daemons and a Cluster Manager –
 Master Daemon – (Master/Driver Process)
 Worker Daemon – (Slave Process)

Elephant Tattoo
30 | P a g e
A spark cluster has a solitary Master and many numbers of Slaves/Workers. The driver and the
agents run their individual Java procedures and users can execute them on individual machines.
Below are the three methods of building Spark with Hadoop Components (these three components
are strong pillars of Spark Architecture)

The book of elephant tattoo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie The book of elephant tattoo

Ähnlich wie The book of elephant tattoo (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The book of elephant tattoo