SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Elephant Tattoo
1 | P a g e
Mohamed Magdy
The Book
OF The
Elephant
Tattoo
ElephantTattoo
Elephant Tattoo
2 | P a g e
Elephant Tattoo
Mohamed Magdy
Mail:
My work :
Transforming Data into Business Value
Every enterprise is powered by data. We take information in & analyze it,
manipulate it, and create more as output. Every application creates data,
whether it is log messages, metrics, user activity, outgoing messages, or
something else. Every byte of data has a story to tell
About : Mohamed Magdy
Magdy is A Big Data Engineer & Data scientist
Master's Degree in Informatics (present)
Professional Diploma in Big Data and Data Science (Nile University)
Bachelor Degree in Information System
(OCP) Oracle Certified Professional
(OCP) Oracle Certified Professional Apps
LinkidIn &Twitter Url :
Elephant Tattoo
3 | P a g e
Elephant Tattoo
4 | P a g e
The Book OF The
Elephant Tattoo
Elephant Tattoo
5 | P a g e
Whats BIG DATA ?
Big data is a term that describes the large volume of data –
both structured and unstructured – that inundates a business
on a day-to-day basis. But it’s not the amount of data that’s
important. It’s what organizations do with the data that
matters. Big data can be analyzed for insights that lead to
better decisions and strategic business moves.
Elephant Tattoo
6 | P a g e
Big Data History & Current Considerations ?
While the term “big data” is relatively new, the act of gathering and storing large
amounts of information for eventual analysis is ages old. The concept gained
momentum in the early 2000s when industry analyst Doug Laney articulated the
now-mainstream definition of big data as the three Vs:
Volume. Organizations collect data from a variety of sources, including business transactions, social
media and information from sensor or machine-to-machine data. In the past, storing it would’ve been
a problem – but new technologies (such as Hadoop) have eased the burden.
Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner.
RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real
time.
Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Elephant Tattoo
7 | P a g e
Why Big Data ?
The importance of big data doesn’t revolve around how much data you have, but
what you do with it. You can take data from any source and analyze it to find
answers that enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making. When you
combine big data with high-powered analytics, you can accomplish business-
related tasks such as:
Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.
Data, in today’s business and technology world, is indispensable. The Big Data technologies and initiatives are rising
to analyze this data for gaining insights that can help in making strategic decisions. The concept evolved at the
beginning of 21st
century, and every technology giant is now making use of Big Data technologies. Big Data refers to
vast and voluminous data sets that may be structured or unstructured. This massive amount of data is produced every
day by businesses and users. Big Data analytics is the process of examining the large data sets to underline insights
and patterns. The Data analytics field in itself is vast.
Importance of Big Data Analytics
The Big Data analytics is indeed a revolution in the field of Information Technology.
The use of Data analytics by the companies is enhancing every year. The primary
focus of the companies is on customers. Hence the field is flourishing in Business to
Consumer (B2C) applications.We divide the analytics into different types as per the
nature of the environment. We have three divisions of Big Data analytics:
Prescriptive Analytics, Predictive Analytics, and Descriptive Analytics. This field
offers immense potential, and in this blog, we will discuss four perspectives to
explain why big data analytics is so important today?
Elephant Tattoo
8 | P a g e
 Data Science Perspective
 Business Perspective
 Real-time Usability Perspective
 Job Market Perspective
Big Data Analytics and Data Sciences
The analytics involves the use of advanced techniques and tools of analytics on the
data obtained from different sources in different sizes. Big data has the properties of
high variety, volume, and velocity. The data sets come from various online networks,
web pages, audio and video devices, social media, logs and many other sources.
Big Data analytics involves the use of analytics techniques like machine learning,
data mining, natural language processing, and statistics. The data is extracted,
prepared and blended to provide analysis for the businesses.
Elephant Tattoo
9 | P a g e
Businesses and Big Data Analytics
Big Data analytics tools and techniques are rising in demand due to the use of Big Data in
businesses. Organizations can find new opportunities and gain new insights to run their business
efficiently. These tools help in providing meaningful information for making better business
decisions.
The companies can improve their strategies by keeping in mind the customer focus. Big data
analytics efficiently helps operations to become more effective. This helps in improving the
profits of the company.
Big data analytics tools like Hadoop helps in reducing the cost of storage. This further increases
the efficiency of the business. With latest analytics tools, analysis of data becomes easier and
quicker. This, in turn, leads to faster decision making saving time and energy.
What is Hadoop ?
Hadoop is an open-source software framework for storing data and
running applications on clusters of commodity hardware. It provides
Elephant Tattoo
10 | P a g e
massive storage for any kind of data, enormous processing power
and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop core modules?
Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across
multiple machines without prior organization. Version (HDFS-1), (HDFS-2) ,(HDFS-3)
YARN – (Yet Another Resource Negotiator) provides resource management for the processes
running on Hadoop.
Elephant Tattoo
11 | P a g e
MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a
master node that takes inputs and partitions them into smaller sub problems and then distributes
them to worker nodes. After the map step has taken place, the master node takes the answers to all
of the sub problems and combines them to produce output.
Hadoop History ?
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally
developed to support distribution for the Nutch search engine project. Doug, who was
working at Yahoo! at the time and is now Chief Architect of Cloudera, named the project
after his son's toy elephant. Cutting's son was 2 years old at the time and just beginning to
talk. He called his beloved stuffed yellow elephant "Hadoop"
Apache Hadoop's MapReduce and HDFS components originally derived respectively from
Google's MapReduce and Google File System (GFS) papers.
What is GFS Google File System ?
a scalable distributed file system for large distributed data-intensive applications. It provides fault
tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance
to a large number of clients. While sharing many of the same goals as previous distributed file systems
designed and implemented the Google File System (GFS) to meet the rapidly growing demands
of Google’s data processing needs. GFS shares many of the same goals as previous distributed
file systems such as performance, scalability, reliability, and availability
Big Data Glossary
While we've attempted to define concepts as we've used them throughout the guide, sometimes it's
helpful to have specialized terminology available in a single place:
 Big data: Big data is an umbrella term for datasets that cannot reasonably be handled by traditional
computers or tools due to their volume, velocity, and variety. This term is also typically applied to
technologies and strategies to work with this type of data.
 Batch processing: Batch processing is a computing strategy that involves processing data in large
sets. This is typically ideal for non-time sensitive work that operates on very large sets of data. The
process is started and at a later time, the results are returned by the system.
Elephant Tattoo
12 | P a g e
 Cluster computing: Clustered computing is the practice of pooling the resources of multiple machines
and managing their collective capabilities to complete tasks. Computer clusters require a cluster
management layer which handles communication between the individual nodes and coordinates work
assignment.
 Data lake: Data lake is a term for a large repository of collected data in a relatively raw state. This is
frequently used to refer to the data collected in a big data system which might be unstructured and
frequently changing. This differs in spirit to data warehouses (defined below).
 Data mining: Data mining is a broad term for the practice of trying to find patterns in large sets of data.
It is the process of trying to refine a mass of data into a more understandable and cohesive set of
information.
 Data warehouse: Data warehouses are large, ordered repositories of data that can be used for
analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has
been cleaned, integrated with other sources, and is generally well-ordered. Data warehouses are often
spoken about in relation to big data, but typically are components of more conventional systems.
 ETL: ETL stands for extract, transform, and load. It refers to the process of taking raw data and
preparing it for the system's use. This is traditionally a process associated with data warehouses, but
characteristics of this process are also found in the ingestion pipelines of big data systems.
.
 In-memory computing: In-memory computing is a strategy that involves moving the working datasets
entirely within a cluster's collective memory. Intermediate calculations are not written to disk and are
instead held in memory. This gives in-memory computing systems like Apache Spark a huge
advantage in speed over I/O bound systems like Hadoop's MapReduce.
 Machine learning: Machine learning is the study and practice of designing systems that can learn,
adjust, and improve based on the data fed to them. This typically involves implementation of predictive
and statistical algorithms that can continually zero in on "correct" behavior and insights as more data
flows through the system.
 Map reduce (big data algorithm): Map reduce (the big data algorithm, not Hadoop's MapReduce
computation engine) is an algorithm for scheduling work on a computing cluster. The process involves
splitting the problem set up (mapping it to different nodes) and computing over them to produce
intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a
single value for each set.
 NoSQL: NoSQL is a broad term referring to databases designed outside of the traditional relational
model. NoSQL databases have different trade-offs compared to relational databases, but are often
well-suited for big data systems due to their flexibility and frequent distributed-first architecture.
 Stream processing: Stream processing is the practice of computing over individual data items as they
move through a system. This allows for real-time analysis of the data being fed to the system and is
useful for time-sensitive operations using high velocity metrics.
Elephant Tattoo
13 | P a g e
Hadoop Architecture
Elephant Tattoo
14 | P a g e
HDFS Architecture
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also makes
applications available to parallel processing
Elephant Tattoo
15 | P a g e
HDFS Hold Meta Data (information about data Like Location of Blocks and Parts of the
Files and Number of copy of the file ) in A server(PYSICAL COMPUTER) Named :Master
NODE
And The another Servers That Named :Data Node that hold the data it self the file csv or
txt or json file , read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the name node.
What is Block ?
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB ,128MB, but it can be
increased as per the need to change in HDFS configuration.
Elephant Tattoo
16 | P a g e
Example Of HDFS
Think of a file that contains the phone numbers for everyone in the United States; the people with a
last name starting with A might be stored on server 1, B on server 2, and so on.
In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to reconstruct
the entire phonebook, your program would need the blocks from every server in the cluster. To
achieve availability as components fail, HDFS replicates these smaller pieces onto two additional
servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole
environment;
for example, a development Hadoop cluster typically doesn’t need any data redundancy.) This
redundancy offers multiple benefits, the most obvious being higher availability.
In addition, this redundancy allows the Hadoop cluster to break work up into smaller chunks and run
those jobs on all the servers in the cluster for better scalability. Finally, you get the benefit of data
locality, which is critical when working with large data sets
Elephant Tattoo
17 | P a g e
HDFS VERSIONS 1,2,3
HDFS 1 : Supports MapReduce (MR) processing model only. Don’t supports any another tool
for processing
Has limited scaling of nodes. Limited to 4000 nodes per cluster
Works on concepts of slots – slots can run either a Map task or a Reduce task only.
A single Namenode to manage the entire namespace.
Has a limitation to serve as a platform for event processing, streaming and real-time operations
A Namenode failure affects the stack.
HDFS 2:
Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message
Passing Interface) MPI & HBase coprocessors.
Works on concepts of containers. Using containers can run generic tasks.
Multiple Namenode servers manage multiple namespaces.
Elephant Tattoo
18 | P a g e
Has to feature to overcome SPOF with a standby Namenode and in the case of
Namenode failure, it is configured for automatic recovery.
Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real-time
operations.
The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure.
High Availability : If the Namenode is down due to some unplanned event such as a machine crash, the
whole Hadoop cluster will be down as well.
Hadoop2.x comes with the solution for this problem, which allows users to configure clusters with
redundant namenodes, removing the chance that a lone name node will become a single point of failure
within a cluster.
HDFS 3:
Support for Erasure Coding in HDFS
Considering the rapid growth trends in data and data center hardware, support for erasure coding in
Hadoop 3.0 is an important feature in years to come. Erasure Coding is a 50 years old technique that
lets any random piece of data to be recovered based on other piece of data i.e. metadata stored
around it. Erasure Coding is more like an advanced RAID technique that recovers data automatically
when hard disk fails.
JDK 8 is the minimum runtime version of JAVA required to run Hadoop 3.x as many dependency
library files have been used from JDK 8.
Storage overhead in Hadoop 3.0 is reduced to 50% with support for Erasure Coding. In this case, if
here are 8 data blocks then a total of only 12 blocks will occupy the storage space.
Hadoop 3.0 supports 2 or more Name Nodes.
Elephant Tattoo
19 | P a g e
Yarn (Yet Another Resource Negotiator)
Apache Hadoop YARN is the resource management and job scheduling technology in the open
source Hadoop distributed processing framework. One of Apache Hadoop's core components, YARN
is responsible for allocating system resources to the various applications running in a Hadoop
cluster and scheduling tasks to be executed on different cluster nodes
In a cluster architecture, Apache Hadoop YARN sits between HDFS and the processing engines
being used to run applications. It combines a central resource manager with containers, application
coordinators and node-level agents that monitor processing operations in individual cluster
nodes. YARN can dynamically allocate resources to applications as needed, a capability designed to
improve resource utilization and application performance compared with Map Reducer’s more static
allocation approach.
Why YARN?
In Hadoop version 1.0 which is also referred to as MRV1 (Map Reduce Version 1), Map Reduce performed both
processing and resource management functions. It consisted of a Job Tracker which was the single master. The
Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map
and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers
periodically reported their progress to the Job Tracker.
Elephant Tattoo
20 | P a g e
YARN Architecture
YARN enabled the users to perform operations as per requirement by using a variety of tools
like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.
Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all your
processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN Architecture
consists of the following main components:
1. Resource Manager: Runs on a master daemon and manages the resource allocation in the cluster.
2. Node Manager: They run on the slave daemons and are responsible for the execution of a task on every
single Data Node.
3. Application Master: Manages the user job lifecycle and resource needs of individual applications. It
works along with the Node Manager and monitors the execution of tasks.
4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single node.
Components of YARN
You can consider YARN as the brain of your Hadoop Ecosystem. The image below represents the
YARN Architecture.
Elephant Tattoo
21 | P a g e
The first component of YARN Architecture is,
Resource Manager
 It is the ultimate authority in resource allocation.
 On receiving the processing requests, it passes parts of requests to corresponding node managers
accordingly, where the actual processing takes place.
 It is the arbitrator of the cluster resources and decides the allocation of the available resources for
competing applications.
 Optimizes the cluster utilization like keeping all resources in use all the time against various constraints
such as capacity guarantees, fairness, and SLAs.
 It has two major components: a) Scheduler b) Application Manager a) Scheduler
 The scheduler is responsible for allocating resources to the various running applications subject to
constraints of capacities, queues etc.
 It is called a pure scheduler in Resource Manager, which means that it does not perform any monitoring
or tracking of status for the applications.
 If there is an application failure or hardware failure, the Scheduler does not guarantee to restart the failed
tasks.
Elephant Tattoo
22 | P a g e
 Performs scheduling based on the resource requirements of the applications.
 It has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the
various applications. There are two such plug-ins: Capacity Scheduler and Fair Scheduler, which are
currently used as Schedulers in Resource Manager.
Application Manager
 It is responsible for accepting job submissions.
 Negotiates the first container from the Resource Manager for executing the application specific
Application Master.
 Manages running the Application Masters in a cluster and provides service for restarting the Application
Master container on failure.
Node Manager
 It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given
node.
 It registers with the Resource Manager and sends heartbeats with the health status of the node.
 Its primary goal is to manage application containers assigned to it by the resource manager.
 It keeps up-to-date with the Resource Manager.
 Application Master requests the assigned container from the Node Manager by sending it a Container
Launch Context (CLC) which includes everything the application needs in order to run. The Node
Manager creates the requested container process and starts it.
 Monitors resource usage (memory, CPU) of individual containers.
 Performs Log management.
 It also kills the container as directed by the Resource Manager.
Application Master
 An application is a single job submitted to the framework. Each such application has a unique Application
Master associated with it which is a framework specific entity.
 It is the process that coordinates an application’s execution in the cluster and also manages faults.
Elephant Tattoo
23 | P a g e
 Its task is to negotiate resources from the Resource Manager and work with the Node Manager to execute
and monitor the component tasks.
 It is responsible for negotiating appropriate resource containers from the ResourceManager, tracking
their status and monitoring progress.
 Once started, it periodically sends heartbeats to the Resource Manager to affirm its health and to update
the record of its resource demands.
The fourth component is:
Container
 It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
 YARN containers are managed by a container launch context which is container life-cycle(CLC). This
record contains a map of environment variables, dependencies stored in a remotely accessible storage,
security tokens, payload for Node Manager services and the command necessary to create the process.
 It grants rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific
host.
Application Submission in YARN
Refer to the image and have a look at the steps involved in application submission of Hadoop YARN:
1) Submit the job
2) Get Application ID
3) Application Submission Context
4) a: Start Container Launch
B: Launch Application Master
5) Allocate Resources
6) a: Container
B: Launch
7) Execute
Elephant Tattoo
24 | P a g e
Refer to the given image and see the following steps involved in Application workflow of Apache
Hadoop YARN:
1. Client submits an application
2. Resource Manager allocates a container to start Application Manager
3. Application Manager registers with Resource Manager
4. Application Manager asks containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Application Manager unregisters with Resource Manager
MapReduce ?
The term "MapReduce" actually refers to two separate and distinct tasks that Hadoop programs
perform. The first is the map job, which takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples into a smaller
set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed
after the map job.
Elephant Tattoo
25 | P a g e
Example OF MapReduce job
Let’s look at a simple example.
Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms)
that represent a city and the corresponding temperature recorded in that city for the various
measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can
imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even
billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small
the amount of data you need to analyze, the key principles we’re covering here remain the same.
Either way, in this example, city is the key and temperature is the value. Toronto, 20 Whitby, 25 New
York, 22 Rome, 32 Toronto, 4 Rome, 33 New York, 18 Out of all the data we have collected, we want
to find the maximum temperature for each city across all of the data files (note that each file might
have the same city represented multiple times). Using the MapReduce framework, we can break this
down into five map tasks, where each mapper works on one of the five files and the mapper task
goes through the data and returns the maximum temperature for each city. For example, the results
produced from one mapper task for the data above would look like this: (Toronto, 20) (Whitby, 25)
(New York, 22) (Rome, 33) Let’s assume the other four mapper tasks (working on the other four files
not shown here) produced the following intermediate results: (Toronto, 18) (Whitby, 27) (New York,
32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New
York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30) All five of these output
streams would be fed into the reduce tasks, which combine the input results and output a single value
for each city, producing a final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome,
38) As an analogy, you can think of map and reduce tasks as the way a census was conducted in
Roman times, where the census bureau would dispatch its people to each city in the empire. Each
census taker in each city would be tasked to count the number of people in that city and then return
their results to the capital city. There, the results from each city would be reduced to a single count
(sum of all cities) to determine the overall population of the empire. This mapping of people to cities,
in parallel, and then combining the results (reducing) is much more efficient than sending a single
person to count every person in the empire in a serial fashion.
Elephant Tattoo
26 | P a g e
Another Big Data Tools
Elephant Tattoo
27 | P a g e
SPARK ?
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of
the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark
has its own cluster management computation, it uses Hadoop for storage purpose only.
Spark is a lightning-fast cluster computing technology, designed for fast computation. It is
based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.
Features of Apache Spark
Apache Spark has following features.
 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
Elephant Tattoo
28 | P a g e
 The following diagram shows three ways of how Spark can be built with Hadoop
components.
Components of Spark
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.
Elephant Tattoo
29 | P a g e
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib
developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is
nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout
gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.
Apache Spark Abstractions & Concepts
In the below section, it is briefly discussed regarding the abstractions and concepts of Spark -
 RDD (Resilient Distributed Dataset) - RDD is the central and significant unit of data in Apache Spark. It is
a distributed collection of components across cluster nodes and can implement parallel operations.
Parallelized Collections, External datasets and Existing RDDs are the three methods for creating RDD.
 DAG (Direct Acyclic Graph) - DAG is a designated graph with no inscribed sequences. It reads data from
HDFS and Map & Reduce operations are applied. It comprises a series of vertices such that every edge is
directed from initial to succeeding in the progression.
 Spark Shell - An interactive Shell that can execute a command line of the application effective because of
the interactive testing and capability to read a large amount of data sources in various types.
 Transformations - It builds a new RDD from the existing one. It transfers the dataset to the function and
then returns the new dataset.
 Actions - It turns final result to driver program or corresponds it to the external data store.
Apache Spark Architecture
Spark is accessible, intense, powerful and proficient Big Data tool for handling different enormous
information challenges. Apache Spark takes after an ace/slave engineering with two primary
Daemons and a Cluster Manager –
 Master Daemon – (Master/Driver Process)
 Worker Daemon – (Slave Process)
Elephant Tattoo
30 | P a g e
A spark cluster has a solitary Master and many numbers of Slaves/Workers. The driver and the
agents run their individual Java procedures and users can execute them on individual machines.
Below are the three methods of building Spark with Hadoop Components (these three components
are strong pillars of Spark Architecture)

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesCRISIL Limited
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data TrendsIMC Institute
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 

Was ist angesagt? (19)

Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on Businesses
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data Trends
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 

Ähnlich wie The book of elephant tattoo (20)

Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a security
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Big Data at a Glance
Big Data at a GlanceBig Data at a Glance
Big Data at a Glance
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big data
Big dataBig data
Big data
 

Kürzlich hochgeladen

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Kürzlich hochgeladen (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

The book of elephant tattoo

  • 1. Elephant Tattoo 1 | P a g e Mohamed Magdy The Book OF The Elephant Tattoo ElephantTattoo
  • 2. Elephant Tattoo 2 | P a g e Elephant Tattoo Mohamed Magdy Mail: My work : Transforming Data into Business Value Every enterprise is powered by data. We take information in & analyze it, manipulate it, and create more as output. Every application creates data, whether it is log messages, metrics, user activity, outgoing messages, or something else. Every byte of data has a story to tell About : Mohamed Magdy Magdy is A Big Data Engineer & Data scientist Master's Degree in Informatics (present) Professional Diploma in Big Data and Data Science (Nile University) Bachelor Degree in Information System (OCP) Oracle Certified Professional (OCP) Oracle Certified Professional Apps LinkidIn &Twitter Url :
  • 4. Elephant Tattoo 4 | P a g e The Book OF The Elephant Tattoo
  • 5. Elephant Tattoo 5 | P a g e Whats BIG DATA ? Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
  • 6. Elephant Tattoo 6 | P a g e Big Data History & Current Considerations ? While the term “big data” is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs: Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden. Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
  • 7. Elephant Tattoo 7 | P a g e Why Big Data ? The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business- related tasks such as: Determining root causes of failures, issues and defects in near-real time. Generating coupons at the point of sale based on the customer’s buying habits. Recalculating entire risk portfolios in minutes. Detecting fraudulent behavior before it affects your organization. Data, in today’s business and technology world, is indispensable. The Big Data technologies and initiatives are rising to analyze this data for gaining insights that can help in making strategic decisions. The concept evolved at the beginning of 21st century, and every technology giant is now making use of Big Data technologies. Big Data refers to vast and voluminous data sets that may be structured or unstructured. This massive amount of data is produced every day by businesses and users. Big Data analytics is the process of examining the large data sets to underline insights and patterns. The Data analytics field in itself is vast. Importance of Big Data Analytics The Big Data analytics is indeed a revolution in the field of Information Technology. The use of Data analytics by the companies is enhancing every year. The primary focus of the companies is on customers. Hence the field is flourishing in Business to Consumer (B2C) applications.We divide the analytics into different types as per the nature of the environment. We have three divisions of Big Data analytics: Prescriptive Analytics, Predictive Analytics, and Descriptive Analytics. This field offers immense potential, and in this blog, we will discuss four perspectives to explain why big data analytics is so important today?
  • 8. Elephant Tattoo 8 | P a g e  Data Science Perspective  Business Perspective  Real-time Usability Perspective  Job Market Perspective Big Data Analytics and Data Sciences The analytics involves the use of advanced techniques and tools of analytics on the data obtained from different sources in different sizes. Big data has the properties of high variety, volume, and velocity. The data sets come from various online networks, web pages, audio and video devices, social media, logs and many other sources. Big Data analytics involves the use of analytics techniques like machine learning, data mining, natural language processing, and statistics. The data is extracted, prepared and blended to provide analysis for the businesses.
  • 9. Elephant Tattoo 9 | P a g e Businesses and Big Data Analytics Big Data analytics tools and techniques are rising in demand due to the use of Big Data in businesses. Organizations can find new opportunities and gain new insights to run their business efficiently. These tools help in providing meaningful information for making better business decisions. The companies can improve their strategies by keeping in mind the customer focus. Big data analytics efficiently helps operations to become more effective. This helps in improving the profits of the company. Big data analytics tools like Hadoop helps in reducing the cost of storage. This further increases the efficiency of the business. With latest analytics tools, analysis of data becomes easier and quicker. This, in turn, leads to faster decision making saving time and energy. What is Hadoop ? Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides
  • 10. Elephant Tattoo 10 | P a g e massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop core modules? Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization. Version (HDFS-1), (HDFS-2) ,(HDFS-3) YARN – (Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop.
  • 11. Elephant Tattoo 11 | P a g e MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a master node that takes inputs and partitions them into smaller sub problems and then distributes them to worker nodes. After the map step has taken place, the master node takes the answers to all of the sub problems and combines them to produce output. Hadoop History ? Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution for the Nutch search engine project. Doug, who was working at Yahoo! at the time and is now Chief Architect of Cloudera, named the project after his son's toy elephant. Cutting's son was 2 years old at the time and just beginning to talk. He called his beloved stuffed yellow elephant "Hadoop" Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers. What is GFS Google File System ? a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability Big Data Glossary While we've attempted to define concepts as we've used them throughout the guide, sometimes it's helpful to have specialized terminology available in a single place:  Big data: Big data is an umbrella term for datasets that cannot reasonably be handled by traditional computers or tools due to their volume, velocity, and variety. This term is also typically applied to technologies and strategies to work with this type of data.  Batch processing: Batch processing is a computing strategy that involves processing data in large sets. This is typically ideal for non-time sensitive work that operates on very large sets of data. The process is started and at a later time, the results are returned by the system.
  • 12. Elephant Tattoo 12 | P a g e  Cluster computing: Clustered computing is the practice of pooling the resources of multiple machines and managing their collective capabilities to complete tasks. Computer clusters require a cluster management layer which handles communication between the individual nodes and coordinates work assignment.  Data lake: Data lake is a term for a large repository of collected data in a relatively raw state. This is frequently used to refer to the data collected in a big data system which might be unstructured and frequently changing. This differs in spirit to data warehouses (defined below).  Data mining: Data mining is a broad term for the practice of trying to find patterns in large sets of data. It is the process of trying to refine a mass of data into a more understandable and cohesive set of information.  Data warehouse: Data warehouses are large, ordered repositories of data that can be used for analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has been cleaned, integrated with other sources, and is generally well-ordered. Data warehouses are often spoken about in relation to big data, but typically are components of more conventional systems.  ETL: ETL stands for extract, transform, and load. It refers to the process of taking raw data and preparing it for the system's use. This is traditionally a process associated with data warehouses, but characteristics of this process are also found in the ingestion pipelines of big data systems. .  In-memory computing: In-memory computing is a strategy that involves moving the working datasets entirely within a cluster's collective memory. Intermediate calculations are not written to disk and are instead held in memory. This gives in-memory computing systems like Apache Spark a huge advantage in speed over I/O bound systems like Hadoop's MapReduce.  Machine learning: Machine learning is the study and practice of designing systems that can learn, adjust, and improve based on the data fed to them. This typically involves implementation of predictive and statistical algorithms that can continually zero in on "correct" behavior and insights as more data flows through the system.  Map reduce (big data algorithm): Map reduce (the big data algorithm, not Hadoop's MapReduce computation engine) is an algorithm for scheduling work on a computing cluster. The process involves splitting the problem set up (mapping it to different nodes) and computing over them to produce intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a single value for each set.  NoSQL: NoSQL is a broad term referring to databases designed outside of the traditional relational model. NoSQL databases have different trade-offs compared to relational databases, but are often well-suited for big data systems due to their flexibility and frequent distributed-first architecture.  Stream processing: Stream processing is the practice of computing over individual data items as they move through a system. This allows for real-time analysis of the data being fed to the system and is useful for time-sensitive operations using high velocity metrics.
  • 13. Elephant Tattoo 13 | P a g e Hadoop Architecture
  • 14. Elephant Tattoo 14 | P a g e HDFS Architecture Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing
  • 15. Elephant Tattoo 15 | P a g e HDFS Hold Meta Data (information about data Like Location of Blocks and Parts of the Files and Number of copy of the file ) in A server(PYSICAL COMPUTER) Named :Master NODE And The another Servers That Named :Data Node that hold the data it self the file csv or txt or json file , read-write operations on the file systems, as per client request. They also perform operations such as block creation, deletion, and replication according to the instructions of the name node. What is Block ? Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB ,128MB, but it can be increased as per the need to change in HDFS configuration.
  • 16. Elephant Tattoo 16 | P a g e Example Of HDFS Think of a file that contains the phone numbers for everyone in the United States; the people with a last name starting with A might be stored on server 1, B on server 2, and so on. In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole environment; for example, a development Hadoop cluster typically doesn’t need any data redundancy.) This redundancy offers multiple benefits, the most obvious being higher availability. In addition, this redundancy allows the Hadoop cluster to break work up into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, you get the benefit of data locality, which is critical when working with large data sets
  • 17. Elephant Tattoo 17 | P a g e HDFS VERSIONS 1,2,3 HDFS 1 : Supports MapReduce (MR) processing model only. Don’t supports any another tool for processing Has limited scaling of nodes. Limited to 4000 nodes per cluster Works on concepts of slots – slots can run either a Map task or a Reduce task only. A single Namenode to manage the entire namespace. Has a limitation to serve as a platform for event processing, streaming and real-time operations A Namenode failure affects the stack. HDFS 2: Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors. Works on concepts of containers. Using containers can run generic tasks. Multiple Namenode servers manage multiple namespaces.
  • 18. Elephant Tattoo 18 | P a g e Has to feature to overcome SPOF with a standby Namenode and in the case of Namenode failure, it is configured for automatic recovery. Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real-time operations. The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure. High Availability : If the Namenode is down due to some unplanned event such as a machine crash, the whole Hadoop cluster will be down as well. Hadoop2.x comes with the solution for this problem, which allows users to configure clusters with redundant namenodes, removing the chance that a lone name node will become a single point of failure within a cluster. HDFS 3: Support for Erasure Coding in HDFS Considering the rapid growth trends in data and data center hardware, support for erasure coding in Hadoop 3.0 is an important feature in years to come. Erasure Coding is a 50 years old technique that lets any random piece of data to be recovered based on other piece of data i.e. metadata stored around it. Erasure Coding is more like an advanced RAID technique that recovers data automatically when hard disk fails. JDK 8 is the minimum runtime version of JAVA required to run Hadoop 3.x as many dependency library files have been used from JDK 8. Storage overhead in Hadoop 3.0 is reduced to 50% with support for Erasure Coding. In this case, if here are 8 data blocks then a total of only 12 blocks will occupy the storage space. Hadoop 3.0 supports 2 or more Name Nodes.
  • 19. Elephant Tattoo 19 | P a g e Yarn (Yet Another Resource Negotiator) Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework. One of Apache Hadoop's core components, YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes In a cluster architecture, Apache Hadoop YARN sits between HDFS and the processing engines being used to run applications. It combines a central resource manager with containers, application coordinators and node-level agents that monitor processing operations in individual cluster nodes. YARN can dynamically allocate resources to applications as needed, a capability designed to improve resource utilization and application performance compared with Map Reducer’s more static allocation approach. Why YARN? In Hadoop version 1.0 which is also referred to as MRV1 (Map Reduce Version 1), Map Reduce performed both processing and resource management functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers periodically reported their progress to the Job Tracker.
  • 20. Elephant Tattoo 20 | P a g e YARN Architecture YARN enabled the users to perform operations as per requirement by using a variety of tools like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others. Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all your processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN Architecture consists of the following main components: 1. Resource Manager: Runs on a master daemon and manages the resource allocation in the cluster. 2. Node Manager: They run on the slave daemons and are responsible for the execution of a task on every single Data Node. 3. Application Master: Manages the user job lifecycle and resource needs of individual applications. It works along with the Node Manager and monitors the execution of tasks. 4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single node. Components of YARN You can consider YARN as the brain of your Hadoop Ecosystem. The image below represents the YARN Architecture.
  • 21. Elephant Tattoo 21 | P a g e The first component of YARN Architecture is, Resource Manager  It is the ultimate authority in resource allocation.  On receiving the processing requests, it passes parts of requests to corresponding node managers accordingly, where the actual processing takes place.  It is the arbitrator of the cluster resources and decides the allocation of the available resources for competing applications.  Optimizes the cluster utilization like keeping all resources in use all the time against various constraints such as capacity guarantees, fairness, and SLAs.  It has two major components: a) Scheduler b) Application Manager a) Scheduler  The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities, queues etc.  It is called a pure scheduler in Resource Manager, which means that it does not perform any monitoring or tracking of status for the applications.  If there is an application failure or hardware failure, the Scheduler does not guarantee to restart the failed tasks.
  • 22. Elephant Tattoo 22 | P a g e  Performs scheduling based on the resource requirements of the applications.  It has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various applications. There are two such plug-ins: Capacity Scheduler and Fair Scheduler, which are currently used as Schedulers in Resource Manager. Application Manager  It is responsible for accepting job submissions.  Negotiates the first container from the Resource Manager for executing the application specific Application Master.  Manages running the Application Masters in a cluster and provides service for restarting the Application Master container on failure. Node Manager  It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given node.  It registers with the Resource Manager and sends heartbeats with the health status of the node.  Its primary goal is to manage application containers assigned to it by the resource manager.  It keeps up-to-date with the Resource Manager.  Application Master requests the assigned container from the Node Manager by sending it a Container Launch Context (CLC) which includes everything the application needs in order to run. The Node Manager creates the requested container process and starts it.  Monitors resource usage (memory, CPU) of individual containers.  Performs Log management.  It also kills the container as directed by the Resource Manager. Application Master  An application is a single job submitted to the framework. Each such application has a unique Application Master associated with it which is a framework specific entity.  It is the process that coordinates an application’s execution in the cluster and also manages faults.
  • 23. Elephant Tattoo 23 | P a g e  Its task is to negotiate resources from the Resource Manager and work with the Node Manager to execute and monitor the component tasks.  It is responsible for negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.  Once started, it periodically sends heartbeats to the Resource Manager to affirm its health and to update the record of its resource demands. The fourth component is: Container  It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.  YARN containers are managed by a container launch context which is container life-cycle(CLC). This record contains a map of environment variables, dependencies stored in a remotely accessible storage, security tokens, payload for Node Manager services and the command necessary to create the process.  It grants rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific host. Application Submission in YARN Refer to the image and have a look at the steps involved in application submission of Hadoop YARN: 1) Submit the job 2) Get Application ID 3) Application Submission Context 4) a: Start Container Launch B: Launch Application Master 5) Allocate Resources 6) a: Container B: Launch 7) Execute
  • 24. Elephant Tattoo 24 | P a g e Refer to the given image and see the following steps involved in Application workflow of Apache Hadoop YARN: 1. Client submits an application 2. Resource Manager allocates a container to start Application Manager 3. Application Manager registers with Resource Manager 4. Application Manager asks containers from Resource Manager 5. Application Manager notifies Node Manager to launch containers 6. Application code is executed in the container 7. Client contacts Resource Manager/Application Manager to monitor application’s status 8. Application Manager unregisters with Resource Manager MapReduce ? The term "MapReduce" actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.
  • 25. Elephant Tattoo 25 | P a g e Example OF MapReduce job Let’s look at a simple example. Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value. Toronto, 20 Whitby, 25 New York, 22 Rome, 32 Toronto, 4 Rome, 33 New York, 18 Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this: (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33) Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: (Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30) All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38) As an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion.
  • 26. Elephant Tattoo 26 | P a g e Another Big Data Tools
  • 27. Elephant Tattoo 27 | P a g e SPARK ? Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process. As against a common belief, Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark. Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only. Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. Features of Apache Spark Apache Spark has following features.  Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.  Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.  Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
  • 28. Elephant Tattoo 28 | P a g e  The following diagram shows three ways of how Spark can be built with Hadoop components. Components of Spark Apache Spark Core Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.
  • 29. Elephant Tattoo 29 | P a g e Spark SQL Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark Streaming Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data. MLlib (Machine Learning Library) MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface). GraphX GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction. Apache Spark Abstractions & Concepts In the below section, it is briefly discussed regarding the abstractions and concepts of Spark -  RDD (Resilient Distributed Dataset) - RDD is the central and significant unit of data in Apache Spark. It is a distributed collection of components across cluster nodes and can implement parallel operations. Parallelized Collections, External datasets and Existing RDDs are the three methods for creating RDD.  DAG (Direct Acyclic Graph) - DAG is a designated graph with no inscribed sequences. It reads data from HDFS and Map & Reduce operations are applied. It comprises a series of vertices such that every edge is directed from initial to succeeding in the progression.  Spark Shell - An interactive Shell that can execute a command line of the application effective because of the interactive testing and capability to read a large amount of data sources in various types.  Transformations - It builds a new RDD from the existing one. It transfers the dataset to the function and then returns the new dataset.  Actions - It turns final result to driver program or corresponds it to the external data store. Apache Spark Architecture Spark is accessible, intense, powerful and proficient Big Data tool for handling different enormous information challenges. Apache Spark takes after an ace/slave engineering with two primary Daemons and a Cluster Manager –  Master Daemon – (Master/Driver Process)  Worker Daemon – (Slave Process)
  • 30. Elephant Tattoo 30 | P a g e A spark cluster has a solitary Master and many numbers of Slaves/Workers. The driver and the agents run their individual Java procedures and users can execute them on individual machines. Below are the three methods of building Spark with Hadoop Components (these three components are strong pillars of Spark Architecture)