This document discusses how Apache Spark overcomes the limitations of Hadoop MapReduce. It explains that Spark is up to 100 times faster than MapReduce by keeping data in-memory between jobs rather than writing to disk. It also supports features beyond batch processing like machine learning, streaming, and graph processing through its libraries. Spark constructs jobs as directed acyclic graphs of operators that can be rearranged and optimized to cut down on reading and writing to disk.
2. Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
Agenda
At the end of this webinar you will be able to know about:
Strength of MapReduce
Things beyond MapReduce
How MapReduce limitations can be overcome
How Spark fits the bill
Other exciting features in Spark
3. Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training
Strength of MapReduce
4. Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Simple
Scalability
Fault
Tolerance
Minimal
data
motion
Strength of MapReduce
Independence of language of choice, such as Java, C++ or Python.
process petabytes of data, stored in HDFS on one cl
MapReduce takes care of failures using the replicated copies.
Process moves towards data to minimize disk I/O
5. Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training
Limitations Of MapReduce (MR)
6. Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
Real
Time
Complex
Algorithm
Re-reading
And parsing
Data
Minimal
Data
Motion
Graph
Processing
Iterative
Tasks
Random
Access
Limitations Of MR
7. Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
Feature Comparison with Spark
Fast 100x faster than MapReduce
Batch Processing Batch and Real-time Processing
Stores Data on Disk Stores Data in Memory
Written in Java Written in Scala
Hadoop MapReduce HADOOP Spark
Source: Databrix
8. Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training
How MR limitations can be overcome
9. Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cutting down on the number of
reads and writes to the disc
Real
time
11. Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cyclic data flows
Random
access
12. Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
How Spark Implements Features To Make Its
Architecture Better Than MR
13. Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency
computations, whereas MapReduce keeps shuffling things in and out of disk.
Sparks Cuts Down Read/Write I/O To Disk
14. Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Libraries For ML, Graph Programming …
Machine Learning
Library
Graph
programming
Spark interface
For RDBMS lovers
Utility for
continues
ingestion of data
15. Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic
Graph).
• The DAG is optimized by rearranging and combining operators where
possible.
16. Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training
Spark Other Features In Demand
18. Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources
Source: Databrix
20. Slide 20
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey