SlideShare a Scribd company logo
1 of 27
Igniting the Spark,
For the Love of Big Data
ThoughtWorks Gurgaon
By
Achal Aggarwal &
Syed Atif Akhtar
The 3 V’s revisited
Consumer Venue Artist
● Open source framework
● Used for storage and large scale processing of data-sets on clusters of
commodity hardware
● Mainly consists of the following two modules:
- HDFS (Distributed Storage)
- MapReduce (Analysis/Processing)
Hadoop
● Only Batch Processing.
● Hadoop MR API is not functional.
● MR has a bloated computation model.
● Has no awareness of surrounding MR pipelines, which can be used for
optimization.
● Iterative algorithms are difficult to implement.
Limitations with Hadoop MR
● Mappers do not write to file system (by default).
● Uses Akka for data communication between nodes.
● Lazy Computation.
● Functional syntax.
● Better RDD (Resilient Distributed Dataset) API.
● Extension of Spark Streaming for (near) Real-time processing.
Spark to the rescue!
Apache Spark™ is a fast and general engine for large-scale data processing.
-Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk.
Spark has an advanced DAG execution engine that supports cyclic data flow
and in-memory computing.
-Ease of Use
Write applications quickly in Java, Scala, Python, R.
Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala, Python and R shells.
About Spark
-Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these libraries
seamlessly in the same application.
-Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse
data sources including HDFS, Cassandra, HBase, and S3.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN,
or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon,
and any Hadoop data source.
About Spark (Cont...)
Spark Architecture
Dig Deeper..
RDDs are huge collections of records with following properties –
Immutable
Partitioned
Fault tolerant
Created by coarse grained operations
Lazily evaluated
Can be persisted
Resilient Distributed Datasets (RDDs)
What is an RDD?
The data within an RDD is split into several partitions.
Properties of partitions:
Partitions never span multiple machines, i.e., tuples in the same partition are
guaranteed to be on the same machine.
Each machine in the cluster contains one or more partitions.
The number of partitions to use is configurable. By default, it equals the total
number of cores on all executor nodes.
Two kinds of partitioning available in Spark:
Hash partitioning
Range partitioning
Partitioning
RDD keeps track of all the stages that contributed to that RDD
If there is any data loss for the RDD,only that particular RDD is recomputed
from scratch and not all
Fault Tolerance (Lineage)
Spark RDD’s are lazy evaluated ie no actual operation is performed on an RDD till
any action that requires the output is called ie save to disk or a collect()
Lazy Evaluation
Intermediate output from an RDD can be persisted on the worker nodes
Wise thing to do in cases where the RDDs need to be reused again
RDD1
RDD2
RDD3
Persistence
Accumulators - Write only on executor,read only on driver
Broadcast Variables - Write on driver,Read only on executors
Shared Variables
An RDD of a pair/tuple (k,v)
More set of operations that can be performed
Important for defining joins
Pair RDDs
Transformation - created new RDD by changing the original
Actions - measure but do not change the original data
Types of Operations
https://www.mapr.com/ebooks/spark/03-apache-spark-architecture-overview.html
The Spark Stack
Spark Core
Spark Core (Cont...)
Spark Core - Example Word Count
Spark Streaming - Discretized stream processing
Data Frame: Can act as distributed SQL query engine.
Data Sources: Computation over structured data stored in a wide variety of
formats, including Parquet, JSON, and Apache Avro library.
JDBC Server: To connect to the structured data stored in relational database
tables and perform big data analytics using the traditional BI tools.
Spark SQL
Spark Streaming & SQL - Example
Thank You!
Questions?

More Related Content

What's hot

Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
Rich Beaudoin
 

What's hot (20)

Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Hadoop
HadoopHadoop
Hadoop
 
RDD
RDDRDD
RDD
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Analyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_CassandraAnalyzing_Data_with_Spark_and_Cassandra
Analyzing_Data_with_Spark_and_Cassandra
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Viewers also liked

Games design job advertisement
Games design job advertisementGames design job advertisement
Games design job advertisement
LewisB2013
 
Maapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwS
Maapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwSMaapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwS
Maapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwS
nikhilawareness
 
Cover bs sylvia_day
Cover bs sylvia_dayCover bs sylvia_day
Cover bs sylvia_day
elagabriela
 
Reference Letter
Reference LetterReference Letter
Reference Letter
Aman Bharj
 
RobertSimmons_Line Logo
RobertSimmons_Line LogoRobertSimmons_Line Logo
RobertSimmons_Line Logo
David Simmons
 
Pré escolar 001
Pré escolar 001Pré escolar 001
Pré escolar 001
macisabel
 
Brewster logo r2 v2b Orange
Brewster logo r2 v2b OrangeBrewster logo r2 v2b Orange
Brewster logo r2 v2b Orange
John Wood
 
Foro Gtes Seguridad Ceritificado 2015 04
Foro Gtes Seguridad Ceritificado  2015 04Foro Gtes Seguridad Ceritificado  2015 04
Foro Gtes Seguridad Ceritificado 2015 04
andymuchi
 

Viewers also liked (20)

Finalaya daily wrap_05jun2014
Finalaya daily wrap_05jun2014Finalaya daily wrap_05jun2014
Finalaya daily wrap_05jun2014
 
205
205205
205
 
Focus
FocusFocus
Focus
 
Games design job advertisement
Games design job advertisementGames design job advertisement
Games design job advertisement
 
Notasca40 b13.xlsx
Notasca40 b13.xlsxNotasca40 b13.xlsx
Notasca40 b13.xlsx
 
Maapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwS
Maapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwSMaapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwS
Maapu Vechutaayan Aapu #VVel http://q.4rd.ca/aaacwS
 
Jennifer
JenniferJennifer
Jennifer
 
Haha
HahaHaha
Haha
 
Websitesandprint.com Branding
Websitesandprint.com BrandingWebsitesandprint.com Branding
Websitesandprint.com Branding
 
Rest e soap
Rest e soapRest e soap
Rest e soap
 
Cover bs sylvia_day
Cover bs sylvia_dayCover bs sylvia_day
Cover bs sylvia_day
 
Reference Letter
Reference LetterReference Letter
Reference Letter
 
Airports innovative solutions
Airports innovative solutionsAirports innovative solutions
Airports innovative solutions
 
RobertSimmons_Line Logo
RobertSimmons_Line LogoRobertSimmons_Line Logo
RobertSimmons_Line Logo
 
2011.04.08 Sansa pentru rolul vietii
2011.04.08 Sansa pentru rolul vietii2011.04.08 Sansa pentru rolul vietii
2011.04.08 Sansa pentru rolul vietii
 
Pré escolar 001
Pré escolar 001Pré escolar 001
Pré escolar 001
 
Tahiti infos
Tahiti infosTahiti infos
Tahiti infos
 
C'mon tutankamon Tree
C'mon tutankamon TreeC'mon tutankamon Tree
C'mon tutankamon Tree
 
Brewster logo r2 v2b Orange
Brewster logo r2 v2b OrangeBrewster logo r2 v2b Orange
Brewster logo r2 v2b Orange
 
Foro Gtes Seguridad Ceritificado 2015 04
Foro Gtes Seguridad Ceritificado  2015 04Foro Gtes Seguridad Ceritificado  2015 04
Foro Gtes Seguridad Ceritificado 2015 04
 

Similar to Geek Night - Functional Data Processing using Spark and Scala

Similar to Geek Night - Functional Data Processing using Spark and Scala (20)

Spark
SparkSpark
Spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Spark 101
Spark 101Spark 101
Spark 101
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 

Recently uploaded

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Recently uploaded (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 

Geek Night - Functional Data Processing using Spark and Scala

  • 1. Igniting the Spark, For the Love of Big Data ThoughtWorks Gurgaon By Achal Aggarwal & Syed Atif Akhtar
  • 2.
  • 3. The 3 V’s revisited
  • 4. Consumer Venue Artist ● Open source framework ● Used for storage and large scale processing of data-sets on clusters of commodity hardware ● Mainly consists of the following two modules: - HDFS (Distributed Storage) - MapReduce (Analysis/Processing) Hadoop
  • 5. ● Only Batch Processing. ● Hadoop MR API is not functional. ● MR has a bloated computation model. ● Has no awareness of surrounding MR pipelines, which can be used for optimization. ● Iterative algorithms are difficult to implement. Limitations with Hadoop MR
  • 6. ● Mappers do not write to file system (by default). ● Uses Akka for data communication between nodes. ● Lazy Computation. ● Functional syntax. ● Better RDD (Resilient Distributed Dataset) API. ● Extension of Spark Streaming for (near) Real-time processing. Spark to the rescue!
  • 7. Apache Spark™ is a fast and general engine for large-scale data processing. -Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. -Ease of Use Write applications quickly in Java, Scala, Python, R. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. About Spark
  • 8. -Generality Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. -Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. About Spark (Cont...)
  • 11. RDDs are huge collections of records with following properties – Immutable Partitioned Fault tolerant Created by coarse grained operations Lazily evaluated Can be persisted Resilient Distributed Datasets (RDDs)
  • 12. What is an RDD?
  • 13. The data within an RDD is split into several partitions. Properties of partitions: Partitions never span multiple machines, i.e., tuples in the same partition are guaranteed to be on the same machine. Each machine in the cluster contains one or more partitions. The number of partitions to use is configurable. By default, it equals the total number of cores on all executor nodes. Two kinds of partitioning available in Spark: Hash partitioning Range partitioning Partitioning
  • 14. RDD keeps track of all the stages that contributed to that RDD If there is any data loss for the RDD,only that particular RDD is recomputed from scratch and not all Fault Tolerance (Lineage)
  • 15. Spark RDD’s are lazy evaluated ie no actual operation is performed on an RDD till any action that requires the output is called ie save to disk or a collect() Lazy Evaluation
  • 16. Intermediate output from an RDD can be persisted on the worker nodes Wise thing to do in cases where the RDDs need to be reused again RDD1 RDD2 RDD3 Persistence
  • 17. Accumulators - Write only on executor,read only on driver Broadcast Variables - Write on driver,Read only on executors Shared Variables
  • 18. An RDD of a pair/tuple (k,v) More set of operations that can be performed Important for defining joins Pair RDDs
  • 19. Transformation - created new RDD by changing the original Actions - measure but do not change the original data Types of Operations
  • 23. Spark Core - Example Word Count
  • 24. Spark Streaming - Discretized stream processing
  • 25. Data Frame: Can act as distributed SQL query engine. Data Sources: Computation over structured data stored in a wide variety of formats, including Parquet, JSON, and Apache Avro library. JDBC Server: To connect to the structured data stored in relational database tables and perform big data analytics using the traditional BI tools. Spark SQL
  • 26. Spark Streaming & SQL - Example