SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Intro to Apache Spark
Marius Soutier
Freelance Software Engineer
@mariussoutier
Clustered In-Memory Computation
Motivation
• Classical data architectures break down
• RDMBS can’t handle large amounts of data well
• Most RDMBS can’t handle multiple input formats
• Most NoSQLs don’t offer analytics
Problem Running computations on BigData®
The 3 Vs of Big Data
Volume
100s of GB, TB, PB
Variety
Structured, Unstructured,
Semi-Structured
Velocity
Sensors, Realtime
“Fast Data”
Hadoop (1)
• De-facto standard for running computations on large amounts of
different data is Hadoop
• Hadoop consists of
• HDFS distributed, fault-tolerant file system
• Map/Reduce parallelizable computations pioneered by Google
• Hadoop is typically run on a (large) cluster of non-virtualized
commodity hardware
Hadoop (2)
• However, Map/Reduce are batch jobs with high latency
• Not suitable for interactive queries, real-time analytics,
or Machine Learning
• Pure Map/Reduce is hard to develop and maintain
Enter Spark
Spark
is a framework for
clustered
in-memory
data processing
• Developed at UC Berkeley, released in
2010
• Apache Top-Level Project Since February
2014, current version is 1.2.1 / 1.3.0
• USP: Uses cluster-wide available memory
to speed up computations
• Very active community
Apache Spark (1)
• Written in Scala (& Akka), 

APIs for Java and Python
• Programming model is a collection pipeline*
instead of Map/Reduce
• Supports batch, streaming, interactive, 

or all combined using unified API
Apache Spark (2)
* http://martinfowler.com/articles/collection-pipeline/
Spark Ecosystem
Spark Core
Spark SQL
Spark Hive
BlinkDB
Approximate
SQL
Spark
Streaming
MLlib
Machine
Learning
GraphX SparkR
ALPHA
ALPHA
ALPHA
Tachyon
Spark is a framework for clustered in-memory
data processing
Spark is a platform for data driven products.
• Base abstraction Resilient Distributed Dataset (RDD)
• Essentially a distributed collection of objects
• Can be cached in memory or on disk
RDD
RDD Word Count
val sc = new SparkContext()

val input: RDD[String] = sc.textFile("/tmp/word.txt")

val words: RDD[(String, Long)] = input

.flatMap(line => line.toLowerCase.split("s+"))

.map(word => word -> 1L)

.cache()



val wordCountsRdd: RDD[(String, Long)] = words

.reduceByKey(_ + _)

.sortByKey()


val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
Cluster
Driver
SparkContext
Master
Worker
Executor
Worker
Executor
Tasks
Tasks
• Spark app (driver) builds DAG from RDD operations
• DAG is split into tasks that are executed by workers
Example Architecture
Input
HDFS
Message Queue
Spark
Streaming
Spark
Batch Jobs
SparkSQL
Real-Time
Dashboard
Interactive
SQL
Analytics,
Reports
Demo
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 
10 Things About Spark
10 Things About Spark 10 Things About Spark
10 Things About Spark
 
Hadoop
HadoopHadoop
Hadoop
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Spark Core
Spark CoreSpark Core
Spark Core
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 

Andere mochten auch

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Scala the good and bad parts
Scala the good and bad partsScala the good and bad parts
Scala the good and bad parts
benewu
 

Andere mochten auch (10)

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Intro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG CologneIntro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG Cologne
 
Type Classes in Scala
Type Classes in ScalaType Classes in Scala
Type Classes in Scala
 
Intro to sbt-web
Intro to sbt-webIntro to sbt-web
Intro to sbt-web
 
Scala the good and bad parts
Scala the good and bad partsScala the good and bad parts
Scala the good and bad parts
 
Scala
ScalaScala
Scala
 
Scala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentationScala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentation
 
Scala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very ugly
 
Vert.x vs akka
Vert.x vs akkaVert.x vs akka
Vert.x vs akka
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 

Ähnlich wie Intro to Apache Spark

Ähnlich wie Intro to Apache Spark (20)

APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

Kürzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Intro to Apache Spark

  • 1. Intro to Apache Spark Marius Soutier Freelance Software Engineer @mariussoutier Clustered In-Memory Computation
  • 2. Motivation • Classical data architectures break down • RDMBS can’t handle large amounts of data well • Most RDMBS can’t handle multiple input formats • Most NoSQLs don’t offer analytics Problem Running computations on BigData®
  • 3. The 3 Vs of Big Data Volume 100s of GB, TB, PB Variety Structured, Unstructured, Semi-Structured Velocity Sensors, Realtime “Fast Data”
  • 4. Hadoop (1) • De-facto standard for running computations on large amounts of different data is Hadoop • Hadoop consists of • HDFS distributed, fault-tolerant file system • Map/Reduce parallelizable computations pioneered by Google • Hadoop is typically run on a (large) cluster of non-virtualized commodity hardware
  • 5. Hadoop (2) • However, Map/Reduce are batch jobs with high latency • Not suitable for interactive queries, real-time analytics, or Machine Learning • Pure Map/Reduce is hard to develop and maintain
  • 6. Enter Spark Spark is a framework for clustered in-memory data processing
  • 7. • Developed at UC Berkeley, released in 2010 • Apache Top-Level Project Since February 2014, current version is 1.2.1 / 1.3.0 • USP: Uses cluster-wide available memory to speed up computations • Very active community Apache Spark (1)
  • 8. • Written in Scala (& Akka), 
 APIs for Java and Python • Programming model is a collection pipeline* instead of Map/Reduce • Supports batch, streaming, interactive, 
 or all combined using unified API Apache Spark (2) * http://martinfowler.com/articles/collection-pipeline/
  • 9. Spark Ecosystem Spark Core Spark SQL Spark Hive BlinkDB Approximate SQL Spark Streaming MLlib Machine Learning GraphX SparkR ALPHA ALPHA ALPHA Tachyon
  • 10. Spark is a framework for clustered in-memory data processing Spark is a platform for data driven products.
  • 11. • Base abstraction Resilient Distributed Dataset (RDD) • Essentially a distributed collection of objects • Can be cached in memory or on disk RDD
  • 12. RDD Word Count val sc = new SparkContext()
 val input: RDD[String] = sc.textFile("/tmp/word.txt")
 val words: RDD[(String, Long)] = input
 .flatMap(line => line.toLowerCase.split("s+"))
 .map(word => word -> 1L)
 .cache()
 
 val wordCountsRdd: RDD[(String, Long)] = words
 .reduceByKey(_ + _)
 .sortByKey() 
 val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
  • 13. Cluster Driver SparkContext Master Worker Executor Worker Executor Tasks Tasks • Spark app (driver) builds DAG from RDD operations • DAG is split into tasks that are executed by workers
  • 14. Example Architecture Input HDFS Message Queue Spark Streaming Spark Batch Jobs SparkSQL Real-Time Dashboard Interactive SQL Analytics, Reports