SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Boston Apache Spark
User Group
(the Spahk group)
Microsoft NERD Center - Horace Mann
Tuesday, 15 July 2014
Intro to Apache Spark
Matthew Farrellee, @spinningmatt
Updated: July 2014
Background - MapReduce / Hadoop
● Map & reduce around for 5+ decades
(McCarthy, 1960)
● Dean and Ghemawat demonstrate map and
reduce for distributed data processing
(Google, 2004)
● MapReduce paper timed well with
commodity hardware capabilities of the early
2000s
● Open source implementation in 2006
● Years of innovation improving, simplifying,
expanding
MapReduce / Hadoop difficulties
● Hardware evolved
○ Networks became fast
○ Memory became cheap
● Programming model proved non-trivial
○ Gave birth to multiple attempts to simplify, e.g. Pig,
Hive, ...
● Primarily batch execution mode
○ Begat specialized (non-batch) modes, e.g. Storm,
Drill, Giraph, ...
Some history - Spark
● Started in UC Berkeley AMPLab by Matei
Zaharia, 2009
○ AMP = Algorithms Machines People
○ AMPLab is integrating Algorithms, Machines, and
People to make sense of Big Data
● Open sourced, 2010
● Donated to Apache Software Foundation,
2013
● Graduated to top level project, 2014
● 1.0 release, May 2014
What is Apache Spark?
An open source, efficient and productive cluster
computing system that is interoperable with
Hadoop
Open source
● Top level Apache project
● http://www.ohloh.net/p/apache-spark
○ In a Nutshell, Apache Spark

○ has had 7,366 commits made by 299 contributors
representing 117,823 lines of code
○ is mostly written in Scala with a well-commented
source code
○ has a codebase with a long source history
maintained by a very large development team with
increasing Y-O-Y commits
○ took an estimated 30 years of effort (COCOMO
model) starting with its first commit in March, 2010
Efficient
● In-memory primitives
○ Use cluster memory and spill to disk only when
necessary
● High performance
○ https://amplab.cs.berkeley.edu/benchmark/
● General compute graphs, DAGs
○ Not just: Load -> Map -> Reduce -> Store -> Load ->
Map -> Reduce -> Store
○ Rich and pipelined: Load -> Map -> Union -> Reduce
-> Filter -> Group -> Sample -> Store
Interoperable
● Read and write data from HDFS (or any
storage system with an HDFS-like API)
● Read and write Hadoop file formats
● Run on YARN
● Interact with Hive, HBase, etc
Productive
● Unified data model, the RDD
● Multiple execution modes
○ Batch, interactive, streaming
● Multiple languages
○ Scala, Java, Python, R, SQL
● Rich standard library
○ Machine learning, streaming, graph processing, ETL
● Consistent API across languages
● Significant code reduction compared to
MapReduce
Consistent API, less code...
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
import org.apache.spark._
val sc = new SparkContext(new SparkConf().setAppName
(“word count”))
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark (Scala) -MapReduce -
from operator import add
from pyspark import SparkContext
sc = SparkContext(conf=SparkConf().setAppName(“word
count”))
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)
counts.saveAsTextFile("hdfs://...")
Spark (Python) -
Spark workflow
Value
mattf
mattf
RDD
Transform
Action
Load Save
The RDD
The resilient distributed dataset
A lazily evaluated, fault-tolerant collection of
elements that can be operated on in parallel
Value
mattf
mattf
RDD
Transform
Action
Load Save
RDDs technically
1. a set of partitions ("splits" in hadoop terms)
2. list of dependencies on parent RDDs
3. function to compute a partition given its
parents
4. optional partitioner (hash, range)
5. optional preferred location(s) for each
partition
Value
mattf
mattf
RDD
Transform
Action
Load Save
Load Value
mattf
mattf
RDD
Transform
Action
Load Save
Create an RDD.
● parallelize - convert a collection
● textFile - load a text file
● wholeTextFiles - load a dir of text files
● sequenceFile / hadoopFile - load using
Hadoop file formats
● More, http://spark.apache.
org/docs/latest/programming-guide.
html#external-datasets
Lazy operations.
Build compute DAG.
Don’t trigger computation.
● map(func) - elements passed through func
● flatMap(func) - func can return >=0 elements
● filter(func) - subset of elements
● sample(..., fraction, 
) - select fraction
● union(other) - union of two RDDs
● distinct - new RDD w/ distinct elements
Transform Value
mattf
mattf
RDD
Transform
Action
Load Save
● groupByKey - (K, V) -> (K, Seq[V])
● reduceByKey(func) - (K, V) -> (K, func(V...))
● sortByKey - order (K, V) by K
● join(other) - (K, V) + (K, W) -> (K, (V, W))
● cogroup/groupWith(other) - (K, V) + (K, W) -
> (K, (Seq[V], Seq[W]))
● cartesian(other) - cartesian product (all
pairs)
Transform (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.
org/docs/latest/programming-guide.
html#transformations
Transform (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save
Action Value
mattf
mattf
RDD
Transform
Action
Load Save
Active operations.
Trigger execution of DAG.
Result in a value.
● reduce(func) - reduce elements w/ func
● collect - convert to native collection
● count - count elements
● foreach(func) - apply func to elements
● take(n) - return some elements
Action (cont) Value
mattf
mattf
RDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.
org/docs/latest/programming-guide.
html#actions
Save Value
mattf
mattf
RDD
Transform
Action
Load Save
Action that results in
data stored to file system.
● saveAsTextFile
● saveAsSequenceFile
● saveAsObjectFile/saveAsPickleFile
● More, http://spark.apache.
org/docs/latest/programming-guide.
html#actions
Spark workflow
Value
mattf
mattf
RDD
Transform
Action
Load Save
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)
counts.saveAsTextFile("hdfs://...")
Multiple modes, rich standard library
Spark Core
SQL Streaming MLLib GraphX ...
Apache Spark
Spark SQL
● Components
○ Catalyst - generic optimization for relational algebra
○ Core - RDD execution; formats: Parquet, JSON
○ Hive support - run HiveQL and use Hive warehouse
● SchemaRDDs and SQL
Spark
SQL Stream MLlib GraphX ...
User
User
User
Name Age Height
Name Age Height
Name Age Height
sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)
RDD SchemaRDD
Spark Streaming
Spark
SQL Stream MLlib GraphX ...
● Run a streaming computation as a series of time bound,
deterministic batch jobs
● Time bound used to break stream into RDDs
Spark
Streaming
Spark
Stream RDDs Results
X seconds
wide
MLlib
Spark
SQL Stream MLlib GraphX ...
● Machine learning algorithms over RDDs
● Classification - logistic regression, linear support vector
machines, naive Bayes, decision trees
● Regression - linear regression, regression trees
● Collaborative filtering - alternating least squares
● Clustering - K-Means
● Optimization - stochastic gradient descent, limited-
memory BFGS
● Dimensionality reduction - singular value
decomposition, principal component analysis
Deploying Spark
Source: http://spark.apache.org/docs/latest/cluster-overview.html
Deploying Spark
● Driver program - shell or standalone
program that creates a SparkContext
and works with RDDs
● Cluster Manager - standalone, Mesos
or YARN
○ Standalone - the default, simple setup,
master + worker processes on nodes
○ Mesos - a general purpose manager that
runs Hadoop and other services. Two
modes of operation, fine & coarse.
○ YARN - Hadoop 2’s resource manager
Highlights from Spark Summit 2014
http://spark-summit.org/east/2015
New York in early 2015
Thanks to...
@pacoid, @rxin and @michaelarmbrust for
letting me crib slides for this introduction

Weitere Àhnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 

Andere mochten auch

Solaris cluster roadshow day 1 sales overview
Solaris cluster roadshow day 1 sales overviewSolaris cluster roadshow day 1 sales overview
Solaris cluster roadshow day 1 sales overview
xKinAnx
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
saintdevil163
 

Andere mochten auch (6)

Solaris cluster roadshow day 1 sales overview
Solaris cluster roadshow day 1 sales overviewSolaris cluster roadshow day 1 sales overview
Solaris cluster roadshow day 1 sales overview
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2 لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű«Ű§Ù†ÙŠŰ©
ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2   لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű«Ű§Ù†ÙŠŰ©ŰŽŰ±Ű­ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2   لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű«Ű§Ù†ÙŠŰ©
ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2 لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű«Ű§Ù†ÙŠŰ©
 
Hadoop development in China Mobile Research Institute
Hadoop development in China Mobile Research InstituteHadoop development in China Mobile Research Institute
Hadoop development in China Mobile Research Institute
 
ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2 لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű±Ű§ŰšŰčŰ©
ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2   لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű±Ű§ŰšŰč۩ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2   لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű±Ű§ŰšŰčŰ©
ێ۱ۭ Ù…Ù‚Ű±Ű± Ű§Ù„ŰšŰ±Ù…ŰŹŰ© 2 لŰșŰ© ŰŹŰ§ÙŰ§ - Ű§Ù„ÙˆŰ­ŰŻŰ© Ű§Ù„Ű±Ű§ŰšŰčŰ©
 

Ähnlich wie Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 

Ähnlich wie Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014 (20)

Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Sparkℱ is a multi-language engine for executing data-S5.ppt
Apache Sparkℱ is a multi-language engine for executing data-S5.pptApache Sparkℱ is a multi-language engine for executing data-S5.ppt
Apache Sparkℱ is a multi-language engine for executing data-S5.ppt
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 

KĂŒrzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

KĂŒrzlich hochgeladen (20)

Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

  • 1. Boston Apache Spark User Group (the Spahk group) Microsoft NERD Center - Horace Mann Tuesday, 15 July 2014
  • 2. Intro to Apache Spark Matthew Farrellee, @spinningmatt Updated: July 2014
  • 3. Background - MapReduce / Hadoop ● Map & reduce around for 5+ decades (McCarthy, 1960) ● Dean and Ghemawat demonstrate map and reduce for distributed data processing (Google, 2004) ● MapReduce paper timed well with commodity hardware capabilities of the early 2000s ● Open source implementation in 2006 ● Years of innovation improving, simplifying, expanding
  • 4. MapReduce / Hadoop difficulties ● Hardware evolved ○ Networks became fast ○ Memory became cheap ● Programming model proved non-trivial ○ Gave birth to multiple attempts to simplify, e.g. Pig, Hive, ... ● Primarily batch execution mode ○ Begat specialized (non-batch) modes, e.g. Storm, Drill, Giraph, ...
  • 5. Some history - Spark ● Started in UC Berkeley AMPLab by Matei Zaharia, 2009 ○ AMP = Algorithms Machines People ○ AMPLab is integrating Algorithms, Machines, and People to make sense of Big Data ● Open sourced, 2010 ● Donated to Apache Software Foundation, 2013 ● Graduated to top level project, 2014 ● 1.0 release, May 2014
  • 6. What is Apache Spark? An open source, efficient and productive cluster computing system that is interoperable with Hadoop
  • 7. Open source ● Top level Apache project ● http://www.ohloh.net/p/apache-spark ○ In a Nutshell, Apache Spark
 ○ has had 7,366 commits made by 299 contributors representing 117,823 lines of code ○ is mostly written in Scala with a well-commented source code ○ has a codebase with a long source history maintained by a very large development team with increasing Y-O-Y commits ○ took an estimated 30 years of effort (COCOMO model) starting with its first commit in March, 2010
  • 8. Efficient ● In-memory primitives ○ Use cluster memory and spill to disk only when necessary ● High performance ○ https://amplab.cs.berkeley.edu/benchmark/ ● General compute graphs, DAGs ○ Not just: Load -> Map -> Reduce -> Store -> Load -> Map -> Reduce -> Store ○ Rich and pipelined: Load -> Map -> Union -> Reduce -> Filter -> Group -> Sample -> Store
  • 9. Interoperable ● Read and write data from HDFS (or any storage system with an HDFS-like API) ● Read and write Hadoop file formats ● Run on YARN ● Interact with Hive, HBase, etc
  • 10. Productive ● Unified data model, the RDD ● Multiple execution modes ○ Batch, interactive, streaming ● Multiple languages ○ Scala, Java, Python, R, SQL ● Rich standard library ○ Machine learning, streaming, graph processing, ETL ● Consistent API across languages ● Significant code reduction compared to MapReduce
  • 11. Consistent API, less code... public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } import org.apache.spark._ val sc = new SparkContext(new SparkConf().setAppName (“word count”)) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark (Scala) -MapReduce - from operator import add from pyspark import SparkContext sc = SparkContext(conf=SparkConf().setAppName(“word count”)) file = sc.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(add) counts.saveAsTextFile("hdfs://...") Spark (Python) -
  • 13. The RDD The resilient distributed dataset A lazily evaluated, fault-tolerant collection of elements that can be operated on in parallel Value mattf mattf RDD Transform Action Load Save
  • 14. RDDs technically 1. a set of partitions ("splits" in hadoop terms) 2. list of dependencies on parent RDDs 3. function to compute a partition given its parents 4. optional partitioner (hash, range) 5. optional preferred location(s) for each partition Value mattf mattf RDD Transform Action Load Save
  • 15. Load Value mattf mattf RDD Transform Action Load Save Create an RDD. ● parallelize - convert a collection ● textFile - load a text file ● wholeTextFiles - load a dir of text files ● sequenceFile / hadoopFile - load using Hadoop file formats ● More, http://spark.apache. org/docs/latest/programming-guide. html#external-datasets
  • 16. Lazy operations. Build compute DAG. Don’t trigger computation. ● map(func) - elements passed through func ● flatMap(func) - func can return >=0 elements ● filter(func) - subset of elements ● sample(..., fraction, 
) - select fraction ● union(other) - union of two RDDs ● distinct - new RDD w/ distinct elements Transform Value mattf mattf RDD Transform Action Load Save
  • 17. ● groupByKey - (K, V) -> (K, Seq[V]) ● reduceByKey(func) - (K, V) -> (K, func(V...)) ● sortByKey - order (K, V) by K ● join(other) - (K, V) + (K, W) -> (K, (V, W)) ● cogroup/groupWith(other) - (K, V) + (K, W) - > (K, (Seq[V], Seq[W])) ● cartesian(other) - cartesian product (all pairs) Transform (cont) Value mattf mattf RDD Transform Action Load Save
  • 18. More available in documentation... http://spark.apache. org/docs/latest/programming-guide. html#transformations Transform (cont) Value mattf mattf RDD Transform Action Load Save
  • 19. Action Value mattf mattf RDD Transform Action Load Save Active operations. Trigger execution of DAG. Result in a value. ● reduce(func) - reduce elements w/ func ● collect - convert to native collection ● count - count elements ● foreach(func) - apply func to elements ● take(n) - return some elements
  • 20. Action (cont) Value mattf mattf RDD Transform Action Load Save More available in documentation... http://spark.apache. org/docs/latest/programming-guide. html#actions
  • 21. Save Value mattf mattf RDD Transform Action Load Save Action that results in data stored to file system. ● saveAsTextFile ● saveAsSequenceFile ● saveAsObjectFile/saveAsPickleFile ● More, http://spark.apache. org/docs/latest/programming-guide. html#actions
  • 22. Spark workflow Value mattf mattf RDD Transform Action Load Save file = sc.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(add) counts.saveAsTextFile("hdfs://...")
  • 23. Multiple modes, rich standard library Spark Core SQL Streaming MLLib GraphX ... Apache Spark
  • 24. Spark SQL ● Components ○ Catalyst - generic optimization for relational algebra ○ Core - RDD execution; formats: Parquet, JSON ○ Hive support - run HiveQL and use Hive warehouse ● SchemaRDDs and SQL Spark SQL Stream MLlib GraphX ... User User User Name Age Height Name Age Height Name Age Height sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”) RDD SchemaRDD
  • 25. Spark Streaming Spark SQL Stream MLlib GraphX ... ● Run a streaming computation as a series of time bound, deterministic batch jobs ● Time bound used to break stream into RDDs Spark Streaming Spark Stream RDDs Results X seconds wide
  • 26. MLlib Spark SQL Stream MLlib GraphX ... ● Machine learning algorithms over RDDs ● Classification - logistic regression, linear support vector machines, naive Bayes, decision trees ● Regression - linear regression, regression trees ● Collaborative filtering - alternating least squares ● Clustering - K-Means ● Optimization - stochastic gradient descent, limited- memory BFGS ● Dimensionality reduction - singular value decomposition, principal component analysis
  • 28. Deploying Spark ● Driver program - shell or standalone program that creates a SparkContext and works with RDDs ● Cluster Manager - standalone, Mesos or YARN ○ Standalone - the default, simple setup, master + worker processes on nodes ○ Mesos - a general purpose manager that runs Hadoop and other services. Two modes of operation, fine & coarse. ○ YARN - Hadoop 2’s resource manager
  • 29. Highlights from Spark Summit 2014 http://spark-summit.org/east/2015 New York in early 2015
  • 30. Thanks to... @pacoid, @rxin and @michaelarmbrust for letting me crib slides for this introduction