SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Spark Overview
Lisa Hua
7/14/2014
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
What is Spark
● Hadoop/YARN:
○ strong in processing large files parallelly
○ synchronization barrier when persisting data to the
disk.
○ MapReduce: launch mapper & reducer, R/W to disk,
back to queue and get resource
● Spark:
○ in-memory processing
○ iterative and interactive data analysis
○ compare to MapReduce, supports more complex
and interactive applications
Hadoop MapReduce
● Slow due to replication, serialization, and disk IO
● Inefficient for:
○ Iterative algorithms (Machine Learning, Graphs &
Network Analysis)
○ Interactive Data Mining (R, Excel, Searching)
Spark In-memory Processing
1.Extract a working set
2.Cache it
3.Query it repeatedly
Spark Ecosystem
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
How Spark Works - SparkContext
How Spark Works - RDD
● Partitions of Data
● Dependencies between partitions
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK,
DISK_ONLY,
...
How Spark Works - RDD operations
Transformations
● Create a new
dataset from an
existing one.
● Lazy in nature,
executed only
when some action
is performed.
● Example
○ Map(func)
○ Filter(func)
○ Distinct()
Actions
● Returns a value or
exports data after
performing a
computation.
● Example:
○ Count()
○ Reduce(func)
○ Collect
○ Take()
Persistence
● Caching dataset
in-memory for
future operations
● store on disk or
RAM or mixed
● Example:
○ Persist()
○ Cache()
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works - Actions
● Parallel Operations
How Spark Works - Actions
● Parallel Operations
How Spark Works - Stages
Each stage is executed as a series
of Task (one Task for each
Partition).
DAG (Directed Acyclic Graph).
Spark Programming - Tasks
Task is the fundamental unit of execution in Spark
How Spark Works - Summary
● SparkContext
● Resilient Distributed Datasets (RDDs)
● Parallel Operations
● Shared Variables
○ Broadcast Variables - read-only
○ Accumulators
Compare Hadoop and Spark
Traditional OS Hadoop Spark
Storage File System HDFS HDFS
Schedule Processes MapReduce Computation Graph
I/O Disk Cache(in memory) and
shared data
Fault
Tolerance
Duplication and Disk
I/O
Hash partition and auto-
reconstruction
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
Spark - LogisticRegressionModel
1. Initialize spark JavaSparkContext
2. Prepare dataSet
3. Train LR model
4. Evaluation
1. Initializing Spark
1. JavaSparkContext: tell Spark how to access to the cluster
2. SparkConf: setting - a hashmap of <String,String>
a. required: AppName, Master, more default configuration
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
2. Prepare Dataset
1. From Parallelized Collections
2. From External DataSets
3. Passing Functions to Spark
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
JavaRDD<String> distFile = sc.textFile("data.txt"); OR
JavaRDD<String> distFile = sc.textFile("hdfs://data.txt");
class ParseLabeledPoint implements Function<String, LabeledPoint> {
public LabeledPoint call(String s) {...
for (int i = 0; i < len; i++) {
x[i] = Double.parseDouble(tokens[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}}
---
JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
3. Train LogisticRegressionModel
/*
* @param input RDD of (label, array of features) pairs.
* @param numIterations Number of iterations of gradient descent to run.
* @param stepSize Step size to be used for each iteration of gradient descent.
* @param miniBatchFraction Fraction of data to be used per iteration.
*/
LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations,
stepSize,miniBatchFraction);
Train the model
4. Calculate Score - Evaluation
pmmlModel = new PMMLSparkLogisticRegressionModel()
.adaptMLModelToPMML(lrModel, partialPmmlModel);
1. Convert LogisticRegressionModel to PMML model
2. Prepare DataSet and calculate score
//use LogisticRegressionModel
JavaRDD<Vector> evalVectors = lines.map(new ParseVector());
List<Double> evalList = lrModel.predict(evalVectors).collect();
//use PMMLEvaluator
RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml);
List<Double> evalResult = evaluator.evaluate(evalData);
//compare two evaluator results
for (...) {
Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA);
}
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
Why Spark? - scalability & performance
1. leverage the memory of the cluster for in-
memory processing
2. Computation Graph optimization for parallel
execution
Shark: Spark SQL, Hive in Spark
Hive: manage large dataset in
distributed storage
Why Spark? - compatibility
1. compatible with HDFS, HBase, and any
Hadoop storage system
Why Spark? - Ease of Use API
1. Expressive API in Java, Scala, and Python
2. Supports more parallel operations
Expressive API - MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Expressive API - Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Scala:
val file = spark.textFile("hdfs://...")
val counts = file.map(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Java 6, Java 7:
JavaRDD<String> file = spark.textFile("hdfs://...");
JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String,
String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String,
Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new
Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }});
Expressive API - Spark
Why Spark? - Third Party Softwares
● Mahout
○ Say goodbye to MapReduce
○ Support for Apache Spark
■ Mahout-Spark Shell: facilitate the Mahout data
structures, such as Matrix, etc.
○ Support for h2o being explored
○ Support for Apache Flink possibly in future
● H2o
○ Sparkling water - embrace in-memory
processing with ML algorithm
Purpose Language Storage Stakeholder
H2o In-memory ML
predictive analysis
Java/R K/V
store
data analyst
Spark in-memory
processing engine
Scala, support
Java/Python
RDD HDFS user
Why Spark - Third Party Software
● Pig on Spark - Spork
● Other commercial softwares
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
How to use Spark in Shifu?
1. train: LogisticRegressionTrainer
2. stats & normalize
3. eval: add more evaluation metrics
a. precision, recall, F-measure, precision-recall curve
- pr(), precisionByThreshold(),recallByThreshold()..
b. area under the curves (AUC) - areaUnderPR()
c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
Related Projects
1. Bulk Synchronous Parallel
a. parallel computing on message-passing
b. BSP: local computation, global communication,
barrier synchronization
c. graph processing: Pregel, Giraph
d. scientific computing: Hama
e. optimize operation DAG: Flink
Seconds
Nodes
Take Away - Big Data has moved in-memory
1. In-memory big data has come of age.
2. Spark leverages the cluster memory for
iterative and interactive operations
3. Spark is compatible with HDFS, HBase, and
any Hadoop storage system
4. Spark powers a stack of high-level tools
including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming
5. Spark has expressive API
Questions
3. Train LogisticRegressionModel (cont.)
val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)
val weights =
if (addIntercept) { ...
} else { weightsWithIntercept }
2. Calculate weights
3. Gradient Descent optimize()
4. Training Error - not accessible from LogisticRegressionModel
logInfo("Last 10 stochastic losses %s".format(stochasticLoss.takeRight(10)))
14/07/09 14:10:40 INFO optimization.GradientDescent: Last 10 stochastic losses 0.6931471805599468,
0.5255572298404575,.., 0.3444544005102222, 0.3355921369255156

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaEdureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 

Was ist angesagt? (20)

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 

Andere mochten auch

Large scale machine learning at pay pal risk
Large scale machine learning at pay pal riskLarge scale machine learning at pay pal risk
Large scale machine learning at pay pal riskpengshanzhang
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Beginning scala 02 15
Beginning scala 02 15Beginning scala 02 15
Beginning scala 02 15lancegatlin
 
Playframework + Twitter Bootstrap
Playframework + Twitter BootstrapPlayframework + Twitter Bootstrap
Playframework + Twitter BootstrapKevingo Tsai
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database ChoicesLynn Langit
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of LightErik Hatcher
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksLucidworks
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming OverviewStratio
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerIMC Institute
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 

Andere mochten auch (20)

Large scale machine learning at pay pal risk
Large scale machine learning at pay pal riskLarge scale machine learning at pay pal risk
Large scale machine learning at pay pal risk
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Beginning scala 02 15
Beginning scala 02 15Beginning scala 02 15
Beginning scala 02 15
 
Playframework + Twitter Bootstrap
Playframework + Twitter BootstrapPlayframework + Twitter Bootstrap
Playframework + Twitter Bootstrap
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database Choices
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of Light
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Ähnlich wie Spark overview

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben whiteData Con LA
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 

Ähnlich wie Spark overview (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Spark core
Spark coreSpark core
Spark core
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 

Kürzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Spark overview

  • 2. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 3. What is Spark ● Hadoop/YARN: ○ strong in processing large files parallelly ○ synchronization barrier when persisting data to the disk. ○ MapReduce: launch mapper & reducer, R/W to disk, back to queue and get resource ● Spark: ○ in-memory processing ○ iterative and interactive data analysis ○ compare to MapReduce, supports more complex and interactive applications
  • 4. Hadoop MapReduce ● Slow due to replication, serialization, and disk IO ● Inefficient for: ○ Iterative algorithms (Machine Learning, Graphs & Network Analysis) ○ Interactive Data Mining (R, Excel, Searching)
  • 5. Spark In-memory Processing 1.Extract a working set 2.Cache it 3.Query it repeatedly
  • 7. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 8. How Spark Works - SparkContext
  • 9. How Spark Works - RDD ● Partitions of Data ● Dependencies between partitions Storage Types: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, ...
  • 10. How Spark Works - RDD operations Transformations ● Create a new dataset from an existing one. ● Lazy in nature, executed only when some action is performed. ● Example ○ Map(func) ○ Filter(func) ○ Distinct() Actions ● Returns a value or exports data after performing a computation. ● Example: ○ Count() ○ Reduce(func) ○ Collect ○ Take() Persistence ● Caching dataset in-memory for future operations ● store on disk or RAM or mixed ● Example: ○ Persist() ○ Cache()
  • 11. How Spark Works: Word Count
  • 12. How Spark Works: Word Count
  • 13. How Spark Works: Word Count
  • 14. How Spark Works: Word Count
  • 15. How Spark Works: Word Count
  • 16. How Spark Works - Actions ● Parallel Operations
  • 17. How Spark Works - Actions ● Parallel Operations
  • 18. How Spark Works - Stages Each stage is executed as a series of Task (one Task for each Partition). DAG (Directed Acyclic Graph).
  • 19. Spark Programming - Tasks Task is the fundamental unit of execution in Spark
  • 20. How Spark Works - Summary ● SparkContext ● Resilient Distributed Datasets (RDDs) ● Parallel Operations ● Shared Variables ○ Broadcast Variables - read-only ○ Accumulators
  • 21. Compare Hadoop and Spark Traditional OS Hadoop Spark Storage File System HDFS HDFS Schedule Processes MapReduce Computation Graph I/O Disk Cache(in memory) and shared data Fault Tolerance Duplication and Disk I/O Hash partition and auto- reconstruction
  • 22. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 23. Spark - LogisticRegressionModel 1. Initialize spark JavaSparkContext 2. Prepare dataSet 3. Train LR model 4. Evaluation
  • 24. 1. Initializing Spark 1. JavaSparkContext: tell Spark how to access to the cluster 2. SparkConf: setting - a hashmap of <String,String> a. required: AppName, Master, more default configuration SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); JavaSparkContext sc = new JavaSparkContext(conf);
  • 25. 2. Prepare Dataset 1. From Parallelized Collections 2. From External DataSets 3. Passing Functions to Spark List<Integer> data = Arrays.asList(1, 2, 3, 4, 5); JavaRDD<Integer> distData = sc.parallelize(data); JavaRDD<String> distFile = sc.textFile("data.txt"); OR JavaRDD<String> distFile = sc.textFile("hdfs://data.txt"); class ParseLabeledPoint implements Function<String, LabeledPoint> { public LabeledPoint call(String s) {... for (int i = 0; i < len; i++) { x[i] = Double.parseDouble(tokens[i]); } return new LabeledPoint(y, Vectors.dense(x)); }} --- JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
  • 26. 3. Train LogisticRegressionModel /* * @param input RDD of (label, array of features) pairs. * @param numIterations Number of iterations of gradient descent to run. * @param stepSize Step size to be used for each iteration of gradient descent. * @param miniBatchFraction Fraction of data to be used per iteration. */ LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations, stepSize,miniBatchFraction); Train the model
  • 27. 4. Calculate Score - Evaluation pmmlModel = new PMMLSparkLogisticRegressionModel() .adaptMLModelToPMML(lrModel, partialPmmlModel); 1. Convert LogisticRegressionModel to PMML model 2. Prepare DataSet and calculate score //use LogisticRegressionModel JavaRDD<Vector> evalVectors = lines.map(new ParseVector()); List<Double> evalList = lrModel.predict(evalVectors).collect(); //use PMMLEvaluator RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml); List<Double> evalResult = evaluator.evaluate(evalData); //compare two evaluator results for (...) { Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA); }
  • 28. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 29. Why Spark? - scalability & performance 1. leverage the memory of the cluster for in- memory processing 2. Computation Graph optimization for parallel execution Shark: Spark SQL, Hive in Spark Hive: manage large dataset in distributed storage
  • 30. Why Spark? - compatibility 1. compatible with HDFS, HBase, and any Hadoop storage system
  • 31. Why Spark? - Ease of Use API 1. Expressive API in Java, Scala, and Python 2. Supports more parallel operations
  • 32. Expressive API - MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 33. Expressive API - Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Scala: val file = spark.textFile("hdfs://...") val counts = file.map(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 34. public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Java 6, Java 7: JavaRDD<String> file = spark.textFile("hdfs://..."); JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; }}); Expressive API - Spark
  • 35. Why Spark? - Third Party Softwares ● Mahout ○ Say goodbye to MapReduce ○ Support for Apache Spark ■ Mahout-Spark Shell: facilitate the Mahout data structures, such as Matrix, etc. ○ Support for h2o being explored ○ Support for Apache Flink possibly in future ● H2o ○ Sparkling water - embrace in-memory processing with ML algorithm Purpose Language Storage Stakeholder H2o In-memory ML predictive analysis Java/R K/V store data analyst Spark in-memory processing engine Scala, support Java/Python RDD HDFS user
  • 36. Why Spark - Third Party Software ● Pig on Spark - Spork ● Other commercial softwares
  • 37. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 38. How to use Spark in Shifu? 1. train: LogisticRegressionTrainer 2. stats & normalize 3. eval: add more evaluation metrics a. precision, recall, F-measure, precision-recall curve - pr(), precisionByThreshold(),recallByThreshold().. b. area under the curves (AUC) - areaUnderPR() c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
  • 39. Related Projects 1. Bulk Synchronous Parallel a. parallel computing on message-passing b. BSP: local computation, global communication, barrier synchronization c. graph processing: Pregel, Giraph d. scientific computing: Hama e. optimize operation DAG: Flink Seconds Nodes
  • 40. Take Away - Big Data has moved in-memory 1. In-memory big data has come of age. 2. Spark leverages the cluster memory for iterative and interactive operations 3. Spark is compatible with HDFS, HBase, and any Hadoop storage system 4. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming 5. Spark has expressive API
  • 42. 3. Train LogisticRegressionModel (cont.) val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept) val weights = if (addIntercept) { ... } else { weightsWithIntercept } 2. Calculate weights 3. Gradient Descent optimize() 4. Training Error - not accessible from LogisticRegressionModel logInfo("Last 10 stochastic losses %s".format(stochasticLoss.takeRight(10))) 14/07/09 14:10:40 INFO optimization.GradientDescent: Last 10 stochastic losses 0.6931471805599468, 0.5255572298404575,.., 0.3444544005102222, 0.3355921369255156