SlideShare ist ein Scribd-Unternehmen logo
1 von 37
INTRODUCTION TO 
APACHE SPARK 
Mohamed Hedi Abidi - Software Engineer @ebiznext 
@mh_abidi
CONTENT 
 Spark Introduction 
 Installation 
 Spark-Shell 
 SparkContext 
 RDD 
 Persistance 
 Simple Spark Apps 
 Deploiement 
 Spark SQL 
 Spark GraphX 
 Spark Mllib 
 Spark Streaming 
 Spark & Elasticsearch
INTRODUCTION 
An open source data analytics cluster computing 
framework 
In Memory Data processing 
100x faster than Hadoop 
Support MapReduce
INTRODUCTION 
 Handles batch, interactive, and real-time within a single 
framework
INTRODUCTION 
 Programming at a higher level of abstraction : faster, 
easier development
INTRODUCTION 
 Highly accessible through standard APIs built in Java, 
Scala, Python, or SQL (for interactive queries), and a rich 
set of machine learning libraries 
 Compatibility with the existing Hadoop v1 (SIMR) and 
2.x (YARN) ecosystems so companies can leverage their 
existing infrastructure.
INSTALLATION 
 Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+ 
 Download and unzip Apache Spark 1.1.0 sources 
Or clone development Version : 
git clone git://github.com/apache/spark.git 
 Run Maven to build Apache Spark 
mvn -DskipTests clean package 
 Launch Apache Spark standalone REPL 
[spark_home]/bin/spark-shell 
 Go to SparkUI @ 
http://localhost:4040
SPARK-SHELL 
 we’ll run Spark’s interactive shell… within the “spark” 
directory, run: 
./bin/spark-shell 
 then from the “scala>” REPL prompt, let’s create some 
data… 
scala> val data = 1 to 10000 
 create an RDD based on that data… 
scala> val distData = sc.parallelize(data) 
 then use a filter to select values less than 10… 
scala> distData.filter(_ < 10).collect()
SPARKCONTEXT 
 The first thing a Spark program must do is to create a 
SparkContext object, which tells Spark how to access a 
cluster. 
 In the shell for either Scala or Python, this is the sc 
variable, which is created automatically 
 Other programs must use a constructor to instantiate a 
new SparkContext 
val conf = new SparkConf().setAppName(appName).setMaster(master) 
new SparkContext(conf)
RDDS 
 Resilient Distributed Datasets (RDD) are the primary 
abstraction in Spark – It is an immutable distributed 
collection of data, which is partitioned across machines 
in a cluster 
 There are currently two types: 
 parallelized collections : Take an existing Scala collection and 
run functions on it in parallel 
 External datasets : Spark can create distributed datasets from 
any storage source supported by Hadoop, including local file 
system, HDFS, Cassandra, HBase, Amazon S3, etc.
RDDS 
 Parallelized collections 
scala> val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5) 
scala> val distData = sc.parallelize(data) 
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at 
parallelize at <console>:14 
 External datasets 
scala> val distFile = sc.textFile("README.md") 
distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at 
textFileat <console>:12
RDDS 
 Two types of operations on RDDs: 
transformations and actions 
 A transformation is a lazy (not computed immediately) 
operation on an RDD that yields another RDD 
 An action is an operation that triggers a computation, 
returns a value back to the Master, or writes to a stable 
storage system
RDDS : COMMONLY USED TRANSFORMATIONS 
Transformation & Purpose Example & Result 
filter(func) 
Purpose: new RDD by selecting 
those data elements on which 
func returns true 
scala> val rdd = 
sc.parallelize(List(“ABC”,”BCD”,”DEF”)) 
scala> val filtered = rdd.filter(_.contains(“C”)) 
scala> filtered.collect() 
Result: 
Array[String] = Array(ABC, BCD) 
map(func) 
Purpose: return new RDD by 
applying func on each data 
element 
scala> val rdd=sc.parallelize(List(1,2,3,4,5)) 
scala> val times2 = rdd.map(_*2) 
scala> times2.collect() 
Result: 
Array[Int] = Array(2, 4, 6, 8, 10) 
flatMap(func) 
Purpose: Similar to map but func 
returns a Seq instead of a value. 
For example, mapping a sentence 
into a Seq of words 
scala> val rdd=sc.parallelize(List(“Spark is 
awesome”,”It is fun”)) 
scala> val fm=rdd.flatMap(str=>str.split(“ “)) 
scala> fm.collect() 
Result: 
Array[String] = Array(Spark, is, awesome, It, is, fun)
RDDS : COMMONLY USED TRANSFORMATIONS 
Transformation & Purpose Example & Result 
reduceByKey(func,[numTasks]) 
Purpose: To aggregate values of a 
key using a function. “numTasks” 
is anoptional parameter to specify 
number of reduce tasks 
scala> val word1=fm.map(word=>(word,1)) 
scala> val wrdCnt=word1.reduceByKey(_+_) 
scala> wrdCnt.collect() 
Result: 
Array[(String, Int)] = Array((is,2), (It,1), 
(awesome,1), (Spark,1), (fun,1)) 
groupByKey([numTasks]) 
Purpose: To convert (K,V) to 
(K,Iterable<V>) 
scala> val cntWrd = wrdCnt.map{case (word, 
count) => (count, word)} 
scala> cntWrd.groupByKey().collect() 
Result: 
Array[(Int, Iterable[String])] = 
Array((1,ArrayBuffer(It, awesome, Spark, 
fun)), (2,ArrayBuffer(is))) 
distinct([numTasks]) 
Purpose: Eliminate duplicates 
from RDD 
scala> fm.distinct().collect() 
Result: 
Array[String] = Array(is, It, awesome, Spark, 
fun)
RDDS : COMMONLY USED ACTIONS 
Transformation & Purpose Example & Result 
count() 
Purpose: Get the number of 
data elements in the RDD 
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) 
scala> rdd.count() 
Result: 
Long = 3 
collect() 
Purpose: get all the data elements 
in an RDD as an Array 
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) 
scala> rdd.collect() 
Result: 
Array[Char] = Array(A, B, C) 
reduce(func) 
Purpose: Aggregate the data 
elements in an RDD using this 
function which takes two 
arguments and returns one 
scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.reduce(_+_) 
Result: 
Int = 10 
take (n) 
Purpose: fetch first n data 
elements in an RDD. Computed by 
driver program. 
Scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.take(2) 
Result: 
Array[Int] = Array(1, 2)
RDDS : COMMONLY USED ACTIONS 
Transformation & Purpose Example & Result 
foreach(func) 
Purpose: execute function for 
each data element in RDD. 
Usually used to update an 
accumulator(discussed later) or 
interacting with external systems. 
Scala> val rdd = sc.parallelize(List(1,2)) 
scala> rdd.foreach(x=>println(“%s*10=%s”. 
format(x,x*10))) 
Result: 
1*10=10 
2*10=20 
first() 
Purpose: retrieves the first 
data element in RDD. Similar to 
take(1) 
scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.first() 
Result: 
Int = 1 
saveAsTextFile(path) 
Purpose: Writes the content of 
RDD to a text file or a set of text 
files to local file system/HDFS 
scala> val hamlet = sc.textFile(“readme.txt”) 
scala> hamlet.filter(_.contains(“Spark")). 
saveAsTextFile(“filtered”) 
Result: 
…/filtered$ ls 
_SUCCESS part-00000 part-00001
RDDS : 
 For a more detailed list of actions and transformations, 
please refer to: 
http://spark.apache.org/docs/latest/programming-guide. 
html#transformations 
http://spark.apache.org/docs/latest/programming-guide. 
html#actions
PERSISTANCE 
 Spark can persist (or cache) a dataset in memory across 
operations 
 Each node stores in memory any slices of it that it 
computes and reuses them in other actions on that 
dataset – often making future actions more than 10x 
faster 
 The cache is fault-tolerant: if any partition of an RDD is 
lost, it will automatically be recomputed using the 
transformations that originally created it
PERSISTANCE
PERSISTANCE
PERSISTANCE : STORAGE LEVEL 
Storage Level Purpose 
MEMORY_ONLY 
(Default level) 
Store RDD as deserialized Java objects in the JVM. If the RDD does not 
fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not 
fit in memory, store the partitions that don't fit on disk, and read them 
from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This 
is generally more space-efficient than deserialized objects, especially 
when using a fast serializer, but more CPU-intensive to read. 
MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISC_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, etc. 
Same as the levels above, but replicate each partition on two cluster 
nodes.
SIMPLE SPARK APPS : WORDCOUNT 
Download project from github: 
https://github.com/MohamedHedi/SparkSamples 
WordCount.scala: 
val logFile = args(0) 
val conf = new SparkConf().setAppName("WordCount") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numApache = logData.filter(line => line.contains("apache")).count() 
val numSpark = logData.filter(line => line.contains("spark")).count() 
println("Lines with apache: %s, Lines with spark: %s".format(numApache, 
numSpark)) 
 sbt 
 compile 
 assembly
SPARK-SUBMIT 
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options 
<application-jar> 
[application-arguments]
SPARK-SUBMIT : LOCAL MODE 
./bin/spark-submit 
--class com.ebiznext.spark.examples.WordCount 
--master local[4] 
--deploy-mode client 
--conf <key>=<value> 
... # other options 
.targetscala-2.10SparkSamples-assembly-1.0.jar 
.ressourcesREADME.md
CLUSTER MANAGER TYPES 
 Spark supports three cluster managers: 
 Standalone – a simple cluster manager included with Spark 
that makes it easy to set up a cluster. 
 Apache Mesos – a general cluster manager that can also run 
Hadoop MapReduce and service applications. 
 Hadoop YARN – the resource manager in Hadoop 2.
MASTER URLS 
Master URL Meaning 
local One worker thread (no parallelism at all) 
local[K] Run Spark locally with K worker threads (ideally, set 
his to the number of cores on your machine). 
local[*] Run Spark locally with as many worker threads as 
logical cores on your machine. 
spark://HOST:PORT Connect to the given Spark standalone cluster master. 
Default master port : 7077 
mesos://HOST:PORT Connect to the given Mesos cluster. 
Default mesos port : 5050 
yarn-client Connect to a YARN cluster in client mode. The cluster 
location will be found based on the 
HADOOP_CONF_DIR variable. 
yarn-cluster Connect to a YARN cluster in cluster mode. The cluster 
location will be found based on HADOOP_CONF_DIR.
SPARK-SUBMIT : STANDALONE CLUSTER 
 ./sbin/start-master.sh 
(Windows users  spark-class.cmd org.apache.spark.deploy.master.Master) 
 Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER 
 ConnectWorkers to Master 
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT 
 Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER 
./bin/spark-submit --class com.ebiznext.spark.examples.WordCount 
--master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly- 
1.0.jar .ressourcesREADME.md
SPARK SQL 
 Shark is being migrated to Spark SQL 
 Spark SQL blurs the lines between RDDs and relational 
tables 
val conf = new SparkConf().setAppName("SparkSQL") 
val sc = new SparkContext(conf) 
val peopleFile = args(0) 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
import sqlContext._ 
// Define the schema using a case class. 
case class Person(name: String, age: Int) 
// Create an RDD of Person objects and register it as a table. 
val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) 
people.registerAsTable("people") 
// SQL statements can be run by using the sql methods provided by sqlContext. 
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") 
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations. 
// The columns of a row in the result can be accessed by ordinal. 
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
SPARK GRAPHX 
 GraphX is the new (alpha) Spark API for graphs and graph-parallel 
computation. 
 GraphX extends the Spark RDD by introducing the Resilient Distributed 
Property Graph 
case class Peep(name: String, age: Int) 
val vertexArray = Array( 
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), 
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), 
(5L, Peep("Leslie", 45))) 
val edgeArray = Array( 
Edge(2L, 1L, 7), Edge(2L, 4L, 2), 
Edge(3L, 2L, 4), Edge(3L, 5L, 3), 
Edge(4L, 1L, 1), Edge(5L, 3L, 9)) 
val conf = new SparkConf().setAppName("SparkGraphx") 
val sc = new SparkContext(conf) 
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray) 
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) 
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD) 
val results = g.triplets.filter(t => t.attr > 7) 
for (triplet <- results.collect) { 
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") 
}
SPARK MLLIB 
MLlib is Spark’s scalable machine learning library 
consisting of common learning algorithms and utilities. 
Use cases : 
Recommendation Engine 
Content classification 
Ranking 
Algorithms 
Classification and regression : linear regression, decision 
trees, naive Bayes 
 Collaborative filtering : alternating least squares (ALS) 
 Clustering : k-means 
…
SPARK MLLIB 
SparkKMeans.scala 
val sparkConf = new SparkConf().setAppName("SparkKMeans") 
val sc = new SparkContext(sparkConf) 
val lines = sc.textFile(args(0)) 
val data = lines.map(parseVector _).cache() 
val K = args(1).toInt 
val convergeDist = args(2).toDouble 
val kPoints = data.takeSample(withReplacement = false, K, 42).toArray 
var tempDist = 1.0 
while (tempDist > convergeDist) { 
val closest = data.map(p => (closestPoint(p, kPoints), (p, 1))) 
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) } 
val newPoints = pointStats.map { pair => 
(pair._1, pair._2._1 * (1.0 / pair._2._2)) 
}.collectAsMap() 
tempDist = 0.0 
for (i <- 0 until K) { 
tempDist += squaredDistance(kPoints(i), newPoints(i)) 
} 
for (newP <- newPoints) yield { 
kPoints(newP._1) = newP._2 
} 
println("Finished iteration (delta = " + tempDist + ")") 
} 
println("Final centers:") 
kPoints.foreach(println) 
sc.stop()
SPARK STREAMING 
 Spark Streaming extends the core API to allow high-throughput, fault-tolerant 
stream processing of live data streams 
 Data can be ingested from many sources: Kafka, Flume, Twitter, 
ZeroMQ, TCP sockets… 
 Results can be pushed out to filesystems, databases, live dashboards… 
 Spark’s Mllib algorithms and graph processing algorithms can be 
applied to data streams
SPARK STREAMING 
val ssc = new StreamingContext(sparkConf, Seconds(10)) 
 Create a StreamingContext by providing the configuration and batch 
duration
TWITTER - SPARK STREAMING - ELASTICSEARCH 
1. Twitter access 
val keys = ssc.sparkContext.textFile(args(0), 2).cache() 
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4) 
// Set the system properties so that Twitter4j library used by twitter stream 
// can use them to generat OAuth credentials 
System.setProperty("twitter4j.oauth.consumerKey", consumerKey) 
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) 
System.setProperty("twitter4j.oauth.accessToken", accessToken) 
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) 
2. Streaming from Twitter 
val sparkConf = new SparkConf().setAppName("TwitterPopularTags") 
sparkConf.set("es.index.auto.create", "true") 
val ssc = new StreamingContext(sparkConf, Seconds(10)) 
val keys = ssc.sparkContext.textFile(args(0), 2).cache() 
val stream = TwitterUtils.createStream(ssc, None) 
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#"))) 
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)) 
.map { case (topic, count) => (count, topic) } 
.transform(_.sortByKey(false))
TWITTER - SPARK STREAMING - ELASTICSEARCH 
 index in Elasticsearch 
 Adding elasticsearch-spark jar to build.sbt: 
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3" 
 Writing RDD to elasticsearch: 
val conf = new SparkConf().setAppName(appName).setMaster(master) 
sparkConf.set("es.index.auto.create", "true") 
val apache = Map("hashtag" -> "#Apache", "count" -> 10) 
val spark = Map("hashtag" -> "#Spark", "count" -> 15) 
val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark)) 
rdd.saveToEs("spark/hashtag")

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 

Was ist angesagt? (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 

Andere mochten auch

ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementMohamed hedi Abidi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibpumaranikar
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030emmabisogno
 
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturalesLos/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturalesGonza84
 
Dream Village
Dream VillageDream Village
Dream Villagermergo
 
Gr. 4 Unit 1
Gr. 4 Unit 1Gr. 4 Unit 1
Gr. 4 Unit 1jwalts
 
How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India?   How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India? Ranjit Kumar
 
IGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPSIGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPSJose Elias
 

Andere mochten auch (20)

ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030
 
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturalesLos/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
 
Dream Village
Dream VillageDream Village
Dream Village
 
Gr. 4 Unit 1
Gr. 4 Unit 1Gr. 4 Unit 1
Gr. 4 Unit 1
 
How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India?   How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India?
 
IGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPSIGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPS
 
Riyaz_resume
Riyaz_resumeRiyaz_resume
Riyaz_resume
 
Guía2
Guía2Guía2
Guía2
 

Ähnlich wie Introduction to Apache Spark

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark IntroductionRich Lee
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 

Ähnlich wie Introduction to Apache Spark (20)

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Spark core
Spark coreSpark core
Spark core
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Scala+data
Scala+dataScala+data
Scala+data
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Kürzlich hochgeladen

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Kürzlich hochgeladen (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

Introduction to Apache Spark

  • 1. INTRODUCTION TO APACHE SPARK Mohamed Hedi Abidi - Software Engineer @ebiznext @mh_abidi
  • 2. CONTENT  Spark Introduction  Installation  Spark-Shell  SparkContext  RDD  Persistance  Simple Spark Apps  Deploiement  Spark SQL  Spark GraphX  Spark Mllib  Spark Streaming  Spark & Elasticsearch
  • 3. INTRODUCTION An open source data analytics cluster computing framework In Memory Data processing 100x faster than Hadoop Support MapReduce
  • 4. INTRODUCTION  Handles batch, interactive, and real-time within a single framework
  • 5. INTRODUCTION  Programming at a higher level of abstraction : faster, easier development
  • 6. INTRODUCTION  Highly accessible through standard APIs built in Java, Scala, Python, or SQL (for interactive queries), and a rich set of machine learning libraries  Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN) ecosystems so companies can leverage their existing infrastructure.
  • 7. INSTALLATION  Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+  Download and unzip Apache Spark 1.1.0 sources Or clone development Version : git clone git://github.com/apache/spark.git  Run Maven to build Apache Spark mvn -DskipTests clean package  Launch Apache Spark standalone REPL [spark_home]/bin/spark-shell  Go to SparkUI @ http://localhost:4040
  • 8. SPARK-SHELL  we’ll run Spark’s interactive shell… within the “spark” directory, run: ./bin/spark-shell  then from the “scala>” REPL prompt, let’s create some data… scala> val data = 1 to 10000  create an RDD based on that data… scala> val distData = sc.parallelize(data)  then use a filter to select values less than 10… scala> distData.filter(_ < 10).collect()
  • 9. SPARKCONTEXT  The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.  In the shell for either Scala or Python, this is the sc variable, which is created automatically  Other programs must use a constructor to instantiate a new SparkContext val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf)
  • 10. RDDS  Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – It is an immutable distributed collection of data, which is partitioned across machines in a cluster  There are currently two types:  parallelized collections : Take an existing Scala collection and run functions on it in parallel  External datasets : Spark can create distributed datasets from any storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc.
  • 11. RDDS  Parallelized collections scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:14  External datasets scala> val distFile = sc.textFile("README.md") distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at textFileat <console>:12
  • 12. RDDS  Two types of operations on RDDs: transformations and actions  A transformation is a lazy (not computed immediately) operation on an RDD that yields another RDD  An action is an operation that triggers a computation, returns a value back to the Master, or writes to a stable storage system
  • 13. RDDS : COMMONLY USED TRANSFORMATIONS Transformation & Purpose Example & Result filter(func) Purpose: new RDD by selecting those data elements on which func returns true scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result: Array[String] = Array(ABC, BCD) map(func) Purpose: return new RDD by applying func on each data element scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result: Array[Int] = Array(2, 4, 6, 8, 10) flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result: Array[String] = Array(Spark, is, awesome, It, is, fun)
  • 14. RDDS : COMMONLY USED TRANSFORMATIONS Transformation & Purpose Example & Result reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is anoptional parameter to specify number of reduce tasks scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result: Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1)) groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>) scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result: Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is))) distinct([numTasks]) Purpose: Eliminate duplicates from RDD scala> fm.distinct().collect() Result: Array[String] = Array(is, It, awesome, Spark, fun)
  • 15. RDDS : COMMONLY USED ACTIONS Transformation & Purpose Example & Result count() Purpose: Get the number of data elements in the RDD scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) scala> rdd.count() Result: Long = 3 collect() Purpose: get all the data elements in an RDD as an Array scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) scala> rdd.collect() Result: Array[Char] = Array(A, B, C) reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.reduce(_+_) Result: Int = 10 take (n) Purpose: fetch first n data elements in an RDD. Computed by driver program. Scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.take(2) Result: Array[Int] = Array(1, 2)
  • 16. RDDS : COMMONLY USED ACTIONS Transformation & Purpose Example & Result foreach(func) Purpose: execute function for each data element in RDD. Usually used to update an accumulator(discussed later) or interacting with external systems. Scala> val rdd = sc.parallelize(List(1,2)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10))) Result: 1*10=10 2*10=20 first() Purpose: retrieves the first data element in RDD. Similar to take(1) scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.first() Result: Int = 1 saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/HDFS scala> val hamlet = sc.textFile(“readme.txt”) scala> hamlet.filter(_.contains(“Spark")). saveAsTextFile(“filtered”) Result: …/filtered$ ls _SUCCESS part-00000 part-00001
  • 17. RDDS :  For a more detailed list of actions and transformations, please refer to: http://spark.apache.org/docs/latest/programming-guide. html#transformations http://spark.apache.org/docs/latest/programming-guide. html#actions
  • 18. PERSISTANCE  Spark can persist (or cache) a dataset in memory across operations  Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster  The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it
  • 21. PERSISTANCE : STORAGE LEVEL Storage Level Purpose MEMORY_ONLY (Default level) Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISC_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.
  • 22. SIMPLE SPARK APPS : WORDCOUNT Download project from github: https://github.com/MohamedHedi/SparkSamples WordCount.scala: val logFile = args(0) val conf = new SparkConf().setAppName("WordCount") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numApache = logData.filter(line => line.contains("apache")).count() val numSpark = logData.filter(line => line.contains("spark")).count() println("Lines with apache: %s, Lines with spark: %s".format(numApache, numSpark))  sbt  compile  assembly
  • 23. SPARK-SUBMIT ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]
  • 24. SPARK-SUBMIT : LOCAL MODE ./bin/spark-submit --class com.ebiznext.spark.examples.WordCount --master local[4] --deploy-mode client --conf <key>=<value> ... # other options .targetscala-2.10SparkSamples-assembly-1.0.jar .ressourcesREADME.md
  • 25. CLUSTER MANAGER TYPES  Spark supports three cluster managers:  Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.  Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.  Hadoop YARN – the resource manager in Hadoop 2.
  • 26. MASTER URLS Master URL Meaning local One worker thread (no parallelism at all) local[K] Run Spark locally with K worker threads (ideally, set his to the number of cores on your machine). local[*] Run Spark locally with as many worker threads as logical cores on your machine. spark://HOST:PORT Connect to the given Spark standalone cluster master. Default master port : 7077 mesos://HOST:PORT Connect to the given Mesos cluster. Default mesos port : 5050 yarn-client Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable. yarn-cluster Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
  • 27. SPARK-SUBMIT : STANDALONE CLUSTER  ./sbin/start-master.sh (Windows users  spark-class.cmd org.apache.spark.deploy.master.Master)  Go to the master’s web UI
  • 28. SPARK-SUBMIT : STANDALONE CLUSTER  ConnectWorkers to Master ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT  Go to the master’s web UI
  • 29. SPARK-SUBMIT : STANDALONE CLUSTER ./bin/spark-submit --class com.ebiznext.spark.examples.WordCount --master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly- 1.0.jar .ressourcesREADME.md
  • 30. SPARK SQL  Shark is being migrated to Spark SQL  Spark SQL blurs the lines between RDDs and relational tables val conf = new SparkConf().setAppName("SparkSQL") val sc = new SparkContext(conf) val peopleFile = args(0) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 31. SPARK GRAPHX  GraphX is the new (alpha) Spark API for graphs and graph-parallel computation.  GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph case class Peep(name: String, age: Int) val vertexArray = Array( (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), (5L, Peep("Leslie", 45))) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 5L, 3), Edge(4L, 1L, 1), Edge(5L, 3L, 9)) val conf = new SparkConf().setAppName("SparkGraphx") val sc = new SparkContext(conf) val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD) val results = g.triplets.filter(t => t.attr > 7) for (triplet <- results.collect) { println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") }
  • 32. SPARK MLLIB MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. Use cases : Recommendation Engine Content classification Ranking Algorithms Classification and regression : linear regression, decision trees, naive Bayes  Collaborative filtering : alternating least squares (ALS)  Clustering : k-means …
  • 33. SPARK MLLIB SparkKMeans.scala val sparkConf = new SparkConf().setAppName("SparkKMeans") val sc = new SparkContext(sparkConf) val lines = sc.textFile(args(0)) val data = lines.map(parseVector _).cache() val K = args(1).toInt val convergeDist = args(2).toDouble val kPoints = data.takeSample(withReplacement = false, K, 42).toArray var tempDist = 1.0 while (tempDist > convergeDist) { val closest = data.map(p => (closestPoint(p, kPoints), (p, 1))) val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) } val newPoints = pointStats.map { pair => (pair._1, pair._2._1 * (1.0 / pair._2._2)) }.collectAsMap() tempDist = 0.0 for (i <- 0 until K) { tempDist += squaredDistance(kPoints(i), newPoints(i)) } for (newP <- newPoints) yield { kPoints(newP._1) = newP._2 } println("Finished iteration (delta = " + tempDist + ")") } println("Final centers:") kPoints.foreach(println) sc.stop()
  • 34. SPARK STREAMING  Spark Streaming extends the core API to allow high-throughput, fault-tolerant stream processing of live data streams  Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ, TCP sockets…  Results can be pushed out to filesystems, databases, live dashboards…  Spark’s Mllib algorithms and graph processing algorithms can be applied to data streams
  • 35. SPARK STREAMING val ssc = new StreamingContext(sparkConf, Seconds(10))  Create a StreamingContext by providing the configuration and batch duration
  • 36. TWITTER - SPARK STREAMING - ELASTICSEARCH 1. Twitter access val keys = ssc.sparkContext.textFile(args(0), 2).cache() val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4) // Set the system properties so that Twitter4j library used by twitter stream // can use them to generat OAuth credentials System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) 2. Streaming from Twitter val sparkConf = new SparkConf().setAppName("TwitterPopularTags") sparkConf.set("es.index.auto.create", "true") val ssc = new StreamingContext(sparkConf, Seconds(10)) val keys = ssc.sparkContext.textFile(args(0), 2).cache() val stream = TwitterUtils.createStream(ssc, None) val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#"))) val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)) .map { case (topic, count) => (count, topic) } .transform(_.sortByKey(false))
  • 37. TWITTER - SPARK STREAMING - ELASTICSEARCH  index in Elasticsearch  Adding elasticsearch-spark jar to build.sbt: libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"  Writing RDD to elasticsearch: val conf = new SparkConf().setAppName(appName).setMaster(master) sparkConf.set("es.index.auto.create", "true") val apache = Map("hashtag" -> "#Apache", "count" -> 10) val spark = Map("hashtag" -> "#Spark", "count" -> 15) val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark)) rdd.saveToEs("spark/hashtag")

Hinweis der Redaktion

  1. Hadoop est un framework Java qui facilite la création d'applications distribuées scalables. Il permet aux applications de travailler avec des milliers de nœuds et des pétaoctets de données. MapReduce est design pattern d’architecture, inventé par Google Composé de : Phase Map (calcul) : Pour chaque ensemble le traitement Map est appliqué. Phase intermédiaire où les données sont triées et les données liées sont regroupées pour être traitées par un même nœud. Phase Reduce (agrégation) : Les données sont éventuellement agrégées. Regrouper les résultat de chacun des nœuds pour calculer le résultat final.