Spark Streaming & Spark SQL: An Introduction to Real-Time Processing with Apache Spark

Spark Streaming &
Spark SQL
Yousun Jeong
jerryjung@sk.com

History - Spark
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become one
of the largest OSS communities in big data, with over
200 contributors in 50+ organizations
“Organizations that are looking at big data challenges – including collection, ETL,
storage, exploration and analytics – should consider Spark for its in-memory
performance and the breadth of its model. It supports advanced analytics solutions
on Hadoop clusters, including the iterative model required for machine learning and
graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)

History - Spark
Some key points about Spark:
• handles batch, interactive, and real-time within a single
framework
• native integration with Java, Python, Scala programming
at a higher level of abstraction
• multi-step Directed Acrylic Graphs (DAGs).  
many stages compared to just Hadoop Map and
Reduce only.

Data Sharing in MR
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012

Benchmark Test
databricks.com/blog/2014/11/05/spark-ofﬁcially- sets-a-new-record-in-large-scale-
sorting.html

RDD
Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – a fault-tolerant collection of
elements that can be operated on in parallel
There are currently two types:
• parallelized collections – take an existing Scala collection
and run functions on it in parallel
• Hadoop datasets – run functions on each record of a ﬁle
in Hadoop distributed ﬁle system or any other storage
system supported by Hadoop

Fault Tolerance
• An RDD is an immutable, deterministically re-
computable, distributed dataset.
• RDD tracks lineage info rebuild lost data

Beneﬁt of Spark
Spark help us to have the gains in processing speed and implement various
big data applications easily and speedily
▪ Support for Event Stream
Processing
▪ Fast Data Queries in Real Time
▪ Improved Programmer Productivity
▪ Fast Batch Processing of Large Data
Set
Why I use spark …

Big Data
Big Data is not just “big”
The 3V of Big Data

Big Data Processing
1. Batch Processing
• processing data en masse
• big & complex
• higher latencies ex) MR
2. Stream Processing
• one-at-a-time processing
• computations are relatively simple and generally independent
• sub-second latency ex) Storm
3. Micro-Batching
• small batch size (batch+streaming)

Spark Streaming In Action
import org.apache.spark.streaming._  
import org.apache.spark.streaming.StreamingContext._  
 
// create a StreamingContext with a SparkConf conﬁguration
val ssc = new StreamingContext(sparkConf, Seconds(10))  
 
// create a DStream that will connect to serverIP:serverPort
val lines = ssc.socketTextStream(serverIP, serverPort)  
 
// split each line into words  
val words = lines.ﬂatMap(_.split(" "))  
 
// count each word in each batch  
val pairs = words.map(word => (word, 1))  
val wordCounts = pairs.reduceByKey(_ + _)  
 
// print a few of the counts to the console
wordCounts.print()  
 
ssc.start() // Start the computation 
ssc.awaitTermination() // Wait for the computation to terminate

Spark SQL In Action
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingDataTable = sql("""
SELECT e.action, u.age, u.latitude, u.logitude
FROM Users u
JOIN Events e
ON u.userId = e.userId”"")
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD().run(trainingData)

Spark SQL In Action
val allCandidates = sql("""
SELECT userId,
age,
latitude,
logitude
FROM Users
WHERE subscribed = FALSE”"")
// Results of ML algorithms can be used as tables
// in subsequent SQL statements.
case class Score(userId: Int, score: Double)
val scores = allCandidates.map { row =>
val features = Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
scores.registerAsTable("Scores")

MR vs RDD - Compute an
Average

RDD vs DF - Compute an
Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()

Spark 2.0 : Structured
Streaming
• Structured Streaming
• High-level streaming API built on Spark SQL engine
• Runs the same queries on DataFrames
• Event time, windowing, sessions, sources & sinks
• Uniﬁes streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queries at runtime
• Build and apply ML models

Spark 2.0 Example: Page
View Count
Input: records in Kafka
Query: select count(*) group by page, minute(evtime)
Trigger:“every 5 sec”
Output mode: “update-in-place”, into MySQL sink
logs =
ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id). 
agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")

Spark 2.0 Use Case: Fraud
Detection

Spark Streaming & Spark SQL: An Introduction to Real-Time Processing with Apache Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Spark Streaming & Spark SQL: An Introduction to Real-Time Processing with Apache Spark

Ähnlich wie Spark Streaming & Spark SQL: An Introduction to Real-Time Processing with Apache Spark (20)

Mehr von Yousun Jeong

Mehr von Yousun Jeong (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Spark Streaming & Spark SQL: An Introduction to Real-Time Processing with Apache Spark