The lead developer of the Apache Spark Streaming library at Databricks, Tathagata "TD" Das, provides an overview of Spark streaming and previews what's the come.
2. Who am I?
Project Management Committee (PMC) member of Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
3. Founded by the creators of Spark in 2013
Largest organization contributing to Spark
End-to-end hosted service, Databricks Cloud
What is Databricks?
5. Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
Streaming
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
6. What can you use it for?
6
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
7. How does it work?
Data streams are chopped up into batches
Each batch is processed in Spark
Results pushed out in batches
7
data streams
receivers
Streaming
batches results
8. Streaming Word Count
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
8
print some counts on screen
count the words
split lines into words
create DStream
from data over socket
start processing the stream
9. Word Count
9
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val context = new StreamingContext(sparkConf, Seconds(1))
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
10. Word Count
10
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val context = new StreamingContext(sparkConf, Seconds(1))
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Spark Streaming
public class WordCountTopology {
public static class SplitSentence extends ShellBolt implements IRichBolt {
public SplitSentence() {
super("python", "splitsentence.py");
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Storm
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
13. Combine batch and streaming processing
Join data streams with static data sets
// Create data set from Hadoop file!
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaStream.transform { batchRDD =>
batchRDD.join(dataset)
.filter( ... )
}
13
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
14. Combine machine learning with streaming
Learn models offline, apply them online
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
kafkaStream.map { event =>
model.predict(event.feature)
}
14
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
15. Combine SQL with streaming
Interactively query streaming data with SQL
// Register each batch in stream as table
kafkaStream.map { batchRDD =>
batchRDD.registerTempTable("latestEvents")
}
// Interactively query table
sqlContext.sql("select * from latestEvents")
15
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
16. History
16
Late 2011 – research idea
AMPLab, UC Berkeley
We need to
make Spark
faster
Okay...umm,
how??!?!
17. History
17
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Late 2011 – idea
AMPLab, UC Berkeley
18. History
18
Late 2011 – idea
AMPLab, UC Berkeley
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Jan 2014 – Stable release
Graduation with Spark 0.9
22. Python API
Core functionality in Spark 1.2,
with sockets and files as sources
Kafka support coming in Spark 1.3
Other sources coming in future
22
lines = ssc.socketTextStream(“localhost", 9999))
counts = lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
counts.pprint()
23. Streaming MLlib algorithms
val model = new StreamingKMeans()
.setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
// Apply model to DStreams
model.trainOn(trainingDStream)
model.predictOnValues(testDStream.map { lp =>
(lp.label, lp.features) } ).print()
23
Continuous learning and
prediction on streaming data
StreamingLinearRegression in
Spark 1.1
StreamingKMeans in Spark 1.2
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
24. Other library additions
Amazon Kinesis integration [ Spark 1.1]
More fault-tolerant Flume integration [Spark 1.1]
New Kafka API for more native integration [Spark 1.3]
24
28. Spark Packages
More contributions from the
community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
28
30. Spark Summit 2014 Survey
30
40% of Spark users were
using Spark Streaming in
production or prototyping
Another 39% were
evaluating it
Not using
21%
Evaluating
39%
Prototyping
31%
Production
9%
33. Intel China builds big data solutions for large enterprises
Multiple streaming applications for different businesses
Real-time risk analysis for a top online payment company
Real-time deal and flow metric reporting for a top online shopping company
34. Complicated stream processing
SQL queries on streams
Join streams with large historical datasets
> 1TB/day passing through Spark Streaming
YARN
Spark
Streaming
Kafka
RocketMQ
HBase
35. One of the largest publishing and education company, wants
to accelerate their push into digital learning
Needed to combine student activities and domain events to
continuously update the learning model of each student
Earlier implementation in Storm, but now moved on to
Spark Streaming
37. Leading advertising automation company with an exchange
platform for in-feed ads
Process clickstream data for optimizing real-time bidding for ads
Mesos+Marathon
Spark
Streaming
Kinesis MySQL
Redis
RabbitMQ SQS
40. Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning
algorithms on time series data of
every neuron
2TB/hour and increasing with
brain size
80 HPC nodes
41. Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
41
44. Beyond Spark 1.3
Operational Ease
Better flow control
Elastic scaling
Cross-version upgradability
Improved support for non-Hadoop environments
44
45. Beyond Spark 1.3
Performance
Higher throughput, especially of stateful operations
Lower latencies
Easy deployment of streaming apps in Databricks Cloud!
45
46. You can help!
Roadmaps are heavily driven by community feedback
We have listened to community demands over the last year
Write Ahead Logs for zero data loss
New Kafka integration for stronger semantics
Let us know what do you want to see in Spark Streaming
Spark user mailing list, tweet it to me @tathadas
46
47. Takeaways
Spark Streaming is scalable, fault-tolerant stream processing
system with high-level API and rich set of libraries
Over 80+ deployments in the industry
More libraries and operational ease in the roadmap
47
49. Typesafe survey of Spark users
2136 developers, data scientists,
and other tech professionals
http://java.dzone.com/articles/apache-spark-survey-typesafe-0
50. Typesafe survey of Spark users
65% of Spark users are interested
in Spark Streaming
51. Typesafe survey of Spark users
2/3 of Spark users want to process
event streams
53. • Big data solution provider for enterprises
• Multiple applications for different businesses
- Monitoring +optimizing online services of Tier-1 bank
- Fraudulent transaction detection for Tier-2 bank
• Kafka à SS à Cassandra, MongoDB
• Built their own Stratio Streaming platform on
Spark Streaming, Kafka, Cassandra, MongoDB
54. • Provides data analytics solutions for Communication
Service Providers
- 4 of 5 top mobile ops, 3 of 4 top internet backbone providers
- Processes >50% of all US mobile traffic
• Multiple applications for different businesses
- Real-time anomaly detection in cell tower traffic
- Real-time call quality optimizations
• Kafka à SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
55. • Runs claims processing applications for healthcare providers
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims
• Predictive models can look
for claims that are likely to
be held up for approval
• Spark Streaming allows
model scoring in seconds
instead of hours