Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
The Narrative:
Vast quantities of streaming data are being generated, and more will be generated thanks to phenomenon SINGULAR/PLURAL
such as the internet of things. The motivation for Real-Time Stream processing is to turn all this data into valuable insights and actions, as soon as the data is generated.
Instant processing of the data also opens the door to new use cases that were not possible before.
NOTE:
Feel free to remove the cheesy image of “The Flash”, if it feels unprofessional or overly cheesy
The Narrative:
As you can see from the previous slides, lots of streaming data will be generated. Making this data actionable in real time is very valuable across industries.
Our very own Hadoop is all you need.
Previously Hadoop was associated just with “big unstructured data”. That was hadoop’s selling point.
But now, Hadoop can also handle real-time data (in addition to big unstructured). So think Hadoop when you think Real-Time Streaming.
Purpose of the slide:
Goal is to associate Hadoop with real-time……to get people to think hadoop when they think real-time streaming data.
Purpose of this Slide:
Make sure to associate Spark Streaming with Apache Spark, so folks know it is a part of THE Apache Spark that everyone is talking about.
List some of the key properties that make Spark Streaming a good platform for stream processing. Touch upon the key attributes that make it good for stream processing.
Note:
If required, we can mention low latency as well.
Spark Streaming – Storm like, Streaming solution
Blink DB – approximate query results for Hive
Shark SQL – Spark based SQL engine, but Impala is better
GraphX – Graphlab, Giraph alternate that provides graph processing
MLBase and MLLib – ML algorithm implementations
Tachyon – Shared filesystem cache so multiple users/frameworks can share Spark data
For now, we will only support Spark and Spark Streaming
Over time, Pig will move to Spark natively and Spark will provide MR APIs