Good afternoon Everyone. Today we are going to talk about one of the most popular extensions of Spark : Spark Streaming.
And we will talk about using Spark Streaming to implement a use case in a fast growing, and simply put, really cool and popular domain: the Internet of Things.
We wall walk you through a concrete Internet of Things use case. When we talk about the use case, we will focus on end-to-end architectures.
After covering the use case, we will do a deeper dive into some interesting spark streaming features such as sliding windows, streaming state, ml algorithms, and share some pro-tips or best practices with you.
So first, a very quick primer on spark streaming:
In Spark Streaming, each incoming data stream is represented by an abstraction called a Dstream…which stands for Discretized Stream.
A Dstream is a continuous stream of data, broken up into chunks called micro-batches.
Data in each micro-batch becomes and RDD, and is processed by RDD operations.
A batch spark job is defined as a sequence of transformatins and actions of RDDs…..similarly a streaming job is authored as a sequence of transformations and actions on Dstreams.
Dstream micro-batch sizes are often 1 second or even 0.5 second in size.
Spark Streaming has seen tremendous adoption over the past year and we are seeing customers deploy it for a wide variety of use cases….and here I have a random collection of examples of use cases in diverse industries.
But today we will talk about an Internet of Things use case: Proactive maintenance and accident prevention in railways.
The internet of things is all about sensors, continuously sending data back to your data center. In our case, we are talking about sensors fitted to railway locomotives and railway carriages
The goal is to process this sensor data to identify 2 critical issues:
Damage to the the wheels or axles of trains
Damage to railway tracks
At one end, this will help us prevent derailments. Trains are among the safest modes of transporation…much safer than cars. However, many of these accidents are preventable. Also, the proportion of freight trains is a lot higher than trains transporting humans. When freight train derailment happens, there many not always be a loss of life…..and hence not covered in the news…but there is a heavy financial loss….all of which can be avoided.
That is one end of the spectrum.
The other piece is simply identifying defects early, so that they can be fixed proactively thus extending the lifespan of locomotives and rail carriages as well as tracks….fixing issues early, nipping it in the bud so to spreak, will invariably save costs.
This example is based on a real-world use case, but it has been heavily modified and simplified to fit a 15 minute slide deck
Lets do a deeper dive into the sensors we are talking about:
In this diagram of a railway carriage, the red spots on the wheels of the train, denote where the sensors will be located.
These sensors will send back information, on a regular basis….lets say couple measurements per second. The frequency of readings is adjustable.
Each reading will have:
A unique ID, that identifies the sensor
An ID that identifies the locomotive
A speed measurement….while diagnosing an issue it is important to know how fast the train was going
Temperature measurement….if something goes wrong, invariably something is bound to get too hot
Pressure…if the wheels can not spin comfortable because something is hindering them, the pressure readings will go up
Acoustic Signals…..basically noise….noise is a good indicator of problems…for example the sound of clanging metal is a lot different than the smooth turning of wheels or humming of engines
GPS co-ordinates….this is important, we need to know where the train is for many reasons….which we will talk about shortly
Timestamp….you need to know when the reading was taken
Ok….so given these sensor readings, how do we identify damage to the wheels:
They will manifest as a sustained increase in sensor readings like temperature, pressure or acoustic noise. It will be a pronounced lasting increase, possible progressively getting worse
How do we identify damage to the rail track?
Damage to the railtrack is going to be at a specific location…..often on just one side of the track.
When a wheel goes over the problem area, there is bound to be a sudden sharp pronounced spike in sensor readings….most likely acoustic noise and pressure. The key thing is that it will be a pronounced spike, at a specific location, after which the sensors readings will come back down to normal.
Cool. Now lets talk about the implementation.
Data gets from the locomotive sensors to our data center….how that happens is not in the scope of this talk…..if you are really curious you probably need to attend some conference by Cisco or Intel.
Once it gets into your data center, write it to a streaming data channel….we recommend Kafka.
From Kafka, you can read the events in your spark streaming job and process them. We recommend using the receiverless direct connector to read from Kafka.
In your spark streaming job, you will first need to enrich the data….that is….for each event attach it with relevant metadata which is required to identify damage.
For example. Use the locomotive ID to get information about the locomotive such as type…is it freight or passenger, weight, type of cargo….if it is carrying dangerious chemicals you probably want to want to stop the train even for slight damage….vs an empty cargo train.
Similarly, join with information about the sensor, such as its location on the train….is it on the right wheel, the left wheel….and stuff like that.
From GPS information, figure out where the train is….if it is going up a steep incline, temperature readings may go up, and that is ok.
We recommend storing this type of metadata in Hbase which is ideal for randome key reads….and Hbase comes with the hbase-spark module that makes it easy to call hbase from spark jobs.
Once you have enriched the data, and transformed it…you can determine if the sensor readiings signify damage….and that can be rule based or it can be a machine learing model that is trained….again….outside the domain of this talk since we are not a bunch of mechanical engineers.
When potential damage is identified write an event out to Kafka….say to a topic called, “alerts”….and have an application listening to this topic that will in fact send out a pager alert or email alert or other form of alert to technicians.
Write raw data to HDFS. It will come in handy:
Team of data scientists can do offline analysis
More important, the raw data will come in handy when there is a bug in application logic…or a faulty sensor, your end results don’t match expectations. So for auditing purposes….auditing your application and in coming data.
So we have identified a potential problem. The next step is for a technician to step in and diagnose.
Diagnosing the issue will require visualizing the sensor readings as time series data….look at how they are trending, looks at readings for different windows of time…..compare with readings from a different time window……all of this entails visualing time series data.
For visualization of time series, grafana is a popular and useful application….but there are many other options available, and it is also fairly easy to build one with javascript.
The technician can manually inspect the data and decide what to do…..send the railway carriage for maintenance or if things seem bad, stop it and get it physically checked out.
We are talking about a lot of sensors producing 1 or more readings per second…that is a lot of data and it needs to be stored in a way that lends itself for time series visualization.
Time series data entails sequential reads….since you look at a continuous window of time….similarly writing time series data is also sequential since you will keep appending newer readings….these sequential scans are interspersed with random reads….when for example you change your window start and stop time or move back on forth between different points in time.
The ideal storage for this is Kudu: Kudu delivers the best performance for mixed scan and random seek workloads.
Until Kudu is GA, use Hbase. Hbase performance may not match Kudu performance, but it will certainly work for this use case.
Can we not use sensors on the tracks. Sure. But sensors on locomotives are easy, few. Raitracks travel through remote areas Those are the hardest ones to put sensors, and those are probably the ones that do need monitoring.
Can we not use sensors on the tracks. Sure. But sensors on locomotives are easy, few. Raitracks travel through remote areas Those are the hardest ones to put sensors, and those are probably the ones that do need monitoring.
Call out Hue!!