2. Agenda
Sensors in IOT era
Predictive Maintenance
Predictive Maintenance with sensor data in Utilities industry
Architecture for real time distributed sensor data collection, analysis,
visualization, and storage system
Modeling imprecise sensor readings
3. Sensors in IOT era
Sensors
Sensors are a bridge between the physical world and the internet. They will
play an ever increasing role in just about every field imaginable, and powering
the “Internet of Things”.
Potential Uses of Sensor Data
Sensors can be used to monitor machines, infrastructure, and environment such as
ventilation equipment, bridges, energy meters, airplane engines, temperature,
humility, etc.
One use of this data is for predictive maintenance, to repair or replace the items
before they break.
4. 3 classes of Maintenance
Corrective maintenance (CM), is simply fixing things after they suffer a
breakdown and can also be called Reactive maintenance.
Preventive maintenance (PM), is about replacing or replenishing consumables
at scheduled intervals.
Predictive maintenance (PdM) or Condition-based maintenance, focuses on
detecting failures before they occur.
PdM incorporates inspections of the system at predetermined intervals to
determine system condition.
Depending on the outcome of a continual inspection, either a preventive or
no maintenance activity is performed.
5. Fault Detection Method in Predictive Maintenance
PdM employs many fault or defect detection methods which compare current
sensor or inspection data with some reference data.
If the reference data are the outcome of a representation of the real system,
the fault detection method is called model-based.
Mainly, two distinctive kind of models are used, analytical models and
machine learning models:
Analytical models are limited to represent linear characteristics, however
modern machine learning techniques based on artificial intelligence, as
neural networks or Bayesian (beliefs) networks or support vector machines
are capable of including nonlinearities and complex interdependencies. Even
a relatively "simple" machine learning tool such as a decision tree can allow
for nonlinearities.
6. Machine Learning in Predictive Maintenance
Data Mining and Machine Learning
allow systematic classifying of
patterns contained in data sets.
Patterns of data, “attributes”,
containing information about
condition of physical assets can be
represented by “instances” with an
associated failure mode, or “class”.
Predictions can be made based on
patterns in real time data.
7. Decision tree model example
Here is an instance of building a decision tree model where the strategy is to
either perform maintenance or not based on outcome from several
independent measurements (variables).
9. Predictive Maintenance in Utility
industry
By analyzing the patterns of circumstances surrounding past equipment
failures and power outages and by accessing multiple data sources including
sensors in real time, utility companies can predict and prevent future
failures.
Predictive Maintenance allows utility companies to not only prepare for
known consumption peaks, such as those caused by extreme weather
conditions, but also react quickly to unexpected problems when the warning
signs appear.
Utility companies can spot the problem early on:
When some of the values of some sensor are not normal;
When the number of abnormal values exceeds a given threshold;
Or when the values of a given sensor are significantly different from the values
of its neighbors.
10. Big and fast sensor data requires a
different architecture
Due to the rapid advances in sensor technologies, the
number of sensors and the amount of sensor data
have been increasing with incredible rates.
Therefore the scalability, availability, speed
requirements for sensor data collection, storage, and
analysis solutions call for use of new technologies,
which have the ability to efficiently distribute data
over many servers and dynamically add new
attributes to data records.
11. Architecture for a real time distributed sensor
data collection, analysis, visualization, and
storage system
The new architecture must be able to scale to support a large number of
sensors and big data sizes.
It must be able to automatically gather and analyze large number of sensor
measurements over long periods of time and also to deploy statistics and
machine learning to execute computationally complex data analysis
algorithms with many influence factors.
Open source big data frameworks can be utilized for large-scale sensor data
analysis requirements.
13. An example use case
Display all the transformers located in City Houston, Texas on the map, and
when a transformer icon is clicked, display in an info window the following
details for each transformer: Transformer ID, Age, Designed Capacity, exact
location, and the current Load reading.
If a transformer is of Type “Pole-Top”, with Rating 230, Age > 20, and if its
load has exceeds its designed capacity by more than 10 kVA, and also in the
location where the transformer is located, air temperature >100 degrees,
we'll highlight the transformer icon as red.
When user clicks on the specific transformer, we'll populate the details for the
transformer, including its Load reading. Both the transformer icon color and
the transformer Load reading (with red or green color) will continuously
update every second in real time.
14. Why Spark?
Spark presents a new distributed memory abstraction, called resilient
distributed datasets (RDDs), which provides a data structure for in-memory
computations on large clusters.
RDDs can achieve fault tolerance, meaning that if a given task fails due to
some reasons such as hardware failures and erroneous user code, lost data
can be recovered and reconstructed automatically on the remaining tasks.
Spark has a Java high-level API for working with distributed data similar to
Hadoop and presents an in-memory processing solution.
We run Spark on Hortonworks HDP2.2 in YARN mode, also have made Spark
1.3.1 work on HDP2.2 (default Spark version: 1.2).
15. Spark Streaming
Spark Streaming is an extension of the core Spark API that allows to enable
high-throughput, fault-tolerant stream processing of live data streams.
It offers an additional abstraction called discretized streams, or
DStreams. DStreams are a continuous sequence of RDDs representing a
stream of data.
DStreams can be created from live incoming data or by transforming other
DStreams.
Spark receives data, divides it into batches, then replicates the batches for
fault tolerance and persists them in memory where they are available for
mathematical operations.
Spark 1.3 offers Streaming K-means Clustering and Streaming Linear
Regression
16. Spark SQL
Spark SQL is Spark's module for working with structured data.
The foundation of Spark SQL is a type of RDD, called SchemaRDD (pre-V1.3) or
DataFrame (V1.3), an object similar to a table in a relational database.
Spark SQL can run queries against mixed types of data
Spark piece in detail:
17. Sensor Data Storage – HBase
NoSQL databases provide efficient alternatives for large amount of sensor data storage. In
this example, we will use HBase, a NoSQL key/value store which runs on top of HDFS.
Unlike Hive, HBase operations run in real-time on its database rather than batch-based
MapReduce jobs.
Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column
family, column, and time-stamp. A row in HBase is a grouping of key/value mappings
identified by the row-key.
In our case, we’ll store the anomaly sensor data in a table “abnormal_ load” in the format of:
key, Transformer_ID, Timestamp, Load, Overload, Location, Air_Temperature
We can query our HBase table by creating an external Hive table, linking the HBase table to
the Hive table, and then running HiveQL:
select Transformer_ID, Timestamp, Overload from spark_poc.abnormal_load where Overload
> 20 and Air_Temperature>105 order by Timestamp DESC;
18. Why sending all sources data to Kafka
In the diagrams in the next 2 slides:
The first shows what happens without Kafka.
Since each source needs to have a connection to each target, it is difficult to
maintain and can cause lots of programming and security issues.
The second diagram uses the Kafka, so all sources send data to Kafka.
We only to develop one interface/program to get all different data into
Kafka. Each different data is one topic.
And from consumer side, a consumer only deals with Kafka. When we add a
new source or a new consumer, it does not affect any existing source or target
at all. Thus it is easy to maintain, clean, secure, scalable.
20. Data Pipe Lines With Kafka
Kafka
HBase Hive
Sources
Targets
HDFS DB
21. Why write analysis result data stream to
Kafka before publishing it to web UI
This is because if we send data steam (analysis result) to a queue on the web
server and then use web socket to push to the browser, it is very tedious to
maintain the queue.
Kafka comes handy as a distributed, persistent message queue which supports
multiple concurrent writers, as well as multiple groups of readers that
maintain their own offsets within the queue (which Kafka calls a ‘topic’).
This enables us to build applications that consume data from a topic at their
own pace without disrupting access from other groups of readers.
22. Sensor Data Analysis
To analyze data on the aforementioned architecture we use distributed
machine-learning algorithms in Apache Mahout and MLlib by Apache Spark.
MLlib is a Spark component and a fast and flexible iterative computing
framework to implement machine-learning algorithms, including
classification, clustering, linear regression, collaborative filtering, and
decomposition aims to create and analyze large-scale data hosted in memory.
We use -means algorithm for clustering sensor data and find the anomalies. -
means algorithm is a very popular unsupervised learning algorithm. It aims to
assign objects to groups. All of the objects to be grouped need to be
represented as numerical features. The technique iteratively assigns points to
clusters using distance as a similarity factor until there is no change in which
point belongs to which cluster.
We also use Spark’s Streaming K-means.
23. Modeling imprecise sensor readings
Sensor readings are inherently imprecise because of the noise introduced by
the equipment itself.
Two main approaches have emerged for modeling uncertain data series:
In the first, a Probability Density Function (PDF) over the uncertain values is
estimated by using some a priori knowledge.
In the second, the uncertain data distribution is summarized by repeated
measurements (i.e., samples).
24. Dynamic probabilistic models over the
sensor readings
The KEN technique builds and maintains dynamic probabilistic models over the
sensor readings, taking into account the spatio-temporal correlations that exist
in the sensor readings.
These models organize the sensor nodes in non-overlapping groups, and are
shared by the sensor nodes and the sink.
The expected values of the probabilistic models are the values that are
recorded by the sink. If the sensors observe that these values are more than εVT
away from the sensed values, then a model update is triggered.
The PAQ and SAF methods employ linear regression and autoregressive
models, respectively, for modeling the measurements produced by the nodes,
with SAF leading to a more accurate model than PAQ.