This document discusses two approaches for distributed clustering of data streams from sensor networks: DGClust and L2GClust. DGClust performs local discretization and representative clustering to improve computation and communication loads for clustering sensor data streams at a central server. L2GClust performs local clustering based on each sensor's sketch of its own data and its neighbors' estimates of the global clustering, allowing each sensor to estimate the overall network clustering with limited resources and communication. Evaluation shows L2GClust achieves high agreement with centralized clustering while reducing storage, communication and sensitivity to uncertainty.
1. Distributed Clustering for Smart Grids
Pedro Rodrigues, João Gama
University of Porto, Portugal
Project KDUS (PTDC/EIA-EIA/98355/2008)
4 September 2011
NGDM '11
2. Smart Grids
Smart Grids: monitoring information on the top of electrical
grid
Internet-like communications layer
A shift in the way in which power grids are operated
Intelligent monitoring in real time
Interactive with consumers and markets
Optimized to make the best use of resources and equipment
Predictive rather than reactive
Distributed across geographical and organizational boundaries
2
NGDM '11
3. Smart Grids and Data Mining
Smart grid forms a network (eventually decomposable) of distributed
sources of high-speed data streams.
The dynamics of data are unknown:
the topology of network changes over time,
the number of meters tends to increase and
the context where the meter acts evolves over time.
Several data mining tasks are involved: prediction, cluster (profiling)
analysis, event and anomaly detection, correlation analysis, etc.
All these characteristics constitute real challenges and opportunities for
applied research in distributed data mining.
The requirements of near real-time analysis for multiple time horizons
and multiple space aggregations make these analysis an even harder
research challenge. 3
NGDM '11
4. Outline
Rationale
Clustering distributed data streams
Local-to-Global Clustering of data sources
4
NGDM '11
5. Rationale Sensor Networks
Sensors are usually small, low-cost devices capable of sensing some
attribute and of communicating with other sensors.
Sensor networks can include thousands of sensors, each one being
capable of measuring, analysing and transmitting a stream of data.
Resources are scarse, which reduce the possibilities for heavy
computation,while operating under a limited bandwidth.
5
NGDM '11
6. Rationale Comprehension of Ubiquitous Data Streams
Comprehension
Extract information about global interaction between sources by
looking at the data they produce.
When no other information is available, usual knowledge discovery
approaches are based on unsupervised techniques (e.g. clustering).
However, two different stream clustering problems exist:
clustering streaming data points (e.g. meter' readings)
clustering streaming data sources (e.g. meters)
6
NGDM '11
7. Rationale Comprehension by Clustering Data Points
Information about dense regions of the sensor data space.
Cluster A Cluster B Cluster C
7
NGDM '11
8. Rationale Comprehension by Clustering Data Sources
Information about groups of sensors that behave similarly over time.
Possible scenario Cluster A Cluster B Cluster C
Sensors collecting electricity demand data from different homes,
exploring similar consumption patterns.
8
NGDM '11
9. DGClust Setting and Objective
Setting
Sensors in a wide network produce streams of heterogeneously
distributed data (each sensor produces a univariate stream of data)
Objective Cluster A Cluster B Cluster C
To keep a clustering of the observations that are created by
9
aggregating each node's data as a feature in a centralized stream.
NGDM '11
10. DGClust Problems and Research Question
Problems
high-speed data streams excessive storage and processing
widely spread network heavy communication
centralized clustering high dimensionality
dynamic data outdated models
Research Question
Does local discretization and representative clustering improve
validity, communication and computation loads when applied to
distributed sensor data streams?
10
NGDM '11
11. DGClust Methodology : Local Step
DGClust – Distributed Grid Clustering (Local Step)
Each sensor keeps an online ordinal discretization of its data.
Partition Incremental Discretization
Current State
low
D
11
NGDM '11
12. DGClust Methodology : Aggregating Step
DGClust – Distributed Grid Clustering (Aggregating Step)
The central server gathers the global state of the network.
Sensors whose state has not change since last communication, do not
transmit to server.
low low
low low
D D
high high
high high
A A
B B
B B
B B
high high
low low
12
NGDM '11
13. DGClust Methodology : Representative Step
DGClust – Distributed Grid Clustering (Representative Step)
Server keeps a small list of the most frequent global states.
Space-Saving Frequent Items Monitoring
#
low
high
high
high
523
low
low
low
low
D
C
C
B
A
D
high
high
high
high
high
low
low
low
334
D
B
B
B
A
A
B
B
high
high
low
low
low
low
89
D
A
B
A
A
B
high
low
...
13
NGDM '11
14. DGClust Methodology : Clustering Step
DGClust – Distributed Grid Clustering (Clustering Step)
Server applies partitional clustering to the most frequent states.
Furthest Point Clustering + Online Adaptive K-Means
14
NGDM '11
15. DGClust Example (k=5) Varying Resources
15
NGDM '11
16. DGClust Main Findings
Quality of results does not depend on the number of sensors.
Communication reduction is constant with any number of sensors (as
long as direct link with server exists).
higher clustering quality
higher discretization granularity
lower communication reduction
higher number of sensors more clustering updates
16
NGDM '11
17. L2GClust Setting and Objective
Setting
Sensors in a wide network produce streams of heterogeneously
distributed data (each sensor produces a univariate stream of data)
Objective Cluster A Cluster B Cluster C
To keep, at each node, a clustering of the entire network of sensors.
17
NGDM '11
18. L2GClust Methodology : Local Sketch
Each sensor keeps a sketch of its most recent data.
10.2
The common approach for focus on recent data are sliding windows1.
Even within the sliding window, the most recent data point is usually
more important than the last one which is about to be discarded.
In ubiquitous streaming data sources, such as sensor networks,
resources like memory and processing power are scarse.
Some times, there is not even enough memory to store all the data
points inside the window.
Memoryless α-fading average
18
NGDM '11
21. L2GClust Methodology : Local Clustering
This estimate is computed by clustering the centroids of direct
neighbors’ estimates of the global clustering.
Furthest Point Clustering
Basically, each node performs an ensemble of clusterings from its
direct neighbors.
Instead of broadcasting the sketch of the its own data, each node
broadcasts its estimate of the global clustering.
21
NGDM '11
25. L2GClust Evaluation Summary
Comparison was performed with same strategy executed at a central
server with access to all data.
Measured outcomes were the agreement between a node's clustering
estimate and the centralized clustering, averaged over all nodes.
Kappa statistic cluster sanity
Proportion of agreement cluster validity
K=(P(A)-P(e))/(1-P(e))
State-of-the-art Simulator
Each sensor in the simulation (Visual Sense) generates a Gaussian
stream with mean from one of the predefined Gaussian clusters.
Evaluated parameters were number of clusters, network size, and
cluster overlap.
25
NGDM '11
26. L2GClust Results
26
Average proportion of agreement converges (with small fluctuations).
NGDM '11
27. L2GClust Results
27
Sanity was confirmed with Kappa statistic always above 0.58.
NGDM '11
28. L2GClust Results
28
Real data from electricity demand sensors showed
ability to improve with examples.
NGDM '11
29. L2GClust Main Properties
Local sketch yields:
memoryless storage of summaries;
a straightforward adaptation to most recent data;
a reduction of the system's sensitivity to uncertainty;
Local clustering with direct neighbors yields:
no forwarding of information (reduced communication);
low dimensionality of the clustering problem;
sensitive information better preserved.
Future Work
Evaluate L2GClust on smart grid sensor networks. 29
NGDM '11