SlideShare a Scribd company logo
Distributed Clustering for Smart Grids
Pedro Rodrigues, João Gama




                                  University of Porto, Portugal




                             Project KDUS (PTDC/EIA-EIA/98355/2008)
4 September 2011
NGDM '11
Smart Grids
    Smart Grids: monitoring information on the top of electrical
    grid
           Internet-like communications layer
              A shift in the way in which power grids are operated
           Intelligent monitoring in real time
              Interactive with consumers and markets
              Optimized to make the best use of resources and equipment
              Predictive rather than reactive
              Distributed across geographical and organizational boundaries




                                                                              2




NGDM '11
Smart Grids and Data Mining
    Smart grid forms a network (eventually decomposable) of distributed
    sources of high-speed data streams.
           The dynamics of data are unknown:
           the topology of network changes over time,
           the number of meters tends to increase and
           the context where the meter acts evolves over time.
    Several data mining tasks are involved: prediction, cluster (profiling)
    analysis, event and anomaly detection, correlation analysis, etc.

    All these characteristics constitute real challenges and opportunities for
    applied research in distributed data mining.

    The requirements of near real-time analysis for multiple time horizons
    and multiple space aggregations make these analysis an even harder
    research challenge.                                                          3




NGDM '11
Outline


    Rationale


    Clustering distributed data streams


     Local-to-Global Clustering of data sources




                                                  4




NGDM '11
Rationale                                       Sensor Networks



    Sensors are usually small, low-cost devices capable of sensing some
    attribute and of communicating with other sensors.
    Sensor networks can include thousands of sensors, each one being
    capable of measuring, analysing and transmitting a stream of data.




    Resources are scarse, which reduce the possibilities for heavy
    computation,while operating under a limited bandwidth.
                                                                          5




NGDM '11
Rationale            Comprehension of Ubiquitous Data Streams



    Comprehension
    Extract information about global interaction between sources by
    looking at the data they produce.


    When no other information is available, usual knowledge discovery
    approaches are based on unsupervised techniques (e.g. clustering).


    However, two different stream clustering problems exist:
           clustering streaming data points (e.g. meter' readings)
           clustering streaming data sources (e.g. meters)



                                                                         6




NGDM '11
Rationale           Comprehension by Clustering Data Points



    Information about dense regions of the sensor data space.




                                        Cluster A   Cluster B Cluster C




                                                                          7




NGDM '11
Rationale         Comprehension by Clustering Data Sources



    Information about groups of sensors that behave similarly over time.




    Possible scenario                    Cluster A   Cluster B Cluster C

    Sensors collecting electricity demand data from different homes,
    exploring similar consumption patterns.
                                                                           8




NGDM '11
DGClust                                    Setting and Objective



    Setting
    Sensors in a wide network produce streams of heterogeneously
    distributed data (each sensor produces a univariate stream of data)




    Objective                          Cluster A   Cluster B Cluster C

    To keep a clustering of the observations that are created by
                                                                          9
    aggregating each node's data as a feature in a centralized stream.

NGDM '11
DGClust                    Problems and Research Question



    Problems
    high-speed data streams        excessive storage and processing
    widely spread network          heavy communication
    centralized clustering         high dimensionality
    dynamic data                   outdated models


    Research Question
    Does local discretization and representative clustering improve
    validity, communication and computation loads when applied to
    distributed sensor data streams?



                                                                      10




NGDM '11
DGClust                                   Methodology : Local Step



    DGClust – Distributed Grid Clustering (Local Step)
    Each sensor keeps an online ordinal discretization of its data.
                      Partition Incremental Discretization
                                                          Current State

                                                              low




                                                               D




                                                                          11




NGDM '11
DGClust                         Methodology : Aggregating Step



    DGClust – Distributed Grid Clustering (Aggregating Step)
    The central server gathers the global state of the network.
    Sensors whose state has not change since last communication, do not
    transmit to server.


                                                  low             low
                                                     low          low
                                                   D               D
                                                    high          high
                                                  high            high
                                                      A            A
                                                   B               B
                                                      B            B
                                                   B               B
                                                    high          high
                                                  low             low


                                                                          12




NGDM '11
DGClust                      Methodology : Representative Step



    DGClust – Distributed Grid Clustering (Representative Step)
    Server keeps a small list of the most frequent global states.
                     Space-Saving Frequent Items Monitoring



                                 #
              low




                                       high
                                       high




                                       high
                                 523




                                       low
                                       low

                                       low
              low




                                        D
                                        C
                                        C
                                        B


                                        A
               D
              high
              high
                                       high




                                       high
                                       high
                                       low




                                       low
                                       low
                                 334




                                        D
                                        B
                                        B
                                        B
                                        A
               A
               B
               B




                                       high

                                       high
                                       low
                                       low




                                       low



                                       low
                                 89




                                        D
                                        A
                                        B
                                        A
                                        A
               B
              high
              low
                                               ...
                                                                    13




NGDM '11
DGClust                            Methodology : Clustering Step



    DGClust – Distributed Grid Clustering (Clustering Step)
    Server applies partitional clustering to the most frequent states.
             Furthest Point Clustering + Online Adaptive K-Means




                                                                         14




NGDM '11
DGClust   Example (k=5) Varying Resources




                                               15




NGDM '11
DGClust                                                Main Findings



     Quality of results does not depend on the number of sensors.


     Communication reduction is constant with any number of sensors (as
     long as direct link with server exists).




                                           higher clustering quality
     higher discretization granularity
                                           lower communication reduction


     higher number of sensors              more clustering updates
                                                                           16




NGDM '11
L2GClust                                      Setting and Objective



    Setting
    Sensors in a wide network produce streams of heterogeneously
    distributed data (each sensor produces a univariate stream of data)




    Objective                             Cluster A   Cluster B Cluster C

    To keep, at each node, a clustering of the entire network of sensors.
                                                                            17




NGDM '11
L2GClust                             Methodology : Local Sketch



    Each sensor keeps a sketch of its most recent data.

                                                              10.2



    The common approach for focus on recent data are sliding windows1.
    Even within the sliding window, the most recent data point is usually
    more important than the last one which is about to be discarded.


    In ubiquitous streaming data sources, such as sensor networks,
    resources like memory and processing power are scarse.
    Some times, there is not even enough memory to store all the data
    points inside the window.
                        Memoryless α-fading average
                                                                            18




NGDM '11
L2GClust                  Example : Local Clustering


                   1
                                  10
              2
                                         100

                   10

                                  11
                        99

                             95
                   5

                                  10

              10

                                         3

                   12

                                   2                      19




NGDM '11
L2GClust                  Example : Local Clustering

                                   Centroids {6.9, 98.0}
                   1
                                  10
              2
                                         100

                   10

                                  11
                        99

                             95
                   5

                                  10

              10

                                         3

                   12

                                   2                       20




NGDM '11
L2GClust                       Methodology : Local Clustering




    This estimate is computed by clustering the centroids of direct
    neighbors’ estimates of the global clustering.


                         Furthest Point Clustering


    Basically, each node performs an ensemble of clusterings from its
    direct neighbors.


    Instead of broadcasting the sketch of the its own data, each node
    broadcasts its estimate of the global clustering.


                                                                        21




NGDM '11
L2GClust                           Example : Local Clustering

                                                 Centroids {6.9, 98.0}
                   88.07
                                              87.37
           88.06
                                                         4.19

                   2.80
                             {7.71, 97.1}
                                               3.74
                           1.21
                                            {10.59, 97.38}
                                    3.58
                                            {5.10, 95.00}
                   2.41

                                              3.50

           88.06

                                                        88.03

                   86.31

                                             88.12                       22




NGDM '11
L2GClust                           Example : Local Clustering

                                                 Centroids {6.9, 98.0}
                   88.07
                                              87.37
           88.06
                                                         4.19

                   2.80
                             {7.71, 97.1}
                                               3.74
                           1.21
                                            {10.59, 97.38}
                                    3.58
                                            {5.10, 95.00}
                   2.41

                                              3.50

           88.06

                                                        88.03

                   86.31

                                             88.12                       23




NGDM '11
L2GClust                        Example : Local Clustering

                                               Centroids {6.9, 98.0}
                   88.07
                                            87.37
           88.06
                                                         4.19

                   2.80

                                             3.74
                           1.21
                                         {10.36, 97.1}
                                  3.58
                   2.41

                                            3.50

           88.06

                                                         88.03

                   86.31

                                            88.12                      24




NGDM '11
L2GClust                                       Evaluation Summary



     Comparison was performed with same strategy executed at a central
     server with access to all data.
     Measured outcomes were the agreement between a node's clustering
     estimate and the centralized clustering, averaged over all nodes.
     Kappa statistic                            cluster sanity
     Proportion of agreement                    cluster validity
                       K=(P(A)-P(e))/(1-P(e))
     State-of-the-art Simulator
     Each sensor in the simulation (Visual Sense) generates a Gaussian
     stream with mean from one of the predefined Gaussian clusters.
     Evaluated parameters were number of clusters, network size, and
     cluster overlap.
                                                                         25




NGDM '11
L2GClust                                                      Results




                                                                             26
      Average proportion of agreement converges (with small fluctuations).

NGDM '11
L2GClust                                                       Results




                                                                            27
           Sanity was confirmed with Kappa statistic always above 0.58.

NGDM '11
L2GClust                                                   Results




                                                                        28
           Real data from electricity demand sensors showed
                   ability to improve with examples.
NGDM '11
L2GClust                                                 Main Properties




    Local sketch yields:
           memoryless storage of summaries;
           a straightforward adaptation to most recent data;
           a reduction of the system's sensitivity to uncertainty;


    Local clustering with direct neighbors yields:
           no forwarding of information (reduced communication);
           low dimensionality of the clustering problem;
           sensitive information better preserved.
    Future Work
           Evaluate L2GClust on smart grid sensor networks.                   29




NGDM '11
Thank you!




              30




NGDM '11

More Related Content

What's hot

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
International Journal for Research in Applied Science & Engineering
International Journal for Research in Applied Science & EngineeringInternational Journal for Research in Applied Science & Engineering
International Journal for Research in Applied Science & Engineering
priyanka singh
 

What's hot (12)

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
14 vikram kumar_150-159
14 vikram kumar_150-15914 vikram kumar_150-159
14 vikram kumar_150-159
 
International Journal for Research in Applied Science & Engineering
International Journal for Research in Applied Science & EngineeringInternational Journal for Research in Applied Science & Engineering
International Journal for Research in Applied Science & Engineering
 
Commutative approach for securing digital media
Commutative approach for securing digital mediaCommutative approach for securing digital media
Commutative approach for securing digital media
 
Comparison of SVD & Pseudo Random Sequence based methods of Image Watermarking
Comparison of SVD & Pseudo Random Sequence based methods of Image WatermarkingComparison of SVD & Pseudo Random Sequence based methods of Image Watermarking
Comparison of SVD & Pseudo Random Sequence based methods of Image Watermarking
 
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
 
270 273
270 273270 273
270 273
 
SimWare and the new LSA study group on SISO
SimWare and the new LSA study group on SISOSimWare and the new LSA study group on SISO
SimWare and the new LSA study group on SISO
 
[IJET V2I4P2] Authors:Damanbir Singh, Guneet Kaur
[IJET V2I4P2] Authors:Damanbir Singh, Guneet Kaur[IJET V2I4P2] Authors:Damanbir Singh, Guneet Kaur
[IJET V2I4P2] Authors:Damanbir Singh, Guneet Kaur
 
Distributedsystems 100912185813-phpapp01
Distributedsystems 100912185813-phpapp01Distributedsystems 100912185813-phpapp01
Distributedsystems 100912185813-phpapp01
 
DIGITAL WATERMARKING TECHNIQUE BASED ON MULTI-RESOLUTION CURVELET TRANSFORM
DIGITAL WATERMARKING TECHNIQUE BASED ON MULTI-RESOLUTION CURVELET TRANSFORMDIGITAL WATERMARKING TECHNIQUE BASED ON MULTI-RESOLUTION CURVELET TRANSFORM
DIGITAL WATERMARKING TECHNIQUE BASED ON MULTI-RESOLUTION CURVELET TRANSFORM
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 

Similar to Distributed clustering from data streams

Energy consumption mitigation routing protocols for large wsn's final
Energy consumption mitigation  routing protocols for large wsn's finalEnergy consumption mitigation  routing protocols for large wsn's final
Energy consumption mitigation routing protocols for large wsn's final
sumavaidya90
 
Energy consumption mitigation__routing_protocols_for_large_wsn's_final
Energy consumption mitigation__routing_protocols_for_large_wsn's_finalEnergy consumption mitigation__routing_protocols_for_large_wsn's_final
Energy consumption mitigation__routing_protocols_for_large_wsn's_final
Gr Patel
 
Fault tolerant energy aware data dissemination protocol in WSN
Fault tolerant energy aware data dissemination protocol in WSNFault tolerant energy aware data dissemination protocol in WSN
Fault tolerant energy aware data dissemination protocol in WSN
Prajwal Panchmahalkar
 
A seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networksA seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networks
praveen369
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Sigma web solutions pvt. ltd.
 
Energy consumption mitigation routing protocols for large wsn's
Energy consumption mitigation  routing protocols for large wsn'sEnergy consumption mitigation  routing protocols for large wsn's
Energy consumption mitigation routing protocols for large wsn's
Spandan Spandy
 
6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters
zafargilani
 
Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)
rajivagarwal23dei
 
550 537-546
550 537-546550 537-546
550 537-546
idescitation
 

Similar to Distributed clustering from data streams (20)

Energy consumption mitigation routing protocols for large wsn's final
Energy consumption mitigation  routing protocols for large wsn's finalEnergy consumption mitigation  routing protocols for large wsn's final
Energy consumption mitigation routing protocols for large wsn's final
 
Energy consumption mitigation__routing_protocols_for_large_wsn's_final
Energy consumption mitigation__routing_protocols_for_large_wsn's_finalEnergy consumption mitigation__routing_protocols_for_large_wsn's_final
Energy consumption mitigation__routing_protocols_for_large_wsn's_final
 
Tree Based Collaboration For Target Tracking
Tree Based Collaboration For Target TrackingTree Based Collaboration For Target Tracking
Tree Based Collaboration For Target Tracking
 
Fault tolerant energy aware data dissemination protocol in WSN
Fault tolerant energy aware data dissemination protocol in WSNFault tolerant energy aware data dissemination protocol in WSN
Fault tolerant energy aware data dissemination protocol in WSN
 
Characterization of directed diffusion protocol in wireless sensor network
Characterization of directed diffusion protocol in wireless sensor networkCharacterization of directed diffusion protocol in wireless sensor network
Characterization of directed diffusion protocol in wireless sensor network
 
wcn.pptx
wcn.pptxwcn.pptx
wcn.pptx
 
A seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networksA seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networks
 
RETHINKING THE EXPRESSIVE POWER OF GNNS VIA GRAPH BICONNECTIVITY.pptx
RETHINKING THE EXPRESSIVE POWER OF GNNS VIA GRAPH BICONNECTIVITY.pptxRETHINKING THE EXPRESSIVE POWER OF GNNS VIA GRAPH BICONNECTIVITY.pptx
RETHINKING THE EXPRESSIVE POWER OF GNNS VIA GRAPH BICONNECTIVITY.pptx
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
 
Sequentail Max Search (SMS) resouce allocation algorithm
Sequentail Max Search (SMS) resouce allocation algorithm Sequentail Max Search (SMS) resouce allocation algorithm
Sequentail Max Search (SMS) resouce allocation algorithm
 
SPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
SPAR 2015 - Civil Maps Presentation by Sravan PuttaguntaSPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
SPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
 
Energy consumption mitigation routing protocols for large wsn's
Energy consumption mitigation  routing protocols for large wsn'sEnergy consumption mitigation  routing protocols for large wsn's
Energy consumption mitigation routing protocols for large wsn's
 
Analysis of GPSR and its Relevant Attacks in Wireless Sensor Networks
Analysis of GPSR and its Relevant Attacks in Wireless Sensor NetworksAnalysis of GPSR and its Relevant Attacks in Wireless Sensor Networks
Analysis of GPSR and its Relevant Attacks in Wireless Sensor Networks
 
6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters6 intelligent-placement-of-datacenters
6 intelligent-placement-of-datacenters
 
358 365
358 365358 365
358 365
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)Sensor Protocols for Information via Negotiation (SPIN)
Sensor Protocols for Information via Negotiation (SPIN)
 
550 537-546
550 537-546550 537-546
550 537-546
 
Modelling D2D Communications in Cellular Access Networks via Coupled Processors
Modelling D2D Communications in Cellular Access Networks via Coupled ProcessorsModelling D2D Communications in Cellular Access Networks via Coupled Processors
Modelling D2D Communications in Cellular Access Networks via Coupled Processors
 
Using Distributed Node-RED to build fog/edge applications
Using Distributed Node-RED to build fog/edge applicationsUsing Distributed Node-RED to build fog/edge applications
Using Distributed Node-RED to build fog/edge applications
 

More from LARCA UPC

Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...
LARCA UPC
 

More from LARCA UPC (8)

Experiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance ClassificationExperiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance Classification
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...
 
A query language for analyzing networks
A query language for analyzing networksA query language for analyzing networks
A query language for analyzing networks
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 
Overlapping correlation clustering
Overlapping correlation clusteringOverlapping correlation clustering
Overlapping correlation clustering
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Semi-random model tree ensembles: an effective and scalable regression method
Semi-random model tree ensembles: an effective and scalable regression method Semi-random model tree ensembles: an effective and scalable regression method
Semi-random model tree ensembles: an effective and scalable regression method
 
Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming data
 

Recently uploaded

Recently uploaded (20)

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Motion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in TechnologyMotion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in Technology
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 

Distributed clustering from data streams

  • 1. Distributed Clustering for Smart Grids Pedro Rodrigues, João Gama University of Porto, Portugal Project KDUS (PTDC/EIA-EIA/98355/2008) 4 September 2011 NGDM '11
  • 2. Smart Grids Smart Grids: monitoring information on the top of electrical grid Internet-like communications layer A shift in the way in which power grids are operated Intelligent monitoring in real time Interactive with consumers and markets Optimized to make the best use of resources and equipment Predictive rather than reactive Distributed across geographical and organizational boundaries 2 NGDM '11
  • 3. Smart Grids and Data Mining Smart grid forms a network (eventually decomposable) of distributed sources of high-speed data streams. The dynamics of data are unknown: the topology of network changes over time, the number of meters tends to increase and the context where the meter acts evolves over time. Several data mining tasks are involved: prediction, cluster (profiling) analysis, event and anomaly detection, correlation analysis, etc. All these characteristics constitute real challenges and opportunities for applied research in distributed data mining. The requirements of near real-time analysis for multiple time horizons and multiple space aggregations make these analysis an even harder research challenge. 3 NGDM '11
  • 4. Outline Rationale Clustering distributed data streams Local-to-Global Clustering of data sources 4 NGDM '11
  • 5. Rationale Sensor Networks Sensors are usually small, low-cost devices capable of sensing some attribute and of communicating with other sensors. Sensor networks can include thousands of sensors, each one being capable of measuring, analysing and transmitting a stream of data. Resources are scarse, which reduce the possibilities for heavy computation,while operating under a limited bandwidth. 5 NGDM '11
  • 6. Rationale Comprehension of Ubiquitous Data Streams Comprehension Extract information about global interaction between sources by looking at the data they produce. When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques (e.g. clustering). However, two different stream clustering problems exist: clustering streaming data points (e.g. meter' readings) clustering streaming data sources (e.g. meters) 6 NGDM '11
  • 7. Rationale Comprehension by Clustering Data Points Information about dense regions of the sensor data space. Cluster A Cluster B Cluster C 7 NGDM '11
  • 8. Rationale Comprehension by Clustering Data Sources Information about groups of sensors that behave similarly over time. Possible scenario Cluster A Cluster B Cluster C Sensors collecting electricity demand data from different homes, exploring similar consumption patterns. 8 NGDM '11
  • 9. DGClust Setting and Objective Setting Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data) Objective Cluster A Cluster B Cluster C To keep a clustering of the observations that are created by 9 aggregating each node's data as a feature in a centralized stream. NGDM '11
  • 10. DGClust Problems and Research Question Problems high-speed data streams excessive storage and processing widely spread network heavy communication centralized clustering high dimensionality dynamic data outdated models Research Question Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams? 10 NGDM '11
  • 11. DGClust Methodology : Local Step DGClust – Distributed Grid Clustering (Local Step) Each sensor keeps an online ordinal discretization of its data. Partition Incremental Discretization Current State low D 11 NGDM '11
  • 12. DGClust Methodology : Aggregating Step DGClust – Distributed Grid Clustering (Aggregating Step) The central server gathers the global state of the network. Sensors whose state has not change since last communication, do not transmit to server. low low low low D D high high high high A A B B B B B B high high low low 12 NGDM '11
  • 13. DGClust Methodology : Representative Step DGClust – Distributed Grid Clustering (Representative Step) Server keeps a small list of the most frequent global states. Space-Saving Frequent Items Monitoring # low high high high 523 low low low low D C C B A D high high high high high low low low 334 D B B B A A B B high high low low low low 89 D A B A A B high low ... 13 NGDM '11
  • 14. DGClust Methodology : Clustering Step DGClust – Distributed Grid Clustering (Clustering Step) Server applies partitional clustering to the most frequent states. Furthest Point Clustering + Online Adaptive K-Means 14 NGDM '11
  • 15. DGClust Example (k=5) Varying Resources 15 NGDM '11
  • 16. DGClust Main Findings Quality of results does not depend on the number of sensors. Communication reduction is constant with any number of sensors (as long as direct link with server exists). higher clustering quality higher discretization granularity lower communication reduction higher number of sensors more clustering updates 16 NGDM '11
  • 17. L2GClust Setting and Objective Setting Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data) Objective Cluster A Cluster B Cluster C To keep, at each node, a clustering of the entire network of sensors. 17 NGDM '11
  • 18. L2GClust Methodology : Local Sketch Each sensor keeps a sketch of its most recent data. 10.2 The common approach for focus on recent data are sliding windows1. Even within the sliding window, the most recent data point is usually more important than the last one which is about to be discarded. In ubiquitous streaming data sources, such as sensor networks, resources like memory and processing power are scarse. Some times, there is not even enough memory to store all the data points inside the window. Memoryless α-fading average 18 NGDM '11
  • 19. L2GClust Example : Local Clustering 1 10 2 100 10 11 99 95 5 10 10 3 12 2 19 NGDM '11
  • 20. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 1 10 2 100 10 11 99 95 5 10 10 3 12 2 20 NGDM '11
  • 21. L2GClust Methodology : Local Clustering This estimate is computed by clustering the centroids of direct neighbors’ estimates of the global clustering. Furthest Point Clustering Basically, each node performs an ensemble of clusterings from its direct neighbors. Instead of broadcasting the sketch of the its own data, each node broadcasts its estimate of the global clustering. 21 NGDM '11
  • 22. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 88.07 87.37 88.06 4.19 2.80 {7.71, 97.1} 3.74 1.21 {10.59, 97.38} 3.58 {5.10, 95.00} 2.41 3.50 88.06 88.03 86.31 88.12 22 NGDM '11
  • 23. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 88.07 87.37 88.06 4.19 2.80 {7.71, 97.1} 3.74 1.21 {10.59, 97.38} 3.58 {5.10, 95.00} 2.41 3.50 88.06 88.03 86.31 88.12 23 NGDM '11
  • 24. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 88.07 87.37 88.06 4.19 2.80 3.74 1.21 {10.36, 97.1} 3.58 2.41 3.50 88.06 88.03 86.31 88.12 24 NGDM '11
  • 25. L2GClust Evaluation Summary Comparison was performed with same strategy executed at a central server with access to all data. Measured outcomes were the agreement between a node's clustering estimate and the centralized clustering, averaged over all nodes. Kappa statistic cluster sanity Proportion of agreement cluster validity K=(P(A)-P(e))/(1-P(e)) State-of-the-art Simulator Each sensor in the simulation (Visual Sense) generates a Gaussian stream with mean from one of the predefined Gaussian clusters. Evaluated parameters were number of clusters, network size, and cluster overlap. 25 NGDM '11
  • 26. L2GClust Results 26 Average proportion of agreement converges (with small fluctuations). NGDM '11
  • 27. L2GClust Results 27 Sanity was confirmed with Kappa statistic always above 0.58. NGDM '11
  • 28. L2GClust Results 28 Real data from electricity demand sensors showed ability to improve with examples. NGDM '11
  • 29. L2GClust Main Properties Local sketch yields: memoryless storage of summaries; a straightforward adaptation to most recent data; a reduction of the system's sensitivity to uncertainty; Local clustering with direct neighbors yields: no forwarding of information (reduced communication); low dimensionality of the clustering problem; sensitive information better preserved. Future Work Evaluate L2GClust on smart grid sensor networks. 29 NGDM '11
  • 30. Thank you! 30 NGDM '11