SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Creating Histograms from a
Data Stream via MapReduce!
Hans-Henning Gabriel




                        © 2012 Datameer, Inc. All rights reserved.

                       © 2012 Datameer, Inc. All rights reserved.
What is a histogram?!

!    Distribution of a random variable!
!    Bar graph showing frequency!
!    Basic algorithm: Batch Discretization!
                                            Histogram of Values
        3.1	
  
        4.9	
  
                                    5




        9.9	
  
                                    4




        8.5	
  
                        Frequency




        3.3	
  
                                    3




        8.8	
  
                                    2




        7.8	
  
                                    1




        	
  	
  .	
  
        	
  	
  .	
  
                                    0




        	
  	
  .	
                     2    4       6      8     10

                                                  Values
                                                                       © 2012 Datameer, Inc. All rights reserved.
What can it be used for?!

!    Optimize data processing!

!    Probability density estimation!

!    Machine learning algorithms!

!    Visual impression of the data!

                                       © 2012 Datameer, Inc. All rights reserved.
Histograms in Datameer!




                          © 2012 Datameer, Inc. All rights reserved.
Conditions!

!    Data arrives as a stream!
     •  minimum and maxumim value?!
!    Data is distributed!
     •  compute and combine bins via MapReduce?!
!    No user interaction!
     •  how to set parameters?!


                                        © 2012 Datameer, Inc. All rights reserved.
Outline!

!    Partition Incremental Discretization (PiD)!
     •  dropping parameters!
!    Distribute & Combine!
     •  MapReduce!
!    Evaluation!
     •  small error!
!    Conclusion!

                                        © 2012 Datameer, Inc. All rights reserved.
PiD: 2-Layer Approach!
      counts	
  
                                                                                                                 Border	
  Extension	
  
                                 7	
             3	
             10	
                 >	
  alpha?	
             7	
   3	
  
                                                                                                                      10	
   5	
    5	
  

                         2	
             3	
             4	
              5	
                           2	
             3	
           4	
          5	
             6	
  

                                 Histogram of Values                                                                                               step=1	
  
  breaks	
  
                    15




                                                                                                                                  Split	
  
        Frequency

                    10




                                                                                                                7	
         5	
   5	
   5	
                5	
  

                                                                                                        2	
             3	
   3.5	
   4	
          5	
             6	
  
                    5
                    0




                             2           3           4            5               6

                                                 Values
                                                                                                                                              © 2012 Datameer, Inc. All rights reserved.
adjustedPiD: Parameters Dropped !

!    Splitting threshold alpha:!
                          count +1
                                    >!
                          total + 2
     •  the smaller the better!
     à set to small constant value, e.g. = 0.01!

!    Parameter step:!
     •  maintain Min and Max values!
     •  extend border breaks based on Min and Max!
                                            © 2012 Datameer, Inc. All rights reserved.
adjustedPiD: Splitting Behavior !
                                                      s
                                                                                    count MAX +1
                       count MAX = 1+ lim # 2                   !x
                                                                                                 ! 2 = 298
                                               s!>"
                                                      x=1                              0.01
                            alpha=0.01
                            alpha=0.02
                 300




                            alpha=0.04
                            alpha=0.08
                            alpha=0.16
                 250




                            alpha=0.32
number of bins

                 200
                 150
                 100
                 50
                 0




                        0                200              400                       600         800                             1000

                                                                number of records
                                                                                             © 2012 Datameer, Inc. All rights reserved.
MapReduce: Combine Layer 1!




                      A3	
       A4	
  
   A1	
   A2	
  



           A5	
   A6	
  
                                 A7	
       A8	
  




          A2	
   A3	
           A4	
  
          	
  +	
   	
  +	
     	
  +	
  
   A1	
   A5	
   A6	
           A7	
        A8	
  



                                                     © 2012 Datameer, Inc. All rights reserved.
Evaluation: Measures!

!    Percentage Error!
                                    k

                     ! (P, S) =
                                "   i=1 i
                                          P ! Si
                                       k

                                    "S        i
                                        i=1

 !    Affinity Coefficient!

                                k
                     ! (P, S) = " Pi!* Si!
                               i=1


                                                   © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying Distribution!

                         Normal Distribution                                       Uniform Distribution                    Log Normal Distribution




                                                          1000
                                               original
                                               PiD
2500




                                               aPiD




                                                                                                          6000
                                                                  εPiD=0.0010695                                  εPiD=0.0153203




                                                          800
                                                                 εaPiD=0.0044543                                 εaPiD=0.0197731
2000




        εPiD=0.0934349                                                                                            δPiD=0.9993737
       εaPiD=0.0369968                                            δPiD=0.9999998
                                                                                                                 δaPiD=0.9958205
        δPiD=0.9869035                                           δaPiD=0.9999959
       δaPiD=0.9956227
                                                          600
1500




                                                                                                          4000
                                                          400
1000




                                                                                                          2000
                                                          200
500
0




                                                          0




                                                                                                          0        © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying alpha!




                             © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying alpha!
                          Median percentage error                                               Median affinity coefficient




                                                                       1.00
                                                                   ●            ●
                                                                                ●         ●
                                                                                          ●        ●
                                                                                                   ●       ●
                                                                                                           ●            ●
                                                                                                                        ●
                                                                                                                                 ●
                                                                                                                                 ●          ●         ●
                                                                                                           ●                                ●
                                                                                                                        ●

                                                                                                                                 ●
               ●       PiD uniform
0.20




                       aPiD uniform
               ●       PiD normal




                                                                       0.99
                       aPiD normal                           ●

               ●       PiD log normal
0.15




                       aPiD log normal




                                                                       0.98
0.10




                                                                                                                                            ●
                                                                                      ●       PiD uniform
                                                                                              aPiD uniform
                                                                                      ●       PiD normal
0.05




                                                                                              aPiD normal




                                                                       0.97
                                                      ●
                                                                                      ●       PiD log normal
                                                 ●
                                                             ●                                aPiD log normal
                                                             ●
                                    ●                 ●
                                                 ●    ●
                                                 ●
0.00




                            ●
                            ●       ●
                                    ●
         ●         ●
         ●                                                                      ●



       0.005   0.01       0.02    0.04       0.08    0.16   0.32              0.005   0.01       0.02    0.04       0.08       0.16       0.32

                                         alpha                                                                  alpha




                                                                                                                © 2012 Datameer, Inc. All rights reserved.
Conclusion!

!    brought together PiD & MapReduce!

!    streaming data, distributed, no parameters!

!    approach is approximative, error is small!




                                            © 2012 Datameer, Inc. All rights reserved.
Thank you!!

!    Questions & Answers!




                            © 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.

Weitere ähnliche Inhalte

Ähnlich wie Creating Histograms from Data Stream via MapReduce

The Comment Density of Open Source Software Code
The Comment Density of Open Source Software CodeThe Comment Density of Open Source Software Code
The Comment Density of Open Source Software CodeDirk Riehle
 
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentationWikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentationYu Suzuki
 
Wikipedia ws
Wikipedia wsWikipedia ws
Wikipedia wsYu Suzuki
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNChris Armstrong
 
Profiling blueprints
Profiling blueprintsProfiling blueprints
Profiling blueprintsbergel
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time SeriesApache MXNet
 
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI TodayBest Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI TodayChris Chamberlain
 
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of ModularityTowards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of ModularityKevin Hoffman
 
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARKSPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARKTsuyoshi Horigome
 
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Moullet
 
adc converter basics
adc converter basicsadc converter basics
adc converter basicshacker1500
 
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章tniky1
 

Ähnlich wie Creating Histograms from Data Stream via MapReduce (13)

The Comment Density of Open Source Software Code
The Comment Density of Open Source Software CodeThe Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
 
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentationWikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentation
 
Wikipedia ws
Wikipedia wsWikipedia ws
Wikipedia ws
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
 
Profiling blueprints
Profiling blueprintsProfiling blueprints
Profiling blueprints
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time Series
 
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI TodayBest Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI Today
 
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of ModularityTowards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of Modularity
 
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARKSPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
 
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012
 
adc converter basics
adc converter basicsadc converter basics
adc converter basics
 
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Creating Histograms from Data Stream via MapReduce

  • 1. Creating Histograms from a Data Stream via MapReduce! Hans-Henning Gabriel © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
  • 2. What is a histogram?! !  Distribution of a random variable! !  Bar graph showing frequency! !  Basic algorithm: Batch Discretization! Histogram of Values 3.1   4.9   5 9.9   4 8.5   Frequency 3.3   3 8.8   2 7.8   1    .      .   0    .   2 4 6 8 10 Values © 2012 Datameer, Inc. All rights reserved.
  • 3. What can it be used for?! !  Optimize data processing! !  Probability density estimation! !  Machine learning algorithms! !  Visual impression of the data! © 2012 Datameer, Inc. All rights reserved.
  • 4. Histograms in Datameer! © 2012 Datameer, Inc. All rights reserved.
  • 5. Conditions! !  Data arrives as a stream! •  minimum and maxumim value?! !  Data is distributed! •  compute and combine bins via MapReduce?! !  No user interaction! •  how to set parameters?! © 2012 Datameer, Inc. All rights reserved.
  • 6. Outline! !  Partition Incremental Discretization (PiD)! •  dropping parameters! !  Distribute & Combine! •  MapReduce! !  Evaluation! •  small error! !  Conclusion! © 2012 Datameer, Inc. All rights reserved.
  • 7. PiD: 2-Layer Approach! counts   Border  Extension   7   3   10   >  alpha?   7   3   10   5   5   2   3   4   5   2   3   4   5   6   Histogram of Values step=1   breaks   15 Split   Frequency 10 7   5   5   5   5   2   3   3.5   4   5   6   5 0 2 3 4 5 6 Values © 2012 Datameer, Inc. All rights reserved.
  • 8. adjustedPiD: Parameters Dropped ! !  Splitting threshold alpha:! count +1 >! total + 2 •  the smaller the better! à set to small constant value, e.g. = 0.01! !  Parameter step:! •  maintain Min and Max values! •  extend border breaks based on Min and Max! © 2012 Datameer, Inc. All rights reserved.
  • 9. adjustedPiD: Splitting Behavior ! s count MAX +1 count MAX = 1+ lim # 2 !x ! 2 = 298 s!>" x=1 0.01 alpha=0.01 alpha=0.02 300 alpha=0.04 alpha=0.08 alpha=0.16 250 alpha=0.32 number of bins 200 150 100 50 0 0 200 400 600 800 1000 number of records © 2012 Datameer, Inc. All rights reserved.
  • 10. MapReduce: Combine Layer 1! A3   A4   A1   A2   A5   A6   A7   A8   A2   A3   A4    +    +    +   A1   A5   A6   A7   A8   © 2012 Datameer, Inc. All rights reserved.
  • 11. Evaluation: Measures! !  Percentage Error! k ! (P, S) = " i=1 i P ! Si k "S i i=1 !  Affinity Coefficient! k ! (P, S) = " Pi!* Si! i=1 © 2012 Datameer, Inc. All rights reserved.
  • 12. Evaluation: Varying Distribution! Normal Distribution Uniform Distribution Log Normal Distribution 1000 original PiD 2500 aPiD 6000 εPiD=0.0010695 εPiD=0.0153203 800 εaPiD=0.0044543 εaPiD=0.0197731 2000 εPiD=0.0934349 δPiD=0.9993737 εaPiD=0.0369968 δPiD=0.9999998 δaPiD=0.9958205 δPiD=0.9869035 δaPiD=0.9999959 δaPiD=0.9956227 600 1500 4000 400 1000 2000 200 500 0 0 0 © 2012 Datameer, Inc. All rights reserved.
  • 13. Evaluation: Varying alpha! © 2012 Datameer, Inc. All rights reserved.
  • 14. Evaluation: Varying alpha! Median percentage error Median affinity coefficient 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PiD uniform 0.20 aPiD uniform ● PiD normal 0.99 aPiD normal ● ● PiD log normal 0.15 aPiD log normal 0.98 0.10 ● ● PiD uniform aPiD uniform ● PiD normal 0.05 aPiD normal 0.97 ● ● PiD log normal ● ● aPiD log normal ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● 0.005 0.01 0.02 0.04 0.08 0.16 0.32 0.005 0.01 0.02 0.04 0.08 0.16 0.32 alpha alpha © 2012 Datameer, Inc. All rights reserved.
  • 15. Conclusion! !  brought together PiD & MapReduce! !  streaming data, distributed, no parameters! !  approach is approximative, error is small! © 2012 Datameer, Inc. All rights reserved.
  • 16. Thank you!! !  Questions & Answers! © 2012 Datameer, Inc. All rights reserved.
  • 17. © 2012 Datameer, Inc. All rights reserved.