SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Creating Histograms from a
Data Stream via MapReduce!
Hans-Henning Gabriel




                        © 2012 Datameer, Inc. All rights reserved.

                       © 2012 Datameer, Inc. All rights reserved.
What is a histogram?!

!    Distribution of a random variable!
!    Bar graph showing frequency!
!    Basic algorithm: Batch Discretization!
                                            Histogram of Values
        3.1	
  
        4.9	
  
                                    5




        9.9	
  
                                    4




        8.5	
  
                        Frequency




        3.3	
  
                                    3




        8.8	
  
                                    2




        7.8	
  
                                    1




        	
  	
  .	
  
        	
  	
  .	
  
                                    0




        	
  	
  .	
                     2    4       6      8     10

                                                  Values
                                                                       © 2012 Datameer, Inc. All rights reserved.
What can it be used for?!

!    Optimize data processing!

!    Probability density estimation!

!    Machine learning algorithms!

!    Visual impression of the data!

                                       © 2012 Datameer, Inc. All rights reserved.
Histograms in Datameer!




                          © 2012 Datameer, Inc. All rights reserved.
Conditions!

!    Data arrives as a stream!
     •  minimum and maxumim value?!
!    Data is distributed!
     •  compute and combine bins via MapReduce?!
!    No user interaction!
     •  how to set parameters?!


                                        © 2012 Datameer, Inc. All rights reserved.
Outline!

!    Partition Incremental Discretization (PiD)!
     •  dropping parameters!
!    Distribute & Combine!
     •  MapReduce!
!    Evaluation!
     •  small error!
!    Conclusion!

                                        © 2012 Datameer, Inc. All rights reserved.
PiD: 2-Layer Approach!
      counts	
  
                                                                                                                 Border	
  Extension	
  
                                 7	
             3	
             10	
                 >	
  alpha?	
             7	
   3	
  
                                                                                                                      10	
   5	
    5	
  

                         2	
             3	
             4	
              5	
                           2	
             3	
           4	
          5	
             6	
  

                                 Histogram of Values                                                                                               step=1	
  
  breaks	
  
                    15




                                                                                                                                  Split	
  
        Frequency

                    10




                                                                                                                7	
         5	
   5	
   5	
                5	
  

                                                                                                        2	
             3	
   3.5	
   4	
          5	
             6	
  
                    5
                    0




                             2           3           4            5               6

                                                 Values
                                                                                                                                              © 2012 Datameer, Inc. All rights reserved.
adjustedPiD: Parameters Dropped !

!    Splitting threshold alpha:!
                          count +1
                                    >!
                          total + 2
     •  the smaller the better!
     à set to small constant value, e.g. = 0.01!

!    Parameter step:!
     •  maintain Min and Max values!
     •  extend border breaks based on Min and Max!
                                            © 2012 Datameer, Inc. All rights reserved.
adjustedPiD: Splitting Behavior !
                                                      s
                                                                                    count MAX +1
                       count MAX = 1+ lim # 2                   !x
                                                                                                 ! 2 = 298
                                               s!>"
                                                      x=1                              0.01
                            alpha=0.01
                            alpha=0.02
                 300




                            alpha=0.04
                            alpha=0.08
                            alpha=0.16
                 250




                            alpha=0.32
number of bins

                 200
                 150
                 100
                 50
                 0




                        0                200              400                       600         800                             1000

                                                                number of records
                                                                                             © 2012 Datameer, Inc. All rights reserved.
MapReduce: Combine Layer 1!




                      A3	
       A4	
  
   A1	
   A2	
  



           A5	
   A6	
  
                                 A7	
       A8	
  




          A2	
   A3	
           A4	
  
          	
  +	
   	
  +	
     	
  +	
  
   A1	
   A5	
   A6	
           A7	
        A8	
  



                                                     © 2012 Datameer, Inc. All rights reserved.
Evaluation: Measures!

!    Percentage Error!
                                    k

                     ! (P, S) =
                                "   i=1 i
                                          P ! Si
                                       k

                                    "S        i
                                        i=1

 !    Affinity Coefficient!

                                k
                     ! (P, S) = " Pi!* Si!
                               i=1


                                                   © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying Distribution!

                         Normal Distribution                                       Uniform Distribution                    Log Normal Distribution




                                                          1000
                                               original
                                               PiD
2500




                                               aPiD




                                                                                                          6000
                                                                  εPiD=0.0010695                                  εPiD=0.0153203




                                                          800
                                                                 εaPiD=0.0044543                                 εaPiD=0.0197731
2000




        εPiD=0.0934349                                                                                            δPiD=0.9993737
       εaPiD=0.0369968                                            δPiD=0.9999998
                                                                                                                 δaPiD=0.9958205
        δPiD=0.9869035                                           δaPiD=0.9999959
       δaPiD=0.9956227
                                                          600
1500




                                                                                                          4000
                                                          400
1000




                                                                                                          2000
                                                          200
500
0




                                                          0




                                                                                                          0        © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying alpha!




                             © 2012 Datameer, Inc. All rights reserved.
Evaluation: Varying alpha!
                          Median percentage error                                               Median affinity coefficient




                                                                       1.00
                                                                   ●            ●
                                                                                ●         ●
                                                                                          ●        ●
                                                                                                   ●       ●
                                                                                                           ●            ●
                                                                                                                        ●
                                                                                                                                 ●
                                                                                                                                 ●          ●         ●
                                                                                                           ●                                ●
                                                                                                                        ●

                                                                                                                                 ●
               ●       PiD uniform
0.20




                       aPiD uniform
               ●       PiD normal




                                                                       0.99
                       aPiD normal                           ●

               ●       PiD log normal
0.15




                       aPiD log normal




                                                                       0.98
0.10




                                                                                                                                            ●
                                                                                      ●       PiD uniform
                                                                                              aPiD uniform
                                                                                      ●       PiD normal
0.05




                                                                                              aPiD normal




                                                                       0.97
                                                      ●
                                                                                      ●       PiD log normal
                                                 ●
                                                             ●                                aPiD log normal
                                                             ●
                                    ●                 ●
                                                 ●    ●
                                                 ●
0.00




                            ●
                            ●       ●
                                    ●
         ●         ●
         ●                                                                      ●



       0.005   0.01       0.02    0.04       0.08    0.16   0.32              0.005   0.01       0.02    0.04       0.08       0.16       0.32

                                         alpha                                                                  alpha




                                                                                                                © 2012 Datameer, Inc. All rights reserved.
Conclusion!

!    brought together PiD & MapReduce!

!    streaming data, distributed, no parameters!

!    approach is approximative, error is small!




                                            © 2012 Datameer, Inc. All rights reserved.
Thank you!!

!    Questions & Answers!




                            © 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.

Weitere ähnliche Inhalte

Ähnlich wie Creating Histograms from Data Stream via MapReduce

The Comment Density of Open Source Software Code
The Comment Density of Open Source Software CodeThe Comment Density of Open Source Software Code
The Comment Density of Open Source Software CodeDirk Riehle
 
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentationWikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentationYu Suzuki
 
Wikipedia ws
Wikipedia wsWikipedia ws
Wikipedia wsYu Suzuki
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNChris Armstrong
 
Profiling blueprints
Profiling blueprintsProfiling blueprints
Profiling blueprintsbergel
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time SeriesApache MXNet
 
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI TodayBest Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI TodayChris Chamberlain
 
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of ModularityTowards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of ModularityKevin Hoffman
 
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARKSPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARKTsuyoshi Horigome
 
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Moullet
 
adc converter basics
adc converter basicsadc converter basics
adc converter basicshacker1500
 
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章tniky1
 

Ähnlich wie Creating Histograms from Data Stream via MapReduce (13)

The Comment Density of Open Source Software Code
The Comment Density of Open Source Software CodeThe Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
 
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentationWikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentation
 
Wikipedia ws
Wikipedia wsWikipedia ws
Wikipedia ws
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
 
Profiling blueprints
Profiling blueprintsProfiling blueprints
Profiling blueprints
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time Series
 
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI TodayBest Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI Today
 
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of ModularityTowards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of Modularity
 
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARKSPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
 
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012
 
adc converter basics
adc converter basicsadc converter basics
adc converter basics
 
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Creating Histograms from Data Stream via MapReduce

  • 1. Creating Histograms from a Data Stream via MapReduce! Hans-Henning Gabriel © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
  • 2. What is a histogram?! !  Distribution of a random variable! !  Bar graph showing frequency! !  Basic algorithm: Batch Discretization! Histogram of Values 3.1   4.9   5 9.9   4 8.5   Frequency 3.3   3 8.8   2 7.8   1    .      .   0    .   2 4 6 8 10 Values © 2012 Datameer, Inc. All rights reserved.
  • 3. What can it be used for?! !  Optimize data processing! !  Probability density estimation! !  Machine learning algorithms! !  Visual impression of the data! © 2012 Datameer, Inc. All rights reserved.
  • 4. Histograms in Datameer! © 2012 Datameer, Inc. All rights reserved.
  • 5. Conditions! !  Data arrives as a stream! •  minimum and maxumim value?! !  Data is distributed! •  compute and combine bins via MapReduce?! !  No user interaction! •  how to set parameters?! © 2012 Datameer, Inc. All rights reserved.
  • 6. Outline! !  Partition Incremental Discretization (PiD)! •  dropping parameters! !  Distribute & Combine! •  MapReduce! !  Evaluation! •  small error! !  Conclusion! © 2012 Datameer, Inc. All rights reserved.
  • 7. PiD: 2-Layer Approach! counts   Border  Extension   7   3   10   >  alpha?   7   3   10   5   5   2   3   4   5   2   3   4   5   6   Histogram of Values step=1   breaks   15 Split   Frequency 10 7   5   5   5   5   2   3   3.5   4   5   6   5 0 2 3 4 5 6 Values © 2012 Datameer, Inc. All rights reserved.
  • 8. adjustedPiD: Parameters Dropped ! !  Splitting threshold alpha:! count +1 >! total + 2 •  the smaller the better! à set to small constant value, e.g. = 0.01! !  Parameter step:! •  maintain Min and Max values! •  extend border breaks based on Min and Max! © 2012 Datameer, Inc. All rights reserved.
  • 9. adjustedPiD: Splitting Behavior ! s count MAX +1 count MAX = 1+ lim # 2 !x ! 2 = 298 s!>" x=1 0.01 alpha=0.01 alpha=0.02 300 alpha=0.04 alpha=0.08 alpha=0.16 250 alpha=0.32 number of bins 200 150 100 50 0 0 200 400 600 800 1000 number of records © 2012 Datameer, Inc. All rights reserved.
  • 10. MapReduce: Combine Layer 1! A3   A4   A1   A2   A5   A6   A7   A8   A2   A3   A4    +    +    +   A1   A5   A6   A7   A8   © 2012 Datameer, Inc. All rights reserved.
  • 11. Evaluation: Measures! !  Percentage Error! k ! (P, S) = " i=1 i P ! Si k "S i i=1 !  Affinity Coefficient! k ! (P, S) = " Pi!* Si! i=1 © 2012 Datameer, Inc. All rights reserved.
  • 12. Evaluation: Varying Distribution! Normal Distribution Uniform Distribution Log Normal Distribution 1000 original PiD 2500 aPiD 6000 εPiD=0.0010695 εPiD=0.0153203 800 εaPiD=0.0044543 εaPiD=0.0197731 2000 εPiD=0.0934349 δPiD=0.9993737 εaPiD=0.0369968 δPiD=0.9999998 δaPiD=0.9958205 δPiD=0.9869035 δaPiD=0.9999959 δaPiD=0.9956227 600 1500 4000 400 1000 2000 200 500 0 0 0 © 2012 Datameer, Inc. All rights reserved.
  • 13. Evaluation: Varying alpha! © 2012 Datameer, Inc. All rights reserved.
  • 14. Evaluation: Varying alpha! Median percentage error Median affinity coefficient 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PiD uniform 0.20 aPiD uniform ● PiD normal 0.99 aPiD normal ● ● PiD log normal 0.15 aPiD log normal 0.98 0.10 ● ● PiD uniform aPiD uniform ● PiD normal 0.05 aPiD normal 0.97 ● ● PiD log normal ● ● aPiD log normal ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● 0.005 0.01 0.02 0.04 0.08 0.16 0.32 0.005 0.01 0.02 0.04 0.08 0.16 0.32 alpha alpha © 2012 Datameer, Inc. All rights reserved.
  • 15. Conclusion! !  brought together PiD & MapReduce! !  streaming data, distributed, no parameters! !  approach is approximative, error is small! © 2012 Datameer, Inc. All rights reserved.
  • 16. Thank you!! !  Questions & Answers! © 2012 Datameer, Inc. All rights reserved.
  • 17. © 2012 Datameer, Inc. All rights reserved.