Suche senden
Hochladen
Creating Histograms from Data Stream via MapReduce
•
2 gefällt mir
•
3,325 views
DataWorks Summit
Folgen
Technologie
Bildung
Melden
Teilen
Melden
Teilen
1 von 17
Empfohlen
Grey
Grey
ahrnazemi
R graphics260809
R graphics260809
lizbethfdz
Rational rose 2000e using rose
Rational rose 2000e using rose
Sayoga Ekapranca
Slides registration. Vetrovsem
Slides registration. Vetrovsem
Valera Vishnevskiy
Regression analysis
Regression analysis
Ravi shankar
Regression analysis ppt
Regression analysis ppt
Elkana Rorio
Dinámicas de aula para trabajar la muerte
Dinámicas de aula para trabajar la muerte
Elena Zapata Valero
Talk data sciencemeetup
Talk data sciencemeetup
datasciencenl
Empfohlen
Grey
Grey
ahrnazemi
R graphics260809
R graphics260809
lizbethfdz
Rational rose 2000e using rose
Rational rose 2000e using rose
Sayoga Ekapranca
Slides registration. Vetrovsem
Slides registration. Vetrovsem
Valera Vishnevskiy
Regression analysis
Regression analysis
Ravi shankar
Regression analysis ppt
Regression analysis ppt
Elkana Rorio
Dinámicas de aula para trabajar la muerte
Dinámicas de aula para trabajar la muerte
Elena Zapata Valero
Talk data sciencemeetup
Talk data sciencemeetup
datasciencenl
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
Dirk Riehle
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentation
Yu Suzuki
Wikipedia ws
Wikipedia ws
Yu Suzuki
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
Miguel Araújo
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
Chris Armstrong
Profiling blueprints
Profiling blueprints
bergel
Multivariate Time Series
Multivariate Time Series
Apache MXNet
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI Today
Chris Chamberlain
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of Modularity
Kevin Hoffman
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
Tsuyoshi Horigome
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012
Moullet
adc converter basics
adc converter basics
hacker1500
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章
tniky1
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Weitere ähnliche Inhalte
Ähnlich wie Creating Histograms from Data Stream via MapReduce
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
Dirk Riehle
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentation
Yu Suzuki
Wikipedia ws
Wikipedia ws
Yu Suzuki
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
Miguel Araújo
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
Chris Armstrong
Profiling blueprints
Profiling blueprints
bergel
Multivariate Time Series
Multivariate Time Series
Apache MXNet
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI Today
Chris Chamberlain
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of Modularity
Kevin Hoffman
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
Tsuyoshi Horigome
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012
Moullet
adc converter basics
adc converter basics
hacker1500
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章
tniky1
Ähnlich wie Creating Histograms from Data Stream via MapReduce
(13)
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
Wikimedia Conference 2009 presentation
Wikimedia Conference 2009 presentation
Wikipedia ws
Wikipedia ws
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
Profiling blueprints
Profiling blueprints
Multivariate Time Series
Multivariate Time Series
Best Practices to Leverage Ultipro BI Today
Best Practices to Leverage Ultipro BI Today
Towards Probabilistic Assessment of Modularity
Towards Probabilistic Assessment of Modularity
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
SPICE MODEL of TPC6103 (Standard+BDS Model) in SPICE PARK
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012
adc converter basics
adc converter basics
アルゴリズムイントロダクション 8章
アルゴリズムイントロダクション 8章
Mehr von DataWorks Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Mehr von DataWorks Summit
(20)
Data Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Kürzlich hochgeladen
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
danishmna97
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Orbitshub
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Juan lago vázquez
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Angeliki Cooney
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Rustici Software
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Zilliz
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Zilliz
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
apidays
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
Kürzlich hochgeladen
(20)
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Creating Histograms from Data Stream via MapReduce
1.
Creating Histograms from
a Data Stream via MapReduce! Hans-Henning Gabriel © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved.
2.
What is a
histogram?! ! Distribution of a random variable! ! Bar graph showing frequency! ! Basic algorithm: Batch Discretization! Histogram of Values 3.1 4.9 5 9.9 4 8.5 Frequency 3.3 3 8.8 2 7.8 1 . . 0 . 2 4 6 8 10 Values © 2012 Datameer, Inc. All rights reserved.
3.
What can it
be used for?! ! Optimize data processing! ! Probability density estimation! ! Machine learning algorithms! ! Visual impression of the data! © 2012 Datameer, Inc. All rights reserved.
4.
Histograms in Datameer!
© 2012 Datameer, Inc. All rights reserved.
5.
Conditions! !
Data arrives as a stream! • minimum and maxumim value?! ! Data is distributed! • compute and combine bins via MapReduce?! ! No user interaction! • how to set parameters?! © 2012 Datameer, Inc. All rights reserved.
6.
Outline! !
Partition Incremental Discretization (PiD)! • dropping parameters! ! Distribute & Combine! • MapReduce! ! Evaluation! • small error! ! Conclusion! © 2012 Datameer, Inc. All rights reserved.
7.
PiD: 2-Layer Approach!
counts Border Extension 7 3 10 > alpha? 7 3 10 5 5 2 3 4 5 2 3 4 5 6 Histogram of Values step=1 breaks 15 Split Frequency 10 7 5 5 5 5 2 3 3.5 4 5 6 5 0 2 3 4 5 6 Values © 2012 Datameer, Inc. All rights reserved.
8.
adjustedPiD: Parameters Dropped
! ! Splitting threshold alpha:! count +1 >! total + 2 • the smaller the better! à set to small constant value, e.g. = 0.01! ! Parameter step:! • maintain Min and Max values! • extend border breaks based on Min and Max! © 2012 Datameer, Inc. All rights reserved.
9.
adjustedPiD: Splitting Behavior
! s count MAX +1 count MAX = 1+ lim # 2 !x ! 2 = 298 s!>" x=1 0.01 alpha=0.01 alpha=0.02 300 alpha=0.04 alpha=0.08 alpha=0.16 250 alpha=0.32 number of bins 200 150 100 50 0 0 200 400 600 800 1000 number of records © 2012 Datameer, Inc. All rights reserved.
10.
MapReduce: Combine Layer
1! A3 A4 A1 A2 A5 A6 A7 A8 A2 A3 A4 + + + A1 A5 A6 A7 A8 © 2012 Datameer, Inc. All rights reserved.
11.
Evaluation: Measures! !
Percentage Error! k ! (P, S) = " i=1 i P ! Si k "S i i=1 ! Affinity Coefficient! k ! (P, S) = " Pi!* Si! i=1 © 2012 Datameer, Inc. All rights reserved.
12.
Evaluation: Varying Distribution!
Normal Distribution Uniform Distribution Log Normal Distribution 1000 original PiD 2500 aPiD 6000 εPiD=0.0010695 εPiD=0.0153203 800 εaPiD=0.0044543 εaPiD=0.0197731 2000 εPiD=0.0934349 δPiD=0.9993737 εaPiD=0.0369968 δPiD=0.9999998 δaPiD=0.9958205 δPiD=0.9869035 δaPiD=0.9999959 δaPiD=0.9956227 600 1500 4000 400 1000 2000 200 500 0 0 0 © 2012 Datameer, Inc. All rights reserved.
13.
Evaluation: Varying alpha!
© 2012 Datameer, Inc. All rights reserved.
14.
Evaluation: Varying alpha!
Median percentage error Median affinity coefficient 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PiD uniform 0.20 aPiD uniform ● PiD normal 0.99 aPiD normal ● ● PiD log normal 0.15 aPiD log normal 0.98 0.10 ● ● PiD uniform aPiD uniform ● PiD normal 0.05 aPiD normal 0.97 ● ● PiD log normal ● ● aPiD log normal ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● 0.005 0.01 0.02 0.04 0.08 0.16 0.32 0.005 0.01 0.02 0.04 0.08 0.16 0.32 alpha alpha © 2012 Datameer, Inc. All rights reserved.
15.
Conclusion! !
brought together PiD & MapReduce! ! streaming data, distributed, no parameters! ! approach is approximative, error is small! © 2012 Datameer, Inc. All rights reserved.
16.
Thank you!! !
Questions & Answers! © 2012 Datameer, Inc. All rights reserved.
17.
© 2012 Datameer,
Inc. All rights reserved.