SlideShare ist ein Scribd-Unternehmen logo
1 von 25
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
© 2014 MapR Technologies 2 
• "Decoder ring" 
• "the next thing I want to do is this" 
• Flajolet
© 2014 MapR Technologies 3 
• What's the problem? 
– speed 
– feasibility 
– communication 
– incremental computation 
– tree-based pre-computation 
• What do we need? 
– on-line version 
– associative version
© 2014 MapR Technologies 4 
• Why is that hard (impossible)? 
– pathological inputs 
– median ... any element of the first half of the data could be the median 
– k-th most common ... any element could occur enough in the second 
half to be biggest 
– unique elements ... hashing loses information, any compact 
representation must have false positives or negatives.
© 2014 MapR Technologies 5 
• What can we do? 
– give up ... a slow, but exact answer may not be sooo bad 
– give up ... a fast, but inexact answer may not be sooo bad 
• The good news: 
– approximate can be very, very close to exact
© 2014 MapR Technologies 6 
The Classic Problems 
• Most common (top-40) 
• Count distinct 
• Quantiles, with focus on extremes
© 2014 MapR Technologies 7 
Classic Solutions 
• Leaky counters 
– Forget values, remember uncertainties 
• Count min sketch 
– Many small hash tables 
• Count distinct with HyperLogLog 
– Many hashes again 
• New Solution - Quantiles by t-digest 
– A new low in clustering
© 2014 MapR Technologies 8 
Classic Solutions - Leaky counters 
• Intuition: 
– Common elements are rarely rare, rare elements are always rare 
• Leaky counter: 
– new element inserted with count=1, error = ceiling((N-1)/w) 
– every w samples {dropAll( if f+error < ceiling(N/w) )} 
• Adaptation to heavy hitters is trivial
© 2014 MapR Technologies 9 
Classic Solutions - Count min sketch 
• Intuition: 
– A gazillion hashed counters can't all be wrong 
• Big array of counters, each row has different hash function 
• Increment counter in each row determined by hashing 
• Probe by finding minimum hashed counter for probe key 
• Oops... finding heavy hitters is tricky ... requires keeping log n 
sketches
© 2014 MapR Technologies 10 
Increment Hashed Locations to Insert 
a 
h 
i 
(a)
© 2014 MapR Technologies 11 
Probe Using min of Counts 
mini"k[h 
i 
(a)]
Classic Solutions - Count distinct with HyperLogLog 
© 2014 MapR Technologies 12 
• Intuition: 
– The smallest of n uniform samples is expected to be 1/n 
– Hashing turns anything into uniform distribution 
– Hashing again turns anything into a new uniform distribution 
• Best done with pictures
What does hashing look like? 
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.2 0.4 0.6 0.8 1.0 
ix
© 2014 MapR Technologies 15 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.2 0.4 0.6 0.8 1.0 
hash(ix)
Hashing fixes all ills 
© 2014 MapR Technologies 16
0 5 10 15 20 25 30 
© 2014 MapR Technologies 17 
0.0 1.0 2.0 
Original distribution 
x ~ G(0.2, 0.2) 
Mean = 1, median = 0.1, 5%−ile = 10-6 
0.0 0.2 0.4 0.6 0.8 1.0 
0.0 0.4 0.8 
After hashing
Now the trick … what is the min? 
© 2014 MapR Technologies 18
© 2014 MapR Technologies 19 
Repeated Minimum 
10 samples 
Min is ~ 0.1
© 2014 MapR Technologies 20 
Min(x) 
PDF 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Observed minimum value 
(100 samples x 10,000 replications)
© 2014 MapR Technologies 21 
Min(x) 
PDF 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Theoretical distribution 
Observed minimum value 
(100 samples x 10,000 replications)
© 2014 MapR Technologies 22 
Min(x) 
PDF 
Mean = 0.0099 
0.00 0.02 0.04 0.06 0.08 0.10 
0 20 40 60 80 
Theoretical distribution 
Observed minimum value 
(100 samples x 10,000 replications)
Counting leading zeros is 
taking the log (almost) 
© 2014 MapR Technologies 23
© 2014 MapR Technologies 24 
Mean = −2.3 
10−2.3 
= 0.0056 
Observed minimum log10(value) 
Min(x) 
PDF 
0.0 0.2 0.4 0.6 0.8 1.0 
Error 
1e−05 1e−04 0.001 0.01 0.1
© 2014 MapR Technologies 25 
T-digest for Quantiles 
• Intuition: 
– 1-d k-means with size cap 
– Make size cap depend on distance to nearest end 
• Experimental verification 
– Distribution in cluster very uniform 
– Accuracy far better than alternatives, especially at extremes

Weitere ähnliche Inhalte

Was ist angesagt?

Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
DataWorks Summit
 

Was ist angesagt? (20)

Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 

Ähnlich wie Doing-the-impossible

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
 

Ähnlich wie Doing-the-impossible (20)

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 

Mehr von Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 

Mehr von Ted Dunning (9)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Doing-the-impossible

  • 1. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  • 2. © 2014 MapR Technologies 2 • "Decoder ring" • "the next thing I want to do is this" • Flajolet
  • 3. © 2014 MapR Technologies 3 • What's the problem? – speed – feasibility – communication – incremental computation – tree-based pre-computation • What do we need? – on-line version – associative version
  • 4. © 2014 MapR Technologies 4 • Why is that hard (impossible)? – pathological inputs – median ... any element of the first half of the data could be the median – k-th most common ... any element could occur enough in the second half to be biggest – unique elements ... hashing loses information, any compact representation must have false positives or negatives.
  • 5. © 2014 MapR Technologies 5 • What can we do? – give up ... a slow, but exact answer may not be sooo bad – give up ... a fast, but inexact answer may not be sooo bad • The good news: – approximate can be very, very close to exact
  • 6. © 2014 MapR Technologies 6 The Classic Problems • Most common (top-40) • Count distinct • Quantiles, with focus on extremes
  • 7. © 2014 MapR Technologies 7 Classic Solutions • Leaky counters – Forget values, remember uncertainties • Count min sketch – Many small hash tables • Count distinct with HyperLogLog – Many hashes again • New Solution - Quantiles by t-digest – A new low in clustering
  • 8. © 2014 MapR Technologies 8 Classic Solutions - Leaky counters • Intuition: – Common elements are rarely rare, rare elements are always rare • Leaky counter: – new element inserted with count=1, error = ceiling((N-1)/w) – every w samples {dropAll( if f+error < ceiling(N/w) )} • Adaptation to heavy hitters is trivial
  • 9. © 2014 MapR Technologies 9 Classic Solutions - Count min sketch • Intuition: – A gazillion hashed counters can't all be wrong • Big array of counters, each row has different hash function • Increment counter in each row determined by hashing • Probe by finding minimum hashed counter for probe key • Oops... finding heavy hitters is tricky ... requires keeping log n sketches
  • 10. © 2014 MapR Technologies 10 Increment Hashed Locations to Insert a h i (a)
  • 11. © 2014 MapR Technologies 11 Probe Using min of Counts mini"k[h i (a)]
  • 12. Classic Solutions - Count distinct with HyperLogLog © 2014 MapR Technologies 12 • Intuition: – The smallest of n uniform samples is expected to be 1/n – Hashing turns anything into uniform distribution – Hashing again turns anything into a new uniform distribution • Best done with pictures
  • 13. What does hashing look like? © 2014 MapR Technologies 13
  • 14. © 2014 MapR Technologies 14 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ix
  • 15. © 2014 MapR Technologies 15 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 hash(ix)
  • 16. Hashing fixes all ills © 2014 MapR Technologies 16
  • 17. 0 5 10 15 20 25 30 © 2014 MapR Technologies 17 0.0 1.0 2.0 Original distribution x ~ G(0.2, 0.2) Mean = 1, median = 0.1, 5%−ile = 10-6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 After hashing
  • 18. Now the trick … what is the min? © 2014 MapR Technologies 18
  • 19. © 2014 MapR Technologies 19 Repeated Minimum 10 samples Min is ~ 0.1
  • 20. © 2014 MapR Technologies 20 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Observed minimum value (100 samples x 10,000 replications)
  • 21. © 2014 MapR Technologies 21 Min(x) PDF 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
  • 22. © 2014 MapR Technologies 22 Min(x) PDF Mean = 0.0099 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 Theoretical distribution Observed minimum value (100 samples x 10,000 replications)
  • 23. Counting leading zeros is taking the log (almost) © 2014 MapR Technologies 23
  • 24. © 2014 MapR Technologies 24 Mean = −2.3 10−2.3 = 0.0056 Observed minimum log10(value) Min(x) PDF 0.0 0.2 0.4 0.6 0.8 1.0 Error 1e−05 1e−04 0.001 0.01 0.1
  • 25. © 2014 MapR Technologies 25 T-digest for Quantiles • Intuition: – 1-d k-means with size cap – Make size cap depend on distance to nearest end • Experimental verification – Distribution in cluster very uniform – Accuracy far better than alternatives, especially at extremes